23 Oct, 2015

1 commit

  • commit 2f84a8990ebbe235c59716896e017c6b2ca1200f upstream.

    SunDong reported the following on

    https://bugzilla.kernel.org/show_bug.cgi?id=103841

    I think I find a linux bug, I have the test cases is constructed. I
    can stable recurring problems in fedora22(4.0.4) kernel version,
    arch for x86_64. I construct transparent huge page, when the parent
    and child process with MAP_SHARE, MAP_PRIVATE way to access the same
    huge page area, it has the opportunity to lead to huge page copy on
    write failure, and then it will munmap the child corresponding mmap
    area, but then the child mmap area with VM_MAYSHARE attributes, child
    process munmap this area can trigger VM_BUG_ON in set_vma_resv_flags
    functions (vma - > vm_flags & VM_MAYSHARE).

    There were a number of problems with the report (e.g. it's hugetlbfs that
    triggers this, not transparent huge pages) but it was fundamentally
    correct in that a VM_BUG_ON in set_vma_resv_flags() can be triggered that
    looks like this

    vma ffff8804651fd0d0 start 00007fc474e00000 end 00007fc475e00000
    next ffff8804651fd018 prev ffff8804651fd188 mm ffff88046b1b1800
    prot 8000000000000027 anon_vma (null) vm_ops ffffffff8182a7a0
    pgoff 0 file ffff88106bdb9800 private_data (null)
    flags: 0x84400fb(read|write|shared|mayread|maywrite|mayexec|mayshare|dontexpand|hugetlb)
    ------------
    kernel BUG at mm/hugetlb.c:462!
    SMP
    Modules linked in: xt_pkttype xt_LOG xt_limit [..]
    CPU: 38 PID: 26839 Comm: map Not tainted 4.0.4-default #1
    Hardware name: Dell Inc. PowerEdge R810/0TT6JF, BIOS 2.7.4 04/26/2012
    set_vma_resv_flags+0x2d/0x30

    The VM_BUG_ON is correct because private and shared mappings have
    different reservation accounting but the warning clearly shows that the
    VMA is shared.

    When a private COW fails to allocate a new page then only the process
    that created the VMA gets the page -- all the children unmap the page.
    If the children access that data in the future then they get killed.

    The problem is that the same file is mapped shared and private. During
    the COW, the allocation fails, the VMAs are traversed to unmap the other
    private pages but a shared VMA is found and the bug is triggered. This
    patch identifies such VMAs and skips them.

    Signed-off-by: Mel Gorman
    Reported-by: SunDong
    Reviewed-by: Michal Hocko
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Naoya Horiguchi
    Cc: David Rientjes
    Reviewed-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Mel Gorman
     

04 Aug, 2015

1 commit

  • commit 641844f5616d7c6597309f560838f996466d7aac upstream.

    Currently the initial value of order in dissolve_free_huge_page is 64 or
    32, which leads to the following warning in static checker:

    mm/hugetlb.c:1203 dissolve_free_huge_pages()
    warn: potential right shift more than type allows '9,18,64'

    This is a potential risk of infinite loop, because 1 << order (== 0) is used
    in for-loop like this:

    for (pfn =3D start_pfn; pfn < end_pfn; pfn +=3D 1 << order)
    ...

    So this patch fixes it by using global minimum_order calculated at boot time.

    text data bss dec hex filename
    28313 469 84236 113018 1b97a mm/hugetlb.o
    28256 473 84236 112965 1b945 mm/hugetlb.o (patched)

    Fixes: c8721bbbdd36 ("mm: memory-hotplug: enable memory hotplug to handle hugepage")
    Reported-by: Dan Carpenter
    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Naoya Horiguchi
     

16 Apr, 2015

5 commits

  • Now we have an easy access to hugepages' activeness, so existing helpers to
    get the information can be cleaned up.

    [akpm@linux-foundation.org: s/PageHugeActive/page_huge_active/]
    Signed-off-by: Naoya Horiguchi
    Cc: Hugh Dickins
    Reviewed-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • We are not safe from calling isolate_huge_page() on a hugepage
    concurrently, which can make the victim hugepage in invalid state and
    results in BUG_ON().

    The root problem of this is that we don't have any information on struct
    page (so easily accessible) about hugepages' activeness. Note that
    hugepages' activeness means just being linked to
    hstate->hugepage_activelist, which is not the same as normal pages'
    activeness represented by PageActive flag.

    Normal pages are isolated by isolate_lru_page() which prechecks PageLRU
    before isolation, so let's do similarly for hugetlb with a new
    paeg_huge_active().

    set/clear_page_huge_active() should be called within hugetlb_lock. But
    hugetlb_cow() and hugetlb_no_page() don't do this, being justified because
    in these functions set_page_huge_active() is called right after the
    hugepage is allocated and no other thread tries to isolate it.

    [akpm@linux-foundation.org: s/PageHugeActive/page_huge_active/, make it return bool]
    [fengguang.wu@intel.com: set_page_huge_active() can be static]
    Signed-off-by: Naoya Horiguchi
    Cc: Hugh Dickins
    Reviewed-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: David Rientjes
    Signed-off-by: Fengguang Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Make 'min_size=' be an option when mounting a hugetlbfs. This
    option takes the same value as the 'size' option. min_size can be
    specified without specifying size. If both are specified, min_size must
    be less that or equal to size else the mount will fail. If min_size is
    specified, then at mount time an attempt is made to reserve min_size
    pages. If the reservation fails, the mount fails. At umount time, the
    reserved pages are released.

    Signed-off-by: Mike Kravetz
    Cc: Davidlohr Bueso
    Cc: Aneesh Kumar
    Cc: Joonsoo Kim
    Cc: Andi Kleen
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • The same routines that perform subpool maximum size accounting
    hugepage_subpool_get/put_pages() are modified to also perform minimum size
    accounting. When a delta value is passed to these routines, calculate how
    global reservations must be adjusted to maintain the subpool minimum size.
    The routines now return this global reserve count adjustment. This
    global reserve count adjustment is then passed to the global accounting
    routine hugetlb_acct_memory().

    Signed-off-by: Mike Kravetz
    Cc: Davidlohr Bueso
    Cc: Aneesh Kumar
    Cc: Joonsoo Kim
    Cc: Andi Kleen
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • hugetlbfs allocates huge pages from the global pool as needed. Even if
    the global pool contains a sufficient number pages for the filesystem size
    at mount time, those global pages could be grabbed for some other use. As
    a result, filesystem huge page allocations may fail due to lack of pages.

    Applications such as a database want to use huge pages for performance
    reasons. hugetlbfs filesystem semantics with ownership and modes work
    well to manage access to a pool of huge pages. However, the application
    would like some reasonable assurance that allocations will not fail due to
    a lack of huge pages. At application startup time, the application would
    like to configure itself to use a specific number of huge pages. Before
    starting, the application can check to make sure that enough huge pages
    exist in the system global pools. However, there are no guarantees that
    those pages will be available when needed by the application. What the
    application wants is exclusive use of a subset of huge pages.

    Add a new hugetlbfs mount option 'min_size=' to indicate that the
    specified number of pages will be available for use by the filesystem. At
    mount time, this number of huge pages will be reserved for exclusive use
    of the filesystem. If there is not a sufficient number of free pages, the
    mount will fail. As pages are allocated to and freeed from the
    filesystem, the number of reserved pages is adjusted so that the specified
    minimum is maintained.

    This patch (of 4):

    Add a field to the subpool structure to indicate the minimimum number of
    huge pages to always be used by this subpool. This minimum count includes
    allocated pages as well as reserved pages. If the minimum number of pages
    for the subpool have not been allocated, pages are reserved up to this
    minimum. An additional field (rsv_hpages) is used to track the number of
    pages reserved to meet this minimum size. The hstate pointer in the
    subpool is convenient to have when reserving and unreserving the pages.

    Signed-off-by: Mike Kravetz
    Cc: Davidlohr Bueso
    Cc: Aneesh Kumar
    Cc: Joonsoo Kim
    Cc: Andi Kleen
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

15 Apr, 2015

2 commits

  • If __get_user_pages() is faulting a significant number of hugetlb pages,
    usually as the result of mmap(MAP_LOCKED), it can potentially allocate a
    very large amount of memory.

    If the process has been oom killed, this will cause a lot of memory to
    potentially deplete memory reserves.

    In the same way that commit 4779280d1ea4 ("mm: make get_user_pages()
    interruptible") aborted for pending SIGKILLs when faulting non-hugetlb
    memory, based on the premise of commit 462e00cc7151 ("oom: stop
    allocating user memory if TIF_MEMDIE is set"), hugetlb page faults now
    terminate when the process has been oom killed.

    Signed-off-by: David Rientjes
    Acked-by: Rik van Riel
    Acked-by: Greg Thelen
    Cc: Naoya Horiguchi
    Acked-by: Davidlohr Bueso
    Acked-by: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Commit 61f77eda9bbf ("mm/hugetlb: reduce arch dependent code around
    follow_huge_*") broke follow_huge_pmd() on s390, where pmd and pte
    layout differ and using pte_page() on a huge pmd will return wrong
    results. Using pmd_page() instead fixes this.

    All architectures that were touched by that commit have pmd_page()
    defined, so this should not break anything on other architectures.

    Fixes: 61f77eda "mm/hugetlb: reduce arch dependent code around follow_huge_*"
    Signed-off-by: Gerald Schaefer
    Acked-by: Naoya Horiguchi
    Cc: Hugh Dickins
    Cc: Michal Hocko , Andrea Arcangeli
    Cc: Martin Schwidefsky
    Acked-by: David Rientjes
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gerald Schaefer
     

13 Mar, 2015

1 commit

  • Now that gigantic pages are dynamically allocatable, care must be taken to
    ensure that p->first_page is valid before setting PageTail.

    If this isn't done, then it is possible to race and have compound_head()
    return NULL.

    Signed-off-by: David Rientjes
    Acked-by: Davidlohr Bueso
    Cc: Luiz Capitulino
    Cc: Joonsoo Kim
    Acked-by: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

12 Feb, 2015

7 commits

  • Dave noticed that unprivileged process can allocate significant amount of
    memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and
    memory cgroup. The trick is to allocate a lot of PMD page tables. Linux
    kernel doesn't account PMD tables to the process, only PTE.

    The use-cases below use few tricks to allocate a lot of PMD page tables
    while keeping VmRSS and VmPTE low. oom_score for the process will be 0.

    #include
    #include
    #include
    #include
    #include
    #include

    #define PUD_SIZE (1UL << 30)
    #define PMD_SIZE (1UL << 21)

    #define NR_PUD 130000

    int main(void)
    {
    char *addr = NULL;
    unsigned long i;

    prctl(PR_SET_THP_DISABLE);
    for (i = 0; i < NR_PUD ; i++) {
    addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ,
    MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
    if (addr == MAP_FAILED) {
    perror("mmap");
    break;
    }
    *addr = 'x';
    munmap(addr, PMD_SIZE);
    mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ,
    MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0);
    if (addr == MAP_FAILED)
    perror("re-mmap"), exit(1);
    }
    printf("PID %d consumed %lu KiB in PMD page tables\n",
    getpid(), i * 4096 >> 10);
    return pause();
    }

    The patch addresses the issue by account PMD tables to the process the
    same way we account PTE.

    The main place where PMD tables is accounted is __pmd_alloc() and
    free_pmd_range(). But there're few corner cases:

    - HugeTLB can share PMD page tables. The patch handles by accounting
    the table to all processes who share it.

    - x86 PAE pre-allocates few PMD tables on fork.

    - Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity
    check on exit(2).

    Accounting only happens on configuration where PMD page table's level is
    present (PMD is not folded). As with nr_ptes we use per-mm counter. The
    counter value is used to calculate baseline for badness score by
    oom-killer.

    Signed-off-by: Kirill A. Shutemov
    Reported-by: Dave Hansen
    Cc: Hugh Dickins
    Reviewed-by: Cyrill Gorcunov
    Cc: Pavel Emelyanov
    Cc: David Rientjes
    Tested-by: Sedat Dilek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • If __unmap_hugepage_range() tries to unmap the address range over which
    hugepage migration is on the way, we get the wrong page because pte_page()
    doesn't work for migration entries. This patch simply clears the pte for
    migration entries as we do for hwpoison entries.

    Fixes: 290408d4a2 ("hugetlb: hugepage migration core")
    Signed-off-by: Naoya Horiguchi
    Cc: Hugh Dickins
    Cc: James Hogan
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Luiz Capitulino
    Cc: Nishanth Aravamudan
    Cc: Lee Schermerhorn
    Cc: Steve Capper
    Cc: [2.6.36+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • There is a race condition between hugepage migration and
    change_protection(), where hugetlb_change_protection() doesn't care about
    migration entries and wrongly overwrites them. That causes unexpected
    results like kernel crash. HWPoison entries also can cause the same
    problem.

    This patch adds is_hugetlb_entry_(migration|hwpoisoned) check in this
    function to do proper actions.

    Fixes: 290408d4a2 ("hugetlb: hugepage migration core")
    Signed-off-by: Naoya Horiguchi
    Cc: Hugh Dickins
    Cc: James Hogan
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Luiz Capitulino
    Cc: Nishanth Aravamudan
    Cc: Lee Schermerhorn
    Cc: Steve Capper
    Cc: [2.6.36+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • When running the test which causes the race as shown in the previous patch,
    we can hit the BUG "get_page() on refcount 0 page" in hugetlb_fault().

    This race happens when pte turns into migration entry just after the first
    check of is_hugetlb_entry_migration() in hugetlb_fault() passed with false.
    To fix this, we need to check pte_present() again after huge_ptep_get().

    This patch also reorders taking ptl and doing pte_page(), because
    pte_page() should be done in ptl. Due to this reordering, we need use
    trylock_page() in page != pagecache_page case to respect locking order.

    Fixes: 66aebce747ea ("hugetlb: fix race condition in hugetlb_fault()")
    Signed-off-by: Naoya Horiguchi
    Cc: Hugh Dickins
    Cc: James Hogan
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Luiz Capitulino
    Cc: Nishanth Aravamudan
    Cc: Lee Schermerhorn
    Cc: Steve Capper
    Cc: [3.2+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • We have a race condition between move_pages() and freeing hugepages, where
    move_pages() calls follow_page(FOLL_GET) for hugepages internally and
    tries to get its refcount without preventing concurrent freeing. This
    race crashes the kernel, so this patch fixes it by moving FOLL_GET code
    for hugepages into follow_huge_pmd() with taking the page table lock.

    This patch intentionally removes page==NULL check after pte_page.
    This is justified because pte_page() never returns NULL for any
    architectures or configurations.

    This patch changes the behavior of follow_huge_pmd() for tail pages and
    then tail pages can be pinned/returned. So the caller must be changed to
    properly handle the returned tail pages.

    We could have a choice to add the similar locking to
    follow_huge_(addr|pud) for consistency, but it's not necessary because
    currently these functions don't support FOLL_GET flag, so let's leave it
    for future development.

    Here is the reproducer:

    $ cat movepages.c
    #include
    #include
    #include

    #define ADDR_INPUT 0x700000000000UL
    #define HPS 0x200000
    #define PS 0x1000

    int main(int argc, char *argv[]) {
    int i;
    int nr_hp = strtol(argv[1], NULL, 0);
    int nr_p = nr_hp * HPS / PS;
    int ret;
    void **addrs;
    int *status;
    int *nodes;
    pid_t pid;

    pid = strtol(argv[2], NULL, 0);
    addrs = malloc(sizeof(char *) * nr_p + 1);
    status = malloc(sizeof(char *) * nr_p + 1);
    nodes = malloc(sizeof(char *) * nr_p + 1);

    while (1) {
    for (i = 0; i < nr_p; i++) {
    addrs[i] = (void *)ADDR_INPUT + i * PS;
    nodes[i] = 1;
    status[i] = 0;
    }
    ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
    MPOL_MF_MOVE_ALL);
    if (ret == -1)
    err("move_pages");

    for (i = 0; i < nr_p; i++) {
    addrs[i] = (void *)ADDR_INPUT + i * PS;
    nodes[i] = 0;
    status[i] = 0;
    }
    ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
    MPOL_MF_MOVE_ALL);
    if (ret == -1)
    err("move_pages");
    }
    return 0;
    }

    $ cat hugepage.c
    #include
    #include
    #include

    #define ADDR_INPUT 0x700000000000UL
    #define HPS 0x200000

    int main(int argc, char *argv[]) {
    int nr_hp = strtol(argv[1], NULL, 0);
    char *p;

    while (1) {
    p = mmap((void *)ADDR_INPUT, nr_hp * HPS, PROT_READ | PROT_WRITE,
    MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
    if (p != (void *)ADDR_INPUT) {
    perror("mmap");
    break;
    }
    memset(p, 0, nr_hp * HPS);
    munmap(p, nr_hp * HPS);
    }
    }

    $ sysctl vm.nr_hugepages=40
    $ ./hugepage 10 &
    $ ./movepages 10 $(pgrep -f hugepage)

    Fixes: e632a938d914 ("mm: migrate: add hugepage migration code to move_pages()")
    Signed-off-by: Naoya Horiguchi
    Reported-by: Hugh Dickins
    Cc: James Hogan
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Luiz Capitulino
    Cc: Nishanth Aravamudan
    Cc: Lee Schermerhorn
    Cc: Steve Capper
    Cc: [3.12+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Migrating hugepages and hwpoisoned hugepages are considered as non-present
    hugepages, and they are referenced via migration entries and hwpoison
    entries in their page table slots.

    This behavior causes race condition because pmd_huge() doesn't tell
    non-huge pages from migrating/hwpoisoned hugepages. follow_page_mask() is
    one example where the kernel would call follow_page_pte() for such
    hugepage while this function is supposed to handle only normal pages.

    To avoid this, this patch makes pmd_huge() return true when pmd_none() is
    true *and* pmd_present() is false. We don't have to worry about mixing up
    non-present pmd entry with normal pmd (pointing to leaf level pte entry)
    because pmd_present() is true in normal pmd.

    The same race condition could happen in (x86-specific) gup_pmd_range(),
    where this patch simply adds pmd_present() check instead of pmd_huge().
    This is because gup_pmd_range() is fast path. If we have non-present
    hugepage in this function, we will go into gup_huge_pmd(), then return 0
    at flag mask check, and finally fall back to the slow path.

    Fixes: 290408d4a2 ("hugetlb: hugepage migration core")
    Signed-off-by: Naoya Horiguchi
    Cc: Hugh Dickins
    Cc: James Hogan
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Luiz Capitulino
    Cc: Nishanth Aravamudan
    Cc: Lee Schermerhorn
    Cc: Steve Capper
    Cc: [2.6.36+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Currently we have many duplicates in definitions around
    follow_huge_addr(), follow_huge_pmd(), and follow_huge_pud(), so this
    patch tries to remove the m. The basic idea is to put the default
    implementation for these functions in mm/hugetlb.c as weak symbols
    (regardless of CONFIG_ARCH_WANT_GENERAL_HUGETL B), and to implement
    arch-specific code only when the arch needs it.

    For follow_huge_addr(), only powerpc and ia64 have their own
    implementation, and in all other architectures this function just returns
    ERR_PTR(-EINVAL). So this patch sets returning ERR_PTR(-EINVAL) as
    default.

    As for follow_huge_(pmd|pud)(), if (pmd|pud)_huge() is implemented to
    always return 0 in your architecture (like in ia64 or sparc,) it's never
    called (the callsite is optimized away) no matter how implemented it is.
    So in such architectures, we don't need arch-specific implementation.

    In some architecture (like mips, s390 and tile,) their current
    arch-specific follow_huge_(pmd|pud)() are effectively identical with the
    common code, so this patch lets these architecture use the common code.

    One exception is metag, where pmd_huge() could return non-zero but it
    expects follow_huge_pmd() to always return NULL. This means that we need
    arch-specific implementation which returns NULL. This behavior looks
    strange to me (because non-zero pmd_huge() implies that the architecture
    supports PMD-based hugepage, so follow_huge_pmd() can/should return some
    relevant value,) but that's beyond this cleanup patch, so let's keep it.

    Justification of non-trivial changes:
    - in s390, follow_huge_pmd() checks !MACHINE_HAS_HPAGE at first, and this
    patch removes the check. This is OK because we can assume MACHINE_HAS_HPAGE
    is true when follow_huge_pmd() can be called (note that pmd_huge() has
    the same check and always returns 0 for !MACHINE_HAS_HPAGE.)
    - in s390 and mips, we use HPAGE_MASK instead of PMD_MASK as done in common
    code. This patch forces these archs use PMD_MASK, but it's OK because
    they are identical in both archs.
    In s390, both of HPAGE_SHIFT and PMD_SHIFT are 20.
    In mips, HPAGE_SHIFT is defined as (PAGE_SHIFT + PAGE_SHIFT - 3) and
    PMD_SHIFT is define as (PAGE_SHIFT + PAGE_SHIFT + PTE_ORDER - 3), but
    PTE_ORDER is always 0, so these are identical.

    Signed-off-by: Naoya Horiguchi
    Acked-by: Hugh Dickins
    Cc: James Hogan
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Luiz Capitulino
    Cc: Nishanth Aravamudan
    Cc: Lee Schermerhorn
    Cc: Steve Capper
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

11 Feb, 2015

1 commit

  • hugetlb_treat_as_movable declared as unsigned long, but
    proc_dointvec() used for parsing it:

    static struct ctl_table vm_table[] = {
    ...
    {
    .procname = "hugepages_treat_as_movable",
    .data = &hugepages_treat_as_movable,
    .maxlen = sizeof(int),
    .mode = 0644,
    .proc_handler = proc_dointvec,
    },

    This seems harmless, but it's better to use int type here.

    Signed-off-by: Andrey Ryabinin
    Cc: Dmitry Vyukov
    Cc: Manfred Spraul
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     

16 Dec, 2014

1 commit

  • Pull drm updates from Dave Airlie:
    "Highlights:

    - AMD KFD driver merge

    This is the AMD HSA interface for exposing a lowlevel interface for
    GPGPU use. They have an open source userspace built on top of this
    interface, and the code looks as good as it was going to get out of
    tree.

    - Initial atomic modesetting work

    The need for an atomic modesetting interface to allow userspace to
    try and send a complete set of modesetting state to the driver has
    arisen, and been suffering from neglect this past year. No more,
    the start of the common code and changes for msm driver to use it
    are in this tree. Ongoing work to get the userspace ioctl finished
    and the code clean will probably wait until next kernel.

    - DisplayID 1.3 and tiled monitor exposed to userspace.

    Tiled monitor property is now exposed for userspace to make use of.

    - Rockchip drm driver merged.

    - imx gpu driver moved out of staging

    Other stuff:

    - core:
    panel - MIPI DSI + new panels.
    expose suggested x/y properties for virtual GPUs

    - i915:
    Initial Skylake (SKL) support
    gen3/4 reset work
    start of dri1/ums removal
    infoframe tracking
    fixes for lots of things.

    - nouveau:
    tegra k1 voltage support
    GM204 modesetting support
    GT21x memory reclocking work

    - radeon:
    CI dpm fixes
    GPUVM improvements
    Initial DPM fan control

    - rcar-du:
    HDMI support added
    removed some support for old boards
    slave encoder driver for Analog Devices adv7511

    - exynos:
    Exynos4415 SoC support

    - msm:
    a4xx gpu support
    atomic helper conversion

    - tegra:
    iommu support
    universal plane support
    ganged-mode DSI support

    - sti:
    HDMI i2c improvements

    - vmwgfx:
    some late fixes.

    - qxl:
    use suggested x/y properties"

    * 'drm-next' of git://people.freedesktop.org/~airlied/linux: (969 commits)
    drm: sti: fix module compilation issue
    drm/i915: save/restore GMBUS freq across suspend/resume on gen4
    drm: sti: correctly cleanup CRTC and planes
    drm: sti: add HQVDP plane
    drm: sti: add cursor plane
    drm: sti: enable auxiliary CRTC
    drm: sti: fix delay in VTG programming
    drm: sti: prepare sti_tvout to support auxiliary crtc
    drm: sti: use drm_crtc_vblank_{on/off} instead of drm_vblank_{on/off}
    drm: sti: fix hdmi avi infoframe
    drm: sti: remove event lock while disabling vblank
    drm: sti: simplify gdp code
    drm: sti: clear all mixer control
    drm: sti: remove gpio for HDMI hot plug detection
    drm: sti: allow to change hdmi ddc i2c adapter
    drm/doc: Document drm_add_modes_noedid() usage
    drm/i915: Remove '& 0xffff' from the mask given to WA_REG()
    drm/i915: Invert the mask and val arguments in wa_add() and WA_REG()
    drm: Zero out DRM object memory upon cleanup
    drm/i915/bdw: Fix the write setting up the WIZ hashing mode
    ...

    Linus Torvalds
     

14 Dec, 2014

4 commits

  • This function is only called during initialization.

    Signed-off-by: Luiz Capitulino
    Cc: Andi Kleen
    Acked-by: David Rientjes
    Cc: Rik van Riel
    Cc: Yasuaki Ishimatsu
    Cc: Yinghai Lu
    Cc: Davidlohr Bueso
    Acked-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luiz Capitulino
     
  • No reason to duplicate the code of an existing macro.

    Signed-off-by: Luiz Capitulino
    Cc: Andi Kleen
    Acked-by: David Rientjes
    Cc: Rik van Riel
    Cc: Yasuaki Ishimatsu
    Cc: Yinghai Lu
    Cc: Davidlohr Bueso
    Acked-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luiz Capitulino
     
  • The i_mmap_mutex is a close cousin of the anon vma lock, both protecting
    similar data, one for file backed pages and the other for anon memory. To
    this end, this lock can also be a rwsem. In addition, there are some
    important opportunities to share the lock when there are no tree
    modifications.

    This conversion is straightforward. For now, all users take the write
    lock.

    [sfr@canb.auug.org.au: update fremap.c]
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Rik van Riel
    Acked-by: "Kirill A. Shutemov"
    Acked-by: Hugh Dickins
    Cc: Oleg Nesterov
    Acked-by: Peter Zijlstra (Intel)
    Cc: Srikar Dronamraju
    Acked-by: Mel Gorman
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Convert all open coded mutex_lock/unlock calls to the
    i_mmap_[lock/unlock]_write() helpers.

    Signed-off-by: Davidlohr Bueso
    Acked-by: Rik van Riel
    Acked-by: "Kirill A. Shutemov"
    Acked-by: Hugh Dickins
    Cc: Oleg Nesterov
    Acked-by: Peter Zijlstra (Intel)
    Cc: Srikar Dronamraju
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

12 Dec, 2014

1 commit

  • Pull cgroup update from Tejun Heo:
    "cpuset got simplified a bit. cgroup core got a fix on unified
    hierarchy and grew some effective css related interfaces which will be
    used for blkio support for writeback IO traffic which is currently
    being worked on"

    * 'for-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: implement cgroup_get_e_css()
    cgroup: add cgroup_subsys->css_e_css_changed()
    cgroup: add cgroup_subsys->css_released()
    cgroup: fix the async css offline wait logic in cgroup_subtree_control_write()
    cgroup: restructure child_subsys_mask handling in cgroup_subtree_control_write()
    cgroup: separate out cgroup_calc_child_subsys_mask() from cgroup_refresh_child_subsys_mask()
    cpuset: lock vs unlock typo
    cpuset: simplify cpuset_node_allowed API
    cpuset: convert callback_mutex to a spinlock

    Linus Torvalds
     

11 Dec, 2014

1 commit

  • First, after flushing TLB, we have no need to scan pte from start again.
    Second, before bail out loop, the address is forwarded one step.

    Signed-off-by: Hillf Danton
    Reviewed-by: Michal Hocko
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hillf Danton
     

27 Oct, 2014

1 commit

  • Current cpuset API for checking if a zone/node is allowed to allocate
    from looks rather awkward. We have hardwall and softwall versions of
    cpuset_node_allowed with the softwall version doing literally the same
    as the hardwall version if __GFP_HARDWALL is passed to it in gfp flags.
    If it isn't, the softwall version may check the given node against the
    enclosing hardwall cpuset, which it needs to take the callback lock to
    do.

    Such a distinction was introduced by commit 02a0e53d8227 ("cpuset:
    rework cpuset_zone_allowed api"). Before, we had the only version with
    the __GFP_HARDWALL flag determining its behavior. The purpose of the
    commit was to avoid sleep-in-atomic bugs when someone would mistakenly
    call the function without the __GFP_HARDWALL flag for an atomic
    allocation. The suffixes introduced were intended to make the callers
    think before using the function.

    However, since the callback lock was converted from mutex to spinlock by
    the previous patch, the softwall check function cannot sleep, and these
    precautions are no longer necessary.

    So let's simplify the API back to the single check.

    Suggested-by: David Rientjes
    Signed-off-by: Vladimir Davydov
    Acked-by: Christoph Lameter
    Acked-by: Zefan Li
    Signed-off-by: Tejun Heo

    Vladimir Davydov
     

10 Oct, 2014

1 commit

  • Trivially convert a few VM_BUG_ON calls to VM_BUG_ON_VMA to extract
    more information when they trigger.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Sasha Levin
    Reviewed-by: Naoya Horiguchi
    Cc: Kirill A. Shutemov
    Cc: Konstantin Khlebnikov
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Vlastimil Babka
    Cc: Michel Lespinasse
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     

07 Aug, 2014

5 commits

  • It is possible for some platforms, such as powerpc to set HPAGE_SHIFT to
    0 to indicate huge pages not supported.

    When this is the case, hugetlbfs could be disabled during boot time:
    hugetlbfs: disabling because there are no supported hugepage sizes

    Then in dissolve_free_huge_pages(), order is kept maximum (64 for
    64bits), and the for loop below won't end: for (pfn = start_pfn; pfn <
    end_pfn; pfn += 1 << order)

    As suggested by Naoya, below fix checks hugepages_supported() before
    calling dissolve_free_huge_pages().

    [rientjes@google.com: no legitimate reason to call dissolve_free_huge_pages() when !hugepages_supported()]
    Signed-off-by: Li Zhong
    Acked-by: Naoya Horiguchi
    Acked-by: David Rientjes
    Signed-off-by: David Rientjes
    Cc: [3.12+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zhong
     
  • They are unnecessary: "zero" can be used in place of "hugetlb_zero" and
    passing extra2 == NULL is equivalent to infinity.

    Signed-off-by: David Rientjes
    Cc: Joonsoo Kim
    Reviewed-by: Naoya Horiguchi
    Reviewed-by: Luiz Capitulino
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Three different interfaces alter the maximum number of hugepages for an
    hstate:

    - /proc/sys/vm/nr_hugepages for global number of hugepages of the default
    hstate,

    - /sys/kernel/mm/hugepages/hugepages-X/nr_hugepages for global number of
    hugepages for a specific hstate, and

    - /sys/kernel/mm/hugepages/hugepages-X/nr_hugepages/mempolicy for number of
    hugepages for a specific hstate over the set of allowed nodes.

    Generalize the code so that a single function handles all of these
    writes instead of duplicating the code in two different functions.

    This decreases the number of lines of code, but also reduces the size of
    .text by about half a percent since set_max_huge_pages() can be inlined.

    Signed-off-by: David Rientjes
    Cc: Joonsoo Kim
    Reviewed-by: Naoya Horiguchi
    Reviewed-by: Luiz Capitulino
    Cc: "Kirill A. Shutemov"
    Acked-by: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • When returning from hugetlb_cow(), we always (1) put back the refcount
    for each referenced page -- always 'old', and 'new' if allocation was
    successful. And (2) retake the page table lock right before returning,
    as the callers expects. This logic can be simplified and encapsulated,
    as proposed in this patch. In addition to cleaner code, we also shave a
    few bytes off the instruction text:

    text data bss dec hex filename
    28399 462 41328 70189 1122d mm/hugetlb.o-baseline
    28367 462 41328 70157 1120d mm/hugetlb.o-patched

    Passes libhugetlbfs testcases.

    Signed-off-by: Davidlohr Bueso
    Cc: Aswin Chandramouleeswaran
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • This function always returns 1, thus no need to check return value in
    hugetlb_cow(). By doing so, we can get rid of the unnecessary WARN_ON
    call. While this logic perhaps existed as a way of identifying future
    unmap_ref_private() mishandling, reality is it serves no apparent
    purpose.

    Signed-off-by: Davidlohr Bueso
    Cc: Aswin Chandramouleeswaran
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

31 Jul, 2014

1 commit

  • PG_head_mask was added into VMCOREINFO to filter huge pages in b3acc56bfe1
    ("kexec: save PG_head_mask in VMCOREINFO"), but makedumpfile still need
    another symbol to filter *hugetlbfs* pages.

    If a user hope to filter user pages, makedumpfile tries to exclude them by
    checking the condition whether the page is anonymous, but hugetlbfs pages
    aren't anonymous while they also be user pages.

    We know it's possible to detect them in the same way as PageHuge(),
    so we need the start address of free_huge_page():

    int PageHuge(struct page *page)
    {
    if (!PageCompound(page))
    return 0;

    page = compound_head(page);
    return get_compound_page_dtor(page) == free_huge_page;
    }

    For that reason, this patch changes free_huge_page() into public
    to export it to VMCOREINFO.

    Signed-off-by: Atsushi Kumagai
    Acked-by: Baoquan He
    Cc: Vivek Goyal
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Atsushi Kumagai
     

24 Jul, 2014

1 commit

  • Commit 4a705fef9862 ("hugetlb: fix copy_hugetlb_page_range() to handle
    migration/hwpoisoned entry") changed the order of
    huge_ptep_set_wrprotect() and huge_ptep_get(), which leads to breakage
    in some workloads like hugepage-backed heap allocation via libhugetlbfs.
    This patch fixes it.

    The test program for the problem is shown below:

    $ cat heap.c
    #include
    #include
    #include

    #define HPS 0x200000

    int main() {
    int i;
    char *p = malloc(HPS);
    memset(p, '1', HPS);
    for (i = 0; i < 5; i++) {
    if (!fork()) {
    memset(p, '2', HPS);
    p = malloc(HPS);
    memset(p, '3', HPS);
    free(p);
    return 0;
    }
    }
    sleep(1);
    free(p);
    return 0;
    }

    $ export HUGETLB_MORECORE=yes ; export HUGETLB_NO_PREFAULT= ; hugectl --heap ./heap

    Fixes 4a705fef9862 ("hugetlb: fix copy_hugetlb_page_range() to handle
    migration/hwpoisoned entry"), so is applicable to -stable kernels which
    include it.

    Signed-off-by: Naoya Horiguchi
    Reported-by: Guillaume Morin
    Suggested-by: Guillaume Morin
    Acked-by: Hugh Dickins
    Cc: [2.6.37+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

24 Jun, 2014

1 commit

  • There's a race between fork() and hugepage migration, as a result we try
    to "dereference" a swap entry as a normal pte, causing kernel panic.
    The cause of the problem is that copy_hugetlb_page_range() can't handle
    "swap entry" family (migration entry and hwpoisoned entry) so let's fix
    it.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Naoya Horiguchi
    Acked-by: Hugh Dickins
    Cc: Christoph Lameter
    Cc: [2.6.37+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

05 Jun, 2014

5 commits

  • We already have a function named hugepages_supported(), and the similar
    name hugepage_migration_support() is a bit unconfortable, so let's rename
    it hugepage_migration_supported().

    Signed-off-by: Naoya Horiguchi
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • alloc_huge_page() now mixes normal code path with error handle logic.
    This patches move out the error handle logic, to make normal code path
    more clean and redue code duplicate.

    Signed-off-by: Jianyu Zhan
    Acked-by: Davidlohr Bueso
    Reviewed-by: Michal Hocko
    Reviewed-by: Aneesh Kumar K.V
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jianyu Zhan
     
  • HugeTLB is limited to allocating hugepages whose size are less than
    MAX_ORDER order. This is so because HugeTLB allocates hugepages via the
    buddy allocator. Gigantic pages (that is, pages whose size is greater
    than MAX_ORDER order) have to be allocated at boottime.

    However, boottime allocation has at least two serious problems. First,
    it doesn't support NUMA and second, gigantic pages allocated at boottime
    can't be freed.

    This commit solves both issues by adding support for allocating gigantic
    pages during runtime. It works just like regular sized hugepages,
    meaning that the interface in sysfs is the same, it supports NUMA, and
    gigantic pages can be freed.

    For example, on x86_64 gigantic pages are 1GB big. To allocate two 1G
    gigantic pages on node 1, one can do:

    # echo 2 > \
    /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages

    And to free them all:

    # echo 0 > \
    /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages

    The one problem with gigantic page allocation at runtime is that it
    can't be serviced by the buddy allocator. To overcome that problem,
    this commit scans all zones from a node looking for a large enough
    contiguous region. When one is found, it's allocated by using CMA, that
    is, we call alloc_contig_range() to do the actual allocation. For
    example, on x86_64 we scan all zones looking for a 1GB contiguous
    region. When one is found, it's allocated by alloc_contig_range().

    One expected issue with that approach is that such gigantic contiguous
    regions tend to vanish as runtime goes by. The best way to avoid this
    for now is to make gigantic page allocations very early during system
    boot, say from a init script. Other possible optimization include using
    compaction, which is supported by CMA but is not explicitly used by this
    commit.

    It's also important to note the following:

    1. Gigantic pages allocated at boottime by the hugepages= command-line
    option can be freed at runtime just fine

    2. This commit adds support for gigantic pages only to x86_64. The
    reason is that I don't have access to nor experience with other archs.
    The code is arch indepedent though, so it should be simple to add
    support to different archs

    3. I didn't add support for hugepage overcommit, that is allocating
    a gigantic page on demand when
    /proc/sys/vm/nr_overcommit_hugepages > 0. The reason is that I don't
    think it's reasonable to do the hard and long work required for
    allocating a gigantic page at fault time. But it should be simple
    to add this if wanted

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Luiz Capitulino
    Reviewed-by: Davidlohr Bueso
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Zhang Yanfei
    Reviewed-by: Yasuaki Ishimatsu
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Cc: Marcelo Tosatti
    Cc: Naoya Horiguchi
    Cc: Rik van Riel
    Cc: Yinghai Lu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luiz Capitulino
     
  • Next commit will add new code which will want to call
    for_each_node_mask_to_alloc() macro. Move it, its buddy
    for_each_node_mask_to_free() and their dependencies up in the file so the
    new code can use them. This is just code movement, no logic change.

    Signed-off-by: Luiz Capitulino
    Reviewed-by: Andrea Arcangeli
    Reviewed-by: Naoya Horiguchi
    Reviewed-by: Yasuaki Ishimatsu
    Reviewed-by: Davidlohr Bueso
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Zhang Yanfei
    Cc: David Rientjes
    Cc: Marcelo Tosatti
    Cc: Rik van Riel
    Cc: Yinghai Lu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luiz Capitulino
     
  • Hugepages pages never get the PG_reserved bit set, so don't clear it.

    However, note that if the bit gets mistakenly set free_pages_check() will
    catch it.

    Signed-off-by: Luiz Capitulino
    Reviewed-by: Davidlohr Bueso
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Zhang Yanfei
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Cc: Marcelo Tosatti
    Cc: Naoya Horiguchi
    Cc: Rik van Riel
    Cc: Yasuaki Ishimatsu
    Cc: Yinghai Lu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luiz Capitulino