12 Feb, 2015

40 commits

  • This patch makes do_mincore() use walk_page_vma(), which reduces many
    lines of code by using common page table walk code.

    [daeseok.youn@gmail.com: remove unneeded variable 'err']
    Signed-off-by: Naoya Horiguchi
    Acked-by: Johannes Weiner
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Cc: Cyrill Gorcunov
    Cc: Dave Hansen
    Cc: Kirill A. Shutemov
    Cc: Pavel Emelyanov
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Daeseok Youn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Currently pagewalker splits all THP pages on any clear_refs request. It's
    not necessary. We can handle this on PMD level.

    One side effect is that soft dirty will potentially see more dirty memory,
    since we will mark whole THP page dirty at once.

    Sanity checked with CRIU test suite. More testing is required.

    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Naoya Horiguchi
    Reviewed-by: Cyrill Gorcunov
    Cc: Pavel Emelyanov
    Cc: Andrea Arcangeli
    Cc: Dave Hansen
    Cc: "Kirill A. Shutemov"
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • walk_page_range() silently skips vma having VM_PFNMAP set, which leads to
    undesirable behaviour at client end (who called walk_page_range). For
    example for pagemap_read(), when no callbacks are called against VM_PFNMAP
    vma, pagemap_read() may prepare pagemap data for next virtual address
    range at wrong index. That could confuse and/or break userspace
    applications.

    This patch avoid this misbehavior caused by vma(VM_PFNMAP) like follows:
    - for pagemap_read() which has its own ->pte_hole(), call the ->pte_hole()
    over vma(VM_PFNMAP),
    - for clear_refs and queue_pages which have their own ->tests_walk,
    just return 1 and skip vma(VM_PFNMAP). This is no problem because
    these are not interested in hole regions,
    - for other callers, just skip the vma(VM_PFNMAP) as a default behavior.

    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Shiraz Hashim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • queue_pages_range() does page table walking in its own way now, but there
    is some code duplicate. This patch applies page table walker to reduce
    lines of code.

    queue_pages_range() has to do some precheck to determine whether we really
    walk over the vma or just skip it. Now we have test_walk() callback in
    mm_walk for this purpose, so we can do this replacement cleanly.
    queue_pages_test_walk() depends on not only the current vma but also the
    previous one, so queue_pages->prev is introduced to remember it.

    Signed-off-by: Naoya Horiguchi
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Cc: Cyrill Gorcunov
    Cc: Dave Hansen
    Cc: Kirill A. Shutemov
    Cc: Pavel Emelyanov
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • We don't have to use mm_walk->private to pass vma to the callback function
    because of mm_walk->vma. And walk_page_vma() is useful if we walk over a
    single vma.

    Signed-off-by: Naoya Horiguchi
    Acked-by: Kirill A. Shutemov
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Cc: Cyrill Gorcunov
    Cc: Dave Hansen
    Cc: Pavel Emelyanov
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • pagewalk.c can handle vma in itself, so we don't have to pass vma via
    walk->private. And both of mem_cgroup_count_precharge() and
    mem_cgroup_move_charge() do for each vma loop themselves, but now it's
    done in pagewalk.c, so let's clean up them.

    Signed-off-by: Naoya Horiguchi
    Acked-by: Johannes Weiner
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Cc: Cyrill Gorcunov
    Cc: Dave Hansen
    Cc: Kirill A. Shutemov
    Cc: Pavel Emelyanov
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • pagewalk.c can handle vma in itself, so we don't have to pass vma via
    walk->private. And show_numa_map() walks pages on vma basis, so using
    walk_page_vma() is preferable.

    Signed-off-by: Naoya Horiguchi
    Acked-by: Kirill A. Shutemov
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Cc: Cyrill Gorcunov
    Cc: Dave Hansen
    Cc: Pavel Emelyanov
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Just doing s/gather_hugetbl_stats/gather_hugetlb_stats/g, this makes code
    grep-friendly.

    Signed-off-by: Naoya Horiguchi
    Acked-by: Kirill A. Shutemov
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Cc: Cyrill Gorcunov
    Cc: Dave Hansen
    Cc: Pavel Emelyanov
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Page table walker has the information of the current vma in mm_walk, so we
    don't have to call find_vma() in each pagemap_(pte|hugetlb)_range() call
    any longer. Currently pagemap_pte_range() does vma loop itself, so this
    patch reduces many lines of code.

    NULL-vma check is omitted because we assume that we never run these
    callbacks on any address outside vma. And even if it were broken, NULL
    pointer dereference would be detected, so we can get enough information
    for debugging.

    Signed-off-by: Naoya Horiguchi
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Cc: Cyrill Gorcunov
    Cc: Dave Hansen
    Cc: Kirill A. Shutemov
    Cc: Pavel Emelyanov
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • clear_refs_write() has some prechecks to determine if we really walk over
    a given vma. Now we have a test_walk() callback to filter vmas, so let's
    utilize it.

    Signed-off-by: Naoya Horiguchi
    Acked-by: Kirill A. Shutemov
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Cc: Cyrill Gorcunov
    Cc: Dave Hansen
    Cc: Pavel Emelyanov
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • pagewalk.c can handle vma in itself, so we don't have to pass vma via
    walk->private. And show_smap() walks pages on vma basis, so using
    walk_page_vma() is preferable.

    Signed-off-by: Naoya Horiguchi
    Acked-by: Kirill A. Shutemov
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Cc: Cyrill Gorcunov
    Cc: Dave Hansen
    Cc: Pavel Emelyanov
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Introduce walk_page_vma(), which is useful for the callers which want to
    walk over a given vma. It's used by later patches.

    Signed-off-by: Naoya Horiguchi
    Acked-by: Kirill A. Shutemov
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Cc: Cyrill Gorcunov
    Cc: Dave Hansen
    Cc: Pavel Emelyanov
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Current implementation of page table walker has a fundamental problem in
    vma handling, which started when we tried to handle vma(VM_HUGETLB).
    Because it's done in pgd loop, considering vma boundary makes code
    complicated and bug-prone.

    From the users viewpoint, some user checks some vma-related condition to
    determine whether the user really does page walk over the vma.

    In order to solve these, this patch moves vma check outside pgd loop and
    introduce a new callback ->test_walk().

    Signed-off-by: Naoya Horiguchi
    Acked-by: Kirill A. Shutemov
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Cc: Cyrill Gorcunov
    Cc: Dave Hansen
    Cc: Pavel Emelyanov
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Currently no user of page table walker sets ->pgd_entry() or
    ->pud_entry(), so checking their existence in each loop is just wasting
    CPU cycle. So let's remove it to reduce overhead.

    Signed-off-by: Naoya Horiguchi
    Acked-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Cyrill Gorcunov
    Cc: Dave Hansen
    Cc: Kirill A. Shutemov
    Cc: Pavel Emelyanov
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Lockless access to pte in pagemap_pte_range() might race with page
    migration and trigger BUG_ON(!PageLocked()) in migration_entry_to_page():

    CPU A (pagemap) CPU B (migration)
    lock_page()
    try_to_unmap(page, TTU_MIGRATION...)
    make_migration_entry()
    set_pte_at()

    pte_to_pagemap_entry()
    remove_migration_ptes()
    unlock_page()
    if(is_migration_entry())
    migration_entry_to_page()
    BUG_ON(!PageLocked(page))

    Also lockless read might be non-atomic if pte is larger than wordsize.
    Other pte walkers (smaps, numa_maps, clear_refs) already lock ptes.

    Fixes: 052fb0d635df ("proc: report file/anon bit in /proc/pid/pagemap")
    Signed-off-by: Konstantin Khlebnikov
    Reported-by: Andrey Ryabinin
    Reviewed-by: Cyrill Gorcunov
    Acked-by: Naoya Horiguchi
    Acked-by: Kirill A. Shutemov
    Cc: [3.5+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Use the more generic get_user_pages_unlocked which has the additional
    benefit of passing FAULT_FLAG_ALLOW_RETRY at the very first page fault
    (which allows the first page fault in an unmapped area to be always able
    to block indefinitely by being allowed to release the mmap_sem).

    Signed-off-by: Andrea Arcangeli
    Reviewed-by: Andres Lagar-Cavilla
    Reviewed-by: Kirill A. Shutemov
    Cc: Peter Feiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • This allows those get_user_pages calls to pass FAULT_FLAG_ALLOW_RETRY to
    the page fault in order to release the mmap_sem during the I/O.

    Signed-off-by: Andrea Arcangeli
    Reviewed-by: Kirill A. Shutemov
    Cc: Andres Lagar-Cavilla
    Cc: Peter Feiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • This allows the get_user_pages_fast slow path to release the mmap_sem
    before blocking.

    Signed-off-by: Andrea Arcangeli
    Reviewed-by: Kirill A. Shutemov
    Cc: Andres Lagar-Cavilla
    Cc: Peter Feiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Some callers (like KVM) may want to set the gup_flags like FOLL_HWPOSION
    to get a proper -EHWPOSION retval instead of -EFAULT to take a more
    appropriate action if get_user_pages runs into a memory failure.

    Signed-off-by: Andrea Arcangeli
    Reviewed-by: Kirill A. Shutemov
    Cc: Andres Lagar-Cavilla
    Cc: Peter Feiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • FAULT_FOLL_ALLOW_RETRY allows the page fault to drop the mmap_sem for
    reading to reduce the mmap_sem contention (for writing), like while
    waiting for I/O completion. The problem is that right now practically no
    get_user_pages call uses FAULT_FOLL_ALLOW_RETRY, so we're not leveraging
    that nifty feature.

    Andres fixed it for the KVM page fault. However get_user_pages_fast
    remains uncovered, and 99% of other get_user_pages aren't using it either
    (the only exception being FOLL_NOWAIT in KVM which is really nonblocking
    and in fact it doesn't even release the mmap_sem).

    So this patchsets extends the optimization Andres did in the KVM page
    fault to the whole kernel. It makes most important places (including
    gup_fast) to use FAULT_FOLL_ALLOW_RETRY to reduce the mmap_sem hold times
    during I/O.

    The only few places that remains uncovered are drivers like v4l and other
    exceptions that tends to work on their own memory and they're not working
    on random user memory (for example like O_DIRECT that uses gup_fast and is
    fully covered by this patch).

    A follow up patch should probably also add a printk_once warning to
    get_user_pages that should go obsolete and be phased out eventually. The
    "vmas" parameter of get_user_pages makes it fundamentally incompatible
    with FAULT_FOLL_ALLOW_RETRY (vmas array becomes meaningless the moment the
    mmap_sem is released).

    While this is just an optimization, this becomes an absolute requirement
    for the userfaultfd feature http://lwn.net/Articles/615086/ .

    The userfaultfd allows to block the page fault, and in order to do so I
    need to drop the mmap_sem first. So this patch also ensures that all
    memory where userfaultfd could be registered by KVM, the very first fault
    (no matter if it is a regular page fault, or a get_user_pages) always has
    FAULT_FOLL_ALLOW_RETRY set. Then the userfaultfd blocks and it is waken
    only when the pagetable is already mapped. The second fault attempt after
    the wakeup doesn't need FAULT_FOLL_ALLOW_RETRY, so it's ok to retry
    without it.

    This patch (of 5):

    We can leverage the VM_FAULT_RETRY functionality in the page fault paths
    better by using either get_user_pages_locked or get_user_pages_unlocked.

    The former allows conversion of get_user_pages invocations that will have
    to pass a "&locked" parameter to know if the mmap_sem was dropped during
    the call. Example from:

    down_read(&mm->mmap_sem);
    do_something()
    get_user_pages(tsk, mm, ..., pages, NULL);
    up_read(&mm->mmap_sem);

    to:

    int locked = 1;
    down_read(&mm->mmap_sem);
    do_something()
    get_user_pages_locked(tsk, mm, ..., pages, &locked);
    if (locked)
    up_read(&mm->mmap_sem);

    The latter is suitable only as a drop in replacement of the form:

    down_read(&mm->mmap_sem);
    get_user_pages(tsk, mm, ..., pages, NULL);
    up_read(&mm->mmap_sem);

    into:

    get_user_pages_unlocked(tsk, mm, ..., pages);

    Where tsk, mm, the intermediate "..." paramters and "pages" can be any
    value as before. Just the last parameter of get_user_pages (vmas) must be
    NULL for get_user_pages_locked|unlocked to be usable (the latter original
    form wouldn't have been safe anyway if vmas wasn't null, for the former we
    just make it explicit by dropping the parameter).

    If vmas is not NULL these two methods cannot be used.

    Signed-off-by: Andrea Arcangeli
    Reviewed-by: Andres Lagar-Cavilla
    Reviewed-by: Peter Feiner
    Reviewed-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • The previous commit ("mm/thp: Allocate transparent hugepages on local
    node") introduced alloc_hugepage_vma() to mm/mempolicy.c to perform a
    special policy for THP allocations. The function has the same interface
    as alloc_pages_vma(), shares a lot of boilerplate code and a long
    comment.

    This patch merges the hugepage special case into alloc_pages_vma. The
    extra if condition should be cheap enough price to pay. We also prevent
    a (however unlikely) race with parallel mems_allowed update, which could
    make hugepage allocation restart only within the fallback call to
    alloc_hugepage_vma() and not reconsider the special rule in
    alloc_hugepage_vma().

    Also by making sure mpol_cond_put(pol) is always called before actual
    allocation attempt, we can use a single exit path within the function.

    Also update the comment for missing node parameter and obsolete reference
    to mm_sem.

    Signed-off-by: Vlastimil Babka
    Cc: Aneesh Kumar K.V
    Cc: Kirill A. Shutemov
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • This make sure that we try to allocate hugepages from local node if
    allowed by mempolicy. If we can't, we fallback to small page allocation
    based on mempolicy. This is based on the observation that allocating
    pages on local node is more beneficial than allocating hugepages on remote
    node.

    With this patch applied we may find transparent huge page allocation
    failures if the current node doesn't have enough freee hugepages. Before
    this patch such failures result in us retrying the allocation on other
    nodes in the numa node mask.

    [akpm@linux-foundation.org: fix comment, add CONFIG_TRANSPARENT_HUGEPAGE dependency]
    Signed-off-by: Aneesh Kumar K.V
    Acked-by: Kirill A. Shutemov
    Acked-by: Vlastimil Babka
    Cc: David Rientjes
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     
  • Compaction deferring logic is heavy hammer that block the way to the
    compaction. It doesn't consider overall system state, so it could prevent
    user from doing compaction falsely. In other words, even if system has
    enough range of memory to compact, compaction would be skipped due to
    compaction deferring logic. This patch add new tracepoint to understand
    work of deferring logic. This will also help to check compaction success
    and fail.

    Signed-off-by: Joonsoo Kim
    Cc: Vlastimil Babka
    Cc: Mel Gorman
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • It is not well analyzed that when/why compaction start/finish or not.
    With these new tracepoints, we can know much more about start/finish
    reason of compaction. I can find following bug with these tracepoint.

    http://www.spinics.net/lists/linux-mm/msg81582.html

    Signed-off-by: Joonsoo Kim
    Cc: Vlastimil Babka
    Cc: Mel Gorman
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • It'd be useful to know current range where compaction work for detailed
    analysis. With it, we can know pageblock where we actually scan and
    isolate, and, how much pages we try in that pageblock and can guess why it
    doesn't become freepage with pageblock order roughly.

    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • We now have tracepoint for begin event of compaction and it prints start
    position of both scanners, but, tracepoint for end event of compaction
    doesn't print finish position of both scanners. It'd be also useful to
    know finish position of both scanners so this patch add it. It will help
    to find odd behavior or problem on compaction internal logic.

    And mode is added to both begin/end tracepoint output, since according to
    mode, compaction behavior is quite different.

    And lastly, status format is changed to string rather than status number
    for readability.

    [akpm@linux-foundation.org: fix sparse warning]
    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Dan Carpenter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • To check the range that compaction is working, tracepoint print
    start/end pfn of zone and start pfn of both scanner with decimal format.
    Since we manage all pages in order of 2 and it is well represented by
    hexadecimal, this patch change the tracepoint format from decimal to
    hexadecimal. This would improve readability. For example, it makes us
    easily notice whether current scanner try to compact previously
    attempted pageblock or not.

    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Helper account_page_redirty() fixes dirty pages counter for redirtied
    pages. This patch puts it after dirtying and prevents temporary
    underflows of dirtied pages counters on zone/bdi and current->nr_dirtied.

    Signed-off-by: Konstantin Khebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khebnikov
     
  • The problem is that we check nr_ptes/nr_pmds in exit_mmap() which happens
    *before* pgd_free(). And if an arch does pte/pmd allocation in
    pgd_alloc() and frees them in pgd_free() we see offset in counters by the
    time of the checks.

    We tried to workaround this by offsetting expected counter value according
    to FIRST_USER_ADDRESS for both nr_pte and nr_pmd in exit_mmap(). But it
    doesn't work in some cases:

    1. ARM with LPAE enabled also has non-zero USER_PGTABLES_CEILING, but
    upper addresses occupied with huge pmd entries, so the trick with
    offsetting expected counter value will get really ugly: we will have
    to apply it nr_pmds, but not nr_ptes.

    2. Metag has non-zero FIRST_USER_ADDRESS, but doesn't do allocation
    pte/pmd page tables allocation in pgd_alloc(), just setup a pgd entry
    which is allocated at boot and shared accross all processes.

    The proposal is to move the check to check_mm() which happens *after*
    pgd_free() and do proper accounting during pgd_alloc() and pgd_free()
    which would bring counters to zero if nothing leaked.

    Signed-off-by: Kirill A. Shutemov
    Reported-by: Tyler Baker
    Tested-by: Tyler Baker
    Tested-by: Nishanth Menon
    Cc: Russell King
    Cc: James Hogan
    Cc: Guan Xuetao
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Dave noticed that unprivileged process can allocate significant amount of
    memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and
    memory cgroup. The trick is to allocate a lot of PMD page tables. Linux
    kernel doesn't account PMD tables to the process, only PTE.

    The use-cases below use few tricks to allocate a lot of PMD page tables
    while keeping VmRSS and VmPTE low. oom_score for the process will be 0.

    #include
    #include
    #include
    #include
    #include
    #include

    #define PUD_SIZE (1UL << 30)
    #define PMD_SIZE (1UL << 21)

    #define NR_PUD 130000

    int main(void)
    {
    char *addr = NULL;
    unsigned long i;

    prctl(PR_SET_THP_DISABLE);
    for (i = 0; i < NR_PUD ; i++) {
    addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ,
    MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
    if (addr == MAP_FAILED) {
    perror("mmap");
    break;
    }
    *addr = 'x';
    munmap(addr, PMD_SIZE);
    mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ,
    MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0);
    if (addr == MAP_FAILED)
    perror("re-mmap"), exit(1);
    }
    printf("PID %d consumed %lu KiB in PMD page tables\n",
    getpid(), i * 4096 >> 10);
    return pause();
    }

    The patch addresses the issue by account PMD tables to the process the
    same way we account PTE.

    The main place where PMD tables is accounted is __pmd_alloc() and
    free_pmd_range(). But there're few corner cases:

    - HugeTLB can share PMD page tables. The patch handles by accounting
    the table to all processes who share it.

    - x86 PAE pre-allocates few PMD tables on fork.

    - Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity
    check on exit(2).

    Accounting only happens on configuration where PMD page table's level is
    present (PMD is not folded). As with nr_ptes we use per-mm counter. The
    counter value is used to calculate baseline for badness score by
    oom-killer.

    Signed-off-by: Kirill A. Shutemov
    Reported-by: Dave Hansen
    Cc: Hugh Dickins
    Reviewed-by: Cyrill Gorcunov
    Cc: Pavel Emelyanov
    Cc: David Rientjes
    Tested-by: Sedat Dilek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • ARM uses custom implementation of PMD folding in 2-level page table case.
    Generic code expects to see __PAGETABLE_PMD_FOLDED to be defined if PMD is
    folded, but ARM doesn't do this. Let's fix it.

    Defining __PAGETABLE_PMD_FOLDED will drop out unused __pmd_alloc(). It
    also fixes problems with recently-introduced pmd accounting on ARM without
    LPAE.

    Signed-off-by: Kirill A. Shutemov
    Reported-by: Nishanth Menon
    Reported-by: Simon Horman
    Tested-by: Simon Horman
    Tested-by: Fabio Estevam
    Tested-by: Felipe Balbi
    Tested-by: Nishanth Menon
    Tested-by: Peter Ujfalusi
    Tested-by: Krzysztof Kozlowski
    Tested-by: Geert Uytterhoeven
    Cc: Dave Hansen
    Cc: Hugh Dickins
    Cc: Cyrill Gorcunov
    Cc: Pavel Emelyanov
    Cc: David Rientjes
    Cc: Russell King
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • If an architecure uses , build fails if we
    try to use PUD_SHIFT in generic code:

    In file included from arch/microblaze/include/asm/bug.h:1:0,
    from include/linux/bug.h:4,
    from include/linux/thread_info.h:11,
    from include/asm-generic/preempt.h:4,
    from arch/microblaze/include/generated/asm/preempt.h:1,
    from include/linux/preempt.h:18,
    from include/linux/spinlock.h:50,
    from include/linux/mmzone.h:7,
    from include/linux/gfp.h:5,
    from include/linux/slab.h:14,
    from mm/mmap.c:12:
    mm/mmap.c: In function 'exit_mmap':
    >> mm/mmap.c:2858:46: error: 'PUD_SHIFT' undeclared (first use in this function)
    round_up(FIRST_USER_ADDRESS, PUD_SIZE) >> PUD_SHIFT);
    ^
    include/asm-generic/bug.h:86:25: note: in definition of macro 'WARN_ON'
    int __ret_warn_on = !!(condition); \
    ^
    mm/mmap.c:2858:46: note: each undeclared identifier is reported only once for each function it appears in
    round_up(FIRST_USER_ADDRESS, PUD_SIZE) >> PUD_SHIFT);
    ^
    include/asm-generic/bug.h:86:25: note: in definition of macro 'WARN_ON'
    int __ret_warn_on = !!(condition); \
    ^
    As with , let's define PUD_SHIFT to
    PGDIR_SHIFT.

    Signed-off-by: Kirill A. Shutemov
    Reported-by: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • LKP has triggered a compiler warning after my recent patch "mm: account
    pmd page tables to the process":

    mm/mmap.c: In function 'exit_mmap':
    >> mm/mmap.c:2857:2: warning: right shift count >= width of type [enabled by default]

    The code:

    > 2857 WARN_ON(mm_nr_pmds(mm) >
    2858 round_up(FIRST_USER_ADDRESS, PUD_SIZE) >> PUD_SHIFT);

    In this, on tile, we have FIRST_USER_ADDRESS defined as 0. round_up() has
    the same type -- int. PUD_SHIFT.

    I think the best way to fix it is to define FIRST_USER_ADDRESS as unsigned
    long. On every arch for consistency.

    Signed-off-by: Kirill A. Shutemov
    Reported-by: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Microblaze uses custom implementation of PMD folding, but doesn't define
    __PAGETABLE_PMD_FOLDED, which generic code expects to see. Let's fix it.

    Defining __PAGETABLE_PMD_FOLDED will drop out unused __pmd_alloc(). It
    also fixes problems with recently-introduced pmd accounting.

    Signed-off-by: Kirill A. Shutemov
    Reported-by: Guenter Roeck
    Tested-by: Guenter Roeck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • The swap controller code is scattered all over the file. Gather all
    the code that isn't directly needed by the memory controller at the
    end of the file in its own CONFIG_MEMCG_SWAP section.

    Signed-off-by: Johannes Weiner
    Cc: Michal Hocko
    Reviewed-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The initialization code for the per-cpu charge stock and the soft
    limit tree is compact enough to inline it into mem_cgroup_init().

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Reviewed-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • - No need to test the node for N_MEMORY. node_online() is enough for
    node fallback to work in slab, use NUMA_NO_NODE for everything else.

    - Remove the BUG_ON() for allocation failure. A NULL pointer crash is
    just as descriptive, and the absent return value check is obvious.

    - Move local variables to the inner-most blocks.

    - Point to the tree structure after its initialized, not before, it's
    just more logical that way.

    Signed-off-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Guenter Roeck
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The totalcma_pages variable is not updated to account for CMA regions
    defined via device tree reserved-memory sub-nodes. Fix this omission by
    moving the calculation of totalcma_pages into cma_init_reserved_mem()
    instead of cma_declare_contiguous() such that it will include reserved
    memory used by all CMA regions.

    Signed-off-by: George G. Davis
    Cc: Marek Szyprowski
    Acked-by: Michal Nazarewicz
    Cc: Joonsoo Kim
    Cc: "Aneesh Kumar K.V"
    Cc: Laurent Pinchart
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    George G. Davis
     
  • Commit 5695be142e20 ("OOM, PM: OOM killed task shouldn't escape PM
    suspend") has left a race window when OOM killer manages to
    note_oom_kill after freeze_processes checks the counter. The race
    window is quite small and really unlikely and partial solution deemed
    sufficient at the time of submission.

    Tejun wasn't happy about this partial solution though and insisted on a
    full solution. That requires the full OOM and freezer's task freezing
    exclusion, though. This is done by this patch which introduces oom_sem
    RW lock and turns oom_killer_disable() into a full OOM barrier.

    oom_killer_disabled check is moved from the allocation path to the OOM
    level and we take oom_sem for reading for both the check and the whole
    OOM invocation.

    oom_killer_disable() takes oom_sem for writing so it waits for all
    currently running OOM killer invocations. Then it disable all the further
    OOMs by setting oom_killer_disabled and checks for any oom victims.
    Victims are counted via mark_tsk_oom_victim resp. unmark_oom_victim. The
    last victim wakes up all waiters enqueued by oom_killer_disable().
    Therefore this function acts as the full OOM barrier.

    The page fault path is covered now as well although it was assumed to be
    safe before. As per Tejun, "We used to have freezing points deep in file
    system code which may be reacheable from page fault." so it would be
    better and more robust to not rely on freezing points here. Same applies
    to the memcg OOM killer.

    out_of_memory tells the caller whether the OOM was allowed to trigger and
    the callers are supposed to handle the situation. The page allocation
    path simply fails the allocation same as before. The page fault path will
    retry the fault (more on that later) and Sysrq OOM trigger will simply
    complain to the log.

    Normally there wouldn't be any unfrozen user tasks after
    try_to_freeze_tasks so the function will not block. But if there was an
    OOM killer racing with try_to_freeze_tasks and the OOM victim didn't
    finish yet then we have to wait for it. This should complete in a finite
    time, though, because

    - the victim cannot loop in the page fault handler (it would die
    on the way out from the exception)
    - it cannot loop in the page allocator because all the further
    allocation would fail and __GFP_NOFAIL allocations are not
    acceptable at this stage
    - it shouldn't be blocked on any locks held by frozen tasks
    (try_to_freeze expects lockless context) and kernel threads and
    work queues are not frozen yet

    Signed-off-by: Michal Hocko
    Suggested-by: Tejun Heo
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Oleg Nesterov
    Cc: Cong Wang
    Cc: "Rafael J. Wysocki"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • While touching this area let's convert printk to pr_*. This also makes
    the printing of continuation lines done properly.

    Signed-off-by: Michal Hocko
    Acked-by: Tejun Heo
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Oleg Nesterov
    Cc: Cong Wang
    Cc: "Rafael J. Wysocki"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko