06 Nov, 2015

40 commits

  • We have had trouble in the past from the way in which page migration's
    newpage is initialized in dribs and drabs - see commit 8bdd63809160 ("mm:
    fix direct reclaim writeback regression") which proposed a cleanup.

    We have no actual problem now, but I think the procedure would be clearer
    (and alternative get_new_page pools safer to implement) if we assert that
    newpage is not touched until we are sure that it's going to be used -
    except for taking the trylock on it in __unmap_and_move().

    So shift the early initializations from move_to_new_page() into
    migrate_page_move_mapping(), mapping and NULL-mapping paths. Similarly
    migrate_huge_page_move_mapping(), but its NULL-mapping path can just be
    deleted: you cannot reach hugetlbfs_migrate_page() with a NULL mapping.

    Adjust stages 3 to 8 in the Documentation file accordingly.

    Signed-off-by: Hugh Dickins
    Cc: Christoph Lameter
    Cc: "Kirill A. Shutemov"
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: Oleg Nesterov
    Cc: Sasha Levin
    Cc: Dmitry Vyukov
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Hitherto page migration has avoided using a migration entry for a
    swapcache page mapped into userspace, apparently for historical reasons.
    So any page blessed with swapcache would entail a minor fault when it's
    next touched, which page migration otherwise tries to avoid. Swapcache in
    an mlocked area is rare, so won't often matter, but still better fixed.

    Just rearrange the block in try_to_unmap_one(), to handle TTU_MIGRATION
    before checking PageAnon, that's all (apart from some reindenting).

    Well, no, that's not quite all: doesn't this by the way fix a soft_dirty
    bug, that page migration of a file page was forgetting to transfer the
    soft_dirty bit? Probably not a serious bug: if I understand correctly,
    soft_dirty afficionados usually have to handle file pages separately
    anyway; but we publish the bit in /proc//pagemap on file mappings as
    well as anonymous, so page migration ought not to perturb it.

    Signed-off-by: Hugh Dickins
    Cc: Christoph Lameter
    Cc: "Kirill A. Shutemov"
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: Oleg Nesterov
    Cc: Sasha Levin
    Cc: Dmitry Vyukov
    Cc: KOSAKI Motohiro
    Reviewed-by: Cyrill Gorcunov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • __unmap_and_move() contains a long stale comment on page_get_anon_vma()
    and PageSwapCache(), with an odd control flow that's hard to follow.
    Mostly this reflects our confusion about the lifetime of an anon_vma, in
    the early days of page migration, before we could take a reference to one.
    Nowadays this seems quite straightforward: cut it all down to essentials.

    I cannot see the relevance of swapcache here at all, so don't treat it any
    differently: I believe the old comment reflects in part our anon_vma
    confusions, and in part the original v2.6.16 page migration technique,
    which used actual swap to migrate anon instead of swap-like migration
    entries. Why should a swapcache page not be migrated with the aid of
    migration entry ptes like everything else? So lose that comment now, and
    enable migration entries for swapcache in the next patch.

    Signed-off-by: Hugh Dickins
    Cc: Christoph Lameter
    Cc: "Kirill A. Shutemov"
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: Oleg Nesterov
    Cc: Sasha Levin
    Cc: Dmitry Vyukov
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Clean up page migration a little more by calling remove_migration_ptes()
    from the same level, on success or on failure, from __unmap_and_move() or
    from unmap_and_move_huge_page().

    Don't reset page->mapping of a PageAnon old page in move_to_new_page(),
    leave that to when the page is freed. Except for here in page migration,
    it has been an invariant that a PageAnon (bit set in page->mapping) page
    stays PageAnon until it is freed, and I think we're safer to keep to that.

    And with the above rearrangement, it's necessary because zap_pte_range()
    wants to identify whether a migration entry represents a file or an anon
    page, to update the appropriate rss stats without waiting on it.

    Signed-off-by: Hugh Dickins
    Cc: Christoph Lameter
    Cc: "Kirill A. Shutemov"
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: Oleg Nesterov
    Cc: Sasha Levin
    Cc: Dmitry Vyukov
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Clean up page migration a little by moving the trylock of newpage from
    move_to_new_page() into __unmap_and_move(), where the old page has been
    locked. Adjust unmap_and_move_huge_page() and balloon_page_migrate()
    accordingly.

    But make one kind-of-functional change on the way: whereas trylock of
    newpage used to BUG() if it failed, now simply return -EAGAIN if so.
    Cutting out BUG()s is good, right? But, to be honest, this is really to
    extend the usefulness of the custom put_new_page feature, allowing a pool
    of new pages to be shared perhaps with racing uses.

    Use an "else" instead of that "skip_unmap" label.

    Signed-off-by: Hugh Dickins
    Cc: Christoph Lameter
    Cc: "Kirill A. Shutemov"
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: Oleg Nesterov
    Cc: Sasha Levin
    Cc: Dmitry Vyukov
    Cc: KOSAKI Motohiro
    Acked-by: Rafael Aquini
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • I don't know of any problem from the way it's used in our current tree,
    but there is one defect in page migration's custom put_new_page feature.

    An unused newpage is expected to be released with the put_new_page(), but
    there was one MIGRATEPAGE_SUCCESS (0) path which released it with
    putback_lru_page(): which can be very wrong for a custom pool.

    Fixed more easily by resetting put_new_page once it won't be needed, than
    by adding a further flag to modify the rc test.

    Signed-off-by: Hugh Dickins
    Cc: Christoph Lameter
    Cc: "Kirill A. Shutemov"
    Cc: Rik van Riel
    Acked-by: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: Oleg Nesterov
    Cc: Sasha Levin
    Cc: Dmitry Vyukov
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • It's migrate.c not migration,c, and nowadays putback_movable_pages() not
    putback_lru_pages().

    Signed-off-by: Hugh Dickins
    Cc: Christoph Lameter
    Cc: "Kirill A. Shutemov"
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: Oleg Nesterov
    Cc: Sasha Levin
    Cc: Dmitry Vyukov
    Cc: KOSAKI Motohiro
    Acked-by: Rafael Aquini
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • After v4.3's commit 0610c25daa3e ("memcg: fix dirty page migration")
    mem_cgroup_migrate() doesn't have much to offer in page migration: convert
    migrate_misplaced_transhuge_page() to set_page_memcg() instead.

    Then rename mem_cgroup_migrate() to mem_cgroup_replace_page(), since its
    remaining callers are replace_page_cache_page() and shmem_replace_page():
    both of whom passed lrucare true, so just eliminate that argument.

    Signed-off-by: Hugh Dickins
    Cc: Christoph Lameter
    Cc: "Kirill A. Shutemov"
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: Oleg Nesterov
    Cc: Sasha Levin
    Cc: Dmitry Vyukov
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Commit e6c509f85455 ("mm: use clear_page_mlock() in page_remove_rmap()")
    in v3.7 inadvertently made mlock_migrate_page() impotent: page migration
    unmaps the page from userspace before migrating, and that commit clears
    PageMlocked on the final unmap, leaving mlock_migrate_page() with
    nothing to do. Not a serious bug, the next attempt at reclaiming the
    page would fix it up; but a betrayal of page migration's intent - the
    new page ought to emerge as PageMlocked.

    I don't see how to fix it for mlock_migrate_page() itself; but easily
    fixed in remove_migration_pte(), by calling mlock_vma_page() when the vma
    is VM_LOCKED - under pte lock as in try_to_unmap_one().

    Delete mlock_migrate_page()? Not quite, it does still serve a purpose for
    migrate_misplaced_transhuge_page(): where we could replace it by a test,
    clear_page_mlock(), mlock_vma_page() sequence; but would that be an
    improvement? mlock_migrate_page() is fairly lean, and let's make it
    leaner by skipping the irq save/restore now clearly not needed.

    Signed-off-by: Hugh Dickins
    Cc: Christoph Lameter
    Cc: "Kirill A. Shutemov"
    Cc: Rik van Riel
    Acked-by: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: Oleg Nesterov
    Cc: Sasha Levin
    Cc: Dmitry Vyukov
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • KernelThreadSanitizer (ktsan) has shown that the down_read_trylock() of
    mmap_sem in try_to_unmap_one() (when going to set PageMlocked on a page
    found mapped in a VM_LOCKED vma) is ineffective against races with
    exit_mmap()'s munlock_vma_pages_all(), because mmap_sem is not held when
    tearing down an mm.

    But that's okay, those races are benign; and although we've believed for
    years in that ugly down_read_trylock(), it's unsuitable for the job, and
    frustrates the good intention of setting PageMlocked when it fails.

    It just doesn't matter if here we read vm_flags an instant before or after
    a racing mlock() or munlock() or exit_mmap() sets or clears VM_LOCKED: the
    syscalls (or exit) work their way up the address space (taking pt locks
    after updating vm_flags) to establish the final state.

    We do still need to be careful never to mark a page Mlocked (hence
    unevictable) by any race that will not be corrected shortly after. The
    page lock protects from many of the races, but not all (a page is not
    necessarily locked when it's unmapped). But the pte lock we just dropped
    is good to cover the rest (and serializes even with
    munlock_vma_pages_all(), so no special barriers required): now hold on to
    the pte lock while calling mlock_vma_page(). Is that lock ordering safe?
    Yes, that's how follow_page_pte() calls it, and how page_remove_rmap()
    calls the complementary clear_page_mlock().

    This fixes the following case (though not a case which anyone has
    complained of), which mmap_sem did not: truncation's preliminary
    unmap_mapping_range() is supposed to remove even the anonymous COWs of
    filecache pages, and that might race with try_to_unmap_one() on a
    VM_LOCKED vma, so that mlock_vma_page() sets PageMlocked just after
    zap_pte_range() unmaps the page, causing "Bad page state (mlocked)" when
    freed. The pte lock protects against this.

    You could say that it also protects against the more ordinary case, racing
    with the preliminary unmapping of a filecache page itself: but in our
    current tree, that's independently protected by i_mmap_rwsem; and that
    race would be why "Bad page state (mlocked)" was seen before commit
    48ec833b7851 ("Revert mm/memory.c: share the i_mmap_rwsem").

    Vlastimil Babka points out another race which this patch protects against.
    try_to_unmap_one() might reach its mlock_vma_page() TestSetPageMlocked a
    moment after munlock_vma_pages_all() did its Phase 1 TestClearPageMlocked:
    leaving PageMlocked and unevictable when it should be evictable. mmap_sem
    is ineffective because exit_mmap() does not hold it; page lock ineffective
    because __munlock_pagevec() only takes it afterwards, in Phase 2; pte lock
    is effective because __munlock_pagevec_fill() takes it to get the page,
    after VM_LOCKED was cleared from vm_flags, so visible to try_to_unmap_one.

    Kirill Shutemov points out that if the compiler chooses to implement a
    "vma->vm_flags &= VM_WHATEVER" or "vma->vm_flags |= VM_WHATEVER" operation
    with an intermediate store of unrelated bits set, since I'm here foregoing
    its usual protection by mmap_sem, try_to_unmap_one() might catch sight of
    a spurious VM_LOCKED in vm_flags, and make the wrong decision. This does
    not appear to be an immediate problem, but we may want to define vm_flags
    accessors in future, to guard against such a possibility.

    While we're here, make a related optimization in try_to_munmap_one(): if
    it's doing TTU_MUNLOCK, then there's no point at all in descending the
    page tables and getting the pt lock, unless the vma is VM_LOCKED. Yes,
    that can change racily, but it can change racily even without the
    optimization: it's not critical. Far better not to waste time here.

    Stopped short of separating try_to_munlock_one() from try_to_munmap_one()
    on this occasion, but that's probably the sensible next step - with a
    rename, given that try_to_munlock()'s business is to try to set Mlocked.

    Updated the unevictable-lru Documentation, to remove its reference to mmap
    semaphore, but found a few more updates needed in just that area.

    Signed-off-by: Hugh Dickins
    Cc: Christoph Lameter
    Cc: "Kirill A. Shutemov"
    Cc: Rik van Riel
    Acked-by: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: Oleg Nesterov
    Cc: Sasha Levin
    Cc: Dmitry Vyukov
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • While updating some mm Documentation, I came across a few straggling
    references to the non-linear vmas which were happily removed in v4.0.
    Delete them.

    Signed-off-by: Hugh Dickins
    Cc: Christoph Lameter
    Cc: "Kirill A. Shutemov"
    Cc: Rik van Riel
    Acked-by: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: Oleg Nesterov
    Cc: Sasha Levin
    Cc: Dmitry Vyukov
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • If ALLOC_SPLIT_PTLOCKS is defined, ptlock_init may fail, in which case we
    shouldn't increment NR_PAGETABLE.

    Since small allocations, such as ptlock, normally do not fail (currently
    they can fail if kmemcg is used though), this patch does not really fix
    anything and should be considered as a code cleanup.

    Signed-off-by: Vladimir Davydov
    Acked-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Don't build clear_soft_dirty_pmd() if transparent huge pages are not
    enabled.

    Signed-off-by: Laurent Dufour
    Reviewed-by: Aneesh Kumar K.V
    Cc: Pavel Emelyanov
    Cc: Michael Ellerman
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Laurent Dufour
     
  • As mentioned in the commit 56eecdb912b5 ("mm: Use ptep/pmdp_set_numa()
    for updating _PAGE_NUMA bit"), architectures like ppc64 don't do tlb
    flush in set_pte/pmd functions.

    So when dealing with existing pte in clear_soft_dirty, the pte must be
    cleared before being modified.

    Signed-off-by: Laurent Dufour
    Reviewed-by: Aneesh Kumar K.V
    Cc: Pavel Emelyanov
    Cc: Michael Ellerman
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Laurent Dufour
     
  • get_mergeable_page() can only return NULL (also in case of errors) or the
    pinned mergeable page. It can't return an error different than NULL.
    This optimizes away the unnecessary error check.

    Add a return after the "out:" label in the callee to make it more
    readable.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Hugh Dickins
    Cc: Petr Holasek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Doing the VM_MERGEABLE check after the page == kpage check won't provide
    any meaningful benefit. The !vma->anon_vma check of find_mergeable_vma is
    the only superfluous bit in using find_mergeable_vma because the !PageAnon
    check of try_to_merge_one_page() implicitly checks for that, but it still
    looks cleaner to share the same find_mergeable_vma().

    Signed-off-by: Andrea Arcangeli
    Acked-by: Hugh Dickins
    Cc: Petr Holasek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • This just uses the helper function to cleanup the assumption on the
    hlist_node internals.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Hugh Dickins
    Cc: Petr Holasek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • The stable_nodes can become stale at any time if the underlying pages gets
    freed. The stable_node gets collected and removed from the stable rbtree
    if that is detected during the rbtree lookups.

    Don't fail the lookup if running into stale stable_nodes, just restart the
    lookup after collecting the stale stable_nodes. Otherwise the CPU spent
    in the preparation stage is wasted and the lookup must be repeated at the
    next loop potentially failing a second time in a second stale stable_node.

    If we don't prune aggressively we delay the merging of the unstable node
    candidates and at the same time we delay the freeing of the stale
    stable_nodes. Keeping stale stable_nodes around wastes memory and it
    can't provide any benefit.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Hugh Dickins
    Cc: Petr Holasek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • While at it add it to the file and anon walks too.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Hugh Dickins
    Cc: Petr Holasek
    Acked-by: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Before the previous patch ("memcg: unify slab and other kmem pages
    charging"), __mem_cgroup_from_kmem had to handle two types of kmem - slab
    pages and pages allocated with alloc_kmem_pages - memcg in the page
    struct. Now we can unify it. Since after it, this function becomes tiny
    we can fold it into mem_cgroup_from_kmem.

    [hughd@google.com: move mem_cgroup_from_kmem into list_lru.c]
    Signed-off-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • We have memcg_kmem_charge and memcg_kmem_uncharge methods for charging and
    uncharging kmem pages to memcg, but currently they are not used for
    charging slab pages (i.e. they are only used for charging pages allocated
    with alloc_kmem_pages). The only reason why the slab subsystem uses
    special helpers, memcg_charge_slab and memcg_uncharge_slab, is that it
    needs to charge to the memcg of kmem cache while memcg_charge_kmem charges
    to the memcg that the current task belongs to.

    To remove this diversity, this patch adds an extra argument to
    __memcg_kmem_charge that can be a pointer to a memcg or NULL. If it is
    not NULL, the function tries to charge to the memcg it points to,
    otherwise it charge to the current context. Next, it makes the slab
    subsystem use this function to charge slab pages.

    Since memcg_charge_kmem and memcg_uncharge_kmem helpers are now used only
    in __memcg_kmem_charge and __memcg_kmem_uncharge, they are inlined. Since
    __memcg_kmem_charge stores a pointer to the memcg in the page struct, we
    don't need memcg_uncharge_slab anymore and can use free_kmem_pages.
    Besides, one can now detect which memcg a slab page belongs to by reading
    /proc/kpagecgroup.

    Note, this patch switches slab to charge-after-alloc design. Since this
    design is already used for all other memcg charges, it should not make any
    difference.

    [hannes@cmpxchg.org: better to have an outer function than a magic parameter for the memcg lookup]
    Signed-off-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Signed-off-by: Johannes Weiner
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Charging kmem pages proceeds in two steps. First, we try to charge the
    allocation size to the memcg the current task belongs to, then we allocate
    a page and "commit" the charge storing the pointer to the memcg in the
    page struct.

    Such a design looks overcomplicated, because there is not much sense in
    trying charging the allocation before actually allocating a page: we won't
    be able to consume much memory over the limit even if we charge after
    doing the actual allocation, besides we already charge user pages post
    factum, so being pedantic with kmem pages just looks pointless.

    So this patch simplifies the design by merging the "charge" and the
    "commit" steps into the same function, which takes the allocated page.

    Also, rename the charge and uncharge methods to memcg_kmem_charge and
    memcg_kmem_uncharge and make the charge method return error code instead
    of bool to conform to mem_cgroup_try_charge.

    Signed-off-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • If kernelcore was not specified, or the kernelcore size is zero
    (required_movablecore >= totalpages), or the kernelcore size is larger
    than totalpages, there is no ZONE_MOVABLE. We should fill the zone with
    both kernel memory and movable memory.

    Signed-off-by: Xishi Qiu
    Reviewed-by: Yasuaki Ishimatsu
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Tang Chen
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xishi Qiu
     
  • This function incurs in very hot paths and merely does a few loads for
    validity check. Lets inline it, such that we can save the function call
    overhead.

    (akpm: this is cosmetic - the compiler already inlines vmacache_valid_mm())

    Signed-off-by: Davidlohr Bueso
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Change HIGHMEM_ZONE to be the same as the DMA_ZONE macro.

    Signed-off-by: yalin wang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    yalin wang
     
  • Srinivas Kandagatla reported bad page messages when trying to remove the
    bottom 2MB on an ARM based IFC6410 board

    BUG: Bad page state in process swapper pfn:fffa8
    page:ef7fb500 count:0 mapcount:0 mapping: (null) index:0x0
    flags: 0x96640253(locked|error|dirty|active|arch_1|reclaim|mlocked)
    page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
    bad because of flags:
    flags: 0x200041(locked|active|mlocked)
    Modules linked in:
    CPU: 0 PID: 0 Comm: swapper Not tainted 3.19.0-rc3-00007-g412f9ba-dirty #816
    Hardware name: Qualcomm (Flattened Device Tree)
    unwind_backtrace
    show_stack
    dump_stack
    bad_page
    free_pages_prepare
    free_hot_cold_page
    __free_pages
    free_highmem_page
    mem_init
    start_kernel
    Disabling lock debugging due to kernel taint

    Removing the lower 2MB made the start of the lowmem zone to no longer be
    page block aligned. IFC6410 uses CONFIG_FLATMEM where alloc_node_mem_map
    allocates memory for the mem_map. alloc_node_mem_map will offset for
    unaligned nodes with the assumption the pfn/page translation functions
    will account for the offset. The functions for CONFIG_FLATMEM do not
    offset however, resulting in overrunning the memmap array. Just use the
    allocated memmap without any offset when running with CONFIG_FLATMEM to
    avoid the overrun.

    Signed-off-by: Laura Abbott
    Signed-off-by: Laura Abbott
    Reported-by: Srinivas Kandagatla
    Tested-by: Srinivas Kandagatla
    Acked-by: Vlastimil Babka
    Tested-by: Bjorn Andersson
    Cc: Santosh Shilimkar
    Cc: Russell King
    Cc: Kevin Hilman
    Cc: Arnd Bergman
    Cc: Stephen Boyd
    Cc: Andy Gross
    Cc: Mel Gorman
    Cc: Steven Rostedt
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Laura Abbott
     
  • With x86_64 (config http://ozlabs.org/~akpm/config-akpm2.txt) and old gcc
    (4.4.4), drivers/base/node.c:node_read_meminfo() is using 2344 bytes of
    stack. Uninlining node_page_state() reduces this to 440 bytes.

    The stack consumption issue is fixed by newer gcc (4.8.4) however with
    that compiler this patch reduces the node.o text size from 7314 bytes to
    4578.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Make __install_special_mapping() args order match the caller, so the
    caller can pass their register args directly to callee with no touch.

    For most of architectures, args (at least the first 5th args) are in
    registers, so this change will have effect on most of architectures.

    For -O2, __install_special_mapping() may be inlined under most of
    architectures, but for -Os, it should not. So this change can get a
    little better performance for -Os, at least.

    Signed-off-by: Chen Gang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Gang
     
  • (1) For !CONFIG_BUG cases, the bug call is a no-op, so we couldn't
    care less and the change is ok.

    (2) ppc and mips, which HAVE_ARCH_BUG_ON, do not rely on branch
    predictions as it seems to be pointless[1] and thus callers should not
    be trying to push an optimization in the first place.

    (3) For CONFIG_BUG and !HAVE_ARCH_BUG_ON cases, BUG_ON() contains an
    unlikely compiler flag already.

    Hence, we can drop unlikely behind BUG_ON().

    [1] http://lkml.iu.edu/hypermail/linux/kernel/1101.3/02289.html

    Signed-off-by: Geliang Tang
    Acked-by: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Geliang Tang
     
  • When fget() fails we can return -EBADF directly.

    Signed-off-by: Chen Gang
    Acked-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Gang
     
  • It is still a little better to remove it, although it should be skipped
    by "-O2".

    Signed-off-by: Chen Gang =0A=
    Acked-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Gang
     
  • This came up when implementing HIHGMEM/PAE40 for ARC. The kmap() /
    kmap_atomic() generated code seemed needlessly bloated due to the way
    PageHighMem() macro is implemented. It derives the exact zone for page
    and then does pointer subtraction with first zone to infer the zone_type.
    The pointer arithmatic in turn generates the code bloat.

    PageHighMem(page)
    is_highmem(page_zone(page))
    zone_off = (char *)zone - (char *)zone->zone_pgdat->node_zones

    Instead use is_highmem_idx() to work on zone_type available in page flags

    ----- Before -----
    80756348: mov_s r13,r0
    8075634a: ld_s r2,[r13,0]
    8075634c: lsr_s r2,r2,30
    8075634e: mpy r2,r2,0x2a4
    80756352: add_s r2,r2,0x80aef880
    80756358: ld_s r3,[r2,28]
    8075635a: sub_s r2,r2,r3
    8075635c: breq r2,0x2a4,80756378
    80756364: breq r2,0x548,80756378

    ----- After -----
    80756330: mov_s r13,r0
    80756332: ld_s r2,[r13,0]
    80756334: lsr_s r2,r2,30
    80756336: sub_s r2,r2,1
    80756338: brlo r2,2,80756348

    For x86 defconfig build (32 bit only) it saves around 900 bytes.
    For ARC defconfig with HIGHMEM, it saved around 2K bytes.

    ---->8-------
    ./scripts/bloat-o-meter x86/vmlinux-defconfig-pre x86/vmlinux-defconfig-post
    add/remove: 0/0 grow/shrink: 0/36 up/down: 0/-934 (-934)
    function old new delta
    saveable_page 162 154 -8
    saveable_highmem_page 154 146 -8
    skb_gro_reset_offset 147 131 -16
    ...
    ...
    __change_page_attr_set_clr 1715 1678 -37
    setup_data_read 434 394 -40
    mon_bin_event 1967 1927 -40
    swsusp_save 1148 1105 -43
    _set_pages_array 549 493 -56
    ---->8-------

    e.g. For ARC kmap()

    Signed-off-by: Vineet Gupta
    Acked-by: Michal Hocko
    Cc: Naoya Horiguchi
    Cc: Hugh Dickins
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Jennifer Herbert
    Cc: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vineet Gupta
     
  • Both "child->mm == mm" and "p->mm != mm" checks in oom_kill_process() are
    wrong. task->mm can be NULL if the task is the exited group leader. This
    means in particular that "kill sharing same memory" loop can miss a
    process with a zombie leader which uses the same ->mm.

    Note: the process_has_mm(child, p->mm) check is still not 100% correct,
    p->mm can be NULL too. This is minor, but probably deserves a fix or a
    comment anyway.

    [akpm@linux-foundation.org: document process_shares_mm() a bit]
    Signed-off-by: Oleg Nesterov
    Acked-by: David Rientjes
    Acked-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Kyle Walker
    Cc: Stanislav Kozina
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Purely cosmetic, but the complex "if" condition looks annoying to me.
    Especially because it is not consistent with OOM_SCORE_ADJ_MIN check
    which adds another if/continue.

    Signed-off-by: Oleg Nesterov
    Acked-by: David Rientjes
    Acked-by: Michal Hocko
    Acked-by: Hillf Danton
    Cc: Tetsuo Handa
    Cc: Kyle Walker
    Cc: Stanislav Kozina
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • The fatal_signal_pending() was added to suppress unnecessary "sharing same
    memory" message, but it can't 100% help anyway because it can be
    false-negative; SIGKILL can be already dequeued.

    And worse, it can be false-positive due to exec or coredump. exec is
    mostly fine, but coredump is not. It is possible that the group leader
    has the pending SIGKILL because its sub-thread originated the coredump, in
    this case we must not skip this process.

    We could probably add the additional ->group_exit_task check but this
    patch just removes the wrong check along with pr_info().

    Signed-off-by: Oleg Nesterov
    Acked-by: David Rientjes
    Acked-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Cc: Kyle Walker
    Cc: Stanislav Kozina
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Cosmetic, but expand_upwards() and expand_downwards() overuse vma->vm_mm,
    a local variable makes sense imho.

    Signed-off-by: Oleg Nesterov
    Acked-by: Hugh Dickins
    Cc: Andrey Konovalov
    Cc: Davidlohr Bueso
    Cc: "Kirill A. Shutemov"
    Cc: Sasha Levin
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • "mm->locked_vm += grow" and vm_stat_account() in acct_stack_growth() are
    not safe; multiple threads using the same ->mm can do this at the same
    time trying to expans different vma's under down_read(mmap_sem). This
    means that one of the "locked_vm += grow" changes can be lost and we can
    miss munlock_vma_pages_all() later.

    Move this code into the caller(s) under mm->page_table_lock. All other
    updates to ->locked_vm hold mmap_sem for writing.

    Signed-off-by: Oleg Nesterov
    Acked-by: Hugh Dickins
    Cc: Andrey Konovalov
    Cc: Davidlohr Bueso
    Cc: "Kirill A. Shutemov"
    Cc: Sasha Levin
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • If the user set "movablecore=xx" to a large number, corepages will
    overflow. Fix the problem.

    Signed-off-by: Xishi Qiu
    Reviewed-by: Yasuaki Ishimatsu
    Acked-by: Tang Chen
    Acked-by: David Rientjes
    Cc: Mel Gorman
    Cc: Tang Chen
    Cc: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xishi Qiu
     
  • In zone_reclaimable_pages(), `nr' is returned by a function which is
    declared as returning "unsigned long", so declare it such. Negative
    values are meaningless here.

    In zone_pagecache_reclaimable() we should also declare `delta' and
    `nr_pagecache_reclaimable' as being unsigned longs because they're used to
    store the values returned by zone_page_state() and
    zone_unmapped_file_pages() which also happen to return unsigned integers.

    [akpm@linux-foundation.org: make zone_pagecache_reclaimable() return ulong rather than long]
    Signed-off-by: Alexandru Moise
    Acked-by: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexandru Moise
     
  • The oom killer takes task_lock() in a couple of places solely to protect
    printing the task's comm.

    A process's comm, including current's comm, may change due to
    /proc/pid/comm or PR_SET_NAME.

    The comm will always be NULL-terminated, so the worst race scenario would
    only be during update. We can tolerate a comm being printed that is in
    the middle of an update to avoid taking the lock.

    Other locations in the kernel have already dropped task_lock() when
    printing comm, so this is consistent.

    Signed-off-by: David Rientjes
    Suggested-by: Oleg Nesterov
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Sergey Senozhatsky
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes