16 Mar, 2016

40 commits

  • Use the pr_*() print in AUTOFS_*() macros instead of printks and include
    the module name in log message macros. Also use the AUTOFS_*() macros
    everywhere instead of raw printks.

    Signed-off-by: Ian Kent
    Cc: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ian Kent
     
  • Fix some white space format errors.

    Signed-off-by: Ian Kent
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ian Kent
     
  • The return from an ioctl if an invalid ioctl is passed in should be
    EINVAL not ENOSYS.

    Signed-off-by: Ian Kent
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ian Kent
     
  • The need for this is questionable but checkpatch.pl complains about the
    line length and it's a straightfoward change.

    Signed-off-by: Ian Kent
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ian Kent
     
  • Refactor autofs4_get_set_timeout() to eliminate coding style error.

    Signed-off-by: Ian Kent
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ian Kent
     
  • Try and make the coding style completely consistent throughtout the
    autofs module and inline with kernel coding style recommendations.

    Signed-off-by: Ian Kent
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ian Kent
     
  • This is required for CRIU (Checkpoint Restart In Userspace) to migrate a
    mount point when write end in user space is closed.

    Below is a brief description of the problem.

    To migrate a non-catatonic autofs mount point, one has to restore the
    control pipe between kernel and autofs master process.

    One of the autofs masters is systemd, which closes pipe write end after
    passing it to the kernel with mount call.

    To be able to restore the systemd control pipe one has to know which
    read pipe end in systemd corresponds to the write pipe end in the
    kernel. The pipe "fd" in mount options is not enough because it was
    closed and probably replaced by some other descriptor.

    Thus, some other attribute is required to be able to find the read pipe
    end. The best attribute to use to find the correct pipe end is inode
    number becuase it's unique for the whole system and can't be reused
    while the autofs mount exists.

    This attribute can also be used to recognize a situation where an autofs
    mount has no master (no process with specified "pgrp" or no file
    descriptor with "pipe_ino", specified in autofs mount options).

    Signed-off-by: Stanislav Kinsburskiy
    Signed-off-by: Ian Kent
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stanislav Kinsburskiy
     
  • Similar to how relative extables are implemented, it is possible to emit
    the kallsyms table in such a way that it contains offsets relative to
    some anchor point in the kernel image rather than absolute addresses.

    On 64-bit architectures, it cuts the size of the kallsyms address table
    in half, since offsets between kernel symbols can typically be expressed
    in 32 bits. This saves several hundreds of kilobytes of permanent
    .rodata on average. In addition, the kallsyms address table is no
    longer subject to dynamic relocation when CONFIG_RELOCATABLE is in
    effect, so the relocation work done after decompression now doesn't have
    to do relocation updates for all these values. This saves up to 24
    bytes (i.e., the size of a ELF64 RELA relocation table entry) per value,
    which easily adds up to a couple of megabytes of uncompressed __init
    data on ppc64 or arm64. Even if these relocation entries typically
    compress well, the combined size reduction of 2.8 MB uncompressed for a
    ppc64_defconfig build (of which 2.4 MB is __init data) results in a ~500
    KB space saving in the compressed image.

    Since it is useful for some architectures (like x86) to retain the
    ability to emit absolute values as well, this patch also adds support
    for capturing both absolute and relative values when
    KALLSYMS_ABSOLUTE_PERCPU is in effect, by emitting absolute per-cpu
    addresses as positive 32-bit values, and addresses relative to the
    lowest encountered relative symbol as negative values, which are
    subtracted from the runtime address of this base symbol to produce the
    actual address.

    Support for the above is enabled by default for all architectures except
    IA-64 and Tile-GX, whose symbols are too far apart to capture in this
    manner.

    Signed-off-by: Ard Biesheuvel
    Tested-by: Guenter Roeck
    Reviewed-by: Kees Cook
    Tested-by: Kees Cook
    Cc: Heiko Carstens
    Cc: Michael Ellerman
    Cc: Ingo Molnar
    Cc: H. Peter Anvin
    Cc: Benjamin Herrenschmidt
    Cc: Michal Marek
    Cc: Rusty Russell
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ard Biesheuvel
     
  • Commit c6bda7c988a5 ("kallsyms: fix percpu vars on x86-64 with
    relocation") overloaded the 'A' (absolute) symbol type to signify that a
    symbol is not subject to dynamic relocation. However, the original A
    type does not imply that at all, and depending on the version of the
    toolchain, many A type symbols are emitted that are in fact relative to
    the kernel text, i.e., if the kernel is relocated at runtime, these
    symbols should be updated as well.

    For instance, on sparc32, the following symbols are emitted as absolute
    (kindly provided by Guenter Roeck):

    f035a420 A _etext
    f03d9000 A _sdata
    f03de8c4 A jiffies
    f03f8860 A _edata
    f03fc000 A __init_begin
    f041bdc8 A __init_text_end
    f0423000 A __bss_start
    f0423000 A __init_end
    f044457d A __bss_stop
    f044457d A _end

    On x86_64, similar behavior can be observed:

    ffffffff81a00000 A __end_rodata_hpage_align
    ffffffff81b19000 A __vvar_page
    ffffffff81d3d000 A _end

    Even if only a couple of them pass the symbol range check that results
    in them to be taken into account for the final kallsyms symbol table, it
    is obvious that 'A' does not mean the symbol does not need to be updated
    at relocation time, and overloading its meaning to signify that is
    perhaps not a good idea.

    So instead, add a new percpu_absolute member to struct sym_entry, and
    when --absolute-percpu is in effect, use it to record symbols whose
    addresses should be emitted as final values rather than values that
    still require relocation at runtime. That way, we can drop the check
    against the 'A' type.

    Signed-off-by: Ard Biesheuvel
    Tested-by: Guenter Roeck
    Reviewed-by: Kees Cook
    Tested-by: Kees Cook
    Cc: Heiko Carstens
    Cc: Michael Ellerman
    Cc: Ingo Molnar
    Cc: H. Peter Anvin
    Cc: Benjamin Herrenschmidt
    Cc: Michal Marek
    Acked-by: Rusty Russell
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ard Biesheuvel
     
  • scripts/kallsyms.c has a special --absolute-percpu command line option
    which deals with the zero based per cpu offsets that are used when
    building for SMP on x86_64. This means that the option should only be
    passed in that case, so add a Kconfig symbol with the correct predicate,
    and use that instead.

    Signed-off-by: Ard Biesheuvel
    Tested-by: Guenter Roeck
    Reviewed-by: Kees Cook
    Tested-by: Kees Cook
    Acked-by: Rusty Russell
    Cc: Heiko Carstens
    Cc: Michael Ellerman
    Cc: Ingo Molnar
    Cc: H. Peter Anvin
    Cc: Benjamin Herrenschmidt
    Cc: Michal Marek
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ard Biesheuvel
     
  • This patch escapes a regex that uses left brace.

    Using checkpatch.pl with Perl 5.22.0 generates the warning: "Unescaped
    left brace in regex is deprecated, passed through in regex;"

    Comment from regcomp.c in Perl source: "Currently we don't warn when the
    lbrace is at the start of a construct. This catches it in the middle of
    a literal string, or when it's the first thing after something like
    "\b"."

    This works as a complement to 4e5d56bd ("checkpatch: fix left brace
    warning").

    Signed-off-by: Geyslan G. Bem
    Signed-off-by: Joe Perches
    Suggested-by: Peter Senna Tschudin
    Cc: Eddie Kovsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Geyslan G. Bem
     
  • Improve the test to allow casts to (unsigned) or (signed) to be found
    and fixed if desired.

    Signed-off-by: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Kernel style prefers "unsigned int " over "unsigned " and
    "signed int " over "signed ".

    Emit a warning for these simple signed/unsigned declarations. Fix
    it too if desired.

    Signed-off-by: Joe Perches
    Acked-by: David S. Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • asm volatile and all its variants like __asm__ __volatile__ ("")
    are reported as errors with "Macros with with complex values should be
    enclosed in parentheses".

    Make an exception for these asm volatile macro definitions by converting
    the "asm volatile" to "asm_volatile" so it appears as a single function
    call and the error isn't reported.

    Signed-off-by: Joe Perches
    Reported-by: Jeff Merkey
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Migration accounting in the memory controller used to have to handle
    both oldpage and newpage being on the LRU already; fuse's page cache
    replacement used to pass a recycled newpage that had been uncharged but
    not freed and removed from the LRU, and the memcg migration code used to
    uncharge oldpage to "pass on" the existing charge to newpage.

    Nowadays, pages are no longer uncharged when truncated from the page
    cache, but rather only at free time, so if a LRU page is recycled in
    page cache replacement it'll also still be charged. And we bail out of
    the charge transfer altogether in that case. Tell commit_charge() that
    we know newpage is not on the LRU, to avoid taking the zone->lru_lock
    unnecessarily from the migration path.

    But also, oldpage is no longer uncharged inside migration. We only use
    oldpage for its page->mem_cgroup and page size, so we don't care about
    its LRU state anymore either. Remove any mention from the kernel doc.

    Signed-off-by: Johannes Weiner
    Suggested-by: Hugh Dickins
    Acked-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Cc: Mateusz Guzik
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Rather than scattering mem_cgroup_migrate() calls all over the place,
    have a single call from a safe place where every migration operation
    eventually ends up in - migrate_page_copy().

    Signed-off-by: Johannes Weiner
    Suggested-by: Hugh Dickins
    Acked-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Cc: Mateusz Guzik
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • There is a performance drop report due to hugepage allocation and in
    there half of cpu time are spent on pageblock_pfn_to_page() in
    compaction [1].

    In that workload, compaction is triggered to make hugepage but most of
    pageblocks are un-available for compaction due to pageblock type and
    skip bit so compaction usually fails. Most costly operations in this
    case is to find valid pageblock while scanning whole zone range. To
    check if pageblock is valid to compact, valid pfn within pageblock is
    required and we can obtain it by calling pageblock_pfn_to_page(). This
    function checks whether pageblock is in a single zone and return valid
    pfn if possible. Problem is that we need to check it every time before
    scanning pageblock even if we re-visit it and this turns out to be very
    expensive in this workload.

    Although we have no way to skip this pageblock check in the system where
    hole exists at arbitrary position, we can use cached value for zone
    continuity and just do pfn_to_page() in the system where hole doesn't
    exist. This optimization considerably speeds up in above workload.

    Before vs After
    Max: 1096 MB/s vs 1325 MB/s
    Min: 635 MB/s 1015 MB/s
    Avg: 899 MB/s 1194 MB/s

    Avg is improved by roughly 30% [2].

    [1]: http://www.spinics.net/lists/linux-mm/msg97378.html
    [2]: https://lkml.org/lkml/2015/12/9/23

    [akpm@linux-foundation.org: don't forget to restore zone->contiguous on error path, per Vlastimil]
    Signed-off-by: Joonsoo Kim
    Reported-by: Aaron Lu
    Acked-by: Vlastimil Babka
    Tested-by: Aaron Lu
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • pageblock_pfn_to_page() is used to check there is valid pfn and all
    pages in the pageblock is in a single zone. If there is a hole in the
    pageblock, passing arbitrary position to pageblock_pfn_to_page() could
    cause to skip whole pageblock scanning, instead of just skipping the
    hole page. For deterministic behaviour, it's better to always pass
    pageblock aligned range to pageblock_pfn_to_page(). It will also help
    further optimization on pageblock_pfn_to_page() in the following patch.

    Signed-off-by: Joonsoo Kim
    Cc: Aaron Lu
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Rik van Riel
    Acked-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • free_pfn and compact_cached_free_pfn are the pointer that remember
    restart position of freepage scanner. When they are reset or invalid,
    we set them to zone_end_pfn because freepage scanner works in reverse
    direction. But, because zone range is defined as [zone_start_pfn,
    zone_end_pfn), zone_end_pfn is invalid to access. Therefore, we should
    not store it to free_pfn and compact_cached_free_pfn. Instead, we need
    to store zone_end_pfn - 1 to them. There is one more thing we should
    consider. Freepage scanner scan reversely by pageblock unit. If
    free_pfn and compact_cached_free_pfn are set to middle of pageblock, it
    regards that sitiation as that it already scans front part of pageblock
    so we lose opportunity to scan there. To fix-up, this patch do
    round_down() to guarantee that reset position will be pageblock aligned.

    Note that thanks to the current pageblock_pfn_to_page() implementation,
    actual access to zone_end_pfn doesn't happen until now. But, following
    patch will change pageblock_pfn_to_page() so this patch is needed from
    now on.

    Signed-off-by: Joonsoo Kim
    Acked-by: David Rientjes
    Acked-by: Vlastimil Babka
    Cc: Aaron Lu
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • We define struct memblock_type *type in the memblock_add_region() and
    memblock_reserve_region() functions only for passing it to the
    memlock_add_range() and memblock_reserve_range() functions. Let's
    remove these variables and will pass a type directly.

    Signed-off-by: Alexander Kuleshov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Kuleshov
     
  • we want to couple all debugging features with debug_pagealloc_enabled()
    and not with the config option CONFIG_DEBUG_PAGEALLOC.

    Signed-off-by: Christian Borntraeger
    Suggested-by: David Rientjes
    Acked-by: David Rientjes
    Reviewed-by: Thomas Gleixner
    Cc: Laura Abbott
    Cc: Joonsoo Kim
    Cc: Heiko Carstens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christian Borntraeger
     
  • We can use debug_pagealloc_enabled() to check if we can map the identity
    mapping with 1MB/2GB pages as well as to print the current setting in
    dump_stack.

    Signed-off-by: Christian Borntraeger
    Reviewed-by: Heiko Carstens
    Cc: Thomas Gleixner
    Acked-by: David Rientjes
    Cc: Laura Abbott
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christian Borntraeger
     
  • We can use debug_pagealloc_enabled() to check if we can map the identity
    mapping with 2MB pages. We can also add the state into the dump_stack
    output.

    The patch does not touch the code for the 1GB pages, which ignored
    CONFIG_DEBUG_PAGEALLOC. Do we need to fence this as well?

    Signed-off-by: Christian Borntraeger
    Reviewed-by: Thomas Gleixner
    Acked-by: David Rientjes
    Cc: Laura Abbott
    Cc: Joonsoo Kim
    Cc: Heiko Carstens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christian Borntraeger
     
  • After one of bugfixes to freeze_page(), we don't have freezed pages in
    rmap, therefore mapcount of all subpages of freezed THP is zero. And we
    have assert for that.

    Let's drop code which deal with non-zero mapcount of subpages.

    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • do_fault() assumes that PAGE_SIZE is the same as PAGE_CACHE_SIZE. Use
    linear_page_index() to calculate pgoff in the correct units.

    Signed-off-by: Matthew Wilcox
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • There are several users that nest lock_page_memcg() inside lock_page()
    to prevent page->mem_cgroup from changing. But the page lock prevents
    pages from moving between cgroups, so that is unnecessary overhead.

    Remove lock_page_memcg() in contexts with locked contexts and fix the
    debug code in the page stat functions to be okay with the page lock.

    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Now that migration doesn't clear page->mem_cgroup of live pages anymore,
    it's safe to make lock_page_memcg() and the memcg stat functions take
    pages, and spare the callers from memcg objects.

    [akpm@linux-foundation.org: fix warnings]
    Signed-off-by: Johannes Weiner
    Suggested-by: Vladimir Davydov
    Acked-by: Vladimir Davydov
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Changing a page's memcg association complicates dealing with the page,
    so we want to limit this as much as possible. Page migration e.g. does
    not have to do that. Just like page cache replacement, it can forcibly
    charge a replacement page, and then uncharge the old page when it gets
    freed. Temporarily overcharging the cgroup by a single page is not an
    issue in practice, and charging is so cheap nowadays that this is much
    preferrable to the headache of messing with live pages.

    The only place that still changes the page->mem_cgroup binding of live
    pages is when pages move along with a task to another cgroup. But that
    path isolates the page from the LRU, takes the page lock, and the move
    lock (lock_page_memcg()). That means page->mem_cgroup is always stable
    in callers that have the page isolated from the LRU or locked. Lighter
    unlocked paths, like writeback accounting, can use lock_page_memcg().

    [akpm@linux-foundation.org: fix build]
    [vdavydov@virtuozzo.com: fix lockdep splat]
    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Cache thrash detection (see a528910e12ec "mm: thrash detection-based
    file cache sizing" for details) currently only works on the system
    level, not inside cgroups. Worse, as the refaults are compared to the
    global number of active cache, cgroups might wrongfully get all their
    refaults activated when their pages are hotter than those of others.

    Move the refault machinery from the zone to the lruvec, and then tag
    eviction entries with the memcg ID. This makes the thrash detection
    work correctly inside cgroups.

    [sergey.senozhatsky@gmail.com: do not return from workingset_activation() with locked rcu and page]
    Signed-off-by: Johannes Weiner
    Signed-off-by: Sergey Senozhatsky
    Reviewed-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • For per-cgroup thrash detection, we need to store the memcg ID inside
    the radix tree cookie as well. However, on 32 bit that doesn't leave
    enough bits for the eviction timestamp to cover the necessary range of
    recently evicted pages. The radix tree entry would look like this:

    [ RADIX_TREE_EXCEPTIONAL(2) | ZONEID(2) | MEMCGID(16) | EVICTION(12) ]

    12 bits means 4096 pages, means 16M worth of recently evicted pages.
    But refaults are actionable up to distances covering half of memory. To
    not miss refaults, we have to stretch out the range at the cost of how
    precisely we can tell when a page was evicted. This way we can shave
    off lower bits from the eviction timestamp until the necessary range is
    covered. E.g. grouping evictions into 1M buckets (256 pages) will
    stretch the longest representable refault distance to 4G.

    This patch implements eviction buckets that are automatically sized
    according to the available bits and the necessary refault range, in
    preparation for per-cgroup thrash detection.

    The maximum actionable distance is currently half of memory, but to
    support memory hotplug of up to 200% of boot-time memory, we size the
    buckets to cover double the distance. Beyond that, thrashing won't be
    detectable anymore.

    During boot, the kernel will print out the exact parameters, like so:

    [ 0.113929] workingset: timestamp_bits=12 max_order=18 bucket_order=6

    In this example, there are 12 radix entry bits available for the
    eviction timestamp, to cover a maximum distance of 2^18 pages (this is a
    1G machine). Consequently, evictions must be grouped into buckets of
    2^6 pages, or 256K.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Per-cgroup thrash detection will need to derive a live memcg from the
    eviction cookie, and doing that inside unpack_shadow() will get nasty
    with the reference handling spread over two functions.

    In preparation, make unpack_shadow() clearly about extracting static
    data, and let workingset_refault() do all the higher-level handling.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • This is a compile-time constant, no need to calculate it on refault.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • These patches tag the page cache radix tree eviction entries with the
    memcg an evicted page belonged to, thus making per-cgroup LRU reclaim
    work properly and be as adaptive to new cache workingsets as global
    reclaim already is.

    This should have been part of the original thrash detection patch
    series, but was deferred due to the complexity of those patches.

    This patch (of 5):

    So far the only sites that needed to exclude charge migration to
    stabilize page->mem_cgroup have been per-cgroup page statistics, hence
    the name mem_cgroup_begin_page_stat(). But per-cgroup thrash detection
    will add another site that needs to ensure page->mem_cgroup lifetime.

    Rename these locking functions to the more generic lock_page_memcg() and
    unlock_page_memcg(). Since charge migration is a cgroup1 feature only,
    we might be able to delete it at some point, and these now easy to
    identify locking sites along with it.

    Signed-off-by: Johannes Weiner
    Suggested-by: Vladimir Davydov
    Acked-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • zone_reclaimable_pages() is used in should_reclaim_retry() which uses it
    to calculate the target for the watermark check. This means that
    precise numbers are important for the correct decision.
    zone_reclaimable_pages uses zone_page_state which can contain stale data
    with per-cpu diffs not synced yet (the last vmstat_update might have run
    1s in the past).

    Use zone_page_state_snapshot() in zone_reclaimable_pages() instead.
    None of the current callers is in a hot path where getting the precise
    value (which involves per-cpu iteration) would cause an unreasonable
    overhead.

    Signed-off-by: Michal Hocko
    Signed-off-by: Tetsuo Handa
    Suggested-by: David Rientjes
    Acked-by: David Rientjes
    Acked-by: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Some new MADV_* advices are not documented in sys_madvise() comment. So
    let's update it.

    [akpm@linux-foundation.org: modifications suggested by Michal]
    Signed-off-by: Naoya Horiguchi
    Acked-by: Michal Hocko
    Cc: Minchan Kim
    Cc: "Kirill A. Shutemov"
    Cc: Jason Baron
    Cc: Chen Gong
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Currently, on shrinker registration we clear SHRINKER_NUMA_AWARE if
    there's the only NUMA node present. The comment states that this will
    allow us to save some small loop time later. It used to be true when
    this code was added (see commit 1d3d4437eae1b ("vmscan: per-node
    deferred work")), but since commit 6b4f7799c6a57 ("mm: vmscan: invoke
    slab shrinkers from shrink_zone()") it doesn't make any difference.
    Anyway, running on non-NUMA machine shouldn't make a shrinker NUMA
    unaware, so zap this hunk.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Add support for the newly added kernel memory auto onlining policy to
    Xen ballon driver.

    Signed-off-by: Vitaly Kuznetsov
    Suggested-by: Daniel Kiper
    Reviewed-by: Daniel Kiper
    Acked-by: David Vrabel
    Cc: Jonathan Corbet
    Cc: Greg Kroah-Hartman
    Cc: Daniel Kiper
    Cc: Dan Williams
    Cc: Tang Chen
    Cc: David Rientjes
    Cc: Naoya Horiguchi
    Cc: Xishi Qiu
    Cc: Mel Gorman
    Cc: "K. Y. Srinivasan"
    Cc: Igor Mammedov
    Cc: Kay Sievers
    Cc: Konrad Rzeszutek Wilk
    Cc: Boris Ostrovsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vitaly Kuznetsov
     
  • Currently, all newly added memory blocks remain in 'offline' state
    unless someone onlines them, some linux distributions carry special udev
    rules like:

    SUBSYSTEM=="memory", ACTION=="add", ATTR{state}=="offline", ATTR{state}="online"

    to make this happen automatically. This is not a great solution for
    virtual machines where memory hotplug is being used to address high
    memory pressure situations as such onlining is slow and a userspace
    process doing this (udev) has a chance of being killed by the OOM killer
    as it will probably require to allocate some memory.

    Introduce default policy for the newly added memory blocks in
    /sys/devices/system/memory/auto_online_blocks file with two possible
    values: "offline" which preserves the current behavior and "online"
    which causes all newly added memory blocks to go online as soon as
    they're added. The default is "offline".

    Signed-off-by: Vitaly Kuznetsov
    Reviewed-by: Daniel Kiper
    Cc: Jonathan Corbet
    Cc: Greg Kroah-Hartman
    Cc: Daniel Kiper
    Cc: Dan Williams
    Cc: Tang Chen
    Cc: David Vrabel
    Acked-by: David Rientjes
    Cc: Naoya Horiguchi
    Cc: Xishi Qiu
    Cc: Mel Gorman
    Cc: "K. Y. Srinivasan"
    Cc: Igor Mammedov
    Cc: Kay Sievers
    Cc: Konrad Rzeszutek Wilk
    Cc: Boris Ostrovsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vitaly Kuznetsov
     
  • Arm and arm64 used to trigger this BUG_ON() - this has now been fixed.

    But a WARN_ON() here is sufficient to catch future buggy callers.

    Signed-off-by: Mika Penttilä
    Reviewed-by: Pekka Enberg
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mika Penttilä
     
  • VM_HUGETLB and VM_MIXEDMAP vma needs to be excluded to avoid compound
    pages being marked for migration and unexpected COWs when handling
    hugetlb fault.

    Thanks to Naoya Horiguchi for reminding me on these checks.

    Signed-off-by: Liang Chen
    Signed-off-by: Gavin Guo
    Suggested-by: Naoya Horiguchi
    Acked-by: David Rientjes
    Cc: SeongJae Park
    Cc: Rik van Riel
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Liang Chen