11 Sep, 2015

13 commits

  • Pull blk-cg updates from Jens Axboe:
    "A bit later in the cycle, but this has been in the block tree for a a
    while. This is basically four patchsets from Tejun, that improve our
    buffered cgroup writeback. It was dependent on the other cgroup
    changes, but they went in earlier in this cycle.

    Series 1 is set of 5 patches that has cgroup writeback updates:

    - bdi_writeback iteration fix which could lead to some wb's being
    skipped or repeated during e.g. sync under memory pressure.

    - Simplification of wb work wait mechanism.

    - Writeback tracepoints updated to report cgroup.

    Series 2 is is a set of updates for the CFQ cgroup writeback handling:

    cfq has always charged all async IOs to the root cgroup. It didn't
    have much choice as writeback didn't know about cgroups and there
    was no way to tell who to blame for a given writeback IO.
    writeback finally grew support for cgroups and now tags each
    writeback IO with the appropriate cgroup to charge it against.

    This patchset updates cfq so that it follows the blkcg each bio is
    tagged with. Async cfq_queues are now shared across cfq_group,
    which is per-cgroup, instead of per-request_queue cfq_data. This
    makes all IOs follow the weight based IO resource distribution
    implemented by cfq.

    - Switched from GFP_ATOMIC to GFP_NOWAIT as suggested by Jeff.

    - Other misc review points addressed, acks added and rebased.

    Series 3 is the blkcg policy cleanup patches:

    This patchset contains assorted cleanups for blkcg_policy methods
    and blk[c]g_policy_data handling.

    - alloc/free added for blkg_policy_data. exit dropped.

    - alloc/free added for blkcg_policy_data.

    - blk-throttle's async percpu allocation is replaced with direct
    allocation.

    - all methods now take blk[c]g_policy_data instead of blkcg_gq or
    blkcg.

    And finally, series 4 is a set of patches cleaning up the blkcg stats
    handling:

    blkcg's stats have always been somwhat of a mess. This patchset
    tries to improve the situation a bit.

    - The following patches added to consolidate blkcg entry point and
    blkg creation. This is in itself is an improvement and helps
    colllecting common stats on bio issue.

    - per-blkg stats now accounted on bio issue rather than request
    completion so that bio based and request based drivers can behave
    the same way. The issue was spotted by Vivek.

    - cfq-iosched implements custom recursive stats and blk-throttle
    implements custom per-cpu stats. This patchset make blkcg core
    support both by default.

    - cfq-iosched and blk-throttle keep track of the same stats
    multiple times. Unify them"

    * 'for-4.3/blkcg' of git://git.kernel.dk/linux-block: (45 commits)
    blkcg: use CGROUP_WEIGHT_* scale for io.weight on the unified hierarchy
    blkcg: s/CFQ_WEIGHT_*/CFQ_WEIGHT_LEGACY_*/
    blkcg: implement interface for the unified hierarchy
    blkcg: misc preparations for unified hierarchy interface
    blkcg: separate out tg_conf_updated() from tg_set_conf()
    blkcg: move body parsing from blkg_conf_prep() to its callers
    blkcg: mark existing cftypes as legacy
    blkcg: rename subsystem name from blkio to io
    blkcg: refine error codes returned during blkcg configuration
    blkcg: remove unnecessary NULL checks from __cfqg_set_weight_device()
    blkcg: reduce stack usage of blkg_rwstat_recursive_sum()
    blkcg: remove cfqg_stats->sectors
    blkcg: move io_service_bytes and io_serviced stats into blkcg_gq
    blkcg: make blkg_[rw]stat_recursive_sum() to be able to index into blkcg_gq
    blkcg: make blkcg_[rw]stat per-cpu
    blkcg: add blkg_[rw]stat->aux_cnt and replace cfq_group->dead_stats with it
    blkcg: consolidate blkg creation in blkcg_bio_issue_check()
    blk-throttle: improve queue bypass handling
    blkcg: move root blkg lookup optimization from throtl_lookup_tg() to __blkg_lookup()
    blkcg: inline [__]blkg_lookup()
    ...

    Linus Torvalds
     
  • Let's use helper rather than direct check of vma->vm_ops to distinguish
    anonymous VMA.

    Signed-off-by: Kirill A. Shutemov
    Reviewed-by: Oleg Nesterov
    Cc: "H. Peter Anvin"
    Cc: Andy Lutomirski
    Cc: Dave Hansen
    Cc: Ingo Molnar
    Cc: Minchan Kim
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • We rely on vma->vm_ops == NULL to detect anonymous VMA: see
    vma_is_anonymous(), but some drivers doesn't set ->vm_ops.

    As a result we can end up with anonymous page in private file mapping.
    That should not lead to serious misbehaviour, but nevertheless is wrong.

    Let's fix by setting up dummy ->vm_ops for file mmapping if f_op->mmap()
    didn't set its own.

    The patch also adds sanity check into __vma_link_rb(). It will help
    catch broken VMAs which inserted directly into mm_struct via
    insert_vm_struct().

    Signed-off-by: Kirill A. Shutemov
    Reviewed-by: Oleg Nesterov
    Cc: "H. Peter Anvin"
    Cc: Andy Lutomirski
    Cc: Dave Hansen
    Cc: Ingo Molnar
    Cc: Minchan Kim
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Add the additional "vm_flags_t vm_flags" argument to do_mmap_pgoff(),
    rename it to do_mmap(), and re-introduce do_mmap_pgoff() as a simple
    wrapper on top of do_mmap(). Perhaps we should update the callers of
    do_mmap_pgoff() and kill it later.

    This way mpx_mmap() can simply call do_mmap(vm_flags => VM_MPX) and do not
    play with vm internals.

    After this change mmap_region() has a single user outside of mmap.c,
    arch/tile/mm/elf.c:arch_setup_additional_pages(). It would be nice to
    change arch/tile/ and unexport mmap_region().

    [kirill@shutemov.name: fix build]
    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Oleg Nesterov
    Acked-by: Dave Hansen
    Tested-by: Dave Hansen
    Signed-off-by: Kirill A. Shutemov
    Cc: "H. Peter Anvin"
    Cc: Andy Lutomirski
    Cc: Ingo Molnar
    Cc: Minchan Kim
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Instead of custom approach let's use recently introduced seq_hex_dump()
    helper.

    Signed-off-by: Andy Shevchenko
    Cc: Alexander Viro
    Cc: Joe Perches
    Cc: Tadeusz Struk
    Cc: Helge Deller
    Cc: Ingo Tuchscherer
    Acked-by: Catalin Marinas
    Cc: Vladimir Kondratiev
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Shevchenko
     
  • Knowing the portion of memory that is not used by a certain application or
    memory cgroup (idle memory) can be useful for partitioning the system
    efficiently, e.g. by setting memory cgroup limits appropriately.
    Currently, the only means to estimate the amount of idle memory provided
    by the kernel is /proc/PID/{clear_refs,smaps}: the user can clear the
    access bit for all pages mapped to a particular process by writing 1 to
    clear_refs, wait for some time, and then count smaps:Referenced. However,
    this method has two serious shortcomings:

    - it does not count unmapped file pages
    - it affects the reclaimer logic

    To overcome these drawbacks, this patch introduces two new page flags,
    Idle and Young, and a new sysfs file, /sys/kernel/mm/page_idle/bitmap.
    A page's Idle flag can only be set from userspace by setting bit in
    /sys/kernel/mm/page_idle/bitmap at the offset corresponding to the page,
    and it is cleared whenever the page is accessed either through page tables
    (it is cleared in page_referenced() in this case) or using the read(2)
    system call (mark_page_accessed()). Thus by setting the Idle flag for
    pages of a particular workload, which can be found e.g. by reading
    /proc/PID/pagemap, waiting for some time to let the workload access its
    working set, and then reading the bitmap file, one can estimate the amount
    of pages that are not used by the workload.

    The Young page flag is used to avoid interference with the memory
    reclaimer. A page's Young flag is set whenever the Access bit of a page
    table entry pointing to the page is cleared by writing to the bitmap file.
    If page_referenced() is called on a Young page, it will add 1 to its
    return value, therefore concealing the fact that the Access bit was
    cleared.

    Note, since there is no room for extra page flags on 32 bit, this feature
    uses extended page flags when compiled on 32 bit.

    [akpm@linux-foundation.org: fix build]
    [akpm@linux-foundation.org: kpageidle requires an MMU]
    [akpm@linux-foundation.org: decouple from page-flags rework]
    Signed-off-by: Vladimir Davydov
    Reviewed-by: Andres Lagar-Cavilla
    Cc: Minchan Kim
    Cc: Raghavendra K T
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: David Rientjes
    Cc: Pavel Emelyanov
    Cc: Cyrill Gorcunov
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • In the scope of the idle memory tracking feature, which is introduced by
    the following patch, we need to clear the referenced/accessed bit not only
    in primary, but also in secondary ptes. The latter is required in order
    to estimate wss of KVM VMs. At the same time we want to avoid flushing
    tlb, because it is quite expensive and it won't really affect the final
    result.

    Currently, there is no function for clearing pte young bit that would meet
    our requirements, so this patch introduces one. To achieve that we have
    to add a new mmu-notifier callback, clear_young, since there is no method
    for testing-and-clearing a secondary pte w/o flushing tlb. The new method
    is not mandatory and currently only implemented by KVM.

    Signed-off-by: Vladimir Davydov
    Reviewed-by: Andres Lagar-Cavilla
    Acked-by: Paolo Bonzini
    Cc: Minchan Kim
    Cc: Raghavendra K T
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: David Rientjes
    Cc: Pavel Emelyanov
    Cc: Cyrill Gorcunov
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • It is only used in mem_cgroup_try_charge, so fold it in and zap it.

    Signed-off-by: Vladimir Davydov
    Reviewed-by: Andres Lagar-Cavilla
    Cc: Minchan Kim
    Cc: Raghavendra K T
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: David Rientjes
    Cc: Pavel Emelyanov
    Cc: Cyrill Gorcunov
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Hwpoison allows to filter pages by memory cgroup ino. Currently, it
    calls try_get_mem_cgroup_from_page to obtain the cgroup from a page and
    then its ino using cgroup_ino, but now we have a helper method for
    that, page_cgroup_ino, so use it instead.

    This patch also loosens the hwpoison memcg filter dependency rules - it
    makes it depend on CONFIG_MEMCG instead of CONFIG_MEMCG_SWAP, because
    hwpoison memcg filter does not require anything (nor it used to) from
    CONFIG_MEMCG_SWAP side.

    Signed-off-by: Vladimir Davydov
    Reviewed-by: Andres Lagar-Cavilla
    Cc: Minchan Kim
    Cc: Raghavendra K T
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: David Rientjes
    Cc: Pavel Emelyanov
    Cc: Cyrill Gorcunov
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • This patchset introduces a new user API for tracking user memory pages
    that have not been used for a given period of time. The purpose of this
    is to provide the userspace with the means of tracking a workload's
    working set, i.e. the set of pages that are actively used by the
    workload. Knowing the working set size can be useful for partitioning the
    system more efficiently, e.g. by tuning memory cgroup limits
    appropriately, or for job placement within a compute cluster.

    ==== USE CASES ====

    The unified cgroup hierarchy has memory.low and memory.high knobs, which
    are defined as the low and high boundaries for the workload working set
    size. However, the working set size of a workload may be unknown or
    change in time. With this patch set, one can periodically estimate the
    amount of memory unused by each cgroup and tune their memory.low and
    memory.high parameters accordingly, therefore optimizing the overall
    memory utilization.

    Another use case is balancing workloads within a compute cluster. Knowing
    how much memory is not really used by a workload unit may help take a more
    optimal decision when considering migrating the unit to another node
    within the cluster.

    Also, as noted by Minchan, this would be useful for per-process reclaim
    (https://lwn.net/Articles/545668/). With idle tracking, we could reclaim idle
    pages only by smart user memory manager.

    ==== USER API ====

    The user API consists of two new files:

    * /sys/kernel/mm/page_idle/bitmap. This file implements a bitmap where each
    bit corresponds to a page, indexed by PFN. When the bit is set, the
    corresponding page is idle. A page is considered idle if it has not been
    accessed since it was marked idle. To mark a page idle one should set the
    bit corresponding to the page by writing to the file. A value written to the
    file is OR-ed with the current bitmap value. Only user memory pages can be
    marked idle, for other page types input is silently ignored. Writing to this
    file beyond max PFN results in the ENXIO error. Only available when
    CONFIG_IDLE_PAGE_TRACKING is set.

    This file can be used to estimate the amount of pages that are not
    used by a particular workload as follows:

    1. mark all pages of interest idle by setting corresponding bits in the
    /sys/kernel/mm/page_idle/bitmap
    2. wait until the workload accesses its working set
    3. read /sys/kernel/mm/page_idle/bitmap and count the number of bits set

    * /proc/kpagecgroup. This file contains a 64-bit inode number of the
    memory cgroup each page is charged to, indexed by PFN. Only available when
    CONFIG_MEMCG is set.

    This file can be used to find all pages (including unmapped file pages)
    accounted to a particular cgroup. Using /sys/kernel/mm/page_idle/bitmap, one
    can then estimate the cgroup working set size.

    For an example of using these files for estimating the amount of unused
    memory pages per each memory cgroup, please see the script attached
    below.

    ==== REASONING ====

    The reason to introduce the new user API instead of using
    /proc/PID/{clear_refs,smaps} is that the latter has two serious
    drawbacks:

    - it does not count unmapped file pages
    - it affects the reclaimer logic

    The new API attempts to overcome them both. For more details on how it
    is achieved, please see the comment to patch 6.

    ==== PATCHSET STRUCTURE ====

    The patch set is organized as follows:

    - patch 1 adds page_cgroup_ino() helper for the sake of
    /proc/kpagecgroup and patches 2-3 do related cleanup
    - patch 4 adds /proc/kpagecgroup, which reports cgroup ino each page is
    charged to
    - patch 5 introduces a new mmu notifier callback, clear_young, which is
    a lightweight version of clear_flush_young; it is used in patch 6
    - patch 6 implements the idle page tracking feature, including the
    userspace API, /sys/kernel/mm/page_idle/bitmap
    - patch 7 exports idle flag via /proc/kpageflags

    ==== SIMILAR WORKS ====

    Originally, the patch for tracking idle memory was proposed back in 2011
    by Michel Lespinasse (see http://lwn.net/Articles/459269/). The main
    difference between Michel's patch and this one is that Michel implemented
    a kernel space daemon for estimating idle memory size per cgroup while
    this patch only provides the userspace with the minimal API for doing the
    job, leaving the rest up to the userspace. However, they both share the
    same idea of Idle/Young page flags to avoid affecting the reclaimer logic.

    ==== PERFORMANCE EVALUATION ====

    SPECjvm2008 (https://www.spec.org/jvm2008/) was used to evaluate the
    performance impact introduced by this patch set. Three runs were carried
    out:

    - base: kernel without the patch
    - patched: patched kernel, the feature is not used
    - patched-active: patched kernel, 1 minute-period daemon is used for
    tracking idle memory

    For tracking idle memory, idlememstat utility was used:
    https://github.com/locker/idlememstat

    testcase base patched patched-active

    compiler 537.40 ( 0.00)% 532.26 (-0.96)% 538.31 ( 0.17)%
    compress 305.47 ( 0.00)% 301.08 (-1.44)% 300.71 (-1.56)%
    crypto 284.32 ( 0.00)% 282.21 (-0.74)% 284.87 ( 0.19)%
    derby 411.05 ( 0.00)% 413.44 ( 0.58)% 412.07 ( 0.25)%
    mpegaudio 189.96 ( 0.00)% 190.87 ( 0.48)% 189.42 (-0.28)%
    scimark.large 46.85 ( 0.00)% 46.41 (-0.94)% 47.83 ( 2.09)%
    scimark.small 412.91 ( 0.00)% 415.41 ( 0.61)% 421.17 ( 2.00)%
    serial 204.23 ( 0.00)% 213.46 ( 4.52)% 203.17 (-0.52)%
    startup 36.76 ( 0.00)% 35.49 (-3.45)% 35.64 (-3.05)%
    sunflow 115.34 ( 0.00)% 115.08 (-0.23)% 117.37 ( 1.76)%
    xml 620.55 ( 0.00)% 619.95 (-0.10)% 620.39 (-0.03)%

    composite 211.50 ( 0.00)% 211.15 (-0.17)% 211.67 ( 0.08)%

    time idlememstat:

    17.20user 65.16system 2:15:23elapsed 1%CPU (0avgtext+0avgdata 8476maxresident)k
    448inputs+40outputs (1major+36052minor)pagefaults 0swaps

    ==== SCRIPT FOR COUNTING IDLE PAGES PER CGROUP ====
    #! /usr/bin/python
    #

    import os
    import stat
    import errno
    import struct

    CGROUP_MOUNT = "/sys/fs/cgroup/memory"
    BUFSIZE = 8 * 1024 # must be multiple of 8

    def get_hugepage_size():
    with open("/proc/meminfo", "r") as f:
    for s in f:
    k, v = s.split(":")
    if k == "Hugepagesize":
    return int(v.split()[0]) * 1024

    PAGE_SIZE = os.sysconf("SC_PAGE_SIZE")
    HUGEPAGE_SIZE = get_hugepage_size()

    def set_idle():
    f = open("/sys/kernel/mm/page_idle/bitmap", "wb", BUFSIZE)
    while True:
    try:
    f.write(struct.pack("Q", pow(2, 64) - 1))
    except IOError as err:
    if err.errno == errno.ENXIO:
    break
    raise
    f.close()

    def count_idle():
    f_flags = open("/proc/kpageflags", "rb", BUFSIZE)
    f_cgroup = open("/proc/kpagecgroup", "rb", BUFSIZE)

    with open("/sys/kernel/mm/page_idle/bitmap", "rb", BUFSIZE) as f:
    while f.read(BUFSIZE): pass # update idle flag

    idlememsz = {}
    while True:
    s1, s2 = f_flags.read(8), f_cgroup.read(8)
    if not s1 or not s2:
    break

    flags, = struct.unpack('Q', s1)
    cgino, = struct.unpack('Q', s2)

    unevictable = (flags >> 18) & 1
    huge = (flags >> 22) & 1
    idle = (flags >> 25) & 1

    if idle and not unevictable:
    idlememsz[cgino] = idlememsz.get(cgino, 0) + \
    (HUGEPAGE_SIZE if huge else PAGE_SIZE)

    f_flags.close()
    f_cgroup.close()
    return idlememsz

    if __name__ == "__main__":
    print "Setting the idle flag for each page..."
    set_idle()

    raw_input("Wait until the workload accesses its working set, "
    "then press Enter")

    print "Counting idle pages..."
    idlememsz = count_idle()

    for dir, subdirs, files in os.walk(CGROUP_MOUNT):
    ino = os.stat(dir)[stat.ST_INO]
    print dir + ": " + str(idlememsz.get(ino, 0) / 1024) + " kB"
    ==== END SCRIPT ====

    This patch (of 8):

    Add page_cgroup_ino() helper to memcg.

    This function returns the inode number of the closest online ancestor of
    the memory cgroup a page is charged to. It is required for exporting
    information about which page is charged to which cgroup to userspace,
    which will be introduced by a following patch.

    Signed-off-by: Vladimir Davydov
    Reviewed-by: Andres Lagar-Cavilla
    Cc: Minchan Kim
    Cc: Raghavendra K T
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: David Rientjes
    Cc: Pavel Emelyanov
    Cc: Cyrill Gorcunov
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Update the zpool and compressor parameters to be changeable at runtime.
    When changed, a new pool is created with the requested zpool/compressor,
    and added as the current pool at the front of the pool list. Previous
    pools remain in the list only to remove existing compressed pages from.
    The old pool(s) are removed once they become empty.

    Signed-off-by: Dan Streetman
    Acked-by: Seth Jennings
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Streetman
     
  • Add dynamic creation of pools. Move the static crypto compression per-cpu
    transforms into each pool. Add a pointer to zswap_entry to the pool it's
    in.

    This is required by the following patch which enables changing the zswap
    zpool and compressor params at runtime.

    [akpm@linux-foundation.org: fix merge snafus]
    Signed-off-by: Dan Streetman
    Acked-by: Seth Jennings
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Streetman
     
  • This series makes creation of the zpool and compressor dynamic, so that
    they can be changed at runtime. This makes using/configuring zswap
    easier, as before this zswap had to be configured at boot time, using boot
    params.

    This uses a single list to track both the zpool and compressor together,
    although Seth had mentioned an alternative which is to track the zpools
    and compressors using separate lists. In the most common case, only a
    single zpool and single compressor, using one list is slightly simpler
    than using two lists, and for the uncommon case of multiple zpools and/or
    compressors, using one list is slightly less simple (and uses slightly
    more memory, probably) than using two lists.

    This patch (of 4):

    Add zpool_has_pool() function, indicating if the specified type of zpool
    is available (i.e. zsmalloc or zbud). This allows checking if a pool is
    available, without actually trying to allocate it, similar to
    crypto_has_alg().

    This is used by a following patch to zswap that enables the dynamic
    runtime creation of zswap zpools.

    Signed-off-by: Dan Streetman
    Acked-by: Seth Jennings
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Streetman
     

09 Sep, 2015

27 commits

  • Merge second patch-bomb from Andrew Morton:
    "Almost all of the rest of MM. There was an unusually large amount of
    MM material this time"

    * emailed patches from Andrew Morton : (141 commits)
    zpool: remove no-op module init/exit
    mm: zbud: constify the zbud_ops
    mm: zpool: constify the zpool_ops
    mm: swap: zswap: maybe_preload & refactoring
    zram: unify error reporting
    zsmalloc: remove null check from destroy_handle_cache()
    zsmalloc: do not take class lock in zs_shrinker_count()
    zsmalloc: use class->pages_per_zspage
    zsmalloc: consider ZS_ALMOST_FULL as migrate source
    zsmalloc: partial page ordering within a fullness_list
    zsmalloc: use shrinker to trigger auto-compaction
    zsmalloc: account the number of compacted pages
    zsmalloc/zram: introduce zs_pool_stats api
    zsmalloc: cosmetic compaction code adjustments
    zsmalloc: introduce zs_can_compact() function
    zsmalloc: always keep per-class stats
    zsmalloc: drop unused variable `nr_to_migrate'
    mm/memblock.c: fix comment in __next_mem_range()
    mm/page_alloc.c: fix type information of memoryless node
    memory-hotplug: fix comments in zone_spanned_pages_in_node() and zone_spanned_pages_in_node()
    ...

    Linus Torvalds
     
  • Remove zpool_init() and zpool_exit(); they do nothing other than print
    "loaded" and "unloaded".

    Signed-off-by: Dan Streetman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Streetman
     
  • The structure zbud_ops is not modified so make the pointer to it a
    pointer to const.

    Signed-off-by: Krzysztof Kozlowski
    Acked-by: Dan Streetman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Krzysztof Kozlowski
     
  • The structure zpool_ops is not modified so make the pointer to it a
    pointer to const.

    Signed-off-by: Krzysztof Kozlowski
    Acked-by: Dan Streetman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Krzysztof Kozlowski
     
  • zswap_get_swap_cache_page and read_swap_cache_async have pretty much the
    same code with only significant difference in return value and usage of
    swap_readpage.

    I a helper __read_swap_cache_async() with the common code. Behavior
    change: now zswap_get_swap_cache_page will use radix_tree_maybe_preload
    instead radix_tree_preload. Looks like, this wasn't changed only by the
    reason of code duplication.

    Signed-off-by: Dmitry Safonov
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Minchan Kim
    Cc: Tejun Heo
    Cc: Jens Axboe
    Cc: Christoph Hellwig
    Cc: David Herrmann
    Cc: Seth Jennings
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitry Safonov
     
  • We can pass a NULL cache pointer to kmem_cache_destroy(), because it
    NULL-checks its argument now. Remove redundant test from
    destroy_handle_cache().

    Signed-off-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • We can avoid taking class ->lock around zs_can_compact() in
    zs_shrinker_count(), because the number that we return back is outdated
    in general case, by design. We have different sources that are able to
    change class's state right after we return from zs_can_compact() --
    ongoing I/O operations, manually triggered compaction, or two of them
    happening simultaneously.

    We re-do this calculations during compaction on a per class basis
    anyway.

    zs_unregister_shrinker() will not return until we have an active
    shrinker, so classes won't unexpectedly disappear while
    zs_shrinker_count() iterates them.

    Signed-off-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • There is no need to recalcurate pages_per_zspage in runtime. Just use
    class->pages_per_zspage to avoid unnecessary runtime overhead.

    Signed-off-by: Minchan Kim
    Acked-by: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • There is no reason to prevent select ZS_ALMOST_FULL as migration source
    if we cannot find source from ZS_ALMOST_EMPTY.

    With this patch, zs_can_compact will return more exact result.

    Signed-off-by: Minchan Kim
    Acked-by: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • We want to see more ZS_FULL pages and less ZS_ALMOST_{FULL, EMPTY}
    pages. Put a page with higher ->inuse count first within its
    ->fullness_list, which will give us better chances to fill up this page
    with new objects (find_get_zspage() return ->fullness_list head for new
    object allocation), so some zspages will become ZS_ALMOST_FULL/ZS_FULL
    quicker.

    It performs a trivial and cheap ->inuse compare which does not slow down
    zsmalloc and in the worst case keeps the list pages in no particular
    order.

    A more expensive solution could sort fullness_list by ->inuse count.

    [minchan@kernel.org: code adjustments]
    Signed-off-by: Sergey Senozhatsky
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • Perform automatic pool compaction by a shrinker when system is getting
    tight on memory.

    User-space has a very little knowledge regarding zsmalloc fragmentation
    and basically has no mechanism to tell whether compaction will result in
    any memory gain. Another issue is that user space is not always aware
    of the fact that system is getting tight on memory. Which leads to very
    uncomfortable scenarios when user space may start issuing compaction
    'randomly' or from crontab (for example). Fragmentation is not always
    necessarily bad, allocated and unused objects, after all, may be filled
    with the data later, w/o the need of allocating a new zspage. On the
    other hand, we obviously don't want to waste memory when the system
    needs it.

    Compaction now has a relatively quick pool scan so we are able to
    estimate the number of pages that will be freed easily, which makes it
    possible to call this function from a shrinker->count_objects()
    callback. We also abort compaction as soon as we detect that we can't
    free any pages any more, preventing wasteful objects migrations.

    Signed-off-by: Sergey Senozhatsky
    Suggested-by: Minchan Kim
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • Compaction returns back to zram the number of migrated objects, which is
    quite uninformative -- we have objects of different sizes so user space
    cannot obtain any valuable data from that number. Change compaction to
    operate in terms of pages and return back to compaction issuer the
    number of pages that were freed during compaction. So from now on we
    will export more meaningful value in zram/mm_stat -- the number of
    freed (compacted) pages.

    This requires:
    (a) a rename of `num_migrated' to 'pages_compacted'
    (b) a internal API change -- return first_page's fullness_group from
    putback_zspage(), so we know when putback_zspage() did
    free_zspage(). It helps us to account compaction stats correctly.

    Signed-off-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • `zs_compact_control' accounts the number of migrated objects but it has
    a limited lifespan -- we lose it as soon as zs_compaction() returns back
    to zram. It worked fine, because (a) zram had it's own counter of
    migrated objects and (b) only zram could trigger compaction. However,
    this does not work for automatic pool compaction (not issued by zram).
    To account objects migrated during auto-compaction (issued by the
    shrinker) we need to store this number in zs_pool.

    Define a new `struct zs_pool_stats' structure to keep zs_pool's stats
    there. It provides only `num_migrated', as of this writing, but it
    surely can be extended.

    A new zsmalloc zs_pool_stats() symbol exports zs_pool's stats back to
    caller.

    Use zs_pool_stats() in zram and remove `num_migrated' from zram_stats.

    Signed-off-by: Sergey Senozhatsky
    Suggested-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • Change zs_object_copy() argument order to be (DST, SRC) rather than
    (SRC, DST). copy/move functions usually have (to, from) arguments
    order.

    Rename alloc_target_page() to isolate_target_page(). This function
    doesn't allocate anything, it isolates target page, pretty much like
    isolate_source_page().

    Tweak __zs_compact() comment.

    Signed-off-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • This function checks if class compaction will free any pages.
    Rephrasing -- do we have enough unused objects to form at least one
    ZS_EMPTY page and free it. It aborts compaction if class compaction
    will not result in any (further) savings.

    EXAMPLE (this debug output is not part of this patch set):

    - class size
    - number of allocated objects
    - number of used objects
    - max objects per zspage
    - pages per zspage
    - estimated number of pages that will be freed

    [..]
    class-512 objs:544 inuse:540 maxobj-per-zspage:8 pages-per-zspage:1 zspages-to-free:0
    ... class-512 compaction is useless. break
    class-496 objs:660 inuse:570 maxobj-per-zspage:33 pages-per-zspage:4 zspages-to-free:2
    class-496 objs:627 inuse:570 maxobj-per-zspage:33 pages-per-zspage:4 zspages-to-free:1
    class-496 objs:594 inuse:570 maxobj-per-zspage:33 pages-per-zspage:4 zspages-to-free:0
    ... class-496 compaction is useless. break
    class-448 objs:657 inuse:617 maxobj-per-zspage:9 pages-per-zspage:1 zspages-to-free:4
    class-448 objs:648 inuse:617 maxobj-per-zspage:9 pages-per-zspage:1 zspages-to-free:3
    class-448 objs:639 inuse:617 maxobj-per-zspage:9 pages-per-zspage:1 zspages-to-free:2
    class-448 objs:630 inuse:617 maxobj-per-zspage:9 pages-per-zspage:1 zspages-to-free:1
    class-448 objs:621 inuse:617 maxobj-per-zspage:9 pages-per-zspage:1 zspages-to-free:0
    ... class-448 compaction is useless. break
    class-432 objs:728 inuse:685 maxobj-per-zspage:28 pages-per-zspage:3 zspages-to-free:1
    class-432 objs:700 inuse:685 maxobj-per-zspage:28 pages-per-zspage:3 zspages-to-free:0
    ... class-432 compaction is useless. break
    class-416 objs:819 inuse:705 maxobj-per-zspage:39 pages-per-zspage:4 zspages-to-free:2
    class-416 objs:780 inuse:705 maxobj-per-zspage:39 pages-per-zspage:4 zspages-to-free:1
    class-416 objs:741 inuse:705 maxobj-per-zspage:39 pages-per-zspage:4 zspages-to-free:0
    ... class-416 compaction is useless. break
    class-400 objs:690 inuse:674 maxobj-per-zspage:10 pages-per-zspage:1 zspages-to-free:1
    class-400 objs:680 inuse:674 maxobj-per-zspage:10 pages-per-zspage:1 zspages-to-free:0
    ... class-400 compaction is useless. break
    class-384 objs:736 inuse:709 maxobj-per-zspage:32 pages-per-zspage:3 zspages-to-free:0
    ... class-384 compaction is useless. break
    [..]

    Every "compaction is useless" indicates that we saved CPU cycles.

    class-512 has
    544 object allocated
    540 objects used
    8 objects per-page

    Even if we have a ALMOST_EMPTY zspage, we still don't have enough room to
    migrate all of its objects and free this zspage; so compaction will not
    make a lot of sense, it's better to just leave it as is.

    Signed-off-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • Always account per-class `zs_size_stat' stats. This data will help us
    make better decisions during compaction. We are especially interested
    in OBJ_ALLOCATED and OBJ_USED, which can tell us if class compaction
    will result in any memory gain.

    For instance, we know the number of allocated objects in the class, the
    number of objects being used (so we also know how many objects are not
    used) and the number of objects per-page. So we can ensure if we have
    enough unused objects to form at least one ZS_EMPTY zspage during
    compaction.

    We calculate this value on per-class basis so we can calculate a total
    number of zspages that can be released. Which is exactly what a
    shrinker wants to know.

    Signed-off-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • This patchset tweaks compaction and makes it possible to trigger pool
    compaction automatically when system is getting low on memory.

    zsmalloc in some cases can suffer from a notable fragmentation and
    compaction can release some considerable amount of memory. The problem
    here is that currently we fully rely on user space to perform compaction
    when needed. However, performing zsmalloc compaction is not always an
    obvious thing to do. For example, suppose we have a `idle' fragmented
    (compaction was never performed) zram device and system is getting low
    on memory due to some 3rd party user processes (gcc LTO, or firefox,
    etc.). It's quite unlikely that user space will issue zpool compaction
    in this case. Besides, user space cannot tell for sure how badly pool
    is fragmented; however, this info is known to zsmalloc and, hence, to a
    shrinker.

    This patch (of 7):

    __zs_compact() does not use `nr_to_migrate', drop it.

    Signed-off-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • Signed-off-by: Alexander Kuleshov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Kuleshov
     
  • For a memoryless node, the output of get_pfn_range_for_nid are all zero.
    It will display mem from 0 to -1.

    Signed-off-by: Zhen Lei
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhen Lei
     
  • When hot adding a node from add_memory(), we will add memblock first, so
    the node is not empty. But when called from cpu_up(), the node should
    be empty.

    Signed-off-by: Xishi Qiu
    Cc: Tang Chen
    Cc: Yasuaki Ishimatsu
    Cc: Naoya Horiguchi
    Cc: Vlastimil Babka
    Cc: Taku Izumi \
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xishi Qiu
     
  • We use sysctl_lowmem_reserve_ratio rather than
    sysctl_lower_zone_reserve_ratio to determine how aggressive the kernel
    is in defending lowmem from the possibility of being captured into
    pinned user memory. To avoid misleading, correct it in some comments.

    Signed-off-by: Yaowei Bai
    Acked-by: Michal Hocko
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yaowei Bai
     
  • The comment says that the per-cpu batchsize and zone watermarks are
    determined by present_pages which is definitely wrong, they are both
    calculated from managed_pages. Fix it.

    Signed-off-by: Yaowei Bai
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yaowei Bai
     
  • There's no point in initializing vma->vm_pgoff if the insertion attempt
    will be failing anyway. Run the checks before performing the
    initialization.

    Signed-off-by: Chen Gang
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Gang
     
  • Commit 1dfb059b9438 ("thp: reduce khugepaged freezing latency") fixed
    khugepaged to do not block a system suspend. But the result is that it
    could not get interrupted before the given timeout because the condition
    for the wait event is "false".

    This patch puts back the original approach but it uses
    freezable_schedule_timeout_interruptible() instead of
    schedule_timeout_interruptible(). It does the right thing. I am pretty
    sure that the freezable variant was not used in the original fix only
    because it was not available at that time.

    The regression has been there for ages. It was not critical. It just
    did the allocation throttling a little bit more aggressively.

    I found this problem when converting the kthread to kthread worker API
    and trying to understand the code.

    This bug is thought to have minimal userspace-visible impact. Somebody
    could set a high alloc_sleep value by mistake, and then try to fix it
    back, but khugepaged would keep sleeping until the high value expires.

    Signed-off-by: Petr Mladek
    Cc: Andrea Arcangeli
    Acked-by: Vlastimil Babka
    Cc: "Aneesh Kumar K.V"
    Cc: "Kirill A. Shutemov"
    Cc: David Rientjes
    Cc: Ebru Akagunduz
    Cc: Mel Gorman
    Cc: Jiri Kosina
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Petr Mladek
     
  • s/succees/success/

    Signed-off-by: Alexander Kuleshov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Kuleshov
     
  • We cache isolate_start_pfn before entering isolate_migratepages(). If
    pageblock is skipped in isolate_migratepages() due to whatever reason,
    cc->migrate_pfn can be far from isolate_start_pfn hence we flush pages
    that were freed. For example, the following scenario can be possible:

    - assume order-9 compaction, pageblock order is 9
    - start_isolate_pfn is 0x200
    - isolate_migratepages()
    - skip a number of pageblocks
    - start to isolate from pfn 0x600
    - cc->migrate_pfn = 0x620
    - return
    - last_migrated_pfn is set to 0x200
    - check flushing condition
    - current_block_start is set to 0x600
    - last_migrated_pfn < current_block_start then do useless flush

    This wrong flush would not help the performance and success rate so this
    patch tries to fix it. One simple way to know the exact position where
    we start to isolate migratable pages is that we cache it in
    isolate_migratepages() before entering actual isolation. This patch
    implements that and fixes the problem.

    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: David Rientjes
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • alloc_pages_exact_node() was introduced in commit 6484eb3e2a81 ("page
    allocator: do not check NUMA node ID when the caller knows the node is
    valid") as an optimized variant of alloc_pages_node(), that doesn't
    fallback to current node for nid == NUMA_NO_NODE. Unfortunately the
    name of the function can easily suggest that the allocation is
    restricted to the given node and fails otherwise. In truth, the node is
    only preferred, unless __GFP_THISNODE is passed among the gfp flags.

    The misleading name has lead to mistakes in the past, see for example
    commits 5265047ac301 ("mm, thp: really limit transparent hugepage
    allocation to local node") and b360edb43f8e ("mm, mempolicy:
    migrate_to_node should only migrate to node").

    Another issue with the name is that there's a family of
    alloc_pages_exact*() functions where 'exact' means exact size (instead
    of page order), which leads to more confusion.

    To prevent further mistakes, this patch effectively renames
    alloc_pages_exact_node() to __alloc_pages_node() to better convey that
    it's an optimized variant of alloc_pages_node() not intended for general
    usage. Both functions get described in comments.

    It has been also considered to really provide a convenience function for
    allocations restricted to a node, but the major opinion seems to be that
    __GFP_THISNODE already provides that functionality and we shouldn't
    duplicate the API needlessly. The number of users would be small
    anyway.

    Existing callers of alloc_pages_exact_node() are simply converted to
    call __alloc_pages_node(), with the exception of sba_alloc_coherent()
    which open-codes the check for NUMA_NO_NODE, so it is converted to use
    alloc_pages_node() instead. This means it no longer performs some
    VM_BUG_ON checks, and since the current check for nid in
    alloc_pages_node() uses a 'nid < 0' comparison (which includes
    NUMA_NO_NODE), it may hide wrong values which would be previously
    exposed.

    Both differences will be rectified by the next patch.

    To sum up, this patch makes no functional changes, except temporarily
    hiding potentially buggy callers. Restricting the checks in
    alloc_pages_node() is left for the next patch which can in turn expose
    more existing buggy callers.

    Signed-off-by: Vlastimil Babka
    Acked-by: Johannes Weiner
    Acked-by: Robin Holt
    Acked-by: Michal Hocko
    Acked-by: Christoph Lameter
    Acked-by: Michael Ellerman
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Greg Thelen
    Cc: Aneesh Kumar K.V
    Cc: Pekka Enberg
    Cc: Joonsoo Kim
    Cc: Naoya Horiguchi
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Gleb Natapov
    Cc: Paolo Bonzini
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Cliff Whickman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka