23 Sep, 2015

3 commits

  • The sane_reclaim() helper is supposed to return false for memcg reclaim
    if the legacy hierarchy is used, because the latter lacks dirty
    throttling mechanism, and so it did before it was accidentally broken by
    commit 33398cf2f360c ("memcg: export struct mem_cgroup"). Fix it.

    Fixes: 33398cf2f360c ("memcg: export struct mem_cgroup")
    Signed-off-by: Vladimir Davydov
    Acked-by: Tejun Heo
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Since commit bcc54222309c ("mm: hugetlb: introduce page_huge_active")
    each hugetlb page maintains its active flag to avoid a race condition
    betwe= en multiple calls of isolate_huge_page(), but current kernel
    doesn't set the f= lag on a hugepage allocated by migration because the
    proper putback routine isn= 't called. This means that users could
    still encounter the race referred to by bcc54222309c in this special
    case, so this patch fixes it.

    Fixes: bcc54222309c ("mm: hugetlb: introduce page_huge_active")
    Signed-off-by: Naoya Horiguchi
    Cc: Michal Hocko
    Cc: Andi Kleen
    Cc: Hugh Dickins
    Cc: [4.1.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • For VM_PFNMAP and VM_MIXEDMAP we use vm_ops->pfn_mkwrite instead of
    vm_ops->page_mkwrite to notify abort write access. This means we want
    vma->vm_page_prot to be write-protected if the VMA provides this vm_ops.

    A theoretical scenario that will cause these missed events is:

    On writable mapping with vm_ops->pfn_mkwrite, but without
    vm_ops->page_mkwrite: read fault followed by write access to the pfn.
    Writable pte will be set up on read fault and write fault will not be
    generated.

    I found it examining Dave's complaint on generic/080:

    http://lkml.kernel.org/g/20150831233803.GO3902@dastard

    Although I don't think it's the reason.

    It shouldn't be a problem for ext2/ext4 as they provide both pfn_mkwrite
    and page_mkwrite.

    [akpm@linux-foundation.org: add local vm_ops to avoid 80-cols mess]
    Signed-off-by: Kirill A. Shutemov
    Cc: Yigal Korman
    Acked-by: Boaz Harrosh
    Cc: Matthew Wilcox
    Cc: Jan Kara
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

18 Sep, 2015

2 commits

  • Revert commit 6dc296e7df4c "mm: make sure all file VMAs have ->vm_ops
    set".

    Will Deacon reports that it "causes some mmap regressions in LTP, which
    appears to use a MAP_PRIVATE mmap of /dev/zero as a way to get anonymous
    pages in some of its tests (specifically mmap10 [1])".

    William Shuman reports Oracle crashes.

    So revert the patch while we work out what to do.

    Reported-by: William Shuman
    Reported-by: Will Deacon
    Cc: Kirill A. Shutemov
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • The shadow which correspond 16 bytes memory may span 2 or 3 bytes. If
    the memory is aligned on 8, then the shadow takes only 2 bytes. So we
    check "shadow_first_bytes" is enough, and need not to call
    "memory_is_poisoned_1(addr + 15);". But the code "if
    (likely(!last_byte))" is wrong judgement.

    e.g. addr=0, so last_byte = 15 & KASAN_SHADOW_MASK = 7, then the code
    will continue to call "memory_is_poisoned_1(addr + 15);"

    Signed-off-by: Xishi Qiu
    Acked-by: Andrey Ryabinin
    Cc: Andrey Konovalov
    Cc: Rusty Russell
    Cc: Michal Marek
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xishi Qiu
     

12 Sep, 2015

3 commits

  • Merge fourth patch-bomb from Andrew Morton:

    - sys_membarier syscall

    - seq_file interface changes

    - a few misc fixups

    * emailed patches from Andrew Morton :
    revert "ocfs2/dlm: use list_for_each_entry instead of list_for_each"
    mm/early_ioremap: add explicit #include of asm/early_ioremap.h
    fs/seq_file: convert int seq_vprint/seq_printf/etc... returns to void
    selftests: enhance membarrier syscall test
    selftests: add membarrier syscall test
    sys_membarrier(): system-wide memory barrier (generic, x86)
    MODSIGN: fix a compilation warning in extract-cert

    Linus Torvalds
     
  • Pull media updates from Mauro Carvalho Chehab:
    "A series of patches that move part of the code used to allocate memory
    from the media subsystem to the mm subsystem"

    [ The mm parts have been acked by VM people, and the series was
    apparently in -mm for a while - Linus ]

    * tag 'media/v4.3-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
    [media] drm/exynos: Convert g2d_userptr_get_dma_addr() to use get_vaddr_frames()
    [media] media: vb2: Remove unused functions
    [media] media: vb2: Convert vb2_dc_get_userptr() to use frame vector
    [media] media: vb2: Convert vb2_vmalloc_get_userptr() to use frame vector
    [media] media: vb2: Convert vb2_dma_sg_get_userptr() to use frame vector
    [media] vb2: Provide helpers for mapping virtual addresses
    [media] media: omap_vout: Convert omap_vout_uservirt_to_phys() to use get_vaddr_pfns()
    [media] mm: Provide new get_vaddr_frames() helper
    [media] vb2: Push mmap_sem down to memops

    Linus Torvalds
     
  • Commit 6b0f68e32ea8 ("mm: add utility for early copy from unmapped ram")
    introduces a function copy_from_early_mem() into mm/early_ioremap.c
    which itself calls early_memremap()/early_memunmap(). However, since
    early_memunmap() has not been declared yet at this point in the .c file,
    nor by any explicitly included header files, we are depending on a
    transitive include of asm/early_ioremap.h to declare it, which is
    fragile.

    So instead, include this header explicitly.

    Signed-off-by: Ard Biesheuvel
    Acked-by: Mark Salter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ard Biesheuvel
     

11 Sep, 2015

13 commits

  • Pull blk-cg updates from Jens Axboe:
    "A bit later in the cycle, but this has been in the block tree for a a
    while. This is basically four patchsets from Tejun, that improve our
    buffered cgroup writeback. It was dependent on the other cgroup
    changes, but they went in earlier in this cycle.

    Series 1 is set of 5 patches that has cgroup writeback updates:

    - bdi_writeback iteration fix which could lead to some wb's being
    skipped or repeated during e.g. sync under memory pressure.

    - Simplification of wb work wait mechanism.

    - Writeback tracepoints updated to report cgroup.

    Series 2 is is a set of updates for the CFQ cgroup writeback handling:

    cfq has always charged all async IOs to the root cgroup. It didn't
    have much choice as writeback didn't know about cgroups and there
    was no way to tell who to blame for a given writeback IO.
    writeback finally grew support for cgroups and now tags each
    writeback IO with the appropriate cgroup to charge it against.

    This patchset updates cfq so that it follows the blkcg each bio is
    tagged with. Async cfq_queues are now shared across cfq_group,
    which is per-cgroup, instead of per-request_queue cfq_data. This
    makes all IOs follow the weight based IO resource distribution
    implemented by cfq.

    - Switched from GFP_ATOMIC to GFP_NOWAIT as suggested by Jeff.

    - Other misc review points addressed, acks added and rebased.

    Series 3 is the blkcg policy cleanup patches:

    This patchset contains assorted cleanups for blkcg_policy methods
    and blk[c]g_policy_data handling.

    - alloc/free added for blkg_policy_data. exit dropped.

    - alloc/free added for blkcg_policy_data.

    - blk-throttle's async percpu allocation is replaced with direct
    allocation.

    - all methods now take blk[c]g_policy_data instead of blkcg_gq or
    blkcg.

    And finally, series 4 is a set of patches cleaning up the blkcg stats
    handling:

    blkcg's stats have always been somwhat of a mess. This patchset
    tries to improve the situation a bit.

    - The following patches added to consolidate blkcg entry point and
    blkg creation. This is in itself is an improvement and helps
    colllecting common stats on bio issue.

    - per-blkg stats now accounted on bio issue rather than request
    completion so that bio based and request based drivers can behave
    the same way. The issue was spotted by Vivek.

    - cfq-iosched implements custom recursive stats and blk-throttle
    implements custom per-cpu stats. This patchset make blkcg core
    support both by default.

    - cfq-iosched and blk-throttle keep track of the same stats
    multiple times. Unify them"

    * 'for-4.3/blkcg' of git://git.kernel.dk/linux-block: (45 commits)
    blkcg: use CGROUP_WEIGHT_* scale for io.weight on the unified hierarchy
    blkcg: s/CFQ_WEIGHT_*/CFQ_WEIGHT_LEGACY_*/
    blkcg: implement interface for the unified hierarchy
    blkcg: misc preparations for unified hierarchy interface
    blkcg: separate out tg_conf_updated() from tg_set_conf()
    blkcg: move body parsing from blkg_conf_prep() to its callers
    blkcg: mark existing cftypes as legacy
    blkcg: rename subsystem name from blkio to io
    blkcg: refine error codes returned during blkcg configuration
    blkcg: remove unnecessary NULL checks from __cfqg_set_weight_device()
    blkcg: reduce stack usage of blkg_rwstat_recursive_sum()
    blkcg: remove cfqg_stats->sectors
    blkcg: move io_service_bytes and io_serviced stats into blkcg_gq
    blkcg: make blkg_[rw]stat_recursive_sum() to be able to index into blkcg_gq
    blkcg: make blkcg_[rw]stat per-cpu
    blkcg: add blkg_[rw]stat->aux_cnt and replace cfq_group->dead_stats with it
    blkcg: consolidate blkg creation in blkcg_bio_issue_check()
    blk-throttle: improve queue bypass handling
    blkcg: move root blkg lookup optimization from throtl_lookup_tg() to __blkg_lookup()
    blkcg: inline [__]blkg_lookup()
    ...

    Linus Torvalds
     
  • Let's use helper rather than direct check of vma->vm_ops to distinguish
    anonymous VMA.

    Signed-off-by: Kirill A. Shutemov
    Reviewed-by: Oleg Nesterov
    Cc: "H. Peter Anvin"
    Cc: Andy Lutomirski
    Cc: Dave Hansen
    Cc: Ingo Molnar
    Cc: Minchan Kim
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • We rely on vma->vm_ops == NULL to detect anonymous VMA: see
    vma_is_anonymous(), but some drivers doesn't set ->vm_ops.

    As a result we can end up with anonymous page in private file mapping.
    That should not lead to serious misbehaviour, but nevertheless is wrong.

    Let's fix by setting up dummy ->vm_ops for file mmapping if f_op->mmap()
    didn't set its own.

    The patch also adds sanity check into __vma_link_rb(). It will help
    catch broken VMAs which inserted directly into mm_struct via
    insert_vm_struct().

    Signed-off-by: Kirill A. Shutemov
    Reviewed-by: Oleg Nesterov
    Cc: "H. Peter Anvin"
    Cc: Andy Lutomirski
    Cc: Dave Hansen
    Cc: Ingo Molnar
    Cc: Minchan Kim
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Add the additional "vm_flags_t vm_flags" argument to do_mmap_pgoff(),
    rename it to do_mmap(), and re-introduce do_mmap_pgoff() as a simple
    wrapper on top of do_mmap(). Perhaps we should update the callers of
    do_mmap_pgoff() and kill it later.

    This way mpx_mmap() can simply call do_mmap(vm_flags => VM_MPX) and do not
    play with vm internals.

    After this change mmap_region() has a single user outside of mmap.c,
    arch/tile/mm/elf.c:arch_setup_additional_pages(). It would be nice to
    change arch/tile/ and unexport mmap_region().

    [kirill@shutemov.name: fix build]
    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Oleg Nesterov
    Acked-by: Dave Hansen
    Tested-by: Dave Hansen
    Signed-off-by: Kirill A. Shutemov
    Cc: "H. Peter Anvin"
    Cc: Andy Lutomirski
    Cc: Ingo Molnar
    Cc: Minchan Kim
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Instead of custom approach let's use recently introduced seq_hex_dump()
    helper.

    Signed-off-by: Andy Shevchenko
    Cc: Alexander Viro
    Cc: Joe Perches
    Cc: Tadeusz Struk
    Cc: Helge Deller
    Cc: Ingo Tuchscherer
    Acked-by: Catalin Marinas
    Cc: Vladimir Kondratiev
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Shevchenko
     
  • Knowing the portion of memory that is not used by a certain application or
    memory cgroup (idle memory) can be useful for partitioning the system
    efficiently, e.g. by setting memory cgroup limits appropriately.
    Currently, the only means to estimate the amount of idle memory provided
    by the kernel is /proc/PID/{clear_refs,smaps}: the user can clear the
    access bit for all pages mapped to a particular process by writing 1 to
    clear_refs, wait for some time, and then count smaps:Referenced. However,
    this method has two serious shortcomings:

    - it does not count unmapped file pages
    - it affects the reclaimer logic

    To overcome these drawbacks, this patch introduces two new page flags,
    Idle and Young, and a new sysfs file, /sys/kernel/mm/page_idle/bitmap.
    A page's Idle flag can only be set from userspace by setting bit in
    /sys/kernel/mm/page_idle/bitmap at the offset corresponding to the page,
    and it is cleared whenever the page is accessed either through page tables
    (it is cleared in page_referenced() in this case) or using the read(2)
    system call (mark_page_accessed()). Thus by setting the Idle flag for
    pages of a particular workload, which can be found e.g. by reading
    /proc/PID/pagemap, waiting for some time to let the workload access its
    working set, and then reading the bitmap file, one can estimate the amount
    of pages that are not used by the workload.

    The Young page flag is used to avoid interference with the memory
    reclaimer. A page's Young flag is set whenever the Access bit of a page
    table entry pointing to the page is cleared by writing to the bitmap file.
    If page_referenced() is called on a Young page, it will add 1 to its
    return value, therefore concealing the fact that the Access bit was
    cleared.

    Note, since there is no room for extra page flags on 32 bit, this feature
    uses extended page flags when compiled on 32 bit.

    [akpm@linux-foundation.org: fix build]
    [akpm@linux-foundation.org: kpageidle requires an MMU]
    [akpm@linux-foundation.org: decouple from page-flags rework]
    Signed-off-by: Vladimir Davydov
    Reviewed-by: Andres Lagar-Cavilla
    Cc: Minchan Kim
    Cc: Raghavendra K T
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: David Rientjes
    Cc: Pavel Emelyanov
    Cc: Cyrill Gorcunov
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • In the scope of the idle memory tracking feature, which is introduced by
    the following patch, we need to clear the referenced/accessed bit not only
    in primary, but also in secondary ptes. The latter is required in order
    to estimate wss of KVM VMs. At the same time we want to avoid flushing
    tlb, because it is quite expensive and it won't really affect the final
    result.

    Currently, there is no function for clearing pte young bit that would meet
    our requirements, so this patch introduces one. To achieve that we have
    to add a new mmu-notifier callback, clear_young, since there is no method
    for testing-and-clearing a secondary pte w/o flushing tlb. The new method
    is not mandatory and currently only implemented by KVM.

    Signed-off-by: Vladimir Davydov
    Reviewed-by: Andres Lagar-Cavilla
    Acked-by: Paolo Bonzini
    Cc: Minchan Kim
    Cc: Raghavendra K T
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: David Rientjes
    Cc: Pavel Emelyanov
    Cc: Cyrill Gorcunov
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • It is only used in mem_cgroup_try_charge, so fold it in and zap it.

    Signed-off-by: Vladimir Davydov
    Reviewed-by: Andres Lagar-Cavilla
    Cc: Minchan Kim
    Cc: Raghavendra K T
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: David Rientjes
    Cc: Pavel Emelyanov
    Cc: Cyrill Gorcunov
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Hwpoison allows to filter pages by memory cgroup ino. Currently, it
    calls try_get_mem_cgroup_from_page to obtain the cgroup from a page and
    then its ino using cgroup_ino, but now we have a helper method for
    that, page_cgroup_ino, so use it instead.

    This patch also loosens the hwpoison memcg filter dependency rules - it
    makes it depend on CONFIG_MEMCG instead of CONFIG_MEMCG_SWAP, because
    hwpoison memcg filter does not require anything (nor it used to) from
    CONFIG_MEMCG_SWAP side.

    Signed-off-by: Vladimir Davydov
    Reviewed-by: Andres Lagar-Cavilla
    Cc: Minchan Kim
    Cc: Raghavendra K T
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: David Rientjes
    Cc: Pavel Emelyanov
    Cc: Cyrill Gorcunov
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • This patchset introduces a new user API for tracking user memory pages
    that have not been used for a given period of time. The purpose of this
    is to provide the userspace with the means of tracking a workload's
    working set, i.e. the set of pages that are actively used by the
    workload. Knowing the working set size can be useful for partitioning the
    system more efficiently, e.g. by tuning memory cgroup limits
    appropriately, or for job placement within a compute cluster.

    ==== USE CASES ====

    The unified cgroup hierarchy has memory.low and memory.high knobs, which
    are defined as the low and high boundaries for the workload working set
    size. However, the working set size of a workload may be unknown or
    change in time. With this patch set, one can periodically estimate the
    amount of memory unused by each cgroup and tune their memory.low and
    memory.high parameters accordingly, therefore optimizing the overall
    memory utilization.

    Another use case is balancing workloads within a compute cluster. Knowing
    how much memory is not really used by a workload unit may help take a more
    optimal decision when considering migrating the unit to another node
    within the cluster.

    Also, as noted by Minchan, this would be useful for per-process reclaim
    (https://lwn.net/Articles/545668/). With idle tracking, we could reclaim idle
    pages only by smart user memory manager.

    ==== USER API ====

    The user API consists of two new files:

    * /sys/kernel/mm/page_idle/bitmap. This file implements a bitmap where each
    bit corresponds to a page, indexed by PFN. When the bit is set, the
    corresponding page is idle. A page is considered idle if it has not been
    accessed since it was marked idle. To mark a page idle one should set the
    bit corresponding to the page by writing to the file. A value written to the
    file is OR-ed with the current bitmap value. Only user memory pages can be
    marked idle, for other page types input is silently ignored. Writing to this
    file beyond max PFN results in the ENXIO error. Only available when
    CONFIG_IDLE_PAGE_TRACKING is set.

    This file can be used to estimate the amount of pages that are not
    used by a particular workload as follows:

    1. mark all pages of interest idle by setting corresponding bits in the
    /sys/kernel/mm/page_idle/bitmap
    2. wait until the workload accesses its working set
    3. read /sys/kernel/mm/page_idle/bitmap and count the number of bits set

    * /proc/kpagecgroup. This file contains a 64-bit inode number of the
    memory cgroup each page is charged to, indexed by PFN. Only available when
    CONFIG_MEMCG is set.

    This file can be used to find all pages (including unmapped file pages)
    accounted to a particular cgroup. Using /sys/kernel/mm/page_idle/bitmap, one
    can then estimate the cgroup working set size.

    For an example of using these files for estimating the amount of unused
    memory pages per each memory cgroup, please see the script attached
    below.

    ==== REASONING ====

    The reason to introduce the new user API instead of using
    /proc/PID/{clear_refs,smaps} is that the latter has two serious
    drawbacks:

    - it does not count unmapped file pages
    - it affects the reclaimer logic

    The new API attempts to overcome them both. For more details on how it
    is achieved, please see the comment to patch 6.

    ==== PATCHSET STRUCTURE ====

    The patch set is organized as follows:

    - patch 1 adds page_cgroup_ino() helper for the sake of
    /proc/kpagecgroup and patches 2-3 do related cleanup
    - patch 4 adds /proc/kpagecgroup, which reports cgroup ino each page is
    charged to
    - patch 5 introduces a new mmu notifier callback, clear_young, which is
    a lightweight version of clear_flush_young; it is used in patch 6
    - patch 6 implements the idle page tracking feature, including the
    userspace API, /sys/kernel/mm/page_idle/bitmap
    - patch 7 exports idle flag via /proc/kpageflags

    ==== SIMILAR WORKS ====

    Originally, the patch for tracking idle memory was proposed back in 2011
    by Michel Lespinasse (see http://lwn.net/Articles/459269/). The main
    difference between Michel's patch and this one is that Michel implemented
    a kernel space daemon for estimating idle memory size per cgroup while
    this patch only provides the userspace with the minimal API for doing the
    job, leaving the rest up to the userspace. However, they both share the
    same idea of Idle/Young page flags to avoid affecting the reclaimer logic.

    ==== PERFORMANCE EVALUATION ====

    SPECjvm2008 (https://www.spec.org/jvm2008/) was used to evaluate the
    performance impact introduced by this patch set. Three runs were carried
    out:

    - base: kernel without the patch
    - patched: patched kernel, the feature is not used
    - patched-active: patched kernel, 1 minute-period daemon is used for
    tracking idle memory

    For tracking idle memory, idlememstat utility was used:
    https://github.com/locker/idlememstat

    testcase base patched patched-active

    compiler 537.40 ( 0.00)% 532.26 (-0.96)% 538.31 ( 0.17)%
    compress 305.47 ( 0.00)% 301.08 (-1.44)% 300.71 (-1.56)%
    crypto 284.32 ( 0.00)% 282.21 (-0.74)% 284.87 ( 0.19)%
    derby 411.05 ( 0.00)% 413.44 ( 0.58)% 412.07 ( 0.25)%
    mpegaudio 189.96 ( 0.00)% 190.87 ( 0.48)% 189.42 (-0.28)%
    scimark.large 46.85 ( 0.00)% 46.41 (-0.94)% 47.83 ( 2.09)%
    scimark.small 412.91 ( 0.00)% 415.41 ( 0.61)% 421.17 ( 2.00)%
    serial 204.23 ( 0.00)% 213.46 ( 4.52)% 203.17 (-0.52)%
    startup 36.76 ( 0.00)% 35.49 (-3.45)% 35.64 (-3.05)%
    sunflow 115.34 ( 0.00)% 115.08 (-0.23)% 117.37 ( 1.76)%
    xml 620.55 ( 0.00)% 619.95 (-0.10)% 620.39 (-0.03)%

    composite 211.50 ( 0.00)% 211.15 (-0.17)% 211.67 ( 0.08)%

    time idlememstat:

    17.20user 65.16system 2:15:23elapsed 1%CPU (0avgtext+0avgdata 8476maxresident)k
    448inputs+40outputs (1major+36052minor)pagefaults 0swaps

    ==== SCRIPT FOR COUNTING IDLE PAGES PER CGROUP ====
    #! /usr/bin/python
    #

    import os
    import stat
    import errno
    import struct

    CGROUP_MOUNT = "/sys/fs/cgroup/memory"
    BUFSIZE = 8 * 1024 # must be multiple of 8

    def get_hugepage_size():
    with open("/proc/meminfo", "r") as f:
    for s in f:
    k, v = s.split(":")
    if k == "Hugepagesize":
    return int(v.split()[0]) * 1024

    PAGE_SIZE = os.sysconf("SC_PAGE_SIZE")
    HUGEPAGE_SIZE = get_hugepage_size()

    def set_idle():
    f = open("/sys/kernel/mm/page_idle/bitmap", "wb", BUFSIZE)
    while True:
    try:
    f.write(struct.pack("Q", pow(2, 64) - 1))
    except IOError as err:
    if err.errno == errno.ENXIO:
    break
    raise
    f.close()

    def count_idle():
    f_flags = open("/proc/kpageflags", "rb", BUFSIZE)
    f_cgroup = open("/proc/kpagecgroup", "rb", BUFSIZE)

    with open("/sys/kernel/mm/page_idle/bitmap", "rb", BUFSIZE) as f:
    while f.read(BUFSIZE): pass # update idle flag

    idlememsz = {}
    while True:
    s1, s2 = f_flags.read(8), f_cgroup.read(8)
    if not s1 or not s2:
    break

    flags, = struct.unpack('Q', s1)
    cgino, = struct.unpack('Q', s2)

    unevictable = (flags >> 18) & 1
    huge = (flags >> 22) & 1
    idle = (flags >> 25) & 1

    if idle and not unevictable:
    idlememsz[cgino] = idlememsz.get(cgino, 0) + \
    (HUGEPAGE_SIZE if huge else PAGE_SIZE)

    f_flags.close()
    f_cgroup.close()
    return idlememsz

    if __name__ == "__main__":
    print "Setting the idle flag for each page..."
    set_idle()

    raw_input("Wait until the workload accesses its working set, "
    "then press Enter")

    print "Counting idle pages..."
    idlememsz = count_idle()

    for dir, subdirs, files in os.walk(CGROUP_MOUNT):
    ino = os.stat(dir)[stat.ST_INO]
    print dir + ": " + str(idlememsz.get(ino, 0) / 1024) + " kB"
    ==== END SCRIPT ====

    This patch (of 8):

    Add page_cgroup_ino() helper to memcg.

    This function returns the inode number of the closest online ancestor of
    the memory cgroup a page is charged to. It is required for exporting
    information about which page is charged to which cgroup to userspace,
    which will be introduced by a following patch.

    Signed-off-by: Vladimir Davydov
    Reviewed-by: Andres Lagar-Cavilla
    Cc: Minchan Kim
    Cc: Raghavendra K T
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: David Rientjes
    Cc: Pavel Emelyanov
    Cc: Cyrill Gorcunov
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Update the zpool and compressor parameters to be changeable at runtime.
    When changed, a new pool is created with the requested zpool/compressor,
    and added as the current pool at the front of the pool list. Previous
    pools remain in the list only to remove existing compressed pages from.
    The old pool(s) are removed once they become empty.

    Signed-off-by: Dan Streetman
    Acked-by: Seth Jennings
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Streetman
     
  • Add dynamic creation of pools. Move the static crypto compression per-cpu
    transforms into each pool. Add a pointer to zswap_entry to the pool it's
    in.

    This is required by the following patch which enables changing the zswap
    zpool and compressor params at runtime.

    [akpm@linux-foundation.org: fix merge snafus]
    Signed-off-by: Dan Streetman
    Acked-by: Seth Jennings
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Streetman
     
  • This series makes creation of the zpool and compressor dynamic, so that
    they can be changed at runtime. This makes using/configuring zswap
    easier, as before this zswap had to be configured at boot time, using boot
    params.

    This uses a single list to track both the zpool and compressor together,
    although Seth had mentioned an alternative which is to track the zpools
    and compressors using separate lists. In the most common case, only a
    single zpool and single compressor, using one list is slightly simpler
    than using two lists, and for the uncommon case of multiple zpools and/or
    compressors, using one list is slightly less simple (and uses slightly
    more memory, probably) than using two lists.

    This patch (of 4):

    Add zpool_has_pool() function, indicating if the specified type of zpool
    is available (i.e. zsmalloc or zbud). This allows checking if a pool is
    available, without actually trying to allocate it, similar to
    crypto_has_alg().

    This is used by a following patch to zswap that enables the dynamic
    runtime creation of zswap zpools.

    Signed-off-by: Dan Streetman
    Acked-by: Seth Jennings
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Streetman
     

09 Sep, 2015

19 commits

  • Merge second patch-bomb from Andrew Morton:
    "Almost all of the rest of MM. There was an unusually large amount of
    MM material this time"

    * emailed patches from Andrew Morton : (141 commits)
    zpool: remove no-op module init/exit
    mm: zbud: constify the zbud_ops
    mm: zpool: constify the zpool_ops
    mm: swap: zswap: maybe_preload & refactoring
    zram: unify error reporting
    zsmalloc: remove null check from destroy_handle_cache()
    zsmalloc: do not take class lock in zs_shrinker_count()
    zsmalloc: use class->pages_per_zspage
    zsmalloc: consider ZS_ALMOST_FULL as migrate source
    zsmalloc: partial page ordering within a fullness_list
    zsmalloc: use shrinker to trigger auto-compaction
    zsmalloc: account the number of compacted pages
    zsmalloc/zram: introduce zs_pool_stats api
    zsmalloc: cosmetic compaction code adjustments
    zsmalloc: introduce zs_can_compact() function
    zsmalloc: always keep per-class stats
    zsmalloc: drop unused variable `nr_to_migrate'
    mm/memblock.c: fix comment in __next_mem_range()
    mm/page_alloc.c: fix type information of memoryless node
    memory-hotplug: fix comments in zone_spanned_pages_in_node() and zone_spanned_pages_in_node()
    ...

    Linus Torvalds
     
  • Remove zpool_init() and zpool_exit(); they do nothing other than print
    "loaded" and "unloaded".

    Signed-off-by: Dan Streetman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Streetman
     
  • The structure zbud_ops is not modified so make the pointer to it a
    pointer to const.

    Signed-off-by: Krzysztof Kozlowski
    Acked-by: Dan Streetman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Krzysztof Kozlowski
     
  • The structure zpool_ops is not modified so make the pointer to it a
    pointer to const.

    Signed-off-by: Krzysztof Kozlowski
    Acked-by: Dan Streetman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Krzysztof Kozlowski
     
  • zswap_get_swap_cache_page and read_swap_cache_async have pretty much the
    same code with only significant difference in return value and usage of
    swap_readpage.

    I a helper __read_swap_cache_async() with the common code. Behavior
    change: now zswap_get_swap_cache_page will use radix_tree_maybe_preload
    instead radix_tree_preload. Looks like, this wasn't changed only by the
    reason of code duplication.

    Signed-off-by: Dmitry Safonov
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Minchan Kim
    Cc: Tejun Heo
    Cc: Jens Axboe
    Cc: Christoph Hellwig
    Cc: David Herrmann
    Cc: Seth Jennings
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitry Safonov
     
  • We can pass a NULL cache pointer to kmem_cache_destroy(), because it
    NULL-checks its argument now. Remove redundant test from
    destroy_handle_cache().

    Signed-off-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • We can avoid taking class ->lock around zs_can_compact() in
    zs_shrinker_count(), because the number that we return back is outdated
    in general case, by design. We have different sources that are able to
    change class's state right after we return from zs_can_compact() --
    ongoing I/O operations, manually triggered compaction, or two of them
    happening simultaneously.

    We re-do this calculations during compaction on a per class basis
    anyway.

    zs_unregister_shrinker() will not return until we have an active
    shrinker, so classes won't unexpectedly disappear while
    zs_shrinker_count() iterates them.

    Signed-off-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • There is no need to recalcurate pages_per_zspage in runtime. Just use
    class->pages_per_zspage to avoid unnecessary runtime overhead.

    Signed-off-by: Minchan Kim
    Acked-by: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • There is no reason to prevent select ZS_ALMOST_FULL as migration source
    if we cannot find source from ZS_ALMOST_EMPTY.

    With this patch, zs_can_compact will return more exact result.

    Signed-off-by: Minchan Kim
    Acked-by: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • We want to see more ZS_FULL pages and less ZS_ALMOST_{FULL, EMPTY}
    pages. Put a page with higher ->inuse count first within its
    ->fullness_list, which will give us better chances to fill up this page
    with new objects (find_get_zspage() return ->fullness_list head for new
    object allocation), so some zspages will become ZS_ALMOST_FULL/ZS_FULL
    quicker.

    It performs a trivial and cheap ->inuse compare which does not slow down
    zsmalloc and in the worst case keeps the list pages in no particular
    order.

    A more expensive solution could sort fullness_list by ->inuse count.

    [minchan@kernel.org: code adjustments]
    Signed-off-by: Sergey Senozhatsky
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • Perform automatic pool compaction by a shrinker when system is getting
    tight on memory.

    User-space has a very little knowledge regarding zsmalloc fragmentation
    and basically has no mechanism to tell whether compaction will result in
    any memory gain. Another issue is that user space is not always aware
    of the fact that system is getting tight on memory. Which leads to very
    uncomfortable scenarios when user space may start issuing compaction
    'randomly' or from crontab (for example). Fragmentation is not always
    necessarily bad, allocated and unused objects, after all, may be filled
    with the data later, w/o the need of allocating a new zspage. On the
    other hand, we obviously don't want to waste memory when the system
    needs it.

    Compaction now has a relatively quick pool scan so we are able to
    estimate the number of pages that will be freed easily, which makes it
    possible to call this function from a shrinker->count_objects()
    callback. We also abort compaction as soon as we detect that we can't
    free any pages any more, preventing wasteful objects migrations.

    Signed-off-by: Sergey Senozhatsky
    Suggested-by: Minchan Kim
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • Compaction returns back to zram the number of migrated objects, which is
    quite uninformative -- we have objects of different sizes so user space
    cannot obtain any valuable data from that number. Change compaction to
    operate in terms of pages and return back to compaction issuer the
    number of pages that were freed during compaction. So from now on we
    will export more meaningful value in zram/mm_stat -- the number of
    freed (compacted) pages.

    This requires:
    (a) a rename of `num_migrated' to 'pages_compacted'
    (b) a internal API change -- return first_page's fullness_group from
    putback_zspage(), so we know when putback_zspage() did
    free_zspage(). It helps us to account compaction stats correctly.

    Signed-off-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • `zs_compact_control' accounts the number of migrated objects but it has
    a limited lifespan -- we lose it as soon as zs_compaction() returns back
    to zram. It worked fine, because (a) zram had it's own counter of
    migrated objects and (b) only zram could trigger compaction. However,
    this does not work for automatic pool compaction (not issued by zram).
    To account objects migrated during auto-compaction (issued by the
    shrinker) we need to store this number in zs_pool.

    Define a new `struct zs_pool_stats' structure to keep zs_pool's stats
    there. It provides only `num_migrated', as of this writing, but it
    surely can be extended.

    A new zsmalloc zs_pool_stats() symbol exports zs_pool's stats back to
    caller.

    Use zs_pool_stats() in zram and remove `num_migrated' from zram_stats.

    Signed-off-by: Sergey Senozhatsky
    Suggested-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • Change zs_object_copy() argument order to be (DST, SRC) rather than
    (SRC, DST). copy/move functions usually have (to, from) arguments
    order.

    Rename alloc_target_page() to isolate_target_page(). This function
    doesn't allocate anything, it isolates target page, pretty much like
    isolate_source_page().

    Tweak __zs_compact() comment.

    Signed-off-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • This function checks if class compaction will free any pages.
    Rephrasing -- do we have enough unused objects to form at least one
    ZS_EMPTY page and free it. It aborts compaction if class compaction
    will not result in any (further) savings.

    EXAMPLE (this debug output is not part of this patch set):

    - class size
    - number of allocated objects
    - number of used objects
    - max objects per zspage
    - pages per zspage
    - estimated number of pages that will be freed

    [..]
    class-512 objs:544 inuse:540 maxobj-per-zspage:8 pages-per-zspage:1 zspages-to-free:0
    ... class-512 compaction is useless. break
    class-496 objs:660 inuse:570 maxobj-per-zspage:33 pages-per-zspage:4 zspages-to-free:2
    class-496 objs:627 inuse:570 maxobj-per-zspage:33 pages-per-zspage:4 zspages-to-free:1
    class-496 objs:594 inuse:570 maxobj-per-zspage:33 pages-per-zspage:4 zspages-to-free:0
    ... class-496 compaction is useless. break
    class-448 objs:657 inuse:617 maxobj-per-zspage:9 pages-per-zspage:1 zspages-to-free:4
    class-448 objs:648 inuse:617 maxobj-per-zspage:9 pages-per-zspage:1 zspages-to-free:3
    class-448 objs:639 inuse:617 maxobj-per-zspage:9 pages-per-zspage:1 zspages-to-free:2
    class-448 objs:630 inuse:617 maxobj-per-zspage:9 pages-per-zspage:1 zspages-to-free:1
    class-448 objs:621 inuse:617 maxobj-per-zspage:9 pages-per-zspage:1 zspages-to-free:0
    ... class-448 compaction is useless. break
    class-432 objs:728 inuse:685 maxobj-per-zspage:28 pages-per-zspage:3 zspages-to-free:1
    class-432 objs:700 inuse:685 maxobj-per-zspage:28 pages-per-zspage:3 zspages-to-free:0
    ... class-432 compaction is useless. break
    class-416 objs:819 inuse:705 maxobj-per-zspage:39 pages-per-zspage:4 zspages-to-free:2
    class-416 objs:780 inuse:705 maxobj-per-zspage:39 pages-per-zspage:4 zspages-to-free:1
    class-416 objs:741 inuse:705 maxobj-per-zspage:39 pages-per-zspage:4 zspages-to-free:0
    ... class-416 compaction is useless. break
    class-400 objs:690 inuse:674 maxobj-per-zspage:10 pages-per-zspage:1 zspages-to-free:1
    class-400 objs:680 inuse:674 maxobj-per-zspage:10 pages-per-zspage:1 zspages-to-free:0
    ... class-400 compaction is useless. break
    class-384 objs:736 inuse:709 maxobj-per-zspage:32 pages-per-zspage:3 zspages-to-free:0
    ... class-384 compaction is useless. break
    [..]

    Every "compaction is useless" indicates that we saved CPU cycles.

    class-512 has
    544 object allocated
    540 objects used
    8 objects per-page

    Even if we have a ALMOST_EMPTY zspage, we still don't have enough room to
    migrate all of its objects and free this zspage; so compaction will not
    make a lot of sense, it's better to just leave it as is.

    Signed-off-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • Always account per-class `zs_size_stat' stats. This data will help us
    make better decisions during compaction. We are especially interested
    in OBJ_ALLOCATED and OBJ_USED, which can tell us if class compaction
    will result in any memory gain.

    For instance, we know the number of allocated objects in the class, the
    number of objects being used (so we also know how many objects are not
    used) and the number of objects per-page. So we can ensure if we have
    enough unused objects to form at least one ZS_EMPTY zspage during
    compaction.

    We calculate this value on per-class basis so we can calculate a total
    number of zspages that can be released. Which is exactly what a
    shrinker wants to know.

    Signed-off-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • This patchset tweaks compaction and makes it possible to trigger pool
    compaction automatically when system is getting low on memory.

    zsmalloc in some cases can suffer from a notable fragmentation and
    compaction can release some considerable amount of memory. The problem
    here is that currently we fully rely on user space to perform compaction
    when needed. However, performing zsmalloc compaction is not always an
    obvious thing to do. For example, suppose we have a `idle' fragmented
    (compaction was never performed) zram device and system is getting low
    on memory due to some 3rd party user processes (gcc LTO, or firefox,
    etc.). It's quite unlikely that user space will issue zpool compaction
    in this case. Besides, user space cannot tell for sure how badly pool
    is fragmented; however, this info is known to zsmalloc and, hence, to a
    shrinker.

    This patch (of 7):

    __zs_compact() does not use `nr_to_migrate', drop it.

    Signed-off-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • Signed-off-by: Alexander Kuleshov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Kuleshov
     
  • For a memoryless node, the output of get_pfn_range_for_nid are all zero.
    It will display mem from 0 to -1.

    Signed-off-by: Zhen Lei
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhen Lei