01 Dec, 2018

1 commit

  • [ Upstream commit ca0246bb97c23da9d267c2107c07fb77e38205c9 ]

    Reclaim and free can race on an object which is basically fine but in
    order for reclaim to be able to map "freed" object we need to encode
    object length in the handle. handle_to_chunks() is then introduced to
    extract object length from a handle and use it during mapping.

    Moreover, to avoid racing on a z3fold "headless" page release, we should
    not try to free that page in z3fold_free() if the reclaim bit is set.
    Also, in the unlikely case of trying to reclaim a page being freed, we
    should not proceed with that page.

    While at it, fix the page accounting in reclaim function.

    This patch supersedes "[PATCH] z3fold: fix reclaim lock-ups".

    Link: http://lkml.kernel.org/r/20181105162225.74e8837d03583a9b707cf559@gmail.com
    Signed-off-by: Vitaly Wool
    Signed-off-by: Jongseok Kim
    Reported-by-by: Jongseok Kim
    Reviewed-by: Snild Dolkow
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Vitaly Wool
     

30 May, 2018

1 commit

  • [ Upstream commit 1ec6995d1290bfb87cc3a51f0836c889e857cef9 ]

    In z3fold_create_pool(), the memory allocated by __alloc_percpu() is not
    released on the error path that pool->compact_wq , which holds the
    return value of create_singlethread_workqueue(), is NULL. This will
    result in a memory leak bug.

    [akpm@linux-foundation.org: fix oops on kzalloc() failure, check __alloc_percpu() retval]
    Link: http://lkml.kernel.org/r/1522803111-29209-1-git-send-email-wangxidong_97@163.com
    Signed-off-by: Xidong Wang
    Reviewed-by: Andrew Morton
    Cc: Vitaly Wool
    Cc: Mike Rapoport
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Xidong Wang
     

16 May, 2018

1 commit

  • commit 6098d7e136692f9c6e23ae362c62ec822343e4d5 upstream.

    Do not try to optimize in-page object layout while the page is under
    reclaim. This fixes lock-ups on reclaim and improves reclaim
    performance at the same time.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/20180430125800.444cae9706489f412ad12621@gmail.com
    Signed-off-by: Vitaly Wool
    Reported-by: Guenter Roeck
    Tested-by: Guenter Roeck
    Cc:
    Cc: Matthew Wilcox
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Vitaly Wool
     

30 Nov, 2017

1 commit

  • commit 5d03a6613957785e94af7a4a6212ad4af66aa5c2 upstream.

    There is a race in the current z3fold implementation between
    do_compact() called in a work queue context and the page release
    procedure when page's kref goes to 0.

    do_compact() may be waiting for page lock, which is released by
    release_z3fold_page_locked right before putting the page onto the
    "stale" list, and then the page may be freed as do_compact() modifies
    its contents.

    The mechanism currently implemented to handle that (checking the
    PAGE_STALE flag) is not reliable enough. Instead, we'll use page's kref
    counter to guarantee that the page is not released if its compaction is
    scheduled. It then becomes compaction function's responsibility to
    decrease the counter and quit immediately if the page was actually
    freed.

    Link: http://lkml.kernel.org/r/20171117092032.00ea56f42affbed19f4fcc6c@gmail.com
    Signed-off-by: Vitaly Wool
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Vitaly Wool
     

04 Oct, 2017

2 commits

  • Fix the situation when clear_bit() is called for page->private before
    the page pointer is actually assigned. While at it, remove work_busy()
    check because it is costly and does not give 100% guarantee anyway.

    Signed-off-by: Vitaly Wool
    Cc: Dan Streetman
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vitaly Wool
     
  • It is possible that on a (partially) unsuccessful page reclaim,
    kref_put() called in z3fold_reclaim_page() does not yield page release,
    but the page is released shortly afterwards by another thread. Then
    z3fold_reclaim_page() would try to list_add() that (released) page again
    which is obviously a bug.

    To avoid that, spin_lock() has to be taken earlier, before the
    kref_put() call mentioned earlier.

    Link: http://lkml.kernel.org/r/20170913162937.bfff21c7d12b12a5f47639fd@gmail.com
    Signed-off-by: Vitaly Wool
    Cc: Dan Streetman
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vitaly Wool
     

07 Sep, 2017

1 commit

  • It's been noted that z3fold doesn't scale well when it's run in a large
    number of threads on many cores, which can be easily reproduced with fio
    'randrw' test with --numjobs=32. E.g. the result for 1 cluster (4 cores)
    is:

    Run status group 0 (all jobs):
    READ: io=244785MB, aggrb=496883KB/s, minb=15527KB/s, ...
    WRITE: io=246735MB, aggrb=500841KB/s, minb=15651KB/s, ...

    While for 8 cores (2 clusters) the result is:

    Run status group 0 (all jobs):
    READ: io=244785MB, aggrb=265942KB/s, minb=8310KB/s, ...
    WRITE: io=246735MB, aggrb=268060KB/s, minb=8376KB/s, ...

    The bottleneck here is the pool lock which many threads become waiting
    upon. To reduce that spin lock contention, z3fold can operate only on
    the lists local to the current CPU whenever possible. Due to the nature
    of z3fold unbuddied list handling (it only takes the first entry off the
    list on a hot path), if the z3fold pool is big enough and balanced well
    enough, limiting search to only local unbuddied list doesn't lead to a
    significant compression ratio degrade (2.57x vs 2.65x in our
    measurements).

    This patch also introduces two worker threads: one for async in-page
    object layout optimization and one for releasing freed pages. This is
    done to speed up z3fold_free() which is often on a hot path.

    The fio results for 8-core case are now the following:

    Run status group 0 (all jobs):
    READ: io=244785MB, aggrb=1568.3MB/s, minb=50182KB/s, ...
    WRITE: io=246735MB, aggrb=1580.8MB/s, minb=50582KB/s, ...

    So we're in for almost 6x performance increase.

    Link: http://lkml.kernel.org/r/20170806181443.f9b65018f8bde25ef990f9e8@gmail.com
    Signed-off-by: Vitaly Wool
    Cc: Dan Streetman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vitaly Wool
     

14 Apr, 2017

1 commit

  • Stress testing of the current z3fold implementation on a 8-core system
    revealed it was possible that a z3fold page deleted from its unbuddied
    list in z3fold_alloc() would be put on another unbuddied list by
    z3fold_free() while z3fold_alloc() is still processing it. This has
    been introduced with commit 5a27aa822 ("z3fold: add kref refcounting")
    due to the removal of special handling of a z3fold page not on any list
    in z3fold_free().

    To fix this, the z3fold page lock should be taken in z3fold_alloc()
    before the pool lock is released. To avoid deadlocking, we just try to
    lock the page as soon as we get a hold of it, and if trylock fails, we
    drop this page and take the next one.

    Signed-off-by: Vitaly Wool
    Cc: Dan Streetman
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vitaly Wool
     

17 Mar, 2017

1 commit

  • Commmit 5a27aa822029 ("z3fold: add kref refcounting") introduced a bug
    in z3fold_reclaim_page() with function exit that may leave pool->lock
    spinlock held. Here comes the trivial fix.

    Fixes: 5a27aa822029 ("z3fold: add kref refcounting")
    Link: http://lkml.kernel.org/r/20170311222239.7b83d8e7ef1914e05497649f@gmail.com
    Reported-by: Alexey Khoroshilov
    Signed-off-by: Vitaly Wool
    Cc: Dan Streetman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vitaly Wool
     

25 Feb, 2017

5 commits

  • With both coming and already present locking optimizations, introducing
    kref to reference-count z3fold objects is the right thing to do.
    Moreover, it makes buddied list no longer necessary, and allows for a
    simpler handling of headless pages.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/20170131214650.8ea78033d91ded233f552bc0@gmail.com
    Signed-off-by: Vitaly Wool
    Reviewed-by: Dan Streetman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vitaly Wool
     
  • Most of z3fold operations are in-page, such as modifying z3fold page
    header or moving z3fold objects within a page. Taking per-pool spinlock
    to protect per-page objects is therefore suboptimal, and the idea of
    having a per-page spinlock (or rwlock) has been around for some time.

    This patch implements spinlock-based per-page locking mechanism which is
    lightweight enough to normally fit ok into the z3fold header.

    Link: http://lkml.kernel.org/r/20170131214438.433e0a5fda908337b63206d3@gmail.com
    Signed-off-by: Vitaly Wool
    Reviewed-by: Dan Streetman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vitaly Wool
     
  • z3fold_compact_page() currently only handles the situation when there's
    a single middle chunk within the z3fold page. However it may be worth
    it to move middle chunk closer to either first or last chunk, whichever
    is there, if the gap between them is big enough.

    This patch adds the relevant code, using BIG_CHUNK_GAP define as a
    threshold for middle chunk to be worth moving.

    Link: http://lkml.kernel.org/r/20170131214334.c4f3eac9a477af0fa9a22c46@gmail.com
    Signed-off-by: Vitaly Wool
    Reviewed-by: Dan Streetman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vitaly Wool
     
  • Currently the whole kernel build will be stopped if the size of struct
    z3fold_header is greater than the size of one chunk, which is 64 bytes
    by default. This patch instead defines the offset for z3fold objects as
    the size of the z3fold header in chunks.

    Fixed also are the calculation of num_free_chunks() and the address to
    move the middle chunk to in case of in-page compaction in
    z3fold_compact_page().

    Link: http://lkml.kernel.org/r/20170131214057.d98677032bc7b1c6c59a80c9@gmail.com
    Signed-off-by: Vitaly Wool
    Reviewed-by: Dan Streetman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vitaly Wool
     
  • Convert pages_nr per-pool counter to atomic64_t.

    Link: http://lkml.kernel.org/r/20170131213946.b828676ab17bbea42022c213@gmail.com
    Signed-off-by: Vitaly Wool
    Reviewed-by: Dan Streetman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vitaly Wool
     

23 Feb, 2017

1 commit

  • At present, Tying the first_num size to NCHUNKS_ORDER is confusing. the
    number of chunks is completely unrelated to the number of buddies.

    The patch limits the first_num to actual range of possible buddy indexes.
    and that is more reasonable and obvious without functional change.

    Link: http://lkml.kernel.org/r/1476776569-29504-1-git-send-email-zhongjiang@huawei.com
    Signed-off-by: zhong jiang
    Suggested-by: Dan Streetman
    Acked-by: Dan Streetman
    Acked-by: Vitaly Wool
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    zhong jiang
     

04 Jun, 2016

1 commit

  • Fix erroneous z3fold header access in a HEADLESS page in reclaim
    function, and change one remaining direct handle-to-buddy conversion to
    use the appropriate helper.

    Link: http://lkml.kernel.org/r/5748706F.9020208@gmail.com
    Signed-off-by: Vitaly Wool
    Reviewed-by: Dan Streetman
    Cc: Seth Jennings
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vitaly Wool
     

21 May, 2016

1 commit

  • This patch introduces z3fold, a special purpose allocator for storing
    compressed pages. It is designed to store up to three compressed pages
    per physical page. It is a ZBUD derivative which allows for higher
    compression ratio keeping the simplicity and determinism of its
    predecessor.

    This patch comes as a follow-up to the discussions at the Embedded Linux
    Conference in San-Diego related to the talk [1]. The outcome of these
    discussions was that it would be good to have a compressed page
    allocator as stable and deterministic as zbud with with higher
    compression ratio.

    To keep the determinism and simplicity, z3fold, just like zbud, always
    stores an integral number of compressed pages per page, but it can store
    up to 3 pages unlike zbud which can store at most 2. Therefore the
    compression ratio goes to around 2.6x while zbud's one is around 1.7x.

    The patch is based on the latest linux.git tree.

    This version has been updated after testing on various simulators (e.g.
    ARM Versatile Express, MIPS Malta, x86_64/Haswell) and basing on
    comments from Dan Streetman [3].

    [1] https://openiotelc2016.sched.org/event/6DAC/swapping-and-embedded-compression-relieves-the-pressure-vitaly-wool-softprise-consulting-ou
    [2] https://lkml.org/lkml/2016/4/21/799
    [3] https://lkml.org/lkml/2016/5/4/852

    Link: http://lkml.kernel.org/r/20160509151753.ec3f9fda3c9898d31ff52a32@gmail.com
    Signed-off-by: Vitaly Wool
    Cc: Seth Jennings
    Cc: Dan Streetman
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vitaly Wool