02 May, 2017

1 commit

  • Pull block layer updates from Jens Axboe:

    - Add BFQ IO scheduler under the new blk-mq scheduling framework. BFQ
    was initially a fork of CFQ, but subsequently changed to implement
    fairness based on B-WF2Q+, a modified variant of WF2Q. BFQ is meant
    to be used on desktop type single drives, providing good fairness.
    From Paolo.

    - Add Kyber IO scheduler. This is a full multiqueue aware scheduler,
    using a scalable token based algorithm that throttles IO based on
    live completion IO stats, similary to blk-wbt. From Omar.

    - A series from Jan, moving users to separately allocated backing
    devices. This continues the work of separating backing device life
    times, solving various problems with hot removal.

    - A series of updates for lightnvm, mostly from Javier. Includes a
    'pblk' target that exposes an open channel SSD as a physical block
    device.

    - A series of fixes and improvements for nbd from Josef.

    - A series from Omar, removing queue sharing between devices on mostly
    legacy drivers. This helps us clean up other bits, if we know that a
    queue only has a single device backing. This has been overdue for
    more than a decade.

    - Fixes for the blk-stats, and improvements to unify the stats and user
    windows. This both improves blk-wbt, and enables other users to
    register a need to receive IO stats for a device. From Omar.

    - blk-throttle improvements from Shaohua. This provides a scalable
    framework for implementing scalable priotization - particularly for
    blk-mq, but applicable to any type of block device. The interface is
    marked experimental for now.

    - Bucketized IO stats for IO polling from Stephen Bates. This improves
    efficiency of polled workloads in the presence of mixed block size
    IO.

    - A few fixes for opal, from Scott.

    - A few pulls for NVMe, including a lot of fixes for NVMe-over-fabrics.
    From a variety of folks, mostly Sagi and James Smart.

    - A series from Bart, improving our exposed info and capabilities from
    the blk-mq debugfs support.

    - A series from Christoph, cleaning up how handle WRITE_ZEROES.

    - A series from Christoph, cleaning up the block layer handling of how
    we track errors in a request. On top of being a nice cleanup, it also
    shrinks the size of struct request a bit.

    - Removal of mg_disk and hd (sorry Linus) by Christoph. The former was
    never used by platforms, and the latter has outlived it's usefulness.

    - Various little bug fixes and cleanups from a wide variety of folks.

    * 'for-4.12/block' of git://git.kernel.dk/linux-block: (329 commits)
    block: hide badblocks attribute by default
    blk-mq: unify hctx delay_work and run_work
    block: add kblock_mod_delayed_work_on()
    blk-mq: unify hctx delayed_run_work and run_work
    nbd: fix use after free on module unload
    MAINTAINERS: bfq: Add Paolo as maintainer for the BFQ I/O scheduler
    blk-mq-sched: alloate reserved tags out of normal pool
    mtip32xx: use runtime tag to initialize command header
    scsi: Implement blk_mq_ops.show_rq()
    blk-mq: Add blk_mq_ops.show_rq()
    blk-mq: Show operation, cmd_flags and rq_flags names
    blk-mq: Make blk_flags_show() callers append a newline character
    blk-mq: Move the "state" debugfs attribute one level down
    blk-mq: Unregister debugfs attributes earlier
    blk-mq: Only unregister hctxs for which registration succeeded
    blk-mq-debugfs: Rename functions for registering and unregistering the mq directory
    blk-mq: Let blk_mq_debugfs_register() look up the queue name
    blk-mq: Register /queue/mq after having registered /queue
    ide-pm: always pass 0 error to ide_complete_rq in ide_do_devset
    ide-pm: always pass 0 error to __blk_end_request_all
    ..

    Linus Torvalds
     

14 Apr, 2017

2 commits

  • The copy_page is optimized memcpy for page-alinged address. If it is
    used with non-page aligned address, it can corrupt memory which means
    system corruption. With zram, it can happen with

    1. 64K architecture
    2. partial IO
    3. slub debug

    Partial IO need to allocate a page and zram allocates it via kmalloc.
    With slub debug, kmalloc(PAGE_SIZE) doesn't return page-size aligned
    address. And finally, copy_page(mem, cmem) corrupts memory.

    So, this patch changes it to memcpy.

    Actuaully, we don't need to change zram_bvec_write part because zsmalloc
    returns page-aligned address in case of PAGE_SIZE class but it's not
    good to rely on the internal of zsmalloc.

    Note:
    When this patch is merged to stable, clear_page should be fixed, too.
    Unfortunately, recent zram removes it by "same page merge" feature so
    it's hard to backport this patch to -stable tree.

    I will handle it when I receive the mail from stable tree maintainer to
    merge this patch to backport.

    Fixes: 42e99bd ("zram: optimize memory operations with clear_page()/copy_page()")
    Link: http://lkml.kernel.org/r/1492042622-12074-2-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Cc: Sergey Senozhatsky
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • In zram_rw_page, the logic to get offset is wrong by operator precedence
    (i.e., "<
    Cc: Sergey Senozhatsky
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

09 Apr, 2017

1 commit


09 Mar, 2017

1 commit

  • zram can handle at most SECTORS_PER_PAGE sectors in a bio's bvec. When using
    the NVMe over Fabrics loopback target which potentially sends a huge bulk of
    pages attached to the bio's bvec this results in a kernel panic because of
    array out of bounds accesses in zram_decompress_page().

    Signed-off-by: Johannes Thumshirn
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Sergey Senozhatsky
    Signed-off-by: Jens Axboe

    Johannes Thumshirn
     

25 Feb, 2017

2 commits

  • The idea is that without doing more calculations we extend zero pages to
    same element pages for zram. zero page is special case of same element
    page with zero element.

    1. the test is done under android 7.0
    2. startup too many applications circularly
    3. sample the zero pages, same pages (none-zero element)
    and total pages in function page_zero_filled

    the result is listed as below:

    ZERO SAME TOTAL
    36214 17842 598196

    ZERO/TOTAL SAME/TOTAL (ZERO+SAME)/TOTAL ZERO/SAME
    AVERAGE 0.060631909 0.024990816 0.085622726 2.663825038
    STDEV 0.00674612 0.005887625 0.009707034 2.115881328
    MAX 0.069698422 0.030046087 0.094975336 7.56043956
    MIN 0.03959586 0.007332205 0.056055193 1.928985507

    from the above data, the benefit is about 2.5% and up to 3% of total
    swapout pages.

    The defect of the patch is that when we recovery a page from non-zero
    element the operations are low efficient for partial read.

    This patch extends zero_page to same_page so if there is any user to
    have monitored zero_pages, he will be surprised if the number is
    increased but it's not harmful, I believe.

    [minchan@kernel.org: do not free same element pages in zram_meta_free]
    Link: http://lkml.kernel.org/r/20170207065741.GA2567@bbox
    Link: http://lkml.kernel.org/r/1483692145-75357-1-git-send-email-zhouxianrong@huawei.com
    Link: http://lkml.kernel.org/r/1486307804-27903-1-git-send-email-minchan@kernel.org
    Signed-off-by: zhouxianrong
    Signed-off-by: Minchan Kim
    Cc: Sergey Senozhatsky
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    zhouxianrong
     
  • zram_reset_device() waits for ongoing writepage pages to be completed by
    zram->refcount logic. However, it's pointless because before the reset,
    we prevent further opening of zram by zram->claim and flush all of
    pending IO by fsync_bdev so there should be no pending IO at the
    zram_reset_device().

    So let's remove that code which is even broken due to the lack of
    wake_up elsewhere.

    Link: http://lkml.kernel.org/r/1485145031-11661-1-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Reviewed-by: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

23 Feb, 2017

1 commit

  • We had a deprecated_attr_warn() warning for 2 years and now the time has
    come and we finally can do the cleanup.

    The plan was as follows:

    : per-stat sysfs attributes are considered to be deprecated.
    : The basic strategy is:
    : -- the existing RW nodes will be downgraded to WO nodes (in linux 4.11)
    : -- deprecated RO sysfs nodes will eventually be removed (in linux 4.11)
    :
    : The list of deprecated attributes can be found here:
    : Documentation/ABI/obsolete/sysfs-block-zram
    :
    : Basically, every attribute that has its own read accessible sysfs
    : node (e.g. num_reads) *AND* is accessible via one of the stat files
    : (zram/stat or zram/io_stat or zram/mm_stat) is considered
    : to be deprecated.

    The patch also removes `obsolete/sysfs-block-zram', clean ups
    `testing/sysfs-block-zram' and tweaks zram.txt files.

    Link: http://lkml.kernel.org/r/20170118035838.11090-1-sergey.senozhatsky@gmail.com
    Signed-off-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     

03 Feb, 2017

1 commit


11 Jan, 2017

2 commits

  • zram has used per-cpu stream feature from v4.7. It aims for increasing
    cache hit ratio of scratch buffer for compressing. Downside of that
    approach is that zram should ask memory space for compressed page in
    per-cpu context which requires stricted gfp flag which could be failed.
    If so, it retries to allocate memory space out of per-cpu context so it
    could get memory this time and compress the data again, copies it to the
    memory space.

    In this scenario, zram assumes the data should never be changed but it is
    not true without stable page support. So, If the data is changed under
    us, zram can make buffer overrun so that zsmalloc free object chain is
    broken so system goes crash like below

    https://bugzilla.suse.com/show_bug.cgi?id=997574

    This patch adds BDI_CAP_STABLE_WRITES to zram for declaring "I am block
    device needing *stable write*".

    Fixes: da9556a2367c ("zram: user per-cpu compression streams")
    Link: http://lkml.kernel.org/r/1482366980-3782-4-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Reviewed-by: Sergey Senozhatsky
    Cc: Takashi Iwai
    Cc: Hyeoncheol Lee
    Cc:
    Cc: Sangseok Lee
    Cc: Hugh Dickins
    Cc: Darrick J. Wong
    Cc: [4.7+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Commit b4c5c60920e3 ("zram: avoid lockdep splat by revalidate_disk")
    moved revalidate_disk call out of init_lock to avoid lockdep
    false-positive splat. However, commit 08eee69fcf6b ("zram: remove
    init_lock in zram_make_request") removed init_lock in IO path so there
    is no worry about lockdep splat. So, let's restore it.

    This patch is needed to set BDI_CAP_STABLE_WRITES atomically in next
    patch.

    Fixes: da9556a2367c ("zram: user per-cpu compression streams")
    Link: http://lkml.kernel.org/r/1482366980-3782-3-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Reviewed-by: Sergey Senozhatsky
    Cc: Takashi Iwai
    Cc: Hyeoncheol Lee
    Cc:
    Cc: Sangseok Lee
    Cc: Hugh Dickins
    Cc: Darrick J. Wong
    Cc: [4.7+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

13 Dec, 2016

1 commit

  • Pull smp hotplug updates from Thomas Gleixner:
    "This is the final round of converting the notifier mess to the state
    machine. The removal of the notifiers and the related infrastructure
    will happen around rc1, as there are conversions outstanding in other
    trees.

    The whole exercise removed about 2000 lines of code in total and in
    course of the conversion several dozen bugs got fixed. The new
    mechanism allows to test almost every hotplug step standalone, so
    usage sites can exercise all transitions extensively.

    There is more room for improvement, like integrating all the
    pointlessly different architecture mechanisms of synchronizing,
    setting cpus online etc into the core code"

    * 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (60 commits)
    tracing/rb: Init the CPU mask on allocation
    soc/fsl/qbman: Convert to hotplug state machine
    soc/fsl/qbman: Convert to hotplug state machine
    zram: Convert to hotplug state machine
    KVM/PPC/Book3S HV: Convert to hotplug state machine
    arm64/cpuinfo: Convert to hotplug state machine
    arm64/cpuinfo: Make hotplug notifier symmetric
    mm/compaction: Convert to hotplug state machine
    iommu/vt-d: Convert to hotplug state machine
    mm/zswap: Convert pool to hotplug state machine
    mm/zswap: Convert dst-mem to hotplug state machine
    mm/zsmalloc: Convert to hotplug state machine
    mm/vmstat: Convert to hotplug state machine
    mm/vmstat: Avoid on each online CPU loops
    mm/vmstat: Drop get_online_cpus() from init_cpu_node_state/vmstat_cpu_dead()
    tracing/rb: Convert to hotplug state machine
    oprofile/nmi timer: Convert to hotplug state machine
    net/iucv: Use explicit clean up labels in iucv_init()
    x86/pci/amd-bus: Convert to hotplug state machine
    x86/oprofile/nmi: Convert to hotplug state machine
    ...

    Linus Torvalds
     

08 Dec, 2016

1 commit

  • zram hot_add sysfs attribute is a very 'special' attribute - reading
    from it creates a new uninitialized zram device. This file, by a
    mistake, can be read by a 'normal' user at the moment, while only root
    must be able to create a new zram device, therefore hot_add attribute
    must have S_IRUSR mode, not S_IRUGO.

    [akpm@linux-foundation.org: s/sence/sense/, reflow comment to use 80 cols]
    Fixes: 6566d1a32bf72 ("zram: add dynamic device add/remove functionality")
    Link: http://lkml.kernel.org/r/20161205155845.20129-1-sergey.senozhatsky@gmail.com
    Signed-off-by: Sergey Senozhatsky
    Reported-by: Steven Allen
    Acked-by: Greg Kroah-Hartman
    Cc: Minchan Kim
    Cc: [4.2+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     

02 Dec, 2016

1 commit

  • Install the callbacks via the state machine with multi instance support and let
    the core invoke the callbacks on the already online CPUs.

    [bigeasy: wire up the multi instance stuff]
    Signed-off-by: Anna-Maria Gleixner
    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Thomas Gleixner
    Cc: Sergey Senozhatsky
    Cc: Minchan Kim
    Cc: rt@linutronix.de
    Cc: Nitin Gupta
    Link: http://lkml.kernel.org/r/20161126231350.10321-19-bigeasy@linutronix.de
    Signed-off-by: Thomas Gleixner

    Anna-Maria Gleixner
     

01 Dec, 2016

1 commit

  • The zram hot removal code calls idr_remove() even when zram_remove()
    returns an error (typically -EBUSY). This results in a leftover at the
    device release, eventually leading to a crash when the module is
    reloaded.

    As described in the bug report below, the following procedure would
    cause an Oops with zram:

    - provision three zram devices via modprobe zram num_devices=3
    - configure a size for each device
    + echo "1G" > /sys/block/$zram_name/disksize
    - mkfs and mount zram0 only
    - attempt to hot remove all three devices
    + echo 2 > /sys/class/zram-control/hot_remove
    + echo 1 > /sys/class/zram-control/hot_remove
    + echo 0 > /sys/class/zram-control/hot_remove
    - zram0 removal fails with EBUSY, as expected
    - unmount zram0
    - try zram0 hot remove again
    + echo 0 > /sys/class/zram-control/hot_remove
    - fails with ENODEV (unexpected)
    - unload zram kernel module
    + completes successfully
    - zram0 device node still exists
    - attempt to mount /dev/zram0
    + mount command is killed
    + following BUG is encountered

    BUG: unable to handle kernel paging request at ffffffffa0002ba0
    IP: get_disk+0x16/0x50
    Oops: 0000 [#1] SMP
    CPU: 0 PID: 252 Comm: mount Not tainted 4.9.0-rc6 #176
    Call Trace:
    exact_lock+0xc/0x20
    kobj_lookup+0xdc/0x160
    get_gendisk+0x2f/0x110
    __blkdev_get+0x10c/0x3c0
    blkdev_get+0x19d/0x2e0
    blkdev_open+0x56/0x70
    do_dentry_open.isra.19+0x1ff/0x310
    vfs_open+0x43/0x60
    path_openat+0x2c9/0xf30
    do_filp_open+0x79/0xd0
    do_sys_open+0x114/0x1e0
    SyS_open+0x19/0x20
    entry_SYSCALL_64_fastpath+0x13/0x94

    This patch adds the proper error check in hot_remove_store() not to call
    idr_remove() unconditionally.

    Fixes: 17ec4cd98578 ("zram: don't call idr_remove() from zram_remove()")
    Bugzilla: https://bugzilla.opensuse.org/show_bug.cgi?id=1010970
    Link: http://lkml.kernel.org/r/20161121132140.12683-1-tiwai@suse.de
    Signed-off-by: Takashi Iwai
    Reviewed-by: David Disseldorp
    Reported-by: David Disseldorp
    Tested-by: David Disseldorp
    Acked-by: Minchan Kim
    Acked-by: Sergey Senozhatsky
    Cc: [4.4+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Takashi Iwai
     

08 Aug, 2016

1 commit

  • Commit abf545484d31 changed it from an 'rw' flags type to the
    newer ops based interface, but now we're effectively leaking
    some bdev internals to the rest of the kernel. Since we only
    care about whether it's a read or a write at that level, just
    pass in a bool 'is_write' parameter instead.

    Then we can also move op_is_write() and friends back under
    CONFIG_BLOCK protection.

    Reviewed-by: Mike Christie
    Signed-off-by: Jens Axboe

    Jens Axboe
     

05 Aug, 2016

1 commit

  • The rw_page users were not converted to use bio/req ops. As a result
    bdev_write_page is not passing down REQ_OP_WRITE and the IOs will
    be sent down as reads.

    Signed-off-by: Mike Christie
    Fixes: 4e1b2d52a80d ("block, fs, drivers: remove REQ_OP compat defs and related code")

    Modified by me to:

    1) Drop op_flags passing into ->rw_page(), as we don't use it.
    2) Make op_is_write() and friends safe to use for !CONFIG_BLOCK

    Signed-off-by: Jens Axboe

    Mike Christie
     

27 Jul, 2016

8 commits

  • Merge updates from Andrew Morton:

    - a few misc bits

    - ocfs2

    - most(?) of MM

    * emailed patches from Andrew Morton : (125 commits)
    thp: fix comments of __pmd_trans_huge_lock()
    cgroup: remove unnecessary 0 check from css_from_id()
    cgroup: fix idr leak for the first cgroup root
    mm: memcontrol: fix documentation for compound parameter
    mm: memcontrol: remove BUG_ON in uncharge_list
    mm: fix build warnings in
    mm, thp: convert from optimistic swapin collapsing to conservative
    mm, thp: fix comment inconsistency for swapin readahead functions
    thp: update Documentation/{vm/transhuge,filesystems/proc}.txt
    shmem: split huge pages beyond i_size under memory pressure
    thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE
    khugepaged: add support of collapse for tmpfs/shmem pages
    shmem: make shmem_inode_info::lock irq-safe
    khugepaged: move up_read(mmap_sem) out of khugepaged_alloc_page()
    thp: extract khugepaged from mm/huge_memory.c
    shmem, thp: respect MADV_{NO,}HUGEPAGE for file mappings
    shmem: add huge pages support
    shmem: get_unmapped_area align huge page
    shmem: prepare huge= mount option and sysfs knob
    mm, rmap: account shmem thp pages
    ...

    Linus Torvalds
     
  • Zsmalloc is ready for page migration so zram can use __GFP_MOVABLE from
    now on.

    I did test to see how it helps to make higher order pages. Test
    scenario is as follows.

    KVM guest, 1G memory, ext4 formated zram block device,

    for i in `seq 1 8`;
    do
    dd if=/dev/vda1 of=mnt/test$i.txt bs=128M count=1 &
    done

    wait `pidof dd`

    for i in `seq 1 2 8`;
    do
    rm -rf mnt/test$i.txt
    done
    fstrim -v mnt

    echo "init"
    cat /proc/buddyinfo

    echo "compaction"
    echo 1 > /proc/sys/vm/compact_memory
    cat /proc/buddyinfo

    old:

    init
    Node 0, zone DMA 208 120 51 41 11 0 0 0 0 0 0
    Node 0, zone DMA32 16380 13777 9184 3805 789 54 3 0 0 0 0
    compaction
    Node 0, zone DMA 132 82 40 39 16 2 1 0 0 0 0
    Node 0, zone DMA32 5219 5526 4969 3455 1831 677 139 15 0 0 0

    new:

    init
    Node 0, zone DMA 379 115 97 19 2 0 0 0 0 0 0
    Node 0, zone DMA32 18891 16774 10862 3947 637 21 0 0 0 0 0
    compaction
    Node 0, zone DMA 214 66 87 29 10 3 0 0 0 0 0
    Node 0, zone DMA32 1612 3139 3154 2469 1745 990 384 94 7 0 0

    As you can see, compaction made so many high-order pages. Yay!

    Link: http://lkml.kernel.org/r/1464736881-24886-13-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Reviewed-by: Sergey Senozhatsky
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • We now allocate streams from CPU_UP hot-plug path, there are no
    context-dependent stream allocations anymore and we can schedule from
    zcomp_strm_alloc(). Use GFP_KERNEL directly and drop a gfp_t parameter.

    Link: http://lkml.kernel.org/r/20160531122017.2878-9-sergey.senozhatsky@gmail.com
    Signed-off-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • Add "deflate", "lz4hc", "842" algorithms to the list of known
    compression backends. The real availability of those algorithms,
    however, depends on the corresponding CONFIG_CRYPTO_FOO config options.

    [sergey.senozhatsky@gmail.com: zram-add-more-compression-algorithms-v3]
    Link: http://lkml.kernel.org/r/20160604024902.11778-7-sergey.senozhatsky@gmail.com
    Link: http://lkml.kernel.org/r/20160531122017.2878-8-sergey.senozhatsky@gmail.com
    Signed-off-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • Remove lzo/lz4 backends, we use crypto API now.

    [sergey.senozhatsky@gmail.com: zram-delete-custom-lzo-lz4-v3]
    Link: http://lkml.kernel.org/r/20160604024902.11778-6-sergey.senozhatsky@gmail.com
    Link: http://lkml.kernel.org/r/20160531122017.2878-7-sergey.senozhatsky@gmail.com
    Signed-off-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • There is no way to get a string with all the crypto comp algorithms
    supported by the crypto comp engine, so we need to maintain our own
    backends list. At the same time we additionally need to use
    crypto_has_comp() to make sure that the user has requested a compression
    algorithm that is recognized by the crypto comp engine. Relying on
    /proc/crypto is not an options here, because it does not show
    not-yet-inserted compression modules.

    Example:

    modprobe zram
    cat /proc/crypto | grep -i lz4
    modprobe lz4
    cat /proc/crypto | grep -i lz4
    name : lz4
    driver : lz4-generic
    module : lz4

    So the user can't tell exactly if the lz4 is really supported from
    /proc/crypto output, unless someone or something has loaded it.

    This patch also adds crypto_has_comp() to zcomp_available_show(). We
    store all the compression algorithms names in zcomp's `backends' array,
    regardless the CONFIG_CRYPTO_FOO configuration, but show only those that
    are also supported by crypto engine. This helps user to know the exact
    list of compression algorithms that can be used.

    Example:
    module lz4 is not loaded yet, but is supported by the crypto
    engine. /proc/crypto has no information on this module, while
    zram's `comp_algorithm' lists it:

    cat /proc/crypto | grep -i lz4

    cat /sys/block/zram0/comp_algorithm
    [lzo] lz4 deflate lz4hc 842

    We still use the `backends' array to determine if the requested
    compression backend is known to crypto api. This array, however, may not
    contain some entries, therefore as the last step we call crypto_has_comp()
    function which attempts to insmod the requested compression algorithm to
    determine if crypto api supports it. The advantage of this method is that
    now we permit the usage of out-of-tree crypto compression modules
    (implementing S/W or H/W compression).

    [sergey.senozhatsky@gmail.com: zram-use-crypto-api-to-check-alg-availability-v3]
    Link: http://lkml.kernel.org/r/20160604024902.11778-4-sergey.senozhatsky@gmail.com
    Link: http://lkml.kernel.org/r/20160531122017.2878-5-sergey.senozhatsky@gmail.com
    Signed-off-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Cc: Joonsoo Kim
    Signed-off-by: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • We don't have an idle zstreams list anymore and our write path now works
    absolutely differently, preventing preemption during compression. This
    removes possibilities of read paths preempting writes at wrong places
    (which could badly affect the performance of both paths) and at the same
    time opens the door for a move from custom LZO/LZ4 compression backends
    implementation to a more generic one, using crypto compress API.

    Joonsoo Kim [1] attempted to do this a while ago, but faced with the
    need of introducing a new crypto API interface. The root cause was the
    fact that crypto API compression algorithms require a compression stream
    structure (in zram terminology) for both compression and decompression
    ops, while in reality only several of compression algorithms really need
    it. This resulted in a concept of context-less crypto API compression
    backends [2]. Both write and read paths, though, would have been
    executed with the preemption enabled, which in the worst case could have
    resulted in a decreased worst-case performance, e.g. consider the
    following case:

    CPU0

    zram_write()
    spin_lock()
    take the last idle stream
    spin_unlock()

    << preempted >>

    zram_read()
    spin_lock()
    no idle streams
    spin_unlock()
    schedule()

    resuming zram_write compression()

    but it took me some time to realize that, and it took even longer to
    evolve zram and to make it ready for crypto API. The key turned out to be
    -- drop the idle streams list entirely. Without the idle streams list we
    are free to use compression algorithms that require compression stream for
    decompression (read), because streams are now placed in per-cpu data and
    each write path has to disable preemption for compression op, almost
    completely eliminating the aforementioned case (technically, we still have
    a small chance, because write path has a fast and a slow paths and the
    slow path is executed with the preemption enabled; but the frequency of
    failed fast path is too low).

    TEST
    ====

    - 4 CPUs, x86_64 system
    - 3G zram, lzo
    - fio tests: read, randread, write, randwrite, rw, randrw

    test script [3] command:
    ZRAM_SIZE=3G LOG_SUFFIX=XXXX FIO_LOOPS=5 ./zram-fio-test.sh

    BASE PATCHED
    jobs1
    READ: 2527.2MB/s 2482.7MB/s
    READ: 2102.7MB/s 2045.0MB/s
    WRITE: 1284.3MB/s 1324.3MB/s
    WRITE: 1080.7MB/s 1101.9MB/s
    READ: 430125KB/s 437498KB/s
    WRITE: 430538KB/s 437919KB/s
    READ: 399593KB/s 403987KB/s
    WRITE: 399910KB/s 404308KB/s
    jobs2
    READ: 8133.5MB/s 7854.8MB/s
    READ: 7086.6MB/s 6912.8MB/s
    WRITE: 3177.2MB/s 3298.3MB/s
    WRITE: 2810.2MB/s 2871.4MB/s
    READ: 1017.6MB/s 1023.4MB/s
    WRITE: 1018.2MB/s 1023.1MB/s
    READ: 977836KB/s 984205KB/s
    WRITE: 979435KB/s 985814KB/s
    jobs3
    READ: 13557MB/s 13391MB/s
    READ: 11876MB/s 11752MB/s
    WRITE: 4641.5MB/s 4682.1MB/s
    WRITE: 4164.9MB/s 4179.3MB/s
    READ: 1453.8MB/s 1455.1MB/s
    WRITE: 1455.1MB/s 1458.2MB/s
    READ: 1387.7MB/s 1395.7MB/s
    WRITE: 1386.1MB/s 1394.9MB/s
    jobs4
    READ: 20271MB/s 20078MB/s
    READ: 18033MB/s 17928MB/s
    WRITE: 6176.8MB/s 6180.5MB/s
    WRITE: 5686.3MB/s 5705.3MB/s
    READ: 2009.4MB/s 2006.7MB/s
    WRITE: 2007.5MB/s 2004.9MB/s
    READ: 1929.7MB/s 1935.6MB/s
    WRITE: 1926.8MB/s 1932.6MB/s
    jobs5
    READ: 18823MB/s 19024MB/s
    READ: 18968MB/s 19071MB/s
    WRITE: 6191.6MB/s 6372.1MB/s
    WRITE: 5818.7MB/s 5787.1MB/s
    READ: 2011.7MB/s 1981.3MB/s
    WRITE: 2011.4MB/s 1980.1MB/s
    READ: 1949.3MB/s 1935.7MB/s
    WRITE: 1940.4MB/s 1926.1MB/s
    jobs6
    READ: 21870MB/s 21715MB/s
    READ: 19957MB/s 19879MB/s
    WRITE: 6528.4MB/s 6537.6MB/s
    WRITE: 6098.9MB/s 6073.6MB/s
    READ: 2048.6MB/s 2049.9MB/s
    WRITE: 2041.7MB/s 2042.9MB/s
    READ: 2013.4MB/s 1990.4MB/s
    WRITE: 2009.4MB/s 1986.5MB/s
    jobs7
    READ: 21359MB/s 21124MB/s
    READ: 19746MB/s 19293MB/s
    WRITE: 6660.4MB/s 6518.8MB/s
    WRITE: 6211.6MB/s 6193.1MB/s
    READ: 2089.7MB/s 2080.6MB/s
    WRITE: 2085.8MB/s 2076.5MB/s
    READ: 2041.2MB/s 2052.5MB/s
    WRITE: 2037.5MB/s 2048.8MB/s
    jobs8
    READ: 20477MB/s 19974MB/s
    READ: 18922MB/s 18576MB/s
    WRITE: 6851.9MB/s 6788.3MB/s
    WRITE: 6407.7MB/s 6347.5MB/s
    READ: 2134.8MB/s 2136.1MB/s
    WRITE: 2132.8MB/s 2134.4MB/s
    READ: 2074.2MB/s 2069.6MB/s
    WRITE: 2087.3MB/s 2082.4MB/s
    jobs9
    READ: 19797MB/s 19994MB/s
    READ: 18806MB/s 18581MB/s
    WRITE: 6878.7MB/s 6822.7MB/s
    WRITE: 6456.8MB/s 6447.2MB/s
    READ: 2141.1MB/s 2154.7MB/s
    WRITE: 2144.4MB/s 2157.3MB/s
    READ: 2084.1MB/s 2085.1MB/s
    WRITE: 2091.5MB/s 2092.5MB/s
    jobs10
    READ: 19794MB/s 19784MB/s
    READ: 18794MB/s 18745MB/s
    WRITE: 6984.4MB/s 6676.3MB/s
    WRITE: 6532.3MB/s 6342.7MB/s
    READ: 2150.6MB/s 2155.4MB/s
    WRITE: 2156.8MB/s 2161.5MB/s
    READ: 2106.4MB/s 2095.6MB/s
    WRITE: 2109.7MB/s 2098.4MB/s

    BASE PATCHED
    jobs1 perfstat
    stalled-cycles-frontend 102,480,595,419 ( 41.53%) 114,508,864,804 ( 46.92%)
    stalled-cycles-backend 51,941,417,832 ( 21.05%) 46,836,112,388 ( 19.19%)
    instructions 283,612,054,215 ( 1.15) 283,918,134,959 ( 1.16)
    branches 56,372,560,385 ( 724.923) 56,449,814,753 ( 733.766)
    branch-misses 374,826,000 ( 0.66%) 326,935,859 ( 0.58%)
    jobs2 perfstat
    stalled-cycles-frontend 155,142,745,777 ( 40.99%) 164,170,979,198 ( 43.82%)
    stalled-cycles-backend 70,813,866,387 ( 18.71%) 66,456,858,165 ( 17.74%)
    instructions 463,436,648,173 ( 1.22) 464,221,890,191 ( 1.24)
    branches 91,088,733,902 ( 760.088) 91,278,144,546 ( 769.133)
    branch-misses 504,460,363 ( 0.55%) 394,033,842 ( 0.43%)
    jobs3 perfstat
    stalled-cycles-frontend 201,300,397,212 ( 39.84%) 223,969,902,257 ( 44.44%)
    stalled-cycles-backend 87,712,593,974 ( 17.36%) 81,618,888,712 ( 16.19%)
    instructions 642,869,545,023 ( 1.27) 644,677,354,132 ( 1.28)
    branches 125,724,560,594 ( 690.682) 126,133,159,521 ( 694.542)
    branch-misses 527,941,798 ( 0.42%) 444,782,220 ( 0.35%)
    jobs4 perfstat
    stalled-cycles-frontend 246,701,197,429 ( 38.12%) 280,076,030,886 ( 43.29%)
    stalled-cycles-backend 119,050,341,112 ( 18.40%) 110,955,641,671 ( 17.15%)
    instructions 822,716,962,127 ( 1.27) 825,536,969,320 ( 1.28)
    branches 160,590,028,545 ( 688.614) 161,152,996,915 ( 691.068)
    branch-misses 650,295,287 ( 0.40%) 550,229,113 ( 0.34%)
    jobs5 perfstat
    stalled-cycles-frontend 298,958,462,516 ( 38.30%) 344,852,200,358 ( 44.16%)
    stalled-cycles-backend 137,558,742,122 ( 17.62%) 129,465,067,102 ( 16.58%)
    instructions 1,005,714,688,752 ( 1.29) 1,007,657,999,432 ( 1.29)
    branches 195,988,773,962 ( 697.730) 196,446,873,984 ( 700.319)
    branch-misses 695,818,940 ( 0.36%) 624,823,263 ( 0.32%)
    jobs6 perfstat
    stalled-cycles-frontend 334,497,602,856 ( 36.71%) 387,590,419,779 ( 42.38%)
    stalled-cycles-backend 163,539,365,335 ( 17.95%) 152,640,193,639 ( 16.69%)
    instructions 1,184,738,177,851 ( 1.30) 1,187,396,281,677 ( 1.30)
    branches 230,592,915,640 ( 702.902) 231,253,802,882 ( 702.356)
    branch-misses 747,934,786 ( 0.32%) 643,902,424 ( 0.28%)
    jobs7 perfstat
    stalled-cycles-frontend 396,724,684,187 ( 37.71%) 460,705,858,952 ( 43.84%)
    stalled-cycles-backend 188,096,616,496 ( 17.88%) 175,785,787,036 ( 16.73%)
    instructions 1,364,041,136,608 ( 1.30) 1,366,689,075,112 ( 1.30)
    branches 265,253,096,936 ( 700.078) 265,890,524,883 ( 702.839)
    branch-misses 784,991,589 ( 0.30%) 729,196,689 ( 0.27%)
    jobs8 perfstat
    stalled-cycles-frontend 440,248,299,870 ( 36.92%) 509,554,793,816 ( 42.46%)
    stalled-cycles-backend 222,575,930,616 ( 18.67%) 213,401,248,432 ( 17.78%)
    instructions 1,542,262,045,114 ( 1.29) 1,545,233,932,257 ( 1.29)
    branches 299,775,178,439 ( 697.666) 300,528,458,505 ( 694.769)
    branch-misses 847,496,084 ( 0.28%) 748,794,308 ( 0.25%)
    jobs9 perfstat
    stalled-cycles-frontend 506,269,882,480 ( 37.86%) 592,798,032,820 ( 44.43%)
    stalled-cycles-backend 253,192,498,861 ( 18.93%) 233,727,666,185 ( 17.52%)
    instructions 1,721,985,080,913 ( 1.29) 1,724,666,236,005 ( 1.29)
    branches 334,517,360,255 ( 694.134) 335,199,758,164 ( 697.131)
    branch-misses 873,496,730 ( 0.26%) 815,379,236 ( 0.24%)
    jobs10 perfstat
    stalled-cycles-frontend 549,063,363,749 ( 37.18%) 651,302,376,662 ( 43.61%)
    stalled-cycles-backend 281,680,986,810 ( 19.07%) 277,005,235,582 ( 18.55%)
    instructions 1,901,859,271,180 ( 1.29) 1,906,311,064,230 ( 1.28)
    branches 369,398,536,153 ( 694.004) 370,527,696,358 ( 688.409)
    branch-misses 967,929,335 ( 0.26%) 890,125,056 ( 0.24%)

    BASE PATCHED
    seconds elapsed 79.421641008 78.735285546
    seconds elapsed 61.471246133 60.869085949
    seconds elapsed 62.317058173 62.224188495
    seconds elapsed 60.030739363 60.081102518
    seconds elapsed 74.070398362 74.317582865
    seconds elapsed 84.985953007 85.414364176
    seconds elapsed 97.724553255 98.173311344
    seconds elapsed 109.488066758 110.268399318
    seconds elapsed 122.768189405 122.967164498
    seconds elapsed 135.130035105 136.934770801

    On my other system (8 x86_64 CPUs, short version of test results):

    BASE PATCHED
    seconds elapsed 19.518065994 19.806320662
    seconds elapsed 15.172772749 15.594718291
    seconds elapsed 13.820925970 13.821708564
    seconds elapsed 13.293097816 14.585206405
    seconds elapsed 16.207284118 16.064431606
    seconds elapsed 17.958376158 17.771825767
    seconds elapsed 19.478009164 19.602961508
    seconds elapsed 21.347152811 21.352318709
    seconds elapsed 24.478121126 24.171088735
    seconds elapsed 26.865057442 26.767327618

    So performance-wise the numbers are quite similar.

    Also update zcomp interface to be more aligned with the crypto API.

    [1] http://marc.info/?l=linux-kernel&m=144480832108927&w=2
    [2] http://marc.info/?l=linux-kernel&m=145379613507518&w=2
    [3] https://github.com/sergey-senozhatsky/zram-perf-test

    Link: http://lkml.kernel.org/r/20160531122017.2878-3-sergey.senozhatsky@gmail.com
    Signed-off-by: Sergey Senozhatsky
    Suggested-by: Minchan Kim
    Suggested-by: Joonsoo Kim
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • This has started as a 'add zlib support' work, but after some thinking I
    saw no blockers for a bigger change -- a switch to crypto API.

    We don't have an idle zstreams list anymore and our write path now works
    absolutely differently, preventing preemption during compression. This
    removes possibilities of read paths preempting writes at wrong places
    and opens the door for a move from custom LZO/LZ4 compression backends
    implementation to a more generic one, using crypto compress API.

    This patch set also eliminates the need of a new context-less crypto API
    interface, which was quite hard to sell, so we can move along faster.

    benchmarks:

    (x86_64, 4GB, zram-perf script)

    perf reported run-time fio (max jobs=3). I performed fio test with the
    increasing number of parallel jobs (max to 3) on a 3G zram device, using
    `static' data and the following crypto comp algorithms:

    842, deflate, lz4, lz4hc, lzo

    the output was:

    - test running time (which can tell us what algorithms performs faster)

    and

    - zram mm_stat (which tells the compressed memory size, max used memory, etc).

    It's just for information. for example, LZ4HC has twice the running
    time of LZO, but the compressed memory size is: 23592960 vs 34603008
    bytes.

    test-fio-zram-842
    197.907655282 seconds time elapsed
    201.623142884 seconds time elapsed
    226.854291345 seconds time elapsed
    test-fio-zram-DEFLATE
    253.259516155 seconds time elapsed
    258.148563401 seconds time elapsed
    290.251909365 seconds time elapsed
    test-fio-zram-LZ4
    27.022598717 seconds time elapsed
    29.580522717 seconds time elapsed
    33.293463430 seconds time elapsed
    test-fio-zram-LZ4HC
    56.393954615 seconds time elapsed
    74.904659747 seconds time elapsed
    101.940998564 seconds time elapsed
    test-fio-zram-LZO
    28.155948075 seconds time elapsed
    30.390036330 seconds time elapsed
    34.455773159 seconds time elapsed

    zram mm_stat-s (max fio jobs=3)

    test-fio-zram-842
    mm_stat (jobs1): 3221225472 673185792 690266112 0 690266112 0 0
    mm_stat (jobs2): 3221225472 673185792 690266112 0 690266112 0 0
    mm_stat (jobs3): 3221225472 673185792 690266112 0 690266112 0 0
    test-fio-zram-DEFLATE
    mm_stat (jobs1): 3221225472 24379392 37761024 0 37761024 0 0
    mm_stat (jobs2): 3221225472 24379392 37761024 0 37761024 0 0
    mm_stat (jobs3): 3221225472 24379392 37761024 0 37761024 0 0
    test-fio-zram-LZ4
    mm_stat (jobs1): 3221225472 23592960 37761024 0 37761024 0 0
    mm_stat (jobs2): 3221225472 23592960 37761024 0 37761024 0 0
    mm_stat (jobs3): 3221225472 23592960 37761024 0 37761024 0 0
    test-fio-zram-LZ4HC
    mm_stat (jobs1): 3221225472 23592960 37761024 0 37761024 0 0
    mm_stat (jobs2): 3221225472 23592960 37761024 0 37761024 0 0
    mm_stat (jobs3): 3221225472 23592960 37761024 0 37761024 0 0
    test-fio-zram-LZO
    mm_stat (jobs1): 3221225472 34603008 50335744 0 50335744 0 0
    mm_stat (jobs2): 3221225472 34603008 50335744 0 50335744 0 0
    mm_stat (jobs3): 3221225472 34603008 50335744 0 50339840 0 0

    This patch (of 8):

    We don't perform any zstream idle list lookup anymore, so
    zcomp_strm_find()/zcomp_strm_release() names are not representative.

    Rename to zcomp_stream_get()/zcomp_stream_put().

    Link: http://lkml.kernel.org/r/20160531122017.2878-2-sergey.senozhatsky@gmail.com
    Signed-off-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     

08 Jun, 2016

1 commit

  • This patch converts the simple bi_rw use cases in the block,
    drivers, mm and fs code to set/get the bio operation using
    bio_set_op_attrs/bio_op

    These should be simple one or two liner cases, so I just did them
    in one patch. The next patches handle the more complicated
    cases in a module per patch.

    Signed-off-by: Mike Christie
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Mike Christie
     

21 May, 2016

4 commits

  • debug_stat sysfs is read-only and represents various debugging data that
    zram developers may need. This file is not meant to be used by anyone
    else: its content is not documented and will change any time w/o any
    notice. Therefore, the output of debug_stat file contains a version
    string. To avoid any confusion, we will increase the version number
    every time we modify the output.

    At the moment this file exports only one value -- the number of
    re-compressions, IOW, the number of times compression fast path has
    failed. This stat is temporary any will be useful in case if any
    per-cpu compression streams regressions will be reported.

    Link: http://lkml.kernel.org/r/20160513230834.GB26763@bbox
    Link: http://lkml.kernel.org/r/20160511134553.12655-1-sergey.senozhatsky@gmail.com
    Signed-off-by: Sergey Senozhatsky
    Signed-off-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • Remove the internal part of max_comp_streams interface, since we
    switched to per-cpu streams. We will keep RW max_comp_streams attr
    around, because:

    a) we may (silently) switch back to idle compression streams list and
    don't want to disturb user space

    b) max_comp_streams attr must wait for the next 'lay off cycle'; we
    give user space 2 years to adjust before we remove/downgrade the attr,
    and there are already several attrs scheduled for removal in 4.11, so
    it's too late for max_comp_streams.

    This slightly change a user visible behaviour:

    - First, reading from max_comp_stream file now will always return the
    number of online CPUs.

    - Second, writing to max_comp_stream will not take any effect.

    Link: http://lkml.kernel.org/r/20160503165546.25201-1-sergey.senozhatsky@gmail.com
    Signed-off-by: Sergey Senozhatsky
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • Remove idle streams list and keep compression streams in per-cpu data.
    This removes two contented spin_lock()/spin_unlock() calls from write
    path and also prevent write OP from being preempted while holding the
    compression stream, which can cause slow downs.

    For instance, let's assume that we have N cpus and N-2
    max_comp_streams.TASK1 owns the last idle stream, TASK2-TASK3 come in
    with the write requests:

    TASK1 TASK2 TASK3
    zram_bvec_write()
    spin_lock
    find stream
    spin_unlock

    compress

    <> zram_bvec_write()
    spin_lock
    find stream
    spin_unlock
    no_stream
    schedule
    zram_bvec_write()
    spin_lock
    find_stream
    spin_unlock
    no_stream
    schedule
    spin_lock
    release stream
    spin_unlock
    wake up TASK2

    not only TASK2 and TASK3 will not get the stream, TASK1 will be
    preempted in the middle of its operation; while we would prefer it to
    finish compression and release the stream.

    Test environment: x86_64, 4 CPU box, 3G zram, lzo

    The following fio tests were executed:
    read, randread, write, randwrite, rw, randrw
    with the increasing number of jobs from 1 to 10.

    4 streams 8 streams per-cpu
    ===========================================================
    jobs1
    READ: 2520.1MB/s 2566.5MB/s 2491.5MB/s
    READ: 2102.7MB/s 2104.2MB/s 2091.3MB/s
    WRITE: 1355.1MB/s 1320.2MB/s 1378.9MB/s
    WRITE: 1103.5MB/s 1097.2MB/s 1122.5MB/s
    READ: 434013KB/s 435153KB/s 439961KB/s
    WRITE: 433969KB/s 435109KB/s 439917KB/s
    READ: 403166KB/s 405139KB/s 403373KB/s
    WRITE: 403223KB/s 405197KB/s 403430KB/s
    jobs2
    READ: 7958.6MB/s 8105.6MB/s 8073.7MB/s
    READ: 6864.9MB/s 6989.8MB/s 7021.8MB/s
    WRITE: 2438.1MB/s 2346.9MB/s 3400.2MB/s
    WRITE: 1994.2MB/s 1990.3MB/s 2941.2MB/s
    READ: 981504KB/s 973906KB/s 1018.8MB/s
    WRITE: 981659KB/s 974060KB/s 1018.1MB/s
    READ: 937021KB/s 938976KB/s 987250KB/s
    WRITE: 934878KB/s 936830KB/s 984993KB/s
    jobs3
    READ: 13280MB/s 13553MB/s 13553MB/s
    READ: 11534MB/s 11785MB/s 11755MB/s
    WRITE: 3456.9MB/s 3469.9MB/s 4810.3MB/s
    WRITE: 3029.6MB/s 3031.6MB/s 4264.8MB/s
    READ: 1363.8MB/s 1362.6MB/s 1448.9MB/s
    WRITE: 1361.9MB/s 1360.7MB/s 1446.9MB/s
    READ: 1309.4MB/s 1310.6MB/s 1397.5MB/s
    WRITE: 1307.4MB/s 1308.5MB/s 1395.3MB/s
    jobs4
    READ: 20244MB/s 20177MB/s 20344MB/s
    READ: 17886MB/s 17913MB/s 17835MB/s
    WRITE: 4071.6MB/s 4046.1MB/s 6370.2MB/s
    WRITE: 3608.9MB/s 3576.3MB/s 5785.4MB/s
    READ: 1824.3MB/s 1821.6MB/s 1997.5MB/s
    WRITE: 1819.8MB/s 1817.4MB/s 1992.5MB/s
    READ: 1765.7MB/s 1768.3MB/s 1937.3MB/s
    WRITE: 1767.5MB/s 1769.1MB/s 1939.2MB/s
    jobs5
    READ: 18663MB/s 18986MB/s 18823MB/s
    READ: 16659MB/s 16605MB/s 16954MB/s
    WRITE: 3912.4MB/s 3888.7MB/s 6126.9MB/s
    WRITE: 3506.4MB/s 3442.5MB/s 5519.3MB/s
    READ: 1798.2MB/s 1746.5MB/s 1935.8MB/s
    WRITE: 1792.7MB/s 1740.7MB/s 1929.1MB/s
    READ: 1727.6MB/s 1658.2MB/s 1917.3MB/s
    WRITE: 1726.5MB/s 1657.2MB/s 1916.6MB/s
    jobs6
    READ: 21017MB/s 20922MB/s 21162MB/s
    READ: 19022MB/s 19140MB/s 18770MB/s
    WRITE: 3968.2MB/s 4037.7MB/s 6620.8MB/s
    WRITE: 3643.5MB/s 3590.2MB/s 6027.5MB/s
    READ: 1871.8MB/s 1880.5MB/s 2049.9MB/s
    WRITE: 1867.8MB/s 1877.2MB/s 2046.2MB/s
    READ: 1755.8MB/s 1710.3MB/s 1964.7MB/s
    WRITE: 1750.5MB/s 1705.9MB/s 1958.8MB/s
    jobs7
    READ: 21103MB/s 20677MB/s 21482MB/s
    READ: 18522MB/s 18379MB/s 19443MB/s
    WRITE: 4022.5MB/s 4067.4MB/s 6755.9MB/s
    WRITE: 3691.7MB/s 3695.5MB/s 5925.6MB/s
    READ: 1841.5MB/s 1933.9MB/s 2090.5MB/s
    WRITE: 1842.7MB/s 1935.3MB/s 2091.9MB/s
    READ: 1832.4MB/s 1856.4MB/s 1971.5MB/s
    WRITE: 1822.3MB/s 1846.2MB/s 1960.6MB/s
    jobs8
    READ: 20463MB/s 20194MB/s 20862MB/s
    READ: 18178MB/s 17978MB/s 18299MB/s
    WRITE: 4085.9MB/s 4060.2MB/s 7023.8MB/s
    WRITE: 3776.3MB/s 3737.9MB/s 6278.2MB/s
    READ: 1957.6MB/s 1944.4MB/s 2109.5MB/s
    WRITE: 1959.2MB/s 1946.2MB/s 2111.4MB/s
    READ: 1900.6MB/s 1885.7MB/s 2082.1MB/s
    WRITE: 1896.2MB/s 1881.4MB/s 2078.3MB/s
    jobs9
    READ: 19692MB/s 19734MB/s 19334MB/s
    READ: 17678MB/s 18249MB/s 17666MB/s
    WRITE: 4004.7MB/s 4064.8MB/s 6990.7MB/s
    WRITE: 3724.7MB/s 3772.1MB/s 6193.6MB/s
    READ: 1953.7MB/s 1967.3MB/s 2105.6MB/s
    WRITE: 1953.4MB/s 1966.7MB/s 2104.1MB/s
    READ: 1860.4MB/s 1897.4MB/s 2068.5MB/s
    WRITE: 1858.9MB/s 1895.9MB/s 2066.8MB/s
    jobs10
    READ: 19730MB/s 19579MB/s 19492MB/s
    READ: 18028MB/s 18018MB/s 18221MB/s
    WRITE: 4027.3MB/s 4090.6MB/s 7020.1MB/s
    WRITE: 3810.5MB/s 3846.8MB/s 6426.8MB/s
    READ: 1956.1MB/s 1994.6MB/s 2145.2MB/s
    WRITE: 1955.9MB/s 1993.5MB/s 2144.8MB/s
    READ: 1852.8MB/s 1911.6MB/s 2075.8MB/s
    WRITE: 1855.7MB/s 1914.6MB/s 2078.1MB/s

    perf stat

    4 streams 8 streams per-cpu
    ====================================================================================================================
    jobs1
    stalled-cycles-frontend 23,174,811,209 ( 38.21%) 23,220,254,188 ( 38.25%) 23,061,406,918 ( 38.34%)
    stalled-cycles-backend 11,514,174,638 ( 18.98%) 11,696,722,657 ( 19.27%) 11,370,852,810 ( 18.90%)
    instructions 73,925,005,782 ( 1.22) 73,903,177,632 ( 1.22) 73,507,201,037 ( 1.22)
    branches 14,455,124,835 ( 756.063) 14,455,184,779 ( 755.281) 14,378,599,509 ( 758.546)
    branch-misses 69,801,336 ( 0.48%) 80,225,529 ( 0.55%) 72,044,726 ( 0.50%)
    jobs2
    stalled-cycles-frontend 49,912,741,782 ( 46.11%) 50,101,189,290 ( 45.95%) 32,874,195,633 ( 35.11%)
    stalled-cycles-backend 27,080,366,230 ( 25.02%) 27,949,970,232 ( 25.63%) 16,461,222,706 ( 17.58%)
    instructions 122,831,629,690 ( 1.13) 122,919,846,419 ( 1.13) 121,924,786,775 ( 1.30)
    branches 23,725,889,239 ( 692.663) 23,733,547,140 ( 688.062) 23,553,950,311 ( 794.794)
    branch-misses 90,733,041 ( 0.38%) 96,320,895 ( 0.41%) 84,561,092 ( 0.36%)
    jobs3
    stalled-cycles-frontend 66,437,834,608 ( 45.58%) 63,534,923,344 ( 43.69%) 42,101,478,505 ( 33.19%)
    stalled-cycles-backend 34,940,799,661 ( 23.97%) 34,774,043,148 ( 23.91%) 21,163,324,388 ( 16.68%)
    instructions 171,692,121,862 ( 1.18) 171,775,373,044 ( 1.18) 170,353,542,261 ( 1.34)
    branches 32,968,962,622 ( 628.723) 32,987,739,894 ( 630.512) 32,729,463,918 ( 717.027)
    branch-misses 111,522,732 ( 0.34%) 110,472,894 ( 0.33%) 99,791,291 ( 0.30%)
    jobs4
    stalled-cycles-frontend 98,741,701,675 ( 49.72%) 94,797,349,965 ( 47.59%) 54,535,655,381 ( 33.53%)
    stalled-cycles-backend 54,642,609,615 ( 27.51%) 55,233,554,408 ( 27.73%) 27,882,323,541 ( 17.14%)
    instructions 220,884,807,851 ( 1.11) 220,930,887,273 ( 1.11) 218,926,845,851 ( 1.35)
    branches 42,354,518,180 ( 592.105) 42,362,770,587 ( 590.452) 41,955,552,870 ( 716.154)
    branch-misses 138,093,449 ( 0.33%) 131,295,286 ( 0.31%) 121,794,771 ( 0.29%)
    jobs5
    stalled-cycles-frontend 116,219,747,212 ( 48.14%) 110,310,397,012 ( 46.29%) 66,373,082,723 ( 33.70%)
    stalled-cycles-backend 66,325,434,776 ( 27.48%) 64,157,087,914 ( 26.92%) 32,999,097,299 ( 16.76%)
    instructions 270,615,008,466 ( 1.12) 270,546,409,525 ( 1.14) 268,439,910,948 ( 1.36)
    branches 51,834,046,557 ( 599.108) 51,811,867,722 ( 608.883) 51,412,576,077 ( 729.213)
    branch-misses 158,197,086 ( 0.31%) 142,639,805 ( 0.28%) 133,425,455 ( 0.26%)
    jobs6
    stalled-cycles-frontend 138,009,414,492 ( 48.23%) 139,063,571,254 ( 48.80%) 75,278,568,278 ( 32.80%)
    stalled-cycles-backend 79,211,949,650 ( 27.68%) 79,077,241,028 ( 27.75%) 37,735,797,899 ( 16.44%)
    instructions 319,763,993,731 ( 1.12) 319,937,782,834 ( 1.12) 316,663,600,784 ( 1.38)
    branches 61,219,433,294 ( 595.056) 61,250,355,540 ( 598.215) 60,523,446,617 ( 733.706)
    branch-misses 169,257,123 ( 0.28%) 154,898,028 ( 0.25%) 141,180,587 ( 0.23%)
    jobs7
    stalled-cycles-frontend 162,974,812,119 ( 49.20%) 159,290,061,987 ( 48.43%) 88,046,641,169 ( 33.21%)
    stalled-cycles-backend 92,223,151,661 ( 27.84%) 91,667,904,406 ( 27.87%) 44,068,454,971 ( 16.62%)
    instructions 369,516,432,430 ( 1.12) 369,361,799,063 ( 1.12) 365,290,380,661 ( 1.38)
    branches 70,795,673,950 ( 594.220) 70,743,136,124 ( 597.876) 69,803,996,038 ( 732.822)
    branch-misses 181,708,327 ( 0.26%) 165,767,821 ( 0.23%) 150,109,797 ( 0.22%)
    jobs8
    stalled-cycles-frontend 185,000,017,027 ( 49.30%) 182,334,345,473 ( 48.37%) 99,980,147,041 ( 33.26%)
    stalled-cycles-backend 105,753,516,186 ( 28.18%) 107,937,830,322 ( 28.63%) 51,404,177,181 ( 17.10%)
    instructions 418,153,161,055 ( 1.11) 418,308,565,828 ( 1.11) 413,653,475,581 ( 1.38)
    branches 80,035,882,398 ( 592.296) 80,063,204,510 ( 589.843) 79,024,105,589 ( 730.530)
    branch-misses 199,764,528 ( 0.25%) 177,936,926 ( 0.22%) 160,525,449 ( 0.20%)
    jobs9
    stalled-cycles-frontend 210,941,799,094 ( 49.63%) 204,714,679,254 ( 48.55%) 114,251,113,756 ( 33.96%)
    stalled-cycles-backend 122,640,849,067 ( 28.85%) 122,188,553,256 ( 28.98%) 58,360,041,127 ( 17.35%)
    instructions 468,151,025,415 ( 1.10) 467,354,869,323 ( 1.11) 462,665,165,216 ( 1.38)
    branches 89,657,067,510 ( 585.628) 89,411,550,407 ( 588.990) 88,360,523,943 ( 730.151)
    branch-misses 218,292,301 ( 0.24%) 191,701,247 ( 0.21%) 178,535,678 ( 0.20%)
    jobs10
    stalled-cycles-frontend 233,595,958,008 ( 49.81%) 227,540,615,689 ( 49.11%) 160,341,979,938 ( 43.07%)
    stalled-cycles-backend 136,153,676,021 ( 29.03%) 133,635,240,742 ( 28.84%) 65,909,135,465 ( 17.70%)
    instructions 517,001,168,497 ( 1.10) 516,210,976,158 ( 1.11) 511,374,038,613 ( 1.37)
    branches 98,911,641,329 ( 585.796) 98,700,069,712 ( 591.583) 97,646,761,028 ( 728.712)
    branch-misses 232,341,823 ( 0.23%) 199,256,308 ( 0.20%) 183,135,268 ( 0.19%)

    per-cpu streams tend to cause significantly less stalled cycles; execute
    less branches and hit less branch-misses.

    perf stat reported execution time

    4 streams 8 streams per-cpu
    ====================================================================
    jobs1
    seconds elapsed 20.909073870 20.875670495 20.817838540
    jobs2
    seconds elapsed 18.529488399 18.720566469 16.356103108
    jobs3
    seconds elapsed 18.991159531 18.991340812 16.766216066
    jobs4
    seconds elapsed 19.560643828 19.551323547 16.246621715
    jobs5
    seconds elapsed 24.746498464 25.221646740 20.696112444
    jobs6
    seconds elapsed 28.258181828 28.289765505 22.885688857
    jobs7
    seconds elapsed 32.632490241 31.909125381 26.272753738
    jobs8
    seconds elapsed 35.651403851 36.027596308 29.108024711
    jobs9
    seconds elapsed 40.569362365 40.024227989 32.898204012
    jobs10
    seconds elapsed 44.673112304 43.874898137 35.632952191

    Please see
    Link: http://marc.info/?l=linux-kernel&m=146166970727530
    Link: http://marc.info/?l=linux-kernel&m=146174716719650
    for more test results (under low memory conditions).

    Signed-off-by: Sergey Senozhatsky
    Suggested-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • Pass GFP flags to zs_malloc() instead of using a fixed mask supplied to
    zs_create_pool(), so we can be more flexible, but, more importantly, we
    need this to switch zram to per-cpu compression streams -- zram will try
    to allocate handle with preemption disabled in a fast path and switch to
    a slow path (using different gfp mask) if the fast one has failed.

    Apart from that, this also align zs_malloc() interface with zspool/zbud.

    [sergey.senozhatsky@gmail.com: pass GFP flags to zs_malloc() instead of using a fixed mask]
    Link: http://lkml.kernel.org/r/20160429150942.GA637@swordfish
    Link: http://lkml.kernel.org/r/20160429150942.GA637@swordfish
    Signed-off-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     

16 Jan, 2016

5 commits

  • The use of idr_remove() is forbidden in the callback functions of
    idr_for_each(). It is therefore unsafe to call idr_remove in
    zram_remove().

    This patch moves the call to idr_remove() from zram_remove() to
    hot_remove_store(). In the detroy_devices() path, idrs are removed by
    idr_destroy(). This solves an use-after-free detected by KASan.

    [akpm@linux-foundation.org: fix coding stype, per Sergey]
    Signed-off-by: Jerome Marchand
    Acked-by: Sergey Senozhatsky
    Cc: Minchan Kim
    Cc: [4.2+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jerome Marchand
     
  • Do not __GFP_ZERO allocated zcomp ->private pages. We keep allocated
    streams around and use them for read/write requests, so we supply a
    zeroed out ->private to compression algorithm as a scratch buffer only
    once -- the first time we use that stream. For the rest of IO requests
    served by this stream ->private usually contains some temporarily data
    from the previous requests.

    Signed-off-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • Each zcomp backend uses own gfp flag but it's pointless because the
    context they could be called is driven by upper layer(ie, zcomp
    frontend). As well, zcomp frondend could call them in different
    context. One context(ie, zram init part) is it should be better to make
    sure successful allocation other context(ie, further stream allocation
    part for accelarating I/O speed) is just optional so let's pass gfp down
    from driver (ie, zcomp frontend) like normal MM convention.

    [sergey.senozhatsky@gmail.com: add missing __vmalloc zero and highmem gfps]
    Signed-off-by: Minchan Kim
    Signed-off-by: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • When we're using LZ4 multi compression streams for zram swap, we found
    out page allocation failure message in system running test. That was
    not only once, but a few(2 - 5 times per test). Also, some failure
    cases were continually occurring to try allocation order 3.

    In order to make parallel compression private data, we should call
    kzalloc() with order 2/3 in runtime(lzo/lz4). But if there is no order
    2/3 size memory to allocate in that time, page allocation fails. This
    patch makes to use vmalloc() as fallback of kmalloc(), this prevents
    page alloc failure warning.

    After using this, we never found warning message in running test, also
    It could reduce process startup latency about 60-120ms in each case.

    For reference a call trace :

    Binder_1: page allocation failure: order:3, mode:0x10c0d0
    CPU: 0 PID: 424 Comm: Binder_1 Tainted: GW 3.10.49-perf-g991d02b-dirty #20
    Call trace:
    dump_backtrace+0x0/0x270
    show_stack+0x10/0x1c
    dump_stack+0x1c/0x28
    warn_alloc_failed+0xfc/0x11c
    __alloc_pages_nodemask+0x724/0x7f0
    __get_free_pages+0x14/0x5c
    kmalloc_order_trace+0x38/0xd8
    zcomp_lz4_create+0x2c/0x38
    zcomp_strm_alloc+0x34/0x78
    zcomp_strm_multi_find+0x124/0x1ec
    zcomp_strm_find+0xc/0x18
    zram_bvec_rw+0x2fc/0x780
    zram_make_request+0x25c/0x2d4
    generic_make_request+0x80/0xbc
    submit_bio+0xa4/0x15c
    __swap_writepage+0x218/0x230
    swap_writepage+0x3c/0x4c
    shrink_page_list+0x51c/0x8d0
    shrink_inactive_list+0x3f8/0x60c
    shrink_lruvec+0x33c/0x4cc
    shrink_zone+0x3c/0x100
    try_to_free_pages+0x2b8/0x54c
    __alloc_pages_nodemask+0x514/0x7f0
    __get_free_pages+0x14/0x5c
    proc_info_read+0x50/0xe4
    vfs_read+0xa0/0x12c
    SyS_read+0x44/0x74
    DMA: 3397*4kB (MC) 26*8kB (RC) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB
    0*512kB 0*1024kB 0*2048kB 0*4096kB = 13796kB

    [minchan@kernel.org: change vmalloc gfp and adding comment about gfp]
    [sergey.senozhatsky@gmail.com: tweak comments and styles]
    Signed-off-by: Kyeongdon Kim
    Signed-off-by: Minchan Kim
    Acked-by: Sergey Senozhatsky
    Sergey Senozhatsky
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kyeongdon Kim
     
  • We can end up allocating a new compression stream with GFP_KERNEL from
    within the IO path, which may result is nested (recursive) IO
    operations. That can introduce problems if the IO path in question is a
    reclaimer, holding some locks that will deadlock nested IOs.

    Allocate streams and working memory using GFP_NOIO flag, forbidding
    recursive IO and FS operations.

    An example:

    inconsistent {IN-RECLAIM_FS-W} -> {RECLAIM_FS-ON-W} usage.
    git/20158 [HC0[0]:SC0[0]:HE1:SE1] takes:
    (jbd2_handle){+.+.?.}, at: start_this_handle+0x4ca/0x555
    {IN-RECLAIM_FS-W} state was registered at:
    __lock_acquire+0x8da/0x117b
    lock_acquire+0x10c/0x1a7
    start_this_handle+0x52d/0x555
    jbd2__journal_start+0xb4/0x237
    __ext4_journal_start_sb+0x108/0x17e
    ext4_dirty_inode+0x32/0x61
    __mark_inode_dirty+0x16b/0x60c
    iput+0x11e/0x274
    __dentry_kill+0x148/0x1b8
    shrink_dentry_list+0x274/0x44a
    prune_dcache_sb+0x4a/0x55
    super_cache_scan+0xfc/0x176
    shrink_slab.part.14.constprop.25+0x2a2/0x4d3
    shrink_zone+0x74/0x140
    kswapd+0x6b7/0x930
    kthread+0x107/0x10f
    ret_from_fork+0x3f/0x70
    irq event stamp: 138297
    hardirqs last enabled at (138297): debug_check_no_locks_freed+0x113/0x12f
    hardirqs last disabled at (138296): debug_check_no_locks_freed+0x33/0x12f
    softirqs last enabled at (137818): __do_softirq+0x2d3/0x3e9
    softirqs last disabled at (137813): irq_exit+0x41/0x95

    other info that might help us debug this:
    Possible unsafe locking scenario:
    CPU0
    ----
    lock(jbd2_handle);

    lock(jbd2_handle);

    *** DEADLOCK ***
    5 locks held by git/20158:
    #0: (sb_writers#7){.+.+.+}, at: [] mnt_want_write+0x24/0x4b
    #1: (&type->i_mutex_dir_key#2/1){+.+.+.}, at: [] lock_rename+0xd9/0xe3
    #2: (&sb->s_type->i_mutex_key#11){+.+.+.}, at: [] lock_two_nondirectories+0x3f/0x6b
    #3: (&sb->s_type->i_mutex_key#11/4){+.+.+.}, at: [] lock_two_nondirectories+0x66/0x6b
    #4: (jbd2_handle){+.+.?.}, at: [] start_this_handle+0x4ca/0x555

    stack backtrace:
    CPU: 2 PID: 20158 Comm: git Not tainted 4.1.0-rc7-next-20150615-dbg-00016-g8bdf555-dirty #211
    Call Trace:
    dump_stack+0x4c/0x6e
    mark_lock+0x384/0x56d
    mark_held_locks+0x5f/0x76
    lockdep_trace_alloc+0xb2/0xb5
    kmem_cache_alloc_trace+0x32/0x1e2
    zcomp_strm_alloc+0x25/0x73 [zram]
    zcomp_strm_multi_find+0xe7/0x173 [zram]
    zcomp_strm_find+0xc/0xe [zram]
    zram_bvec_rw+0x2ca/0x7e0 [zram]
    zram_make_request+0x1fa/0x301 [zram]
    generic_make_request+0x9c/0xdb
    submit_bio+0xf7/0x120
    ext4_io_submit+0x2e/0x43
    ext4_bio_write_page+0x1b7/0x300
    mpage_submit_page+0x60/0x77
    mpage_map_and_submit_buffers+0x10f/0x21d
    ext4_writepages+0xc8c/0xe1b
    do_writepages+0x23/0x2c
    __filemap_fdatawrite_range+0x84/0x8b
    filemap_flush+0x1c/0x1e
    ext4_alloc_da_blocks+0xb8/0x117
    ext4_rename+0x132/0x6dc
    ? mark_held_locks+0x5f/0x76
    ext4_rename2+0x29/0x2b
    vfs_rename+0x540/0x636
    SyS_renameat2+0x359/0x44d
    SyS_rename+0x1e/0x20
    entry_SYSCALL_64_fastpath+0x12/0x6f

    [minchan@kernel.org: add stable mark]
    Signed-off-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Cc: Kyeongdon Kim
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     

11 Nov, 2015

1 commit

  • Pull block IO poll support from Jens Axboe:
    "Various groups have been doing experimentation around IO polling for
    (really) fast devices. The code has been reviewed and has been
    sitting on the side for a few releases, but this is now good enough
    for coordinated benchmarking and further experimentation.

    Currently O_DIRECT sync read/write are supported. A framework is in
    the works that allows scalable stats tracking so we can auto-tune
    this. And we'll add libaio support as well soon. Fow now, it's an
    opt-in feature for test purposes"

    * 'for-4.4/io-poll' of git://git.kernel.dk/linux-block:
    direct-io: be sure to assign dio->bio_bdev for both paths
    directio: add block polling support
    NVMe: add blk polling support
    block: add block polling support
    blk-mq: return tag/queue combo in the make_request_fn handlers
    block: change ->make_request_fn() and users to return a queue cookie

    Linus Torvalds
     

08 Nov, 2015

1 commit


07 Nov, 2015

3 commits

  • Make is_partial_io()/valid_io_request()/page_zero_filled() return boolean,
    since each function only uses either one or zero as its return value.

    Signed-off-by: Geliang Tang
    Reviewed-by: Sergey Senozhatsky
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Geliang Tang
     
  • `mem_used_max' is designed to store the max amount of memory zram consumed
    to store the data. However, it does not represent the actual
    'overcommited' (max) value. The existing code goes to -ENOMEM
    overcommited case before it updates `->stats.max_used_pages', which hides
    the reason we went to -ENOMEM in the first place -- we actually used more
    memory than `->limit_pages':

    alloced_pages = zs_get_total_pages(meta->mem_pool);
    if (zram->limit_pages && alloced_pages > zram->limit_pages) {
    zs_free(meta->mem_pool, handle);
    ret = -ENOMEM;
    goto out;
    }

    update_used_max(zram, alloced_pages);

    Which is misleading. User will see -ENOMEM, check `->limit_pages', check
    `->stats.max_used_pages', which will keep the value BEFORE zram passed
    `->limit_pages', and see:
    `->stats.max_used_pages' < `->limit_pages'

    Move update_used_max() before we do `->limit_pages' check, so that
    user will see:
    `->stats.max_used_pages' > `->limit_pages'
    should the overcommit and -ENOMEM happen.

    Signed-off-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey SENOZHATSKY
     
  • When the user supplies an unsupported compression algorithm, keep the
    previously selected one (knowingly supported) or the default one (if the
    compression algorithm hasn't been changed yet).

    Note that previously this operation (i.e. setting an invalid algorithm)
    would result in no algorithm being selected, which means that this
    represents a small change in the default behaviour.

    Minchan said:

    For initializing zram, we need to set up 3 optional parameters in advance.

    1. the number of compression streams
    2. memory limitation
    3. compression algorithm

    Although user pass completely wrong value to set up for 1 and 2
    parameters, it's okay because they have default value so zram will be
    initialized with the default value (of course, when user passes a wrong
    value via *echo*, sysfs returns -EINVAL so the user can notice it).

    But 3 is not consistent with other optional parameters. IOW, if the
    user passes a wrong value to set up 3 parameter, zram's initialization
    would fail unlike other optional parameters.

    So this patch makes them consistent.

    Signed-off-by: Luis Henriques
    Acked-by: Minchan Kim
    Acked-by: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luis Henriques