08 Apr, 2014

40 commits

  • zswap used zsmalloc before and now using zbud. But, some comments saying
    it use zsmalloc yet. Fix the trivial problems.

    Signed-off-by: SeongJae Park
    Cc: Seth Jennings
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    SeongJae Park
     
  • Signed-off-by: SeongJae Park
    Cc: Seth Jennings
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    SeongJae Park
     
  • zram is ram based block device and can be used by backend of filesystem.
    When filesystem deletes a file, it normally doesn't do anything on data
    block of that file. It just marks on metadata of that file. This
    behavior has no problem on disk based block device, but has problems on
    ram based block device, since we can't free memory used for data block.
    To overcome this disadvantage, there is REQ_DISCARD functionality. If
    block device support REQ_DISCARD and filesystem is mounted with discard
    option, filesystem sends REQ_DISCARD to block device whenever some data
    blocks are discarded. All we have to do is to handle this request.

    This patch implements to flag up QUEUE_FLAG_DISCARD and handle this
    REQ_DISCARD request. With it, we can free memory used by zram if it isn't
    used.

    [akpm@linux-foundation.org: tweak comments]
    Signed-off-by: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Nitin Gupta
    Cc: Sergey Senozhatsky
    Cc: Jerome Marchand
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • sysfs.txt documentation lists the following requirements:

    - The buffer will always be PAGE_SIZE bytes in length. On i386, this
    is 4096.

    - show() methods should return the number of bytes printed into the
    buffer. This is the return value of scnprintf().

    - show() should always use scnprintf().

    Use scnprintf() in show() functions.

    Signed-off-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Cc: Jerome Marchand
    Cc: Nitin Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • When we initialized zcomp with single, we couldn't change
    max_comp_streams without zram reset but current interface doesn't show
    any error to user and even it changes max_comp_streams's value without
    any effect so it would make user very confusing.

    This patch prevents max_comp_streams's change when zcomp was initialized
    as single zcomp and emit the error to user(ex, echo).

    [akpm@linux-foundation.org: don't return with the lock held, per Sergey]
    [fengguang.wu@intel.com: fix coccinelle warnings]
    Signed-off-by: Minchan Kim
    Cc: Nitin Gupta
    Cc: Jerome Marchand
    Acked-by: Sergey Senozhatsky
    Signed-off-by: Fengguang Wu
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Instead of returning just NULL, return ERR_PTR from zcomp_create() if
    compressing backend creation has failed. ERR_PTR(-EINVAL) for unsupported
    compression algorithm request, ERR_PTR(-ENOMEM) for allocation (zcomp or
    compression stream) error.

    Perform IS_ERR() check of returned from zcomp_create() value in
    disksize_store() and set return code to PTR_ERR().

    Change suggested by Jerome Marchand.

    [akpm@linux-foundation.org: clean up error recovery flow]
    Signed-off-by: Sergey Senozhatsky
    Reported-by: Jerome Marchand
    Cc: Minchan Kim
    Cc: Nitin Gupta
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • While fixing lockdep spew of ->init_lock reported by Sasha Levin [1],
    Minchan Kim noted [2] that it's better to move compression backend
    allocation (using GPF_KERNEL) out of the ->init_lock lock, same way as
    with zram_meta_alloc(), in order to prevent the same lockdep spew.

    [1] https://lkml.org/lkml/2014/2/27/337
    [2] https://lkml.org/lkml/2014/3/3/32

    Signed-off-by: Sergey Senozhatsky
    Reported-by: Minchan Kim
    Acked-by: Minchan Kim
    Cc: Sasha Levin
    Acked-by: Jerome Marchand
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • Introduce LZ4 compression backend and make it available for selection.
    LZ4 support is optional and requires user to set ZRAM_LZ4_COMPRESS config
    option. The default compression backend is LZO.

    TEST

    (x86_64, core i5, 2 cores + 2 hyperthreading, zram disk size 1G,
    ext4 file system, 3 compression streams)

    iozone -t 3 -R -r 16K -s 60M -I +Z

    Test LZO LZ4
    ----------------------------------------------
    Initial write 1642744.62 1317005.09
    Rewrite 2498980.88 1800645.16
    Read 3957026.38 5877043.75
    Re-read 3950997.38 5861847.00
    Reverse Read 2937114.56 5047384.00
    Stride read 2948163.19 4929587.38
    Random read 3292692.69 4880793.62
    Mixed workload 1545602.62 3502940.38
    Random write 2448039.75 1758786.25
    Pwrite 1670051.03 1338329.69
    Pread 2530682.00 5097177.62
    Fwrite 3232085.62 3275942.56
    Fread 6306880.25 6645271.12

    So on my system LZ4 is slower in write-only tests, while it performs
    better in read-only and mixed (reads + writes) tests.

    Official LZ4 benchmarks available here http://code.google.com/p/lz4/
    (linux kernel uses revision r90).

    Signed-off-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Cc: Jerome Marchand
    Cc: Nitin Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • Add and document `comp_algorithm' device attribute. This attribute allows
    to show supported compression and currently selected compression
    algorithms:

    cat /sys/block/zram0/comp_algorithm
    [lzo] lz4

    and change selected compression algorithm:
    echo lzo > /sys/block/zram0/comp_algorithm

    Signed-off-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Cc: Jerome Marchand
    Cc: Nitin Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • This patch allows to change max_comp_streams on initialised zcomp.

    Introduce zcomp set_max_streams() knob, zcomp_strm_multi_set_max_streams()
    and zcomp_strm_single_set_max_streams() callbacks to change streams limit
    for zcomp_strm_multi and zcomp_strm_single, accordingly. set_max_streams
    for single steam zcomp does nothing.

    If user has lowered the limit, then zcomp_strm_multi_set_max_streams()
    attempts to immediately free extra streams (as much as it can, depending
    on idle streams availability).

    Note, this patch does not allow to change stream 'policy' from single to
    multi stream (or vice versa) on already initialised compression backend.

    Signed-off-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Cc: Jerome Marchand
    Cc: Nitin Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • Existing zram (zcomp) implementation has only one compression stream
    (buffer and algorithm private part), so in order to prevent data
    corruption only one write (compress operation) can use this compression
    stream, forcing all concurrent write operations to wait for stream lock
    to be released. This patch changes zcomp to keep a compression streams
    list of user-defined size (via sysfs device attr). Each write operation
    still exclusively holds compression stream, the difference is that we
    can have N write operations (depending on size of streams list)
    executing in parallel. See TEST section later in commit message for
    performance data.

    Introduce struct zcomp_strm_multi and a set of functions to manage
    zcomp_strm stream access. zcomp_strm_multi has a list of idle
    zcomp_strm structs, spinlock to protect idle list and wait queue, making
    it possible to perform parallel compressions.

    The following set of functions added:
    - zcomp_strm_multi_find()/zcomp_strm_multi_release()
    find and release a compression stream, implement required locking
    - zcomp_strm_multi_create()/zcomp_strm_multi_destroy()
    create and destroy zcomp_strm_multi

    zcomp ->strm_find() and ->strm_release() callbacks are set during
    initialisation to zcomp_strm_multi_find()/zcomp_strm_multi_release()
    correspondingly.

    Each time zcomp issues a zcomp_strm_multi_find() call, the following set
    of operations performed:

    - spin lock strm_lock
    - if idle list is not empty, remove zcomp_strm from idle list, spin
    unlock and return zcomp stream pointer to caller
    - if idle list is empty, current adds itself to wait queue. it will be
    awaken by zcomp_strm_multi_release() caller.

    zcomp_strm_multi_release():
    - spin lock strm_lock
    - add zcomp stream to idle list
    - spin unlock, wake up sleeper

    Minchan Kim reported that spinlock-based locking scheme has demonstrated
    a severe perfomance regression for single compression stream case,
    comparing to mutex-based (see https://lkml.org/lkml/2014/2/18/16)

    base spinlock mutex

    ==Initial write ==Initial write ==Initial write
    records: 5 records: 5 records: 5
    avg: 1642424.35 avg: 699610.40 avg: 1655583.71
    std: 39890.95(2.43%) std: 232014.19(33.16%) std: 52293.96
    max: 1690170.94 max: 1163473.45 max: 1697164.75
    min: 1568669.52 min: 573429.88 min: 1553410.23
    ==Rewrite ==Rewrite ==Rewrite
    records: 5 records: 5 records: 5
    avg: 1611775.39 avg: 501406.64 avg: 1684419.11
    std: 17144.58(1.06%) std: 15354.41(3.06%) std: 18367.42
    max: 1641800.95 max: 531356.78 max: 1706445.84
    min: 1593515.27 min: 488817.78 min: 1655335.73

    When only one compression stream available, mutex with spin on owner
    tends to perform much better than frequent wait_event()/wake_up(). This
    is why single stream implemented as a special case with mutex locking.

    Introduce and document zram device attribute max_comp_streams. This
    attr shows and stores current zcomp's max number of zcomp streams
    (max_strm). Extend zcomp's zcomp_create() with `max_strm' parameter.
    `max_strm' limits the number of zcomp_strm structs in compression
    backend's idle list (max_comp_streams).

    max_comp_streams used during initialisation as follows:
    -- passing to zcomp_create() max_strm equals to 1 will initialise zcomp
    using single compression stream zcomp_strm_single (mutex-based locking).
    -- passing to zcomp_create() max_strm greater than 1 will initialise zcomp
    using multi compression stream zcomp_strm_multi (spinlock-based locking).

    default max_comp_streams value is 1, meaning that zram with single stream
    will be initialised.

    Later patch will introduce configuration knob to change max_comp_streams
    on already initialised and used zcomp.

    TEST
    iozone -t 3 -R -r 16K -s 60M -I +Z

    test base 1 strm (mutex) 3 strm (spinlock)
    -----------------------------------------------------------------------
    Initial write 589286.78 583518.39 718011.05
    Rewrite 604837.97 596776.38 1515125.72
    Random write 584120.11 595714.58 1388850.25
    Pwrite 535731.17 541117.38 739295.27
    Fwrite 1418083.88 1478612.72 1484927.06

    Usage example:
    set max_comp_streams to 4
    echo 4 > /sys/block/zram0/max_comp_streams

    show current max_comp_streams (default value is 1).
    cat /sys/block/zram0/max_comp_streams

    Signed-off-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Cc: Jerome Marchand
    Cc: Nitin Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • This is preparation patch to add multi stream support to zcomp.

    Introduce struct zcomp_strm_single and a set of functions to manage
    zcomp_strm stream access. zcomp_strm_single implements single compession
    stream, same way as current zcomp implementation. This moves zcomp_strm
    stream control and locking from zcomp, so compressing backend zcomp is not
    aware of required locking.

    Single and multi streams require different locking schemes. Minchan Kim
    reported that spinlock-based locking scheme (which is used in multi stream
    implementation) has demonstrated a severe perfomance regression for single
    compression stream case, comparing to mutex-based. see
    https://lkml.org/lkml/2014/2/18/16

    The following set of functions added:
    - zcomp_strm_single_find()/zcomp_strm_single_release()
    find and release a compression stream, implement required locking
    - zcomp_strm_single_create()/zcomp_strm_single_destroy()
    create and destroy zcomp_strm_single

    New ->strm_find() and ->strm_release() callbacks added to zcomp, which are
    set to zcomp_strm_single_find() and zcomp_strm_single_release() during
    initialisation. Instead of direct locking and zcomp_strm access from
    zcomp_strm_find() and zcomp_strm_release(), zcomp now calls ->strm_find()
    and ->strm_release() correspondingly.

    Signed-off-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Cc: Jerome Marchand
    Cc: Nitin Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • Do not perform direct LZO compress/decompress calls, initialise
    and use zcomp LZO backend (single compression stream) instead.

    [akpm@linux-foundation.org: resolve conflicts with zram-delete-zram_init_device-fix.patch]
    Signed-off-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Cc: Jerome Marchand
    Cc: Nitin Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • ZRAM performs direct LZO compression algorithm calls, making it the one
    and only option. While LZO is generally performs well, LZ4 algorithm
    tends to have a faster decompression (see http://code.google.com/p/lz4/
    for full report)

    Name Ratio C.speed D.speed
    MB/s MB/s
    LZ4 (r101) 2.084 422 1820
    LZO 2.06 2.106 414 600

    Thus, users who have mostly read (decompress) usage scenarious or mixed
    workflow (writes with relatively high read ops number) will benefit from
    using LZ4 compression backend.

    Introduce compressing backend abstraction zcomp in order to support
    multiple compression algorithms with the following set of operations:

    .create
    .destroy
    .compress
    .decompress

    Schematically zram write() usually contains the following steps:
    0) preparation (decompression of partioal IO, etc.)
    1) lock buffer_lock mutex (protects meta compress buffers)
    2) compress (using meta compress buffers)
    3) alloc and map zs_pool object
    4) copy compressed data (from meta compress buffers) to object allocated by 3)
    5) free previous pool page, assign a new one
    6) unlock buffer_lock mutex

    As we can see, compressing buffers must remain untouched from 1) to 4),
    because, otherwise, concurrent write() can overwrite data. At the same
    time, zram_meta must be aware of a) specific compression algorithm memory
    requirements and b) necessary locking to protect compression buffers. To
    remove requirement a) new struct zcomp_strm introduced, which contains a
    compress/decompress `buffer' and compression algorithm `private' part.
    While struct zcomp implements zcomp_strm stream handling and locking and
    removes requirement b) from zram meta. zcomp ->create() and ->destroy(),
    respectively, allocate and deallocate algorithm specific zcomp_strm
    `private' part.

    Every zcomp has zcomp stream and mutex to protect its compression stream.
    Stream usage semantics remains the same -- only one write can hold stream
    lock and use its buffers. zcomp_strm_find() turns caller into exclusive
    user of a stream (holding stream mutex until zram release stream), and
    zcomp_strm_release() makes zcomp stream available (unlock the stream
    mutex). Hence no concurrent write (compression) operations possible at
    the moment.

    iozone -t 3 -R -r 16K -s 60M -I +Z

    test base patched
    --------------------------------------------------
    Initial write 597992.91 591660.58
    Rewrite 609674.34 616054.97
    Read 2404771.75 2452909.12
    Re-read 2459216.81 2470074.44
    Reverse Read 1652769.66 1589128.66
    Stride read 2202441.81 2202173.31
    Random read 2236311.47 2276565.31
    Mixed workload 1423760.41 1709760.06
    Random write 579584.08 615933.86
    Pwrite 597550.02 594933.70
    Pread 1703672.53 1718126.72
    Fwrite 1330497.06 1461054.00
    Fread 3922851.00 3957242.62

    Usage examples:

    comp = zcomp_create(NAME) /* NAME e.g. "lzo" */

    which initialises compressing backend if requested algorithm is supported.

    Compress:
    zstrm = zcomp_strm_find(comp)
    zcomp_compress(comp, zstrm, src, &dst_len)
    [..] /* copy compressed data */
    zcomp_strm_release(comp, zstrm)

    Decompress:
    zcomp_decompress(comp, src, src_len, dst);

    Free compessing backend and its zcomp stream:
    zcomp_destroy(comp)

    Signed-off-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Cc: Jerome Marchand
    Cc: Nitin Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • allocate new `zram_meta' in disksize_store() only for uninitialised zram
    device, saving a number of allocations and deallocations in case if
    disksize_store() was called on currently used device. at the same time
    zram_meta stack variable is not necessary, because we can set ->meta
    directly. there is also no need in setting QUEUE_FLAG_NONROT queue on
    every disksize_store(), set it once during device creation.

    [minchan@kernel.org: handle zram->meta alloc fail case]
    [minchan@kernel.org: prevent lockdep spew of init_lock]
    Signed-off-by: Sergey Senozhatsky
    Signed-off-by: Minchan Kim
    Acked-by: Jerome Marchand
    Cc: Sasha Levin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • Document `failed_reads' and `failed_writes' device attributes.
    Remove info about `discard' - there is no such zram attr.

    Signed-off-by: Sergey Senozhatsky
    Cc: Minchan Kim
    Cc: Jerome Marchand
    Cc: Nitin Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • Move zram warning about disksize and size of memory correlation to zram
    documentation.

    Signed-off-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Cc: Jerome Marchand
    Cc: Nitin Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • struct table `count' member is not used.

    Signed-off-by: Sergey Senozhatsky
    Cc: Minchan Kim
    Acked-by: Jerome Marchand
    Cc: Nitin Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • zram accounted but did not report numbers of failed read and write
    queries. make these stats available as failed_reads and failed_writes
    attrs.

    Signed-off-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Acked-by: Jerome Marchand
    Cc: Nitin Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • Introduce ZRAM_ATTR_RO macro that generates device_attribute and default
    ATTR show() function for existing atomic64_t zram stats.

    Signed-off-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Cc: Jerome Marchand
    Cc: Nitin Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • This is a preparation patch for stats code duplication removal.

    1) use atomic64_t for `pages_zero' and `pages_stored' zram stats.

    2) `compr_size' and `pages_zero' struct zram_stats members did not
    follow the existing device attr naming scheme: zram_stats.ATTR has
    ATTR_show() function. rename them:

    -- compr_size -> compr_data_size
    -- pages_zero -> zero_pages

    Minchan Kim's note:
    If we really have trouble with atomic stat operation, we could
    change it with percpu_counter so that it could solve atomic overhead and
    unnecessary memory space by introducing unsigned long instead of 64bit
    atomic_t.

    Signed-off-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Acked-by: Jerome Marchand
    Cc: Nitin Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • Remove `good' and `bad' compressed sub-requests stats. RW request may
    cause a number of RW sub-requests. zram used to account `good' compressed
    sub-queries (with compressed size less than 50% of original size), `bad'
    compressed sub-queries (with compressed size greater that 75% of original
    size), leaving sub-requests with compression size between 50% and 75% of
    original size not accounted and not reported. zram already accounts each
    sub-request's compression size so we can calculate real device compression
    ratio.

    Signed-off-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Acked-by: Jerome Marchand
    Cc: Nitin Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • Do not pass rw argument down the __zram_make_request() -> zram_bvec_rw()
    chain, decode it in zram_bvec_rw() instead. Besides, this is the place
    where we distinguish READ and WRITE bio data directions, so account zram
    RW stats here, instead of __zram_make_request(). This also allows to
    account a real number of zram READ/WRITE operations, not just requests
    (single RW request may cause a number of zram RW ops with separate
    locking, compression/decompression, etc).

    Signed-off-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Acked-by: Jerome Marchand
    Cc: Nitin Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • Introduce init_done() helper function which allows us to drop `init_done'
    struct zram member. init_done() uses the fact that ->init_done == 1
    equals to ->meta != NULL.

    Signed-off-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Acked-by: Jerome Marchand
    Cc: Nitin Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • A new dump_page() routine was recently added, and marked
    EXPORT_SYMBOL_GPL. dump_page() was also added to the VM_BUG_ON_PAGE()
    macro, and so the end result is that non-GPL code can no longer call
    get_page() and a few other routines.

    This only happens if the kernel was compiled with CONFIG_DEBUG_VM.

    Change dump_page() to be EXPORT_SYMBOL.

    Longer explanation:

    Prior to commit 309381feaee5 ("mm: dump page when hitting a VM_BUG_ON
    using VM_BUG_ON_PAGE") , it was possible to build MIT-licensed (non-GPL)
    drivers on Fedora. Fedora is semi-unique, in that it sets
    CONFIG_VM_DEBUG.

    Because Fedora sets CONFIG_VM_DEBUG, they end up pulling in dump_page(),
    via VM_BUG_ON_PAGE, via get_page(). As one of the authors of NVIDIA's
    new, open source, "UVM-Lite" kernel module, I originally choose to use
    the kernel's get_page() routine from within nvidia_uvm_page_cache.c,
    because get_page() has always seemed to be very clearly intended for use
    by non-GPL, driver code.

    So I'm hoping that making get_page() widely accessible again will not be
    too controversial. We did check with Fedora first, and they responded
    (https://bugzilla.redhat.com/show_bug.cgi?id=1074710#c3) that we should
    try to get upstream changed, before asking Fedora to change. Their
    reasoning seems beneficial to Linux: leaving CONFIG_DEBUG_VM set allows
    Fedora to help catch mm bugs.

    Signed-off-by: John Hubbard
    Cc: Sasha Levin
    Cc: Josh Boyer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    John Hubbard
     
  • LAST_CPUPID_MASK is calculated using LAST_CPUPID_WIDTH. However
    LAST_CPUPID_WIDTH itself can be 0. (when LAST_CPUPID_NOT_IN_PAGE_FLAGS is
    set). In such a case LAST_CPUPID_MASK turns out to be 0.

    But with recent commit 1ae71d0319: (mm: numa: bugfix for
    LAST_CPUPID_NOT_IN_PAGE_FLAGS) if LAST_CPUPID_MASK is 0,
    page_cpupid_xchg_last() and page_cpupid_reset_last() causes
    page->_last_cpupid to be set to 0.

    This causes performance regression. Its almost as if numa_balancing is
    off.

    Fix LAST_CPUPID_MASK by using LAST_CPUPID_SHIFT instead of
    LAST_CPUPID_WIDTH.

    Some performance numbers and perf stats with and without the fix.

    (3.14-rc6)
    ----------
    numa01

    Performance counter stats for '/usr/bin/time -f %e %S %U %c %w -o start_bench.out -a ./numa01':

    12,27,462 cs [100.00%]
    2,41,957 migrations [100.00%]
    1,68,01,713 faults [100.00%]
    7,99,35,29,041 cache-misses
    98,808 migrate:mm_migrate_pages [100.00%]

    1407.690148814 seconds time elapsed

    numa02

    Performance counter stats for '/usr/bin/time -f %e %S %U %c %w -o start_bench.out -a ./numa02':

    63,065 cs [100.00%]
    14,364 migrations [100.00%]
    2,08,118 faults [100.00%]
    25,32,59,404 cache-misses
    12 migrate:mm_migrate_pages [100.00%]

    63.840827219 seconds time elapsed

    (3.14-rc6 with fix)
    -------------------
    numa01

    Performance counter stats for '/usr/bin/time -f %e %S %U %c %w -o start_bench.out -a ./numa01':

    9,68,911 cs [100.00%]
    1,01,414 migrations [100.00%]
    88,38,697 faults [100.00%]
    4,42,92,51,042 cache-misses
    4,25,060 migrate:mm_migrate_pages [100.00%]

    685.965331189 seconds time elapsed

    numa02

    Performance counter stats for '/usr/bin/time -f %e %S %U %c %w -o start_bench.out -a ./numa02':

    17,543 cs [100.00%]
    2,962 migrations [100.00%]
    1,17,843 faults [100.00%]
    11,80,61,644 cache-misses
    12,358 migrate:mm_migrate_pages [100.00%]

    20.380132343 seconds time elapsed

    Signed-off-by: Srikar Dronamraju
    Cc: Liu Ping Fan
    Reviewed-by: Aneesh Kumar K.V
    Cc: Peter Zijlstra
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Srikar Dronamraju
     
  • s/MADV_NODUMP/MADV_DONTDUMP/

    Signed-off-by: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang Yanfei
     
  • Commit f9acc8c7b35a ("readahead: sanify file_ra_state names") left
    ra_submit with a single function call.

    Move ra_submit to internal.h and inline it to save some stack. Thanks
    to Andrew Morton for commenting different versions.

    Signed-off-by: Fabian Frederick
    Suggested-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     
  • When I decrease the value of nr_hugepage in procfs a lot, softlockup
    happens. It is because there is no chance of context switch during this
    process.

    On the other hand, when I allocate a large number of hugepages, there is
    some chance of context switch. Hence softlockup doesn't happen during
    this process. So it's necessary to add the context switch in the
    freeing process as same as allocating process to avoid softlockup.

    When I freed 12 TB hugapages with kernel-2.6.32-358.el6, the freeing
    process occupied a CPU over 150 seconds and following softlockup message
    appeared twice or more.

    $ echo 6000000 > /proc/sys/vm/nr_hugepages
    $ cat /proc/sys/vm/nr_hugepages
    6000000
    $ grep ^Huge /proc/meminfo
    HugePages_Total: 6000000
    HugePages_Free: 6000000
    HugePages_Rsvd: 0
    HugePages_Surp: 0
    Hugepagesize: 2048 kB
    $ echo 0 > /proc/sys/vm/nr_hugepages

    BUG: soft lockup - CPU#16 stuck for 67s! [sh:12883] ...
    Pid: 12883, comm: sh Not tainted 2.6.32-358.el6.x86_64 #1
    Call Trace:
    free_pool_huge_page+0xb8/0xd0
    set_max_huge_pages+0x128/0x190
    hugetlb_sysctl_handler_common+0x113/0x140
    hugetlb_sysctl_handler+0x1e/0x20
    proc_sys_call_handler+0x97/0xd0
    proc_sys_write+0x14/0x20
    vfs_write+0xb8/0x1a0
    sys_write+0x51/0x90
    __audit_syscall_exit+0x265/0x290
    system_call_fastpath+0x16/0x1b

    I have not confirmed this problem with upstream kernels because I am not
    able to prepare the machine equipped with 12TB memory now. However I
    confirmed that the amount of decreasing hugepages was directly
    proportional to the amount of required time.

    I measured required times on a smaller machine. It showed 130-145
    hugepages decreased in a millisecond.

    Amount of decreasing Required time Decreasing rate
    hugepages (msec) (pages/msec)
    ------------------------------------------------------------
    10,000 pages == 20GB 70 - 74 135-142
    30,000 pages == 60GB 208 - 229 131-144

    It means decrement of 6TB hugepages will trigger softlockup with the
    default threshold 20sec, in this decreasing rate.

    Signed-off-by: Masayoshi Mizuma
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Wanpeng Li
    Cc: Aneesh Kumar
    Cc: KOSAKI Motohiro
    Cc: Naoya Horiguchi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mizuma, Masayoshi
     
  • Replace ((phys_addr_t)(x) << PAGE_SHIFT) by pfn macro.

    Signed-off-by: Fabian Frederick
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     
  • This is a small cleanup.

    Signed-off-by: Emil Medve
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Emil Medve
     
  • There's only one caller of set_page_dirty_balance() and that will call it
    with page_mkwrite == 0.

    The page_mkwrite argument was unused since commit b827e496c893 "mm: close
    page_mkwrite races".

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • A BUG_ON(!PageLocked) was triggered in mlock_vma_page() by Sasha Levin
    fuzzing with trinity. The call site try_to_unmap_cluster() does not lock
    the pages other than its check_page parameter (which is already locked).

    The BUG_ON in mlock_vma_page() is not documented and its purpose is
    somewhat unclear, but apparently it serializes against page migration,
    which could otherwise fail to transfer the PG_mlocked flag. This would
    not be fatal, as the page would be eventually encountered again, but
    NR_MLOCK accounting would become distorted nevertheless. This patch adds
    a comment to the BUG_ON in mlock_vma_page() and munlock_vma_page() to that
    effect.

    The call site try_to_unmap_cluster() is fixed so that for page !=
    check_page, trylock_page() is attempted (to avoid possible deadlocks as we
    already have check_page locked) and mlock_vma_page() is performed only
    upon success. If the page lock cannot be obtained, the page is left
    without PG_mlocked, which is again not a problem in the whole unevictable
    memory design.

    Signed-off-by: Vlastimil Babka
    Signed-off-by: Bob Liu
    Reported-by: Sasha Levin
    Cc: Wanpeng Li
    Cc: Michel Lespinasse
    Cc: KOSAKI Motohiro
    Acked-by: Rik van Riel
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Joonsoo Kim
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • On NUMA systems, a node may start thrashing cache or even swap anonymous
    pages while there are still free pages on remote nodes.

    This is a result of commits 81c0a2bb515f ("mm: page_alloc: fair zone
    allocator policy") and fff4068cba48 ("mm: page_alloc: revert NUMA aspect
    of fair allocation policy").

    Before those changes, the allocator would first try all allowed zones,
    including those on remote nodes, before waking any kswapds. But now,
    the allocator fastpath doubles as the fairness pass, which in turn can
    only consider the local node to prevent remote spilling based on
    exhausted fairness batches alone. Remote nodes are only considered in
    the slowpath, after the kswapds are woken up. But if remote nodes still
    have free memory, kswapd should not be woken to rebalance the local node
    or it may thrash cash or swap prematurely.

    Fix this by adding one more unfair pass over the zonelist that is
    allowed to spill to remote nodes after the local fairness pass fails but
    before entering the slowpath and waking the kswapds.

    This also gets rid of the GFP_THISNODE exemption from the fairness
    protocol because the unfair pass is no longer tied to kswapd, which
    GFP_THISNODE is not allowed to wake up.

    However, because remote spills can be more frequent now - we prefer them
    over local kswapd reclaim - the allocation batches on remote nodes could
    underflow more heavily. When resetting the batches, use
    atomic_long_read() directly instead of zone_page_state() to calculate the
    delta as the latter filters negative counter values.

    Signed-off-by: Johannes Weiner
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Cc: [3.12+]

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • mem_cgroup_newpage_charge is used only for charging anonymous memory so
    it is better to rename it to mem_cgroup_charge_anon.

    mem_cgroup_cache_charge is used for file backed memory so rename it to
    mem_cgroup_charge_file.

    Signed-off-by: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Some callsites pass a memcg directly, some callsites pass an mm that
    then has to be translated to a memcg. This makes for a terrible
    function interface.

    Just push the mm-to-memcg translation into the respective callsites and
    always pass a memcg to mem_cgroup_try_charge().

    [mhocko@suse.cz: add charge mm helper]
    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • __mem_cgroup_try_charge duplicates get_mem_cgroup_from_mm for charges
    which came without a memcg. The only reason seems to be a tiny
    optimization when css_tryget is not called if the charge can be consumed
    from the stock. Nevertheless css_tryget is very cheap since it has been
    reworked to use per-cpu counting so this optimization doesn't give us
    anything these days.

    So let's drop the code duplication so that the code is more readable.

    Signed-off-by: Michal Hocko
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Instead of returning NULL from try_get_mem_cgroup_from_mm() when the mm
    owner is exiting, just return root_mem_cgroup. This makes sense for all
    callsites and gets rid of some of them having to fallback manually.

    [fengguang.wu@intel.com: fix warnings]
    Signed-off-by: Johannes Weiner
    Signed-off-by: Fengguang Wu
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Users pass either a mm that has been established under task lock, or use
    a verified current->mm, which means the task can't be exiting.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Only page cache charges can happen without an mm context, so push this
    special case out of the inner core and into the cache charge function.

    An ancient comment explains that the mm can also be NULL in case the
    task is currently being migrated, but that is not actually true with the
    current case, so just remove it.

    Signed-off-by: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner