13 Feb, 2015

40 commits

  • The file uses nothing from init.h, and also doesn't need the full module.h
    machinery; export.h is sufficient. The latter requires the user to ensure
    compiler.h is included, so do that explicitly instead of relying on some
    other header pulling it in.

    Signed-off-by: Rasmus Villemoes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rasmus Villemoes
     
  • This patch makes hexdump return the number of bytes placed in the buffer
    excluding trailing NUL. In the case of overflow it returns the desired
    amount of bytes to produce the entire dump. Thus, it mimics snprintf().

    This will be useful for users that would like to repeat with a bigger
    buffer.

    [akpm@linux-foundation.org: fix printk warning]
    Signed-off-by: Andy Shevchenko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Shevchenko
     
  • Instead of doing calculations in each case of different groupsize let's do
    them beforehand. While there, change the switch to an if-else-if
    construction.

    Signed-off-by: Andy Shevchenko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Shevchenko
     
  • In the current implementation we have a floating ascii column in the tail
    of the dump.

    For example, for row size equal to 16 the ascii column as in following
    table

    group size \ length 8 12 16
    1 50 50 50
    2 22 32 42
    4 20 29 38
    8 19 - 36

    This patch makes it the same independently of amount of bytes dumped.

    The change is safe since all current users, which use ASCII part of the
    dump, rely on the group size equal to 1. The patch doesn't change
    behaviour for such group size (see the table above).

    Signed-off-by: Andy Shevchenko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Shevchenko
     
  • Test different scenarios of function calls located in lib/hexdump.c.

    Currently hex_dump_to_buffer() is only tested and test data is provided
    for little endian CPUs.

    Signed-off-by: Andy Shevchenko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Shevchenko
     
  • Since chunk->end_addr is (chunk->start_addr + size - 1), the end address
    to compare should be (start + size - 1).

    Signed-off-by: Toshi Kikuchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Toshi Kikuchi
     
  • Now that all in-tree users of strnicmp have been converted to
    strncasecmp, the wrapper can be removed.

    Signed-off-by: Rasmus Villemoes
    Cc: David Howells
    Cc: Heiko Carstens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rasmus Villemoes
     
  • Also, rename bits to nbits. Both changes for consistency with other
    bitmap_* functions.

    Signed-off-by: Rasmus Villemoes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rasmus Villemoes
     
  • Make the return value and the ord and nbits parameters of
    bitmap_ord_to_pos unsigned.

    Also, simplify the implementation and as a side effect make the result
    fully defined, returning nbits for ord >= weight, in analogy with what
    find_{first,next}_bit does. This is a better sentinel than the former
    ("unofficial") 0. No current users are affected by this change.

    Signed-off-by: Rasmus Villemoes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rasmus Villemoes
     
  • The ordinal of a set bit is simply the number of set bits before it;
    counting those doesn't need to be done one bit at a time. While at it,
    update the parameters to unsigned int.

    It is not completely unthinkable that gcc would see pos as compile-time
    constant 0 in one of the uses of bitmap_pos_to_ord. Since the static
    inline frontend bitmap_weight doesn't handle nbits==0 correctly (it would
    behave exactly as if nbits==BITS_PER_LONG), use __bitmap_weight.

    Alternatively, the last line could be spelled bitmap_weight(buf, pos+1)-1,
    but this is simpler.

    Signed-off-by: Rasmus Villemoes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rasmus Villemoes
     
  • Change the sz and nbits parameters of bitmap_fold to unsigned int for
    consistency with other bitmap_* functions, and to save another few bytes
    in the generated code.

    [akpm@linux-foundation.org: fix kerneldoc]
    Signed-off-by: Rasmus Villemoes
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rasmus Villemoes
     
  • Change the nbits parameter of bitmap_onto to unsigned int for consistency
    with other bitmap_* functions.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Rasmus Villemoes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rasmus Villemoes
     
  • Since the various bitmap_* functions now take an unsigned int as nbits
    parameter, it makes sense to also update the various wrappers, even though
    they're marked as obsolete.

    Signed-off-by: Rasmus Villemoes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rasmus Villemoes
     
  • Since the various bitmap_* functions now take an unsigned int as nbits
    parameter, it makes sense to also update the various wrappers.

    Signed-off-by: Rasmus Villemoes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rasmus Villemoes
     
  • For consistency with the other bitmap_* functions, also make the nbits
    parameter of bitmap_zero, bitmap_fill and bitmap_copy unsigned.

    Signed-off-by: Rasmus Villemoes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rasmus Villemoes
     
  • string_get_size() was documented to return an error, but in fact always
    returned 0. Since the output always fits in 9 bytes, just document that
    and let callers do what they do now: pass a small stack buffer and ignore
    the return value.

    Signed-off-by: Rasmus Villemoes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rasmus Villemoes
     
  • The remainder from do_div is always a u32, and after size has been reduced
    to be below 1000 (or 1024), it certainly fits in u32. So both remainder
    and sf_cap can be made u32s, the format specifiers can be simplified (%lld
    wasn't the right thing to use for _unsigned_ long long anyway), and we can
    replace a do_div with an ordinary 32/32 bit division.

    Signed-off-by: Rasmus Villemoes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rasmus Villemoes
     
  • While commit 3c9f3681d0b4 ("[SCSI] lib: add generic helper to print
    sizes rounded to the correct SI range") says that Z and Y are included
    in preparation for 128 bit computers, they just waste .text currently.
    If and when we get u128, string_get_size needs updating anyway (and ISO
    needs to come up with four more prefixes).

    Also there's no need to include and test for the NULL sentinel; once we
    reach "E" size is at most 18. [The test is also wrong; it should be
    units_str[units][i+1]; if we've reached NULL we're already doomed.]

    Signed-off-by: Rasmus Villemoes
    Cc: James Bottomley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rasmus Villemoes
     
  • All callers of skip_atoi have already checked for the first character
    being a digit. In this case, gcc generates simpler code for a do
    while-loop.

    Signed-off-by: Rasmus Villemoes
    Cc: Jiri Kosina
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rasmus Villemoes
     
  • On 64 bit, size may very well be huge even if bit 31 happens to be 0.
    Somehow it doesn't feel right that one can pass a 5 GiB buffer but not a
    3 GiB one. So cap at INT_MAX as was probably the intention all along.
    This is also the made-up value passed by sprintf and vsprintf.

    Signed-off-by: Rasmus Villemoes
    Cc: Jiri Kosina
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rasmus Villemoes
     
  • It seems a little simpler to consume the p from a %p specifier in
    format_decode, just as it is done for the surrounding %c, %s and %% cases.

    While there, delete a redundant and misplaced comment.

    Signed-off-by: Rasmus Villemoes
    Cc: Jiri Kosina
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rasmus Villemoes
     
  • Neaten the MODULE_PARAM_DESC message.
    Use 30 seconds in the comment for the zap console locks timeout.

    Signed-off-by: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • __FUNCTION__ hasn't been treated as a string literal since gcc 3.4, so
    this only helps people who only test-compile using 3.3 (compiler-gcc3.h
    barks at anything older than that). Besides, there are almost no
    occurrences of __FUNCTION__ left in the tree.

    [akpm@linux-foundation.org: convert remaining __FUNCTION__ references]
    Signed-off-by: Rasmus Villemoes
    Cc: Michal Nazarewicz
    Cc: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rasmus Villemoes
     
  • On POWER8 virtualised kernels the VTB register can be read to have a view
    of time that only increases while the guest is running. This will prevent
    guests from seeing time jump if a guest is paused for significant amounts
    of time.

    On POWER7 and below virtualised kernels stolen time is subtracted from
    local_clock as a best effort approximation. This will not eliminate
    spurious warnings in the case of a suspended guest but may reduce the
    occurance in the case of softlockups due to host over commit.

    Bare metal kernels should avoid reading the VTB as KVM does not restore
    sane values when not executing, the approxmation is fine as host kernels
    won't observe any stolen time.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Cyril Bur
    Cc: Michael Ellerman
    Cc: Andrew Jones
    Acked-by: Don Zickus
    Cc: Ingo Molnar
    Cc: Ulrich Obergfell
    Cc: chai wen
    Cc: Fabian Frederick
    Cc: Aaron Tomlin
    Cc: Ben Zhang
    Cc: Martin Schwidefsky
    Cc: John Stultz
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyril Bur
     
  • When the hypervisor pauses a virtualised kernel the kernel will observe a
    jump in timebase, this can cause spurious messages from the softlockup
    detector.

    Whilst these messages are harmless, they are accompanied with a stack
    trace which causes undue concern and more problematically the stack trace
    in the guest has nothing to do with the observed problem and can only be
    misleading.

    Futhermore, on POWER8 this is completely avoidable with the introduction
    of the Virtual Time Base (VTB) register.

    This patch (of 2):

    This permits the use of arch specific clocks for which virtualised kernels
    can use their notion of 'running' time, not the elpased wall time which
    will include host execution time.

    Signed-off-by: Cyril Bur
    Cc: Michael Ellerman
    Cc: Andrew Jones
    Acked-by: Don Zickus
    Cc: Ingo Molnar
    Cc: Ulrich Obergfell
    Cc: chai wen
    Cc: Fabian Frederick
    Cc: Aaron Tomlin
    Cc: Ben Zhang
    Cc: Martin Schwidefsky
    Cc: John Stultz
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyril Bur
     
  • Everybody uses unsigned long for pgoff_t, and no one ever overrode the
    definition of pgoff_t. Keep it that way, and remove the option of
    overriding it.

    Signed-off-by: Geert Uytterhoeven
    Cc: Randy Dunlap
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Geert Uytterhoeven
     
  • Have git ignore the Debian directory created when running:
    make tar-pkg / targz-pkg / tarbz2-pkg / tarxz-pkg

    Signed-off-by: Andrey Skvortsov
    Cc: Michal Marek
    Cc: Greg Kroah-Hartman
    Cc: Boaz Harrosh
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Skvortsov
     
  • If an attacker can cause a controlled kernel stack overflow, overwriting
    the restart block is a very juicy exploit target. This is because the
    restart_block is held in the same memory allocation as the kernel stack.

    Moving the restart block to struct task_struct prevents this exploit by
    making the restart_block harder to locate.

    Note that there are other fields in thread_info that are also easy
    targets, at least on some architectures.

    It's also a decent simplification, since the restart code is more or less
    identical on all architectures.

    [james.hogan@imgtec.com: metag: align thread_info::supervisor_stack]
    Signed-off-by: Andy Lutomirski
    Cc: Thomas Gleixner
    Cc: Al Viro
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Kees Cook
    Cc: David Miller
    Acked-by: Richard Weinberger
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: Matt Turner
    Cc: Vineet Gupta
    Cc: Russell King
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: Haavard Skinnemoen
    Cc: Hans-Christian Egtvedt
    Cc: Steven Miao
    Cc: Mark Salter
    Cc: Aurelien Jacquiot
    Cc: Mikael Starvik
    Cc: Jesper Nilsson
    Cc: David Howells
    Cc: Richard Kuo
    Cc: "Luck, Tony"
    Cc: Geert Uytterhoeven
    Cc: Michal Simek
    Cc: Ralf Baechle
    Cc: Jonas Bonn
    Cc: "James E.J. Bottomley"
    Cc: Helge Deller
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Acked-by: Michael Ellerman (powerpc)
    Tested-by: Michael Ellerman (powerpc)
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Chen Liqin
    Cc: Lennox Wu
    Cc: Chris Metcalf
    Cc: Guan Xuetao
    Cc: Chris Zankel
    Cc: Max Filippov
    Cc: Oleg Nesterov
    Cc: Guenter Roeck
    Signed-off-by: James Hogan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Lutomirski
     
  • Instead of custom approach let's use string_escape_str() to escape a given
    string (task_name in this case).

    Signed-off-by: Andy Shevchenko
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Shevchenko
     
  • The output of /proc/$pid/numa_maps is in terms of number of pages like
    anon=22 or dirty=54. Here's some output:

    7f4680000000 default file=/hugetlb/bigfile anon=50 dirty=50 N0=50
    7f7659600000 default file=/anon_hugepage\040(deleted) anon=50 dirty=50 N0=50
    7fff8d425000 default stack anon=50 dirty=50 N0=50

    Looks like we have a stack and a couple of anonymous hugetlbfs
    areas page which both use the same amount of memory. They don't.

    The 'bigfile' uses 1GB pages and takes up ~50GB of space. The
    anon_hugepage uses 2MB pages and takes up ~100MB of space while the stack
    uses normal 4k pages. You can go over to smaps to figure out what the
    page size _really_ is with KernelPageSize or MMUPageSize. But, I think
    this is a pretty nasty and counterintuitive interface as it stands.

    This patch introduces 'kernelpagesize_kB' line element to
    /proc//numa_maps report file in order to help identifying the size of
    pages that are backing memory areas mapped by a given task. This is
    specially useful to help differentiating between HUGE and GIGANTIC page
    backed VMAs.

    This patch is based on Dave Hansen's proposal and reviewer's follow-ups
    taken from the following dicussion threads:
    * https://lkml.org/lkml/2011/9/21/454
    * https://lkml.org/lkml/2014/12/20/66

    Signed-off-by: Rafael Aquini
    Cc: Johannes Weiner
    Cc: Dave Hansen
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael Aquini
     
  • Add a small section to proc.txt doc in order to document its
    /proc/pid/numa_maps interface. It does not introduce any functional
    changes, just documentation.

    Signed-off-by: Rafael Aquini
    Cc: Johannes Weiner
    Cc: Dave Hansen
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael Aquini
     
  • Use the PDE() helper to get proc_dir_entry instead of coding it directly.

    Signed-off-by: Alexander Kuleshov
    Acked-by: Nicolas Dichtel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Kuleshov
     
  • Peak resident size of a process can be reset back to the process's
    current rss value by writing "5" to /proc/pid/clear_refs. The driving
    use-case for this would be getting the peak RSS value, which can be
    retrieved from the VmHWM field in /proc/pid/status, per benchmark
    iteration or test scenario.

    [akpm@linux-foundation.org: clarify behaviour in documentation]
    Signed-off-by: Petr Cermak
    Cc: Bjorn Helgaas
    Cc: Primiano Tucci
    Cc: Petr Cermak
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Petr Cermak
     
  • Remove the function search_one_table() that is not used anywhere.

    This was partially found by using a static code analysis program called
    cppcheck.

    Signed-off-by: Rickard Strandqvist
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rickard Strandqvist
     
  • Keeping fragmentation of zsmalloc in a low level is our target. But now
    we still need to add the debug code in zsmalloc to get the quantitative
    data.

    This patch adds a new configuration CONFIG_ZSMALLOC_STAT to enable the
    statistics collection for developers. Currently only the objects
    statatitics in each class are collected. User can get the information via
    debugfs.

    cat /sys/kernel/debug/zsmalloc/zram0/...

    For example:

    After I copied "jdk-8u25-linux-x64.tar.gz" to zram with ext4 filesystem:
    class size obj_allocated obj_used pages_used
    0 32 0 0 0
    1 48 256 12 3
    2 64 64 14 1
    3 80 51 7 1
    4 96 128 5 3
    5 112 73 5 2
    6 128 32 4 1
    7 144 0 0 0
    8 160 0 0 0
    9 176 0 0 0
    10 192 0 0 0
    11 208 0 0 0
    12 224 0 0 0
    13 240 0 0 0
    14 256 16 1 1
    15 272 15 9 1
    16 288 0 0 0
    17 304 0 0 0
    18 320 0 0 0
    19 336 0 0 0
    20 352 0 0 0
    21 368 0 0 0
    22 384 0 0 0
    23 400 0 0 0
    24 416 0 0 0
    25 432 0 0 0
    26 448 0 0 0
    27 464 0 0 0
    28 480 0 0 0
    29 496 33 1 4
    30 512 0 0 0
    31 528 0 0 0
    32 544 0 0 0
    33 560 0 0 0
    34 576 0 0 0
    35 592 0 0 0
    36 608 0 0 0
    37 624 0 0 0
    38 640 0 0 0
    40 672 0 0 0
    42 704 0 0 0
    43 720 17 1 3
    44 736 0 0 0
    46 768 0 0 0
    49 816 0 0 0
    51 848 0 0 0
    52 864 14 1 3
    54 896 0 0 0
    57 944 13 1 3
    58 960 0 0 0
    62 1024 4 1 1
    66 1088 15 2 4
    67 1104 0 0 0
    71 1168 0 0 0
    74 1216 0 0 0
    76 1248 0 0 0
    83 1360 3 1 1
    91 1488 11 1 4
    94 1536 0 0 0
    100 1632 5 1 2
    107 1744 0 0 0
    111 1808 9 1 4
    126 2048 4 4 2
    144 2336 7 3 4
    151 2448 0 0 0
    168 2720 15 15 10
    190 3072 28 27 21
    202 3264 0 0 0
    254 4096 36209 36209 36209

    Total 37022 36326 36288

    We can calculate the overall fragentation by the last line:
    Total 37022 36326 36288
    (37022 - 36326) / 37022 = 1.87%

    Also by analysing objects alocated in every class we know why we got so
    low fragmentation: Most of the allocated objects is in . And
    there is only 1 page in class 254 zspage. So, No fragmentation will be
    introduced by allocating objs in class 254.

    And in future, we can collect other zsmalloc statistics as we need and
    analyse them.

    Signed-off-by: Ganesh Mahendran
    Suggested-by: Minchan Kim
    Acked-by: Minchan Kim
    Cc: Nitin Gupta
    Cc: Seth Jennings
    Cc: Dan Streetman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ganesh Mahendran
     
  • Currently the underlay of zpool: zsmalloc/zbud, do not know who creates
    them. There is not a method to let zsmalloc/zbud find which caller they
    belong to.

    Now we want to add statistics collection in zsmalloc. We need to name the
    debugfs dir for each pool created. The way suggested by Minchan Kim is to
    use a name passed by caller(such as zram) to create the zsmalloc pool.

    /sys/kernel/debug/zsmalloc/zram0

    This patch adds an argument `name' to zs_create_pool() and other related
    functions.

    Signed-off-by: Ganesh Mahendran
    Acked-by: Minchan Kim
    Cc: Seth Jennings
    Cc: Nitin Gupta
    Cc: Dan Streetman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ganesh Mahendran
     
  • `struct zram' contains both `struct gendisk' and `struct request_queue'.
    the latter can be deleted, because zram->disk carries ->queue pointer, and
    ->queue carries zram pointer:

    create_device()
    zram->queue->queuedata = zram
    zram->disk->queue = zram->queue
    zram->disk->private_data = zram

    so zram->queue is not needed, we can access all necessary data anyway.

    Signed-off-by: Sergey Senozhatsky
    Cc: Minchan Kim
    Cc: Jerome Marchand
    Cc: Nitin Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • Admin could reset zram during I/O operation going on so we have used
    zram->init_lock as read-side lock in I/O path to prevent sudden zram
    meta freeing.

    However, the init_lock is really troublesome. We can't do call
    zram_meta_alloc under init_lock due to lockdep splat because
    zram_rw_page is one of the function under reclaim path and hold it as
    read_lock while other places in process context hold it as write_lock.
    So, we have used allocation out of the lock to avoid lockdep warn but
    it's not good for readability and fainally, I met another lockdep splat
    between init_lock and cpu_hotplug from kmem_cache_destroy during working
    zsmalloc compaction. :(

    Yes, the ideal is to remove horrible init_lock of zram in rw path. This
    patch removes it in rw path and instead, add atomic refcount for meta
    lifetime management and completion to free meta in process context.
    It's important to free meta in process context because some of resource
    destruction needs mutex lock, which could be held if we releases the
    resource in reclaim context so it's deadlock, again.

    As a bonus, we could remove init_done check in rw path because
    zram_meta_get will do a role for it, instead.

    Signed-off-by: Sergey Senozhatsky
    Signed-off-by: Minchan Kim
    Cc: Nitin Gupta
    Cc: Jerome Marchand
    Cc: Ganesh Mahendran
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • bd_holders is increased only when user open the device file as FMODE_EXCL
    so if something opens zram0 as !FMODE_EXCL and request I/O while another
    user reset zram0, we can see following warning.

    zram0: detected capacity change from 0 to 64424509440
    Buffer I/O error on dev zram0, logical block 180823, lost async page write
    Buffer I/O error on dev zram0, logical block 180824, lost async page write
    Buffer I/O error on dev zram0, logical block 180825, lost async page write
    Buffer I/O error on dev zram0, logical block 180826, lost async page write
    Buffer I/O error on dev zram0, logical block 180827, lost async page write
    Buffer I/O error on dev zram0, logical block 180828, lost async page write
    Buffer I/O error on dev zram0, logical block 180829, lost async page write
    Buffer I/O error on dev zram0, logical block 180830, lost async page write
    Buffer I/O error on dev zram0, logical block 180831, lost async page write
    Buffer I/O error on dev zram0, logical block 180832, lost async page write
    ------------[ cut here ]------------
    WARNING: CPU: 11 PID: 1996 at fs/block_dev.c:57 __blkdev_put+0x1d7/0x210()
    Modules linked in:
    CPU: 11 PID: 1996 Comm: dd Not tainted 3.19.0-rc6-next-20150202+ #1125
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
    Call Trace:
    dump_stack+0x45/0x57
    warn_slowpath_common+0x8a/0xc0
    warn_slowpath_null+0x1a/0x20
    __blkdev_put+0x1d7/0x210
    blkdev_put+0x50/0x130
    blkdev_close+0x25/0x30
    __fput+0xdf/0x1e0
    ____fput+0xe/0x10
    task_work_run+0xa7/0xe0
    do_notify_resume+0x49/0x60
    int_signal+0x12/0x17
    ---[ end trace 274fbbc5664827d2 ]---

    The warning comes from bdev_write_node in blkdev_put path.

    static void bdev_write_inode(struct inode *inode)
    {
    spin_lock(&inode->i_lock);
    while (inode->i_state & I_DIRTY) {
    spin_unlock(&inode->i_lock);
    WARN_ON_ONCE(write_inode_now(inode, true)); i_lock);
    }
    spin_unlock(&inode->i_lock);
    }

    The reason is dd process encounters I/O fails due to sudden block device
    disappear so in filemap_check_errors in __writeback_single_inode returns
    -EIO.

    If we check bd_openers instead of bd_holders, we could address the
    problem. When I see the brd, it already have used it rather than
    bd_holders so although I'm not a expert of block layer, it seems to be
    better.

    I can make following warning with below simple script. In addition, I
    added msleep(2000) below set_capacity(zram->disk, 0) after applying your
    patch to make window huge(Kudos to Ganesh!)

    script:

    echo $((60< /sys/block/zram0/disksize
    setsid dd if=/dev/zero of=/dev/zram0 &
    sleep 1
    setsid echo 1 > /sys/block/zram0/reset

    Signed-off-by: Minchan Kim
    Acked-by: Sergey Senozhatsky
    Cc: Nitin Gupta
    Cc: Jerome Marchand
    Cc: Ganesh Mahendran
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • We need to return set_capacity(disk, 0) from reset_store() back to
    zram_reset_device(), a catch by Ganesh Mahendran. Potentially, we can
    race set_capacity() calls from init and reset paths.

    The problem is that zram_reset_device() is also getting called from
    zram_exit(), which performs operations in misleading reversed order -- we
    first create_device() and then init it, while zram_exit() perform
    destroy_device() first and then does zram_reset_device(). This is done to
    remove sysfs group before we reset device, so we can continue with device
    reset/destruction not being raced by sysfs attr write (f.e. disksize).

    Apart from that, destroy_device() releases zram->disk (but we still have
    ->disk pointer), so we cannot acces zram->disk in later
    zram_reset_device() call, which may cause additional errors in the future.

    So, this patch rework and cleanup destroy path.

    1) remove several unneeded goto labels in zram_init()

    2) factor out zram_init() error path and zram_exit() into
    destroy_devices() function, which takes the number of devices to
    destroy as its argument.

    3) remove sysfs group in destroy_devices() first, so we can reorder
    operations -- reset device (as expected) goes before disk destroy and
    queue cleanup. So we can always access ->disk in zram_reset_device().

    4) and, finally, return set_capacity() back under ->init_lock.

    [akpm@linux-foundation.org: tweak comment]
    Signed-off-by: Sergey Senozhatsky
    Reported-by: Ganesh Mahendran
    Cc: Minchan Kim
    Cc: Jerome Marchand
    Cc: Nitin Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky