07 Aug, 2014

40 commits

  • Use kernel.h definition.

    Signed-off-by: Fabian Frederick
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     
  • Complement commit 68aecfb97978 ("lib/string_helpers.c: make arrays
    static") by making the arrays const -- not only pointing to const
    strings. This moves them out of the data section to the r/o data
    section:

    text data bss dec hex filename
    1150 176 0 1326 52e lib/string_helpers.old.o
    1326 0 0 1326 52e lib/string_helpers.new.o

    Signed-off-by: Mathias Krause
    Cc: James Bottomley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mathias Krause
     
  • For modern filesystems such as btrfs, t/p/e size level operations are
    common. add size unit t/p/e parsing to memparse

    Signed-off-by: Gui Hecheng
    Acked-by: David Rientjes
    Reviewed-by: Satoru Takeuchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gui Hecheng
     
  • The function may be useful for other drivers, so export it. (Suggested
    by Tejun Heo.)

    Note that I inverted the return value of glob_match; returning true on
    match seemed to make more sense.

    Signed-off-by: George Spelvin
    Cc: Randy Dunlap
    Cc: Tejun Heo
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    George Spelvin
     
  • This was useful during development, and is retained for future
    regression testing.

    GCC appears to have no way to place string literals in a particular
    section; adding __initconst to a char pointer leaves the string itself
    in the default string section, where it will not be thrown away after
    module load.

    Thus all string constants are kept in explicitly declared and named
    arrays. Sorry this makes printk a bit harder to read. At least the
    tests are more compact.

    Signed-off-by: George Spelvin
    Cc: Randy Dunlap
    Cc: Tejun Heo
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    George Spelvin
     
  • This is a helper function from drivers/ata/libata_core.c, where it is
    used to blacklist particular device models. It's being moved to lib/ so
    other drivers may use it for the same purpose.

    This implementation in non-recursive, so is safe for the kernel stack.

    [akpm@linux-foundation.org: fix sparse warning]
    Signed-off-by: George Spelvin
    Cc: Randy Dunlap
    Cc: Tejun Heo
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    George Spelvin
     
  • Cleanup unused `if 0'-ed functions, which have been dead since 2006
    (commits 87c2ce3b9305 ("lib/zlib*: cleanups") by Adrian Bunk and
    4f3865fb57a0 ("zlib_inflate: Upgrade library code to a recent version")
    by Richard Purdie):

    - zlib_deflateSetDictionary
    - zlib_deflateParams
    - zlib_deflateCopy
    - zlib_inflateSync
    - zlib_syncsearch
    - zlib_inflateSetDictionary
    - zlib_inflatePrime

    Signed-off-by: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • The name was modified from hlist_add_after() to hlist_add_behind() when
    adjusting the order of arguments to match the one with
    klist_add_after(). This is necessary to break old code when it would
    use it the wrong way.

    Make klist follow this naming scheme for consistency.

    Signed-off-by: Ken Helias
    Cc: "Paul E. McKenney"
    Cc: Christoph Hellwig
    Cc: Hugh Dickins
    Cc: Jeff Kirsher
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ken Helias
     
  • All other add functions for lists have the new item as first argument
    and the position where it is added as second argument. This was changed
    for no good reason in this function and makes using it unnecessary
    confusing.

    The name was changed to hlist_add_behind() to cause unconverted code to
    generate a compile error instead of using the wrong parameter order.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Ken Helias
    Cc: "Paul E. McKenney"
    Acked-by: Jeff Kirsher [intel driver bits]
    Cc: Hugh Dickins
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ken Helias
     
  • The argument names for hlist_add_after() are poorly chosen because they
    look the same as the ones for hlist_add_before() but have to be used
    differently.

    hlist_add_after_rcu() has made a better choice.

    Signed-off-by: Ken Helias
    Cc: "Paul E. McKenney"
    Cc: Christoph Hellwig
    Cc: Hugh Dickins
    Cc: Jeff Kirsher
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ken Helias
     
  • Fix coccinelle warnings.

    Signed-off-by: Neil Zhang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Neil Zhang
     
  • We need interrupts disabled when calling console_trylock_for_printk()
    only so that cpu id we pass to can_use_console() remains valid (for
    other things console_sem provides all the exclusion we need and
    deadlocks on console_sem due to interrupts are impossible because we use
    down_trylock()). However if we are rescheduled, we are guaranteed to
    run on an online cpu so we can easily just get the cpu id in
    can_use_console().

    We can lose a bit of performance when we enable interrupts in
    vprintk_emit() and then disable them again in console_unlock() but OTOH
    it can somewhat reduce interrupt latency caused by console_unlock().

    We differ from (reverted) commit 939f04bec1a4 in that we avoid calling
    console_unlock() from vprintk_emit() with lockdep enabled as that has
    unveiled quite some bugs leading to system freezes during boot (e.g.
    https://lkml.org/lkml/2014/5/30/242,
    https://lkml.org/lkml/2014/6/28/521).

    Signed-off-by: Jan Kara
    Tested-by: Andreas Bombe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Some small cleanups to kernel/printk/printk.c. None of them should
    cause any change in behavior.

    - When CONFIG_PRINTK is defined, parenthesize the value of LOG_LINE_MAX.
    - When CONFIG_PRINTK is *not* defined, there is an extra LOG_LINE_MAX
    definition; delete it.
    - Pull an assignment out of a conditional expression in console_setup().
    - Use isdigit() in console_setup() rather than open coding it.
    - In update_console_cmdline(), drop a NUL-termination assignment;
    the strlcpy() call that precedes it guarantees it's not needed.
    - Simplify some logic in printk_timed_ratelimit().

    Signed-off-by: Alex Elder
    Reviewed-by: Petr Mladek
    Cc: Andi Kleen
    Cc: Borislav Petkov
    Cc: Jan Kara
    Cc: John Stultz
    Cc: Steven Rostedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alex Elder
     
  • Use the IS_ENABLED() macro rather than #ifdef blocks to set certain
    global values.

    Signed-off-by: Alex Elder
    Acked-by: Borislav Petkov
    Reviewed-by: Petr Mladek
    Cc: Andi Kleen
    Cc: Jan Kara
    Cc: John Stultz
    Cc: Steven Rostedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alex Elder
     
  • Fix a few comments that don't accurately describe their corresponding
    code. It also fixes some minor typographical errors.

    Signed-off-by: Alex Elder
    Reviewed-by: Petr Mladek
    Cc: Andi Kleen
    Cc: Borislav Petkov
    Cc: Jan Kara
    Cc: John Stultz
    Cc: Steven Rostedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alex Elder
     
  • Commit a8fe19ebfbfd ("kernel/printk: use symbolic defines for console
    loglevels") makes consistent use of symbolic values for printk() log
    levels.

    The naming scheme used is different from the one used for
    DEFAULT_MESSAGE_LOGLEVEL though. Change that symbol name to be
    MESSAGE_LOGLEVEL_DEFAULT for consistency. And because the value of that
    symbol comes from a similarly-named config option, rename
    CONFIG_DEFAULT_MESSAGE_LOGLEVEL as well.

    Signed-off-by: Alex Elder
    Cc: Andi Kleen
    Cc: Borislav Petkov
    Cc: Jan Kara
    Cc: John Stultz
    Cc: Petr Mladek
    Cc: Steven Rostedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alex Elder
     
  • In do_syslog() there's a path used by kmsg_poll() and kmsg_read() that
    only needs to know whether there's any data available to read (and not
    its size). These callers only check for non-zero return. As a
    shortcut, do_syslog() returns the difference between what has been
    logged and what has been "seen."

    The comments say that the "count of records" should be returned but it's
    not. Instead it returns (log_next_idx - syslog_idx), which is a
    difference between buffer offsets--and the result could be negative.

    The behavior is the same (it'll be zero or not in the same cases), but
    the count of records is more meaningful and it matches what the comments
    say. So change the code to return that.

    Signed-off-by: Alex Elder
    Cc: Petr Mladek
    Cc: Jan Kara
    Cc: Joe Perches
    Cc: John Stultz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alex Elder
     
  • The default size of the ring buffer is too small for machines with a
    large amount of CPUs under heavy load. What ends up happening when
    debugging is the ring buffer overlaps and chews up old messages making
    debugging impossible unless the size is passed as a kernel parameter.
    An idle system upon boot up will on average spew out only about one or
    two extra lines but where this really matters is on heavy load and that
    will vary widely depending on the system and environment.

    There are mechanisms to help increase the kernel ring buffer for tracing
    through debugfs, and those interfaces even allow growing the kernel ring
    buffer per CPU. We also have a static value which can be passed upon
    boot. Relying on debugfs however is not ideal for production, and
    relying on the value passed upon bootup is can only used *after* an
    issue has creeped up. Instead of being reactive this adds a proactive
    measure which lets you scale the amount of contributions you'd expect to
    the kernel ring buffer under load by each CPU in the worst case
    scenario.

    We use num_possible_cpus() to avoid complexities which could be
    introduced by dynamically changing the ring buffer size at run time,
    num_possible_cpus() lets us use the upper limit on possible number of
    CPUs therefore avoiding having to deal with hotplugging CPUs on and off.
    This introduces the kernel configuration option LOG_CPU_MAX_BUF_SHIFT
    which is used to specify the maximum amount of contributions to the
    kernel ring buffer in the worst case before the kernel ring buffer flips
    over, the size is specified as a power of 2. The total amount of
    contributions made by each CPU must be greater than half of the default
    kernel ring buffer size (1 << LOG_BUF_SHIFT bytes) in order to trigger
    an increase upon bootup. The kernel ring buffer is increased to the
    next power of two that would fit the required minimum kernel ring buffer
    size plus the additional CPU contribution. For example if LOG_BUF_SHIFT
    is 18 (256 KB) you'd require at least 128 KB contributions by other CPUs
    in order to trigger an increase of the kernel ring buffer. With a
    LOG_CPU_BUF_SHIFT of 12 (4 KB) you'd require at least anything over > 64
    possible CPUs to trigger an increase. If you had 128 possible CPUs the
    amount of minimum required kernel ring buffer bumps to:

    ((1 << 18) + ((128 - 1) * (1 << 12))) / 1024 = 764 KB

    Since we require the ring buffer to be a power of two the new required
    size would be 1024 KB.

    This CPU contributions are ignored when the "log_buf_len" kernel
    parameter is used as it forces the exact size of the ring buffer to an
    expected power of two value.

    [pmladek@suse.cz: fix build]
    Signed-off-by: Luis R. Rodriguez
    Signed-off-by: Petr Mladek
    Tested-by: Davidlohr Bueso
    Tested-by: Petr Mladek
    Reviewed-by: Davidlohr Bueso
    Cc: Andrew Lunn
    Cc: Stephen Warren
    Cc: Michal Hocko
    Cc: Petr Mladek
    Cc: Joe Perches
    Cc: Arun KS
    Cc: Kees Cook
    Cc: Davidlohr Bueso
    Cc: Chris Metcalf
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luis R. Rodriguez
     
  • Signed-off-by: Luis R. Rodriguez
    Suggested-by: Davidlohr Bueso
    Cc: Andrew Lunn
    Cc: Stephen Warren
    Cc: Greg Kroah-Hartman
    Cc: Michal Hocko
    Cc: Petr Mladek
    Cc: Joe Perches
    Cc: Arun KS
    Cc: Kees Cook
    Cc: Davidlohr Bueso
    Cc: Chris Metcalf
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luis R. Rodriguez
     
  • In practice the power of 2 practice of the size of the kernel ring
    buffer remains purely historical but not a requirement, specially now
    that we have LOG_ALIGN and use it for both static and dynamic
    allocations. It could have helped with implicit alignment back in the
    days given the even the dynamically sized ring buffer was guaranteed to
    be aligned so long as CONFIG_LOG_BUF_SHIFT was set to produce a
    __LOG_BUF_LEN which is architecture aligned, since log_buf_len=n would
    be allowed only if it was > __LOG_BUF_LEN and we always ended up
    rounding the log_buf_len=n to the next power of 2 with
    roundup_pow_of_two(), any multiple of 2 then should be also architecture
    aligned. These assumptions of course relied heavily on
    CONFIG_LOG_BUF_SHIFT producing an aligned value but users can always
    change this.

    We now have precise alignment requirements set for the log buffer size
    for both static and dynamic allocations, but lets upkeep the old
    practice of using powers of 2 for its size to help with easy expected
    scalable values and the allocators for dynamic allocations. We'll reuse
    this later so move this into a helper.

    Signed-off-by: Luis R. Rodriguez
    Cc: Andrew Lunn
    Cc: Stephen Warren
    Cc: Greg Kroah-Hartman
    Cc: Michal Hocko
    Cc: Petr Mladek
    Cc: Joe Perches
    Cc: Arun KS
    Cc: Kees Cook
    Cc: Davidlohr Bueso
    Cc: Chris Metcalf
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luis R. Rodriguez
     
  • We have to consider alignment for the ring buffer both for the default
    static size, and then also for when an dynamic allocation is made when
    the log_buf_len=n kernel parameter is passed to set the size
    specifically to a size larger than the default size set by the
    architecture through CONFIG_LOG_BUF_SHIFT.

    The default static kernel ring buffer can be aligned properly if
    architectures set CONFIG_LOG_BUF_SHIFT properly, we provide ranges for
    the size though so even if CONFIG_LOG_BUF_SHIFT has a sensible aligned
    value it can be reduced to a non aligned value. Commit 6ebb017de9
    ("printk: Fix alignment of buf causing crash on ARM EABI") by Andrew
    Lunn ensures the static buffer is always aligned and the decision of
    alignment is done by the compiler by using __alignof__(struct log).

    When log_buf_len=n is used we allocate the ring buffer dynamically.
    Dynamic allocation varies, for the early allocation called before
    setup_arch() memblock_virt_alloc() requests a page aligment and for the
    default kernel allocation memblock_virt_alloc_nopanic() requests no
    special alignment, which in turn ends up aligning the allocation to
    SMP_CACHE_BYTES, which is L1 cache aligned.

    Since we already have the required alignment for the kernel ring buffer
    though we can do better and request explicit alignment for LOG_ALIGN.
    This does that to be safe and make dynamic allocation alignment
    explicit.

    Signed-off-by: Luis R. Rodriguez
    Tested-by: Petr Mladek
    Acked-by: Petr Mladek
    Cc: Andrew Lunn
    Cc: Stephen Warren
    Cc: Greg Kroah-Hartman
    Cc: Michal Hocko
    Cc: Petr Mladek
    Cc: Joe Perches
    Cc: Arun KS
    Cc: Kees Cook
    Cc: Davidlohr Bueso
    Cc: Chris Metcalf
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luis R. Rodriguez
     
  • Signed-off-by: Geoff Levand
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Geoff Levand
     
  • The DEFINE_SIMPLE_ATTRIBUTE macro should not end in a ; Fix the one use
    in the kernel tree that did not have a semicolon.

    Signed-off-by: Joe Perches
    Acked-by: Guenter Roeck
    Acked-by: Luca Tettamanti
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • We have been chasing a memory corruption bug, which turned out to be
    caused by very old gcc (4.3.4), which happily turned conditional load
    into a non-conditional one, and that broke correctness (the condition
    was met only if lock was held) and corrupted memory.

    This particular problem with that particular code did not happen when
    never gccs were used. I've brought this up with our gcc folks, as I
    wanted to make sure that this can't really happen again, and it turns
    out it actually can.

    Quoting Martin Jambor :
    "More current GCCs are more careful when it comes to replacing a
    conditional load with a non-conditional one, most notably they check
    that a store happens in each iteration of _a_ loop but they assume
    loops are executed. They also perform a simple check whether the
    store cannot trap which currently passes only for non-const
    variables. A simple testcase demonstrating it on an x86_64 is for
    example the following:

    $ cat cond_store.c

    int g_1 = 1;

    int g_2[1024] __attribute__((section ("safe_section"), aligned (4096)));

    int c = 4;

    int __attribute__ ((noinline))
    foo (void)
    {
    int l;
    for (l = 0; (l != 4); l++) {
    if (g_1)
    return l;
    for (g_2[0] = 0; (g_2[0] >= 26); ++g_2[0])
    ;
    }
    return 2;
    }

    int main (int argc, char* argv[])
    {
    if (mprotect (g_2, sizeof(g_2), PROT_READ) == -1)
    {
    int e = errno;
    error (e, e, "mprotect error %i", e);
    }
    foo ();
    __builtin_printf("OK\n");
    return 0;
    }
    /* EOF */
    $ ~/gcc/trunk/inst/bin/gcc cond_store.c -O2 --param allow-store-data-races=0
    $ ./a.out
    OK
    $ ~/gcc/trunk/inst/bin/gcc cond_store.c -O2 --param allow-store-data-races=1
    $ ./a.out
    Segmentation fault

    The testcase fails the same at least with 4.9, 4.8 and 4.7. Therefore
    I would suggest building kernels with this parameter set to zero. I
    also agree with Jikos that the default should be changed for -O2. I
    have run most of the SPEC 2k6 CPU benchmarks (gamess and dealII
    failed, at -O2, not sure why) compiled with and without this option
    and did not see any real difference between respective run-times"

    Hopefully the default will be changed in newer gccs, but let's force it
    for kernel builds so that we are on a safe side even when older gcc are
    used.

    The code in question was out-of-tree printk-in-NMI (yeah, surprise
    suprise, once again) patch written by Petr Mladek, let me quote his
    comment from our internal bugzilla:

    "I have spent few days investigating inconsistent state of kernel ring buffer.
    It went out that it was caused by speculative store generated by
    gcc-4.3.4.

    The problem is in assembly generated for make_free_space(). The functions is
    called the following way:

    + vprintk_emit();
    + log = MAIN_LOG; // with logbuf_lock
    or
    log = NMI_LOG; // with nmi_logbuf_lock
    cont_add(log, ...);
    + cont_flush(log, ...);
    + log_store(log, ...);
    + log_make_free_space(log, ...);

    If called with log = NMI_LOG then only nmi_log_* global variables are safe to
    modify but the generated code does store also into (main_)log_* global
    variables:

    :
    55 push %rbp
    89 f6 mov %esi,%esi

    48 8b 05 03 99 51 01 mov 0x1519903(%rip),%rax # ffffffff82620868
    44 8b 1d ec 98 51 01 mov 0x15198ec(%rip),%r11d # ffffffff82620858
    8b 35 36 60 14 01 mov 0x1146036(%rip),%esi # ffffffff8224cfa8
    44 8b 35 33 60 14 01 mov 0x1146033(%rip),%r14d # ffffffff8224cfac
    4c 8b 2d d0 98 51 01 mov 0x15198d0(%rip),%r13 # ffffffff82620850
    4c 8b 25 11 61 14 01 mov 0x1146111(%rip),%r12 # ffffffff8224d098
    49 89 c2 mov %rax,%r10
    48 21 c2 and %rax,%rdx
    48 8b 1d 0c 99 55 01 mov 0x155990c(%rip),%rbx # ffffffff826608a0
    49 c1 ea 20 shr $0x20,%r10
    48 89 55 d0 mov %rdx,-0x30(%rbp)
    44 29 de sub %r11d,%esi
    45 29 d6 sub %r10d,%r14d
    4c 8b 0d 97 98 51 01 mov 0x1519897(%rip),%r9 # ffffffff82620840
    eb 7e jmp ffffffff81107029
    [...]
    85 ff test %edi,%edi # edi = 1 for NMI_LOG
    4c 89 e8 mov %r13,%rax
    4c 89 ca mov %r9,%rdx
    74 0a je ffffffff8110703d
    8b 15 27 98 51 01 mov 0x1519827(%rip),%edx # ffffffff82620860
    48 8b 45 d0 mov -0x30(%rbp),%rax
    48 39 c2 cmp %rax,%rdx # end of loop
    0f 84 da 00 00 00 je ffffffff81107120
    [...]
    85 ff test %edi,%edi # edi = 1 for NMI_LOG
    4c 89 0d 17 97 51 01 mov %r9,0x1519717(%rip) # ffffffff82620840
    ^^^^^^^^^^^^^^^^^^^^^^^^^^
    KABOOOM
    74 35 je ffffffff81107160

    It stores log_first_seq when edi == NMI_LOG. This instructions are used also
    when edi == MAIN_LOG but the store is done speculatively before the condition
    is decided. It is unsafe because we do not have "logbuf_lock" in NMI context
    and some other process migh modify "log_first_seq" in parallel"

    I believe that the best course of action is both

    - building kernel (and anything multi-threaded, I guess) with that
    optimization turned off
    - persuade gcc folks to change the default for future releases

    Signed-off-by: Jiri Kosina
    Cc: Martin Jambor
    Cc: Petr Mladek
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Marek Polacek
    Cc: Jakub Jelinek
    Cc: Steven Noonan
    Cc: Richard Biener
    Cc: Dan Carpenter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Kosina
     
  • Change zswap to use the zpool api instead of directly using zbud. Add a
    boot-time param to allow selecting which zpool implementation to use,
    with zbud as the default.

    Signed-off-by: Dan Streetman
    Tested-by: Seth Jennings
    Cc: Weijie Yang
    Cc: Minchan Kim
    Cc: Nitin Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Streetman
     
  • Update zbud and zsmalloc to implement the zpool api.

    [fengguang.wu@intel.com: make functions static]
    Signed-off-by: Dan Streetman
    Tested-by: Seth Jennings
    Cc: Minchan Kim
    Cc: Nitin Gupta
    Cc: Weijie Yang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Streetman
     
  • Add zpool api.

    zpool provides an interface for memory storage, typically of compressed
    memory. Users can select what backend to use; currently the only
    implementations are zbud, a low density implementation with up to two
    compressed pages per storage page, and zsmalloc, a higher density
    implementation with multiple compressed pages per storage page.

    Signed-off-by: Dan Streetman
    Tested-by: Seth Jennings
    Cc: Minchan Kim
    Cc: Nitin Gupta
    Cc: Weijie Yang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Streetman
     
  • Change the type of the zbud_alloc() size param from unsigned int to
    size_t.

    Technically, this should not make any difference, as the zbud
    implementation already restricts the size to well within either type's
    limits; but as zsmalloc (and kmalloc) use size_t, and zpool will use
    size_t, this brings the size parameter type in line with zsmalloc/zpool.

    Signed-off-by: Dan Streetman
    Acked-by: Seth Jennings
    Tested-by: Seth Jennings
    Cc: Weijie Yang
    Cc: Minchan Kim
    Cc: Nitin Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Streetman
     
  • Currently, we use a rwlock tb_lock to protect concurrent access to the
    whole zram meta table. However, according to the actual access model,
    there is only a small chance for upper user to access the same
    table[index], so the current lock granularity is too big.

    The idea of optimization is to change the lock granularity from whole
    meta table to per table entry (table -> table[index]), so that we can
    protect concurrent access to the same table[index], meanwhile allow the
    maximum concurrency.

    With this in mind, several kinds of locks which could be used as a
    per-entry lock were tested and compared:

    Test environment:
    x86-64 Intel Core2 Q8400, system memory 4GB, Ubuntu 12.04,
    kernel v3.15.0-rc3 as base, zram with 4 max_comp_streams LZO.

    iozone test:
    iozone -t 4 -R -r 16K -s 200M -I +Z
    (1GB zram with ext4 filesystem, take the average of 10 tests, KB/s)

    Test base CAS spinlock rwlock bit_spinlock
    -------------------------------------------------------------------
    Initial write 1381094 1425435 1422860 1423075 1421521
    Rewrite 1529479 1641199 1668762 1672855 1654910
    Read 8468009 11324979 11305569 11117273 10997202
    Re-read 8467476 11260914 11248059 11145336 10906486
    Reverse Read 6821393 8106334 8282174 8279195 8109186
    Stride read 7191093 8994306 9153982 8961224 9004434
    Random read 7156353 8957932 9167098 8980465 8940476
    Mixed workload 4172747 5680814 5927825 5489578 5972253
    Random write 1483044 1605588 1594329 1600453 1596010
    Pwrite 1276644 1303108 1311612 1314228 1300960
    Pread 4324337 4632869 4618386 4457870 4500166

    To enhance the possibility of access the same table[index] concurrently,
    set zram a small disksize(10MB) and let threads run with large loop
    count.

    fio test:
    fio --bs=32k --randrepeat=1 --randseed=100 --refill_buffers
    --scramble_buffers=1 --direct=1 --loops=3000 --numjobs=4
    --filename=/dev/zram0 --name=seq-write --rw=write --stonewall
    --name=seq-read --rw=read --stonewall --name=seq-readwrite
    --rw=rw --stonewall --name=rand-readwrite --rw=randrw --stonewall
    (10MB zram raw block device, take the average of 10 tests, KB/s)

    Test base CAS spinlock rwlock bit_spinlock
    -------------------------------------------------------------
    seq-write 933789 999357 1003298 995961 1001958
    seq-read 5634130 6577930 6380861 6243912 6230006
    seq-rw 1405687 1638117 1640256 1633903 1634459
    rand-rw 1386119 1614664 1617211 1609267 1612471

    All the optimization methods show a higher performance than the base,
    however, it is hard to say which method is the most appropriate.

    On the other hand, zram is mostly used on small embedded system, so we
    don't want to increase any memory footprint.

    This patch pick the bit_spinlock method, pack object size and page_flag
    into an unsigned long table.value, so as to not increase any memory
    overhead on both 32-bit and 64-bit system.

    On the third hand, even though different kinds of locks have different
    performances, we can ignore this difference, because: if zram is used as
    zram swapfile, the swap subsystem can prevent concurrent access to the
    same swapslot; if zram is used as zram-blk for set up filesystem on it,
    the upper filesystem and the page cache also prevent concurrent access
    of the same block mostly. So we can ignore the different performances
    among locks.

    Acked-by: Sergey Senozhatsky
    Reviewed-by: Davidlohr Bueso
    Signed-off-by: Weijie Yang
    Signed-off-by: Minchan Kim
    Cc: Jerome Marchand
    Cc: Nitin Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Weijie Yang
     
  • Some architectures (eg, hexagon and PowerPC) could use PAGE_SHIFT of 16
    or more. In these cases u16 is not sufficiently large to represent a
    compressed page's size so use size_t.

    Signed-off-by: Minchan Kim
    Reported-by: Weijie Yang
    Acked-by: Sergey Senozhatsky
    Cc: Jerome Marchand
    Cc: Nitin Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Drop SECTOR_SIZE define, because it's not used.

    Signed-off-by: Sergey Senozhatsky
    Cc: Minchan Kim
    Cc: Nitin Gupta
    Cc: Weijie Yang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • Andrew Morton has recently noted that `struct table' actually represents
    table entry and, thus, should be renamed. Rename to `zram_table_entry'.

    Signed-off-by: Sergey Senozhatsky
    Cc: Minchan Kim
    Cc: Nitin Gupta
    Cc: Weijie Yang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • User-visible effect:
    Architectures that choose this method of maintaining cache coherency
    (MIPS and xtensa currently) are able to use high memory on cores with
    aliasing data cache. Without this fix such architectures can not use
    high memory (in case of xtensa it means that at most 128 MBytes of
    physical memory is available).

    The problem:
    VIPT cache with way size larger than MMU page size may suffer from
    aliasing problem: a single physical address accessed via different
    virtual addresses may end up in multiple locations in the cache.
    Virtual mappings of a physical address that always get cached in
    different cache locations are said to have different colors. L1 caching
    hardware usually doesn't handle this situation leaving it up to
    software. Software must avoid this situation as it leads to data
    corruption.

    What can be done:
    One way to handle this is to flush and invalidate data cache every time
    page mapping changes color. The other way is to always map physical
    page at a virtual address with the same color. Low memory pages already
    have this property. Giving architecture a way to control color of high
    memory page mapping allows reusing of existing low memory cache alias
    handling code.

    How this is done with this patch:
    Provide hooks that allow architectures with aliasing cache to align
    mapping address of high pages according to their color. Such
    architectures may enforce similar coloring of low- and high-memory page
    mappings and reuse existing cache management functions to support
    highmem.

    This code is based on the implementation of similar feature for MIPS by
    Leonid Yegoshin.

    Signed-off-by: Max Filippov
    Cc: Leonid Yegoshin
    Cc: Chris Zankel
    Cc: Marc Gauthier
    Cc: David Rientjes
    Cc: Steven Hill
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Max Filippov
     
  • When kernel device drivers or subsystems want to bind their lifespan to
    t= he lifespan of the mm_struct, they usually use one of the following
    methods:

    1. Manually calling a function in the interested kernel module. The
    funct= ion call needs to be placed in mmput. This method was rejected
    by several ker= nel maintainers.

    2. Registering to the mmu notifier release mechanism.

    The problem with the latter approach is that the mmu_notifier_release
    cal= lback is called from__mmu_notifier_release (called from exit_mmap).
    That functi= on iterates over the list of mmu notifiers and don't expect
    the release call= back function to remove itself from the list.
    Therefore, the callback function= in the kernel module can't release the
    mmu_notifier_object, which is actuall= y the kernel module's object
    itself. As a result, the destruction of the kernel module's object must
    to be done in a delayed fashion.

    This patch adds support for this delayed callback, by adding a new
    mmu_notifier_call_srcu function that receives a function ptr and calls
    th= at function with call_srcu. In that function, the kernel module
    releases its object. To use mmu_notifier_call_srcu, the calling module
    needs to call b= efore that a new function called
    mmu_notifier_unregister_no_release that as its= name implies,
    unregisters a notifier without calling its notifier release call= back.

    This patch also adds a function that will call barrier_srcu so those
    kern= el modules can sync with mmu_notifier.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Jérôme Glisse
    Signed-off-by: Oded Gabbay
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • __kmap_atomic_idx is per_cpu variable. Each CPU can use KM_TYPE_NR
    entries from FIXMAP i.e. from 0 to KM_TYPE_NR - 1. Allowing
    __kmap_atomic_idx to over- shoot to KM_TYPE_NR can mess up with next
    CPU's 0th entry which is a bug. Hence BUG_ON if __kmap_atomic_idx >=
    KM_TYPE_NR.

    Fix the off-by-on in this test.

    Signed-off-by: Chintan Pandya
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chintan Pandya
     
  • Charge reclaim and OOM currently use the charge batch variable, but
    batching is already disabled at that point. To simplify the charge
    logic, the batch variable is reset to the original request size when
    reclaim is entered, so it's functionally equal, but it's misleading.

    Switch reclaim/OOM to nr_pages, which is the original request size.

    Signed-off-by: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The rarely-executed memry-allocation-failed callback path generates a
    WARN_ON_ONCE() when smp_call_function_single() succeeds. Presumably
    it's supposed to warn on failures.

    Signed-off-by: Sasha Levin
    Cc: Christoph Lameter
    Cc: Gilad Ben-Yossef
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Tejun Heo
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     
  • This patch changes confusing #ifdef use in __access_remote_vm into
    merely ugly #ifdef use.

    Addresses bug https://bugzilla.kernel.org/show_bug.cgi?id=81651

    Signed-off-by: Rik van Riel
    Reported-by: David Binderman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • After a VMA is created with the VM_SOFTDIRTY flag set, /proc/pid/pagemap
    should report that the VMA's virtual pages are soft-dirty until
    VM_SOFTDIRTY is cleared (i.e., by the next write of "4" to
    /proc/pid/clear_refs). However, pagemap ignores the VM_SOFTDIRTY flag
    for virtual addresses that fall in PTE holes (i.e., virtual addresses
    that don't have a PMD, PUD, or PGD allocated yet).

    To observe this bug, use mmap to create a VMA large enough such that
    there's a good chance that the VMA will occupy an unused PMD, then test
    the soft-dirty bit on its pages. In practice, I found that a VMA that
    covered a PMD's worth of address space was big enough.

    This patch adds the necessary VMA lookup to the PTE hole callback in
    /proc/pid/pagemap's page walk and sets soft-dirty according to the VMAs'
    VM_SOFTDIRTY flag.

    Signed-off-by: Peter Feiner
    Acked-by: Cyrill Gorcunov
    Cc: Pavel Emelyanov
    Cc: Hugh Dickins
    Acked-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Feiner
     
  • fault_around_bytes can only be changed via debugfs. Let's mark it
    read-mostly.

    Signed-off-by: Kirill A. Shutemov
    Suggested-by: David Rientjes
    Acked-by: David Rientjes
    Cc: Dave Hansen
    Cc: Andrey Ryabinin
    Cc: Sasha Levin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov