09 May, 2017

40 commits

  • As we are reusing crashkernel parameter instead of fadump_reserve_mem
    parameter to specify the memory to reserve for fadump's crash kernel,
    update the documentation accordingly.

    Link: http://lkml.kernel.org/r/149035347559.6881.14224829694291758581.stgit@hbathini.in.ibm.com
    Signed-off-by: Hari Bathini
    Acked-by: Michael Ellerman
    Cc: Fenghua Yu
    Cc: Tony Luck
    Cc: Dave Young
    Cc: Eric Biederman
    Cc: Mahesh Salgaonkar
    Cc: Vivek Goyal
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hari Bathini
     
  • fadump supports specifying memory to reserve for fadump's crash kernel
    with fadump_reserve_mem kernel parameter. This parameter currently
    supports passing a fixed memory size, like fadump_reserve_mem=
    only. This patch aims to add support for other syntaxes like
    range-based memory size
    :[,:,:,...] which allows
    using the same parameter to boot the kernel with different system RAM
    sizes.

    As crashkernel parameter already supports the above mentioned syntaxes,
    this patch deprecates fadump_reserve_mem parameter and reuses
    crashkernel parameter instead, to specify memory for fadump's crash
    kernel memory reservation as well. If any offset is provided in
    crashkernel parameter, it will be ignored in case of fadump, as fadump
    reserves memory at end of RAM.

    Advantages using crashkernel parameter instead of fadump_reserve_mem
    parameter are one less kernel parameter overall, code reuse and support
    for multiple syntaxes to specify memory.

    Suggested-by: Dave Young
    Link: http://lkml.kernel.org/r/149035346749.6881.911095631212975718.stgit@hbathini.in.ibm.com
    Signed-off-by: Hari Bathini
    Reviewed-by: Mahesh Salgaonkar
    Acked-by: Michael Ellerman
    Cc: Fenghua Yu
    Cc: Tony Luck
    Cc: Dave Young
    Cc: Eric Biederman
    Cc: Vivek Goyal
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hari Bathini
     
  • Now that crashkernel parameter parsing and vmcoreinfo related code is
    moved under CONFIG_CRASH_CORE instead of CONFIG_KEXEC_CORE, remove
    dependency with CONFIG_KEXEC for CONFIG_FA_DUMP. While here, get rid of
    definitions of fadump_append_elf_note() & fadump_final_note() functions
    to reuse similar functions compiled under CONFIG_CRASH_CORE.

    Link: http://lkml.kernel.org/r/149035343956.6881.1536459326017709354.stgit@hbathini.in.ibm.com
    Signed-off-by: Hari Bathini
    Reviewed-by: Mahesh Salgaonkar
    Acked-by: Michael Ellerman
    Cc: Fenghua Yu
    Cc: Tony Luck
    Cc: Dave Young
    Cc: Eric Biederman
    Cc: Vivek Goyal
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hari Bathini
     
  • Get rid of multiple definitions of append_elf_note() & final_note()
    functions. Reuse these functions compiled under CONFIG_CRASH_CORE Also,
    define Elf_Word and use it instead of generic u32 or the more specific
    Elf64_Word.

    Link: http://lkml.kernel.org/r/149035342324.6881.11667840929850361402.stgit@hbathini.in.ibm.com
    Signed-off-by: Hari Bathini
    Acked-by: Dave Young
    Acked-by: Tony Luck
    Cc: Fenghua Yu
    Cc: Eric Biederman
    Cc: Mahesh Salgaonkar
    Cc: Vivek Goyal
    Cc: Michael Ellerman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hari Bathini
     
  • Patch series "kexec/fadump: remove dependency with CONFIG_KEXEC and
    reuse crashkernel parameter for fadump", v4.

    Traditionally, kdump is used to save vmcore in case of a crash. Some
    architectures like powerpc can save vmcore using architecture specific
    support instead of kexec/kdump mechanism. Such architecture specific
    support also needs to reserve memory, to be used by dump capture kernel.
    crashkernel parameter can be a reused, for memory reservation, by such
    architecture specific infrastructure.

    This patchset removes dependency with CONFIG_KEXEC for crashkernel
    parameter and vmcoreinfo related code as it can be reused without kexec
    support. Also, crashkernel parameter is reused instead of
    fadump_reserve_mem to reserve memory for fadump.

    The first patch moves crashkernel parameter parsing and vmcoreinfo
    related code under CONFIG_CRASH_CORE instead of CONFIG_KEXEC_CORE. The
    second patch reuses the definitions of append_elf_note() & final_note()
    functions under CONFIG_CRASH_CORE in IA64 arch code. The third patch
    removes dependency on CONFIG_KEXEC for firmware-assisted dump (fadump)
    in powerpc. The next patch reuses crashkernel parameter for reserving
    memory for fadump, instead of the fadump_reserve_mem parameter. This
    has the advantage of using all syntaxes crashkernel parameter supports,
    for fadump as well. The last patch updates fadump kernel documentation
    about use of crashkernel parameter.

    This patch (of 5):

    Traditionally, kdump is used to save vmcore in case of a crash. Some
    architectures like powerpc can save vmcore using architecture specific
    support instead of kexec/kdump mechanism. Such architecture specific
    support also needs to reserve memory, to be used by dump capture kernel.
    crashkernel parameter can be a reused, for memory reservation, by such
    architecture specific infrastructure.

    But currently, code related to vmcoreinfo and parsing of crashkernel
    parameter is built under CONFIG_KEXEC_CORE. This patch introduces
    CONFIG_CRASH_CORE and moves the above mentioned code under this config,
    allowing code reuse without dependency on CONFIG_KEXEC. There is no
    functional change with this patch.

    Link: http://lkml.kernel.org/r/149035338104.6881.4550894432615189948.stgit@hbathini.in.ibm.com
    Signed-off-by: Hari Bathini
    Acked-by: Dave Young
    Cc: Fenghua Yu
    Cc: Tony Luck
    Cc: Eric Biederman
    Cc: Mahesh Salgaonkar
    Cc: Vivek Goyal
    Cc: Michael Ellerman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hari Bathini
     
  • Bit searching functions accept "unsigned long" indices but
    "nr_cpumask_bits" is "int" which is signed, so inevitable sign
    extensions occur on x86_64. Those MOVSX are #1 MOVSX bloat by number of
    uses across whole kernel.

    Change "nr_cpumask_bits" to unsigned, this number can't be negative
    after all. It allows to do implicit zero-extension on x86_64 without
    MOVSX.

    Change signed comparisons into unsigned comparisons where necessary.

    Other uses looks fine because it is either argument passed to a function
    or comparison is already unsigned.

    Net win on allyesconfig type of kernel: ~2.8 KB (!)

    add/remove: 0/0 grow/shrink: 8/725 up/down: 93/-2926 (-2833)
    function old new delta
    xen_exit_mmap 691 735 +44
    qstat_read 426 440 +14
    __cpufreq_cooling_register 1678 1687 +9
    trace_rb_cpu_prepare 447 455 +8
    vermagic 54 60 +6
    nfp_driver_version 54 60 +6
    rcu_torture_stats_print 1147 1151 +4
    find_next_push_cpu 267 269 +2
    xen_irq_resume 961 960 -1
    ...
    init_vp_index 946 906 -40
    od_set_powersave_bias 328 281 -47
    power_cpu_exit 193 139 -54
    arch_show_interrupts 3538 3484 -54
    select_idle_sibling 1558 1471 -87
    Total: Before=158358910, After=158356077, chg -0.00%

    Same arguments apply to "nr_cpu_ids" but I haven't yet found enough
    courage to delve into this issue (and proper fix may require new type
    "cpu_t" which is whole separate story).

    Link: http://lkml.kernel.org/r/20170309205322.GA1728@avx2
    Signed-off-by: Alexey Dobriyan
    Cc: Rusty Russell
    Cc: Heiko Carstens
    Cc: Martin Schwidefsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Using virtually mapped stack, kernel stacks are allocated via vmalloc.

    In the current implementation, two stacks per cpu can be cached when
    tasks are freed and the cached stacks are used again in task
    duplications. But the cached stacks may remain unfreed even when cpu
    are offline. By adding a cpu hotplug callback to free the cached stacks
    when a cpu goes offline, the pages of the cached stacks are not wasted.

    Link: http://lkml.kernel.org/r/1487076043-17802-1-git-send-email-hoeun.ryu@gmail.com
    Signed-off-by: Hoeun Ryu
    Reviewed-by: Thomas Gleixner
    Acked-by: Michal Hocko
    Cc: Ingo Molnar
    Cc: Andy Lutomirski
    Cc: Kees Cook
    Cc: "Eric W. Biederman"
    Cc: Oleg Nesterov
    Cc: Mateusz Guzik
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hoeun Ryu
     
  • Prepare to mark sensitive kernel structures for randomization by making
    sure they're using designated initializers. These were identified
    during allyesconfig builds of x86, arm, and arm64, with most initializer
    fixes extracted from grsecurity.

    Link: http://lkml.kernel.org/r/20170329210419.GA40066@beast
    Signed-off-by: Kees Cook
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • The current SUSPECT_CODE_INDENT test does not recognize several
    defective code style defects where code following a logical test is
    inappropriately indented.

    Before this patch, for code like:

    if (foo)
    bar();

    checkpatch would not emit a warning.

    Improve the test to warn when code after a logical test has the same
    indentation as the logical test.

    Perform the same indentation test for "else" blocks too.

    Link: http://lkml.kernel.org/r/df2374b68c4a68af2b7ef08afe486584811f610a.1493683942.git.joe@perches.com
    Signed-off-by: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • The current test works only for a single patch context as it is done in
    the foreach ($rawlines) loop that precedes the loop where the actual
    $context_function variable is used.

    Move the set of $context_function into the foreach (@lines) loop where
    it is useful for each patch context.

    Link: http://lkml.kernel.org/r/6c675a31c74fbfad4fc45b9f462303d60ca2a283.1493486091.git.joe@perches.com
    Signed-off-by: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • When using checkpatch on out-of-tree code, it may occur that some
    project-specific types are used, which will cause spurious warnings.
    Add the --typedefsfile option as a way to extend the known types and
    deal with this issue.

    This was developed for OP-TEE [1]. We run a Travis job on all pull
    requests [2], and checkpatch is part of that. The typical false warning
    we get on a regular basis is with some pointers to functions returning
    TEE_Result [3], which is a typedef from the GlobalPlatform APIs. We
    consider it is acceptable to use GP types in the OP-TEE core
    implementation, that's why this patch would be helpful for us.

    [1] https://github.com/OP-TEE/optee_os
    [2] https://travis-ci.org/OP-TEE/optee_os/builds
    [3] https://travis-ci.org/OP-TEE/optee_os/builds/193355335#L1733

    Link: http://lkml.kernel.org/r/ba1124d6dfa599bb0dd1d8919dd45dd09ce541a4.1492702192.git.jerome.forissier@linaro.org
    Signed-off-by: Jerome Forissier
    Cc: Joe Perches
    Cc: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jerome Forissier
     
  • Find multi-line uses of k.alloc by using the $stat variable and not the
    $line variable. This can still --fix only the single line variant
    though.

    Link: http://lkml.kernel.org/r/3f4b23d37cd4c7d8628eefc25afe83ba8fb3ab55.1493167076.git.joe@perches.com
    Signed-off-by: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Currently checkpatch.pl does not recognize git's default commit revert
    message and will complain about the hash format. Add special audit for
    revert commit message line to fix it.

    Link: http://lkml.kernel.org/r/20170411191532.74381-1-wvw@google.com
    Signed-off-by: Wei Wang
    Acked-by: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Wang
     
  • Try to make the conversion of embedded function names to "%s: ", __func__
    a bit clearer.

    Add a bit more information to the comment describing the test too.

    Link: http://lkml.kernel.org/r/38f5d32f0aec1cd98cb9ceeedd6a736cc9a802db.1491759835.git.joe@perches.com
    Signed-off-by: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • The logic currrently misses macros that start with an if statement.

    e.g.: #define foo(bar) if (bar) baz;

    Add a test for macro content that starts with if

    Link: http://lkml.kernel.org/r/a9d41aafe1673889caf1a9850208fb7fd74107a0.1491783914.git.joe@perches.com
    Signed-off-by: Joe Perches
    Reported-by: Andreas Mohr
    Original-patch-by: Alfonso Lima
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Many structs are generally used const and there is a known list of these
    structs.

    struct definitions should not be generally be declared const.

    Add a test for the lack of an open brace immediately after the struct to
    avoid definitions.

    This avoids the false positive "struct foo should normally be const"
    message only when the open brace is on the same line as the definition.

    Link: http://lkml.kernel.org/r/0dce709150d712e66f1b90b03827634b53b28085.1491845946.git.joe@perches.com
    Signed-off-by: Joe Perches
    Cc: Arthur Brainville
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Allow a leading space and otherwise blank link in the email headers as
    it can be a line wrapped Spamassassin multiple line string or any other
    valid rfc 2822/5322 email header.

    The line with space causes checkpatch to erroneously think that it's in
    the content body, as opposed to headers and thus flag a mail header as
    an unwrapped long comment line.

    Link: http://lkml.kernel.org/r/d75a9f0b78b3488078429f4037d9fff3bdfa3b78.1490247180.git.joe@perches.com
    Signed-off-by: Joe Perches Reported-by: Darren Hart (VMware)
    Tested-by: Darren Hart (VMware)
    Reviewed-by: Darren Hart (VMware)
    Original-patch-by: John 'Warthog9' Hawley (VMware)
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • The existing behavior relies on patch context to identify function
    declarations. Add the ability to find function declarations when there
    is an open brace in column 1.

    This finds function declarations only in specific single line forms
    where the function name is on a single line like:

    int foo(args...)
    {

    and

    int
    foo(args...)
    {

    It does not recognize function declarations like:

    int foo(int bar,
    int baz)
    {

    Link: http://lkml.kernel.org/r/738d74bbbe1a06b80f11ed504818107c68903095.1488155636.git.joe@perches.com
    Signed-off-by: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • %pK was at least once misused at %pk in an out-of-tree module. This
    lead to some security concerns. Add the ability to track single and
    multiple line statements for misuses of %p.

    [akpm@linux-foundation.org: add helpful comment into lib/vsprintf.c]
    [akpm@linux-foundation.org: text tweak]
    Link: http://lkml.kernel.org/r/163a690510e636a23187c0dc9caa09ddac6d4cde.1488228427.git.joe@perches.com
    Signed-off-by: Joe Perches
    Acked-by: Kees Cook
    Acked-by: William Roberts
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Config EXPERIMENTAL has been removed from kernel in 2013 (see commit
    3d374d09f16f: "final removal of CONFIG_EXPERIMENTAL"), there is no any
    reason to do these checks now.

    Link: http://lkml.kernel.org/r/1488234097-20119-1-git-send-email-ruslan.bilovol@gmail.com
    Signed-off-by: Ruslan Bilovol
    Acked-by: Kees Cook
    Acked-by: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ruslan Bilovol
     
  • If you modify the target asm we currently do not force the recompilation
    of the firmware files. The target asm is in the firmware/Makefile, peg
    this file as a dependency to require re-compilation of firmware targets
    when the asm changes.

    Link: http://lkml.kernel.org/r/20170123150727.4883-1-mcgrof@kernel.org
    Signed-off-by: Luis R. Rodriguez
    Cc: Masahiro Yamada
    Cc: Michal Marek
    Cc: Ming Lei
    Cc: Greg Kroah-Hartman
    Cc: Tom Gundersen
    Cc: David Woodhouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luis R. Rodriguez
     
  • Extract the linked list sorting test code into its own source file, to
    allow to compile it either to a loadable module, or builtin into the
    kernel.

    Link: http://lkml.kernel.org/r/1488287219-15832-4-git-send-email-geert@linux-m68k.org
    Signed-off-by: Geert Uytterhoeven
    Reviewed-by: Andy Shevchenko
    Cc: Arnd Bergmann
    Cc: Paul Gortmaker
    Cc: Shuah Khan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Geert Uytterhoeven
     
  • Allow to compile the array-based sort test code either to a loadable
    module, or builtin into the kernel.

    Link: http://lkml.kernel.org/r/1488287219-15832-3-git-send-email-geert@linux-m68k.org
    Signed-off-by: Geert Uytterhoeven
    Reviewed-by: Andy Shevchenko
    Cc: Arnd Bergmann
    Cc: Paul Gortmaker
    Cc: Shuah Khan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Geert Uytterhoeven
     
  • Patch series "lib: add module support to sort tests".

    This patch series allows to compile the array-based and linked list sort
    test code either to loadable modules, or builtin into the kernel.

    It's very valuable to have modular tests, so you can run them just by
    insmodding the test modules, instead of needing a separate kernel that
    runs them at boot.

    This patch (of 3):

    This reverts commit 8893f519330bb073a49c5b4676fce4be6f1be15d.

    It's very valuable to have modular tests, so you can run them just by
    insmodding the test modules, instead of needing a separate kernel that
    runs them at boot.

    Link: http://lkml.kernel.org/r/1488287219-15832-2-git-send-email-geert@linux-m68k.org
    Signed-off-by: Geert Uytterhoeven
    Reviewed-by: Andy Shevchenko
    Cc: Arnd Bergmann
    Cc: Paul Gortmaker
    Cc: Shuah Khan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Geert Uytterhoeven
     
  • c2port_device_register() never returns NULL, it uses error pointers.

    Link: http://lkml.kernel.org/r/20170412083321.GC3250@mwanda
    Fixes: 65131cd52b9e ("c2port: add c2port support for Eurotech Duramar 2150")
    Signed-off-by: Dan Carpenter
    Acked-by: Rodolfo Giometti
    Cc: Greg Kroah-Hartman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Carpenter
     
  • The "DIV_ROUND_UP(size, PAGE_SIZE)" operation can overflow if "size" is
    more than ULLONG_MAX - PAGE_SIZE.

    Link: http://lkml.kernel.org/r/20170322111950.GA11279@mwanda
    Signed-off-by: Dan Carpenter
    Cc: Jorgen Hansen
    Cc: Masahiro Yamada
    Cc: Michal Hocko
    Cc: Greg Kroah-Hartman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Carpenter
     
  • When I was running my testcase which may block hundreds of threads on fs
    locks, I got lockup due to output from debug_show_all_locks() added by
    commit b2d4c2edb2e4 ("locking/hung_task: Show all locks").

    For example, if 1000 threads were blocked in TASK_UNINTERRUPTIBLE state
    and 500 out of 1000 threads hold some lock, debug_show_all_locks() from
    for_each_process_thread() loop will report locks held by 500 threads for
    1000 times. This is a too much noise.

    In order to make sure rcu_lock_break() is called frequently, we should
    avoid calling debug_show_all_locks() from for_each_process_thread() loop
    because debug_show_all_locks() effectively calls for_each_process_thread()
    loop. Let's defer calling debug_show_all_locks() till before panic() or
    leaving for_each_process_thread() loop.

    Link: http://lkml.kernel.org/r/1489296834-60436-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
    Signed-off-by: Tetsuo Handa
    Reviewed-by: Vegard Nossum
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • Add a top-level Makefile help target for Userspace tools.

    Also make each help "heading" end with a colon ':'.

    Link: http://lkml.kernel.org/r/55c986ff-3966-3e47-2984-7349da2cce51@infradead.org
    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • jiffies_64 is defined in kernel/time/timer.c with
    ____cacheline_aligned_in_smp, however this macro is not part of the
    declaration of jiffies and jiffies_64 in jiffies.h.

    As a result clang generates the following warning:

    kernel/time/timer.c:57:26: error: section does not match previous declaration [-Werror,-Wsection]
    __visible u64 jiffies_64 __cacheline_aligned_in_smp = INITIAL_JIFFIES;
    ^
    include/linux/cache.h:39:36: note: expanded from macro '__cacheline_aligned_in_smp'
    ^
    include/linux/cache.h:34:4: note: expanded from macro '__cacheline_aligned'
    __section__(".data..cacheline_aligned")))
    ^
    include/linux/jiffies.h:77:12: note: previous attribute is here
    extern u64 __jiffy_data jiffies_64;
    ^
    include/linux/jiffies.h:70:38: note: expanded from macro '__jiffy_data'

    Link: http://lkml.kernel.org/r/20170403190200.70273-1-mka@chromium.org
    Signed-off-by: Matthias Kaehlcke
    Cc: "Jason A . Donenfeld"
    Cc: Grant Grundler
    Cc: Michael Davidson
    Cc: Greg Hackmann
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthias Kaehlcke
     
  • Moving from get_user_pages() to get_user_pages_unlocked() simplifies the
    code and takes advantage of VM_FAULT_RETRY functionality when faulting
    in pages.

    Link: http://lkml.kernel.org/r/20161101194332.23961-1-lstoakes@gmail.com
    Signed-off-by: Lorenzo Stoakes
    Cc: Michal Hocko
    Cc: Paolo Bonzini
    Cc: Kumar Gala
    Cc: Mihai Caraman
    Cc: Greg KH
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     
  • do_proc_dointvec_jiffies_conv() uses LONG_MAX/HZ as the max value to
    avoid overflow. But actually the *valp is int type, so it still causes
    overflow.

    For example,

    echo 2147483647 > ./sys/net/ipv4/tcp_keepalive_time

    Then,

    cat ./sys/net/ipv4/tcp_keepalive_time

    The output is "-1", it is not expected.

    Now use INT_MAX/HZ as the max value instead LONG_MAX/HZ to fix it.

    Link: http://lkml.kernel.org/r/1490109532-9228-1-git-send-email-fgao@ikuai8.com
    Signed-off-by: Gao Feng
    Cc: Arnaldo Carvalho de Melo
    Cc: Ingo Molnar
    Cc: Alexey Dobriyan
    Cc: Eric Dumazet
    Cc: Josh Poimboeuf
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gao Feng
     
  • Coccinelle emits this warning:

    WARNING: casting value returned by memory allocation function to (struct proc_inode *) is useless.

    Remove unnecessary cast.

    Link: http://lkml.kernel.org/r/1487745720-16967-1-git-send-email-me@tobin.cc
    Signed-off-by: Tobin C. Harding
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tobin C. Harding
     
  • The main goal of direct compaction is to form a high-order page for
    allocation, but it should also help against long-term fragmentation when
    possible.

    Most lower-than-pageblock-order compactions are for non-movable
    allocations, which means that if we compact in a movable pageblock and
    terminate as soon as we create the high-order page, it's unlikely that
    the fallback heuristics will claim the whole block. Instead there might
    be a single unmovable page in a pageblock full of movable pages, and the
    next unmovable allocation might pick another pageblock and increase
    long-term fragmentation.

    To help against such scenarios, this patch changes the termination
    criteria for compaction so that the current pageblock is finished even
    though the high-order page already exists. Note that it might be
    possible that the high-order page formed elsewhere in the zone due to
    parallel activity, but this patch doesn't try to detect that.

    This is only done with sync compaction, because async compaction is
    limited to pageblock of the same migratetype, where it cannot result in
    a migratetype fallback. (Async compaction also eagerly skips
    order-aligned blocks where isolation fails, which is against the goal of
    migrating away as much of the pageblock as possible.)

    As a result of this patch, long-term memory fragmentation should be
    reduced.

    In testing based on 4.9 kernel with stress-highalloc from mmtests
    configured for order-4 GFP_KERNEL allocations, this patch has reduced
    the number of unmovable allocations falling back to movable pageblocks
    by 20%. The number

    Link: http://lkml.kernel.org/r/20170307131545.28577-9-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The migrate scanner in async compaction is currently limited to
    MIGRATE_MOVABLE pageblocks. This is a heuristic intended to reduce
    latency, based on the assumption that non-MOVABLE pageblocks are
    unlikely to contain movable pages.

    However, with the exception of THP's, most high-order allocations are
    not movable. Should the async compaction succeed, this increases the
    chance that the non-MOVABLE allocations will fallback to a MOVABLE
    pageblock, making the long-term fragmentation worse.

    This patch attempts to help the situation by changing async direct
    compaction so that the migrate scanner only scans the pageblocks of the
    requested migratetype. If it's a non-MOVABLE type and there are such
    pageblocks that do contain movable pages, chances are that the
    allocation can succeed within one of such pageblocks, removing the need
    for a fallback. If that fails, the subsequent sync attempt will ignore
    this restriction.

    In testing based on 4.9 kernel with stress-highalloc from mmtests
    configured for order-4 GFP_KERNEL allocations, this patch has reduced
    the number of unmovable allocations falling back to movable pageblocks
    by 30%. The number of movable allocations falling back is reduced by
    12%.

    Link: http://lkml.kernel.org/r/20170307131545.28577-8-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Preparation patch. We are going to need migratetype at lower layers
    than compact_zone() and compact_finished().

    Link: http://lkml.kernel.org/r/20170307131545.28577-7-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Preparation for making the decisions more complex and depending on
    compact_control flags. No functional change.

    Link: http://lkml.kernel.org/r/20170307131545.28577-6-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • When stealing pages from pageblock of a different migratetype, we count
    how many free pages were stolen, and change the pageblock's migratetype
    if more than half of the pageblock was free. This might be too
    conservative, as there might be other pages that are not free, but were
    allocated with the same migratetype as our allocation requested.

    While we cannot determine the migratetype of allocated pages precisely
    (at least without the page_owner functionality enabled), we can count
    pages that compaction would try to isolate for migration - those are
    either on LRU or __PageMovable(). The rest can be assumed to be
    MIGRATE_RECLAIMABLE or MIGRATE_UNMOVABLE, which we cannot easily
    distinguish. This counting can be done as part of free page stealing
    with little additional overhead.

    The page stealing code is changed so that it considers free pages plus
    pages of the "good" migratetype for the decision whether to change
    pageblock's migratetype.

    The result should be more accurate migratetype of pageblocks wrt the
    actual pages in the pageblocks, when stealing from semi-occupied
    pageblocks. This should help the efficiency of page grouping by
    mobility.

    In testing based on 4.9 kernel with stress-highalloc from mmtests
    configured for order-4 GFP_KERNEL allocations, this patch has reduced
    the number of unmovable allocations falling back to movable pageblocks
    by 47%. The number of movable allocations falling back to other
    pageblocks are increased by 55%, but these events don't cause permanent
    fragmentation, so the tradeoff should be positive. Later patches also
    offset the movable fallback increase to some extent.

    [akpm@linux-foundation.org: merge fix]
    Link: http://lkml.kernel.org/r/20170307131545.28577-5-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The __rmqueue_fallback() function is called when there's no free page of
    requested migratetype, and we need to steal from a different one.

    There are various heuristics to make this event infrequent and reduce
    permanent fragmentation. The main one is to try stealing from a
    pageblock that has the most free pages, and possibly steal them all at
    once and convert the whole pageblock. Precise searching for such
    pageblock would be expensive, so instead the heuristics walks the free
    lists from MAX_ORDER down to requested order and assumes that the block
    with highest-order free page is likely to also have the most free pages
    in total.

    Chances are that together with the highest-order page, we steal also
    pages of lower orders from the same block. But then we still split the
    highest order page. This is wasteful and can contribute to
    fragmentation instead of avoiding it.

    This patch thus changes __rmqueue_fallback() to just steal the page(s)
    and put them on the freelist of the requested migratetype, and only
    report whether it was successful. Then we pick (and eventually split)
    the smallest page with __rmqueue_smallest(). This all happens under
    zone lock, so nobody can steal it from us in the process. This should
    reduce fragmentation due to fallbacks. At worst we are only stealing a
    single highest-order page and waste some cycles by moving it between
    lists and then removing it, but fallback is not exactly hot path so that
    should not be a concern. As a side benefit the patch removes some
    duplicate code by reusing __rmqueue_smallest().

    [vbabka@suse.cz: fix endless loop in the modified __rmqueue()]
    Link: http://lkml.kernel.org/r/59d71b35-d556-4fc9-ee2e-1574259282fd@suse.cz
    Link: http://lkml.kernel.org/r/20170307131545.28577-4-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • When detecting whether compaction has succeeded in forming a high-order
    page, __compact_finished() employs a watermark check, followed by an own
    search for a suitable page in the freelists. This is not ideal for two
    reasons:

    - The watermark check also searches high-order freelists, but has a
    less strict criteria wrt fallback. It's therefore redundant and waste
    of cycles. This was different in the past when high-order watermark
    check attempted to apply reserves to high-order pages.

    - The watermark check might actually fail due to lack of order-0 pages.
    Compaction can't help with that, so there's no point in continuing
    because of that. It's possible that high-order page still exists and
    it terminates.

    This patch therefore removes the watermark check. This should save some
    cycles and terminate compaction sooner in some cases.

    Link: http://lkml.kernel.org/r/20170307131545.28577-3-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Patch series "try to reduce fragmenting fallbacks", v3.

    Last year, Johannes Weiner has reported a regression in page mobility
    grouping [1] and while the exact cause was not found, I've come up with
    some ways to improve it by reducing the number of allocations falling
    back to different migratetype and causing permanent fragmentation.

    The series was tested with mmtests stress-highalloc modified to do
    GFP_KERNEL order-4 allocations, on 4.9 with "mm, vmscan: fix zone
    balance check in prepare_kswapd_sleep" (without that, kcompactd indeed
    wasn't woken up) on UMA machine with 4GB memory. There were 5 repeats
    of each run, as the extfrag stats are quite volatile (note the stats
    below are sums, not averages, as it was less perl hacking for me).

    Success rate are the same, already high due to the low allocation order
    used, so I'm not including them.

    Compaction stats:
    (the patches are stacked, and I haven't measured the non-functional-changes
    patches separately)

    patch 1 patch 2 patch 3 patch 4 patch 7 patch 8
    Compaction stalls 22449 24680 24846 19765 22059 17480
    Compaction success 12971 14836 14608 10475 11632 8757
    Compaction failures 9477 9843 10238 9290 10426 8722
    Page migrate success 3109022 3370438 3312164 1695105 1608435 2111379
    Page migrate failure 911588 1149065 1028264 1112675 1077251 1026367
    Compaction pages isolated 7242983 8015530 7782467 4629063 4402787 5377665
    Compaction migrate scanned 980838938 987367943 957690188 917647238 947155598 1018922197
    Compaction free scanned 557926893 598946443 602236894 594024490 541169699 763651731
    Compaction cost 10243 10578 10304 8286 8398 9440

    Compaction stats are mostly within noise until patch 4, which decreases
    the number of compactions, and migrations. Part of that could be due to
    more pageblocks marked as unmovable, and async compaction skipping
    those. This changes a bit with patch 7, but not so much. Patch 8
    increases free scanner stats and migrations, which comes from the
    changed termination criteria. Interestingly number of compactions
    decreases - probably the fully compacted pageblock satisfies multiple
    subsequent allocations, so it amortizes.

    Next comes the extfrag tracepoint, where "fragmenting" means that an
    allocation had to fallback to a pageblock of another migratetype which
    wasn't fully free (which is almost all of the fallbacks). I have
    locally added another tracepoint for "Page steal" into
    steal_suitable_fallback() which triggers in situations where we are
    allowed to do move_freepages_block(). If we decide to also do
    set_pageblock_migratetype(), it's "Pages steal with pageblock" with
    break down for which allocation migratetype we are stealing and from
    which fallback migratetype. The last part "due to counting" comes from
    patch 4 and counts the events where the counting of movable pages
    allowed us to change pageblock's migratetype, while the number of free
    pages alone wouldn't be enough to cross the threshold.

    patch 1 patch 2 patch 3 patch 4 patch 7 patch 8
    Page alloc extfrag event 10155066 8522968 10164959 15622080 13727068 13140319
    Extfrag fragmenting 10149231 8517025 10159040 15616925 13721391 13134792
    Extfrag fragmenting for unmovable 159504 168500 184177 97835 70625 56948
    Extfrag fragmenting unmovable placed with movable 153613 163549 172693 91740 64099 50917
    Extfrag fragmenting unmovable placed with reclaim. 5891 4951 11484 6095 6526 6031
    Extfrag fragmenting for reclaimable 4738 4829 6345 4822 5640 5378
    Extfrag fragmenting reclaimable placed with movable 1836 1902 1851 1579 1739 1760
    Extfrag fragmenting reclaimable placed with unmov. 2902 2927 4494 3243 3901 3618
    Extfrag fragmenting for movable 9984989 8343696 9968518 15514268 13645126 13072466
    Pages steal 179954 192291 210880 123254 94545 81486
    Pages steal with pageblock 22153 18943 20154 33562 29969 33444
    Pages steal with pageblock for unmovable 14350 12858 13256 20660 19003 20852
    Pages steal with pageblock for unmovable from mov. 12812 11402 11683 19072 17467 19298
    Pages steal with pageblock for unmovable from recl. 1538 1456 1573 1588 1536 1554
    Pages steal with pageblock for movable 7114 5489 5965 11787 10012 11493
    Pages steal with pageblock for movable from unmov. 6885 5291 5541 11179 9525 10885
    Pages steal with pageblock for movable from recl. 229 198 424 608 487 608
    Pages steal with pageblock for reclaimable 689 596 933 1115 954 1099
    Pages steal with pageblock for reclaimable from unmov. 273 219 537 658 547 667
    Pages steal with pageblock for reclaimable from mov. 416 377 396 457 407 432
    Pages steal with pageblock due to counting 11834 10075 7530
    ... for unmovable 8993 7381 4616
    ... for movable 2792 2653 2851
    ... for reclaimable 49 41 63

    What we can see is that "Extfrag fragmenting for unmovable" and "...
    placed with movable" drops with almost each patch, which is good as we
    are polluting less movable pageblocks with unmovable pages.

    The most significant change is patch 4 with movable page counting. On
    the other hand it increases "Extfrag fragmenting for movable" by 50%.
    "Pages steal" drops though, so these movable allocation fallbacks find
    only small free pages and are not allowed to steal whole pageblocks
    back. "Pages steal with pageblock" raises, because the patch increases
    the chances of pageblock migratetype changes to happen. This affects
    all migratetypes.

    The summary is that patch 4 is not a clear win wrt these stats, but I
    believe that the tradeoff it makes is a good one. There's less
    pollution of movable pageblocks by unmovable allocations. There's less
    stealing between pageblock, and those that remain have higher chance of
    changing migratetype also the pageblock itself, so it should more
    faithfully reflect the migratetype of the pages within the pageblock.
    The increase of movable allocations falling back to unmovable pageblock
    might look dramatic, but those allocations can be migrated by compaction
    when needed, and other patches in the series (7-9) improve that aspect.

    Patches 7 and 8 continue the trend of reduced unmovable fallbacks and
    also reduce the impact on movable fallbacks from patch 4.

    [1] https://www.spinics.net/lists/linux-mm/msg114237.html

    This patch (of 8):

    While currently there are (mostly by accident) no holes in struct
    compact_control (on x86_64), but we are going to add more bool flags, so
    place them all together to the end of the structure. While at it, just
    order all fields from largest to smallest.

    Link: http://lkml.kernel.org/r/20170307131545.28577-2-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka