09 Sep, 2017

40 commits

  • Allow interval trees to quickly check for overlaps to avoid unnecesary
    tree lookups in interval_tree_iter_first().

    As of this patch, all interval tree flavors will require using a
    'rb_root_cached' such that we can have the leftmost node easily
    available. While most users will make use of this feature, those with
    special functions (in addition to the generic insert, delete, search
    calls) will avoid using the cached option as they can do funky things
    with insertions -- for example, vma_interval_tree_insert_after().

    [jglisse@redhat.com: fix deadlock from typo vm_lock_anon_vma()]
    Link: http://lkml.kernel.org/r/20170808225719.20723-1-jglisse@redhat.com
    Link: http://lkml.kernel.org/r/20170719014603.19029-12-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Jérôme Glisse
    Acked-by: Christian König
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Doug Ledford
    Acked-by: Michael S. Tsirkin
    Cc: David Airlie
    Cc: Jason Wang
    Cc: Christian Benvenuti
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • ... with the generic rbtree flavor instead. No changes
    in semantics whatsoever.

    Link: http://lkml.kernel.org/r/20170719014603.19029-11-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Jan Kara
    Acked-by: Peter Zijlstra (Intel)
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • ... with the generic rbtree flavor instead. No changes
    in semantics whatsoever.

    Link: http://lkml.kernel.org/r/20170719014603.19029-10-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Acked-by: Peter Zijlstra (Intel)
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • ... with the generic rbtree flavor instead. No changes
    in semantics whatsoever.

    Link: http://lkml.kernel.org/r/20170719014603.19029-9-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Acked-by: Peter Zijlstra (Intel)
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • ... with the generic rbtree flavor instead. No changes
    in semantics whatsoever.

    Link: http://lkml.kernel.org/r/20170719014603.19029-8-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Acked-by: Peter Zijlstra (Intel)
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • We can work with a single rb_root_cached root to test both cached and
    non-cached rbtrees. In addition, also add a test to measure latencies
    between rb_first and its fast counterpart.

    Link: http://lkml.kernel.org/r/20170719014603.19029-7-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • This adds a second test for regular rb-tree testing in that there is no
    need to repeat it for the augmented flavor.

    Link: http://lkml.kernel.org/r/20170719014603.19029-6-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Allows for more flexible debugging.

    Link: http://lkml.kernel.org/r/20170719014603.19029-5-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • While overall the code is very nicely commented, it might not be
    immediately obvious from the diagrams what is going on. Add a very
    brief summary of each case. Opposite cases where the node is the left
    child are left untouched.

    Link: http://lkml.kernel.org/r/20170719014603.19029-4-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • The only times the nil-parent (root node) condition is true is when the
    node is the first in the tree, or after fixing rbtree rule #4 and the
    case 1 rebalancing made the node the root. Such conditions do not apply
    most of the time:

    (i) The common case in an rbtree is to have more than a single node,
    so this is only true for the first rb_insert().

    (ii) While there is a chance only one first rotation is needed, cases
    where the node's uncle is black (cases 2,3) are more common as we can
    have the following scenarios during the rotation looping:

    case1 only, case1+1, case2+3, case1+2+3, case3 only, etc.

    This patch, therefore, adds an unlikely() optimization to this
    conditional. When profiling with CONFIG_PROFILE_ANNOTATED_BRANCHES, a
    kernel build shows that the incorrect rate is less than 15%, and for
    workloads that involve insert mostly trees overtime tend to have less
    than 2% incorrect rate.

    Link: http://lkml.kernel.org/r/20170719014603.19029-3-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Patch series "rbtree: Cache leftmost node internally", v4.

    A series to extending rbtrees to internally cache the leftmost node such
    that we can have fast overlap check optimization for all interval tree
    users[1]. The benefits of this series are that:

    (i) Unify users that do internal leftmost node caching.
    (ii) Optimize all interval tree users.
    (iii) Convert at least two new users (epoll and procfs) to the new interface.

    This patch (of 16):

    Red-black tree semantics imply that nodes with smaller or greater (or
    equal for duplicates) keys always be to the left and right,
    respectively. For the kernel this is extremely evident when considering
    our rb_first() semantics. Enabling lookups for the smallest node in the
    tree in O(1) can save a good chunk of cycles in not having to walk down
    the tree each time. To this end there are a few core users that
    explicitly do this, such as the scheduler and rtmutexes. There is also
    the desire for interval trees to have this optimization allowing faster
    overlap checking.

    This patch introduces a new 'struct rb_root_cached' which is just the
    root with a cached pointer to the leftmost node. The reason why the
    regular rb_root was not extended instead of adding a new structure was
    that this allows the user to have the choice between memory footprint
    and actual tree performance. The new wrappers on top of the regular
    rb_root calls are:

    - rb_first_cached(cached_root) -- which is a fast replacement
    for rb_first.

    - rb_insert_color_cached(node, cached_root, new)

    - rb_erase_cached(node, cached_root)

    In addition, augmented cached interfaces are also added for basic
    insertion and deletion operations; which becomes important for the
    interval tree changes.

    With the exception of the inserts, which adds a bool for updating the
    new leftmost, the interfaces are kept the same. To this end, porting rb
    users to the cached version becomes really trivial, and keeping current
    rbtree semantics for users that don't care about the optimization
    requires zero overhead.

    Link: http://lkml.kernel.org/r/20170719014603.19029-2-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Jan Kara
    Acked-by: Peter Zijlstra (Intel)
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • GENMASK(_ULL) performs a left-shift of ~0UL(L), which technically
    results in an integer overflow. clang raises a warning if the overflow
    occurs in a preprocessor expression. Clear the low-order bits through a
    substraction instead of the left-shift to avoid the overflow.

    (akpm: no change in .text size in my testing)

    Link: http://lkml.kernel.org/r/20170803212020.24939-1-mka@chromium.org
    Signed-off-by: Matthias Kaehlcke
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthias Kaehlcke
     
  • We have seen some generic code use config parameter CONFIG_CPU_BIG_ENDIAN
    to decide the endianness.

    Here are the few examples.
    include/asm-generic/qrwlock.h
    drivers/of/base.c
    drivers/of/fdt.c
    drivers/tty/serial/earlycon.c
    drivers/tty/serial/serial_core.c

    Display warning if CPU_BIG_ENDIAN is not defined on big endian
    architecture and also warn if it defined on little endian architectures.

    Here is our original discussion
    https://lkml.org/lkml/2017/5/24/620

    Link: http://lkml.kernel.org/r/1499358861-179979-4-git-send-email-babu.moger@oracle.com
    Signed-off-by: Babu Moger
    Suggested-by: Arnd Bergmann
    Acked-by: Geert Uytterhoeven
    Cc: "James E.J. Bottomley"
    Cc: Alexander Viro
    Cc: David S. Miller
    Cc: Greg KH
    Cc: Helge Deller
    Cc: Ingo Molnar
    Cc: Jonas Bonn
    Cc: Max Filippov
    Cc: Michael Ellerman (powerpc)
    Cc: Michal Simek
    Cc: Peter Zijlstra
    Cc: Stafford Horne
    Cc: Stefan Kristiansson
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Babu Moger
     
  • microblaze architectures can be configured for either little or big endian
    formats. Add a choice option for the user to select the correct endian
    format(default to big endian).

    Also update the Makefile so toolchain can compile for the format it is
    configured for.

    Link: http://lkml.kernel.org/r/1499358861-179979-3-git-send-email-babu.moger@oracle.com
    Signed-off-by: Babu Moger
    Signed-off-by: Arnd Bergmann
    Cc: Michal Simek
    Cc: "James E.J. Bottomley"
    Cc: Alexander Viro
    Cc: David S. Miller
    Cc: Geert Uytterhoeven
    Cc: Greg KH
    Cc: Helge Deller
    Cc: Ingo Molnar
    Cc: Jonas Bonn
    Cc: Max Filippov
    Cc: Michael Ellerman (powerpc)
    Cc: Peter Zijlstra
    Cc: Stafford Horne
    Cc: Stefan Kristiansson
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Babu Moger
     
  • Patch series "Define CPU_BIG_ENDIAN or warn for inconsistencies", v3.

    While working on enabling queued rwlock on SPARC, found this following
    code in include/asm-generic/qrwlock.h which uses CONFIG_CPU_BIG_ENDIAN to
    clear a byte.

    static inline u8 *__qrwlock_write_byte(struct qrwlock *lock)
    {
    return (u8 *)lock + 3 * IS_BUILTIN(CONFIG_CPU_BIG_ENDIAN);
    }

    Problem is many of the fixed big endian architectures don't define
    CPU_BIG_ENDIAN and clears the wrong byte.

    Define CPU_BIG_ENDIAN for all the fixed big endian architecture to fix it.

    Also found few more references of this config parameter in
    drivers/of/base.c
    drivers/of/fdt.c
    drivers/tty/serial/earlycon.c
    drivers/tty/serial/serial_core.c
    Be aware that this may cause regressions if someone has worked-around
    problems in the above code already. Remove the work-around.

    Here is our original discussion
    https://lkml.org/lkml/2017/5/24/620

    Link: http://lkml.kernel.org/r/1499358861-179979-2-git-send-email-babu.moger@oracle.com
    Signed-off-by: Babu Moger
    Suggested-by: Arnd Bergmann
    Acked-by: Geert Uytterhoeven
    Acked-by: David S. Miller
    Acked-by: Stafford Horne
    Cc: Yoshinori Sato
    Cc: Jonas Bonn
    Cc: Stefan Kristiansson
    Cc: "James E.J. Bottomley"
    Cc: Helge Deller
    Cc: Alexander Viro
    Cc: Michal Simek
    Cc: Michael Ellerman (powerpc)
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Max Filippov
    Cc: Greg KH
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Babu Moger
     
  • First, number of CPUs can't be negative number.

    Second, different signnnedness leads to suboptimal code in the following
    cases:

    1)
    kmalloc(nr_cpu_ids * sizeof(X));

    "int" has to be sign extended to size_t.

    2)
    while (loff_t *pos < nr_cpu_ids)

    MOVSXD is 1 byte longed than the same MOV.

    Other cases exist as well. Basically compiler is told that nr_cpu_ids
    can't be negative which can't be deduced if it is "int".

    Code savings on allyesconfig kernel: -3KB

    add/remove: 0/0 grow/shrink: 25/264 up/down: 261/-3631 (-3370)
    function old new delta
    coretemp_cpu_online 450 512 +62
    rcu_init_one 1234 1272 +38
    pci_device_probe 374 399 +25

    ...

    pgdat_reclaimable_pages 628 556 -72
    select_fallback_rq 446 369 -77
    task_numa_find_cpu 1923 1807 -116

    Link: http://lkml.kernel.org/r/20170819114959.GA30580@avx2
    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Where possible, call memset16(), memmove() or memcpy() instead of using
    open-coded loops. I don't like the calling convention that uses a byte
    count instead of a count of u16s, but it's a little late to change that.
    Reduces code size of fbcon.o by almost 400 bytes on my laptop build.

    [akpm@linux-foundation.org: fix build]
    Link: http://lkml.kernel.org/r/20170720184539.31609-9-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Cc: Ralf Baechle
    Cc: David Miller
    Cc: Sam Ravnborg
    Cc: "H. Peter Anvin"
    Cc: "James E.J. Bottomley"
    Cc: "Martin K. Petersen"
    Cc: Ingo Molnar
    Cc: Ivan Kokshaysky
    Cc: Matt Turner
    Cc: Michael Ellerman
    Cc: Minchan Kim
    Cc: Richard Henderson
    Cc: Russell King
    Cc: Sergey Senozhatsky
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • memset32() can be used to initialise these three arrays. Minor code
    footprint reduction.

    Link: http://lkml.kernel.org/r/20170720184539.31609-8-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Cc: "James E.J. Bottomley"
    Cc: "Martin K. Petersen"
    Cc: "H. Peter Anvin"
    Cc: David Miller
    Cc: Ingo Molnar
    Cc: Ivan Kokshaysky
    Cc: Matt Turner
    Cc: Michael Ellerman
    Cc: Minchan Kim
    Cc: Ralf Baechle
    Cc: Richard Henderson
    Cc: Russell King
    Cc: Sam Ravnborg
    Cc: Sergey Senozhatsky
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • zram was the motivation for creating memset_l(). Minchan Kim sees a 7%
    performance improvement on x86 with 100MB of non-zero deduplicatable
    data:

    perf stat -r 10 dd if=/dev/zram0 of=/dev/null

    vanilla: 0.232050465 seconds time elapsed ( +- 0.51% )
    memset_l: 0.217219387 seconds time elapsed ( +- 0.07% )

    Link: http://lkml.kernel.org/r/20170720184539.31609-7-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Tested-by: Minchan Kim
    Cc: Sergey Senozhatsky
    Cc: "H. Peter Anvin"
    Cc: "James E.J. Bottomley"
    Cc: "Martin K. Petersen"
    Cc: David Miller
    Cc: Ingo Molnar
    Cc: Ivan Kokshaysky
    Cc: Matt Turner
    Cc: Michael Ellerman
    Cc: Ralf Baechle
    Cc: Richard Henderson
    Cc: Russell King
    Cc: Sam Ravnborg
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Alpha already had an optimised fill-memory-with-16-bit-quantity
    assembler routine called memsetw(). It has a slightly different calling
    convention from memset16() in that it takes a byte count, not a count of
    words. That's the same convention used by ARM's __memset routines, so
    rename Alpha's routine to match and add a memset16() wrapper around it.
    Then convert Alpha's scr_memsetw() to call memset16() instead of
    memsetw().

    Link: http://lkml.kernel.org/r/20170720184539.31609-6-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: Matt Turner
    Cc: "H. Peter Anvin"
    Cc: "James E.J. Bottomley"
    Cc: "Martin K. Petersen"
    Cc: David Miller
    Cc: Ingo Molnar
    Cc: Michael Ellerman
    Cc: Minchan Kim
    Cc: Ralf Baechle
    Cc: Russell King
    Cc: Sam Ravnborg
    Cc: Sergey Senozhatsky
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Reuse the existing optimised memset implementation to implement an
    optimised memset32 and memset64.

    Link: http://lkml.kernel.org/r/20170720184539.31609-5-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Reviewed-by: Russell King
    Cc: "H. Peter Anvin"
    Cc: "James E.J. Bottomley"
    Cc: "Martin K. Petersen"
    Cc: David Miller
    Cc: Ingo Molnar
    Cc: Ivan Kokshaysky
    Cc: Matt Turner
    Cc: Michael Ellerman
    Cc: Minchan Kim
    Cc: Ralf Baechle
    Cc: Richard Henderson
    Cc: Sam Ravnborg
    Cc: Sergey Senozhatsky
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • These are single instructions on x86. There's no 64-bit instruction for
    x86-32, but we don't yet have any user for memset64() on 32-bit
    architectures, so don't bother to implement it.

    Link: http://lkml.kernel.org/r/20170720184539.31609-4-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Cc: Minchan Kim
    Cc: Michael Ellerman
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: "James E.J. Bottomley"
    Cc: "Martin K. Petersen"
    Cc: David Miller
    Cc: Ivan Kokshaysky
    Cc: Matt Turner
    Cc: Ralf Baechle
    Cc: Richard Henderson
    Cc: Russell King
    Cc: Sam Ravnborg
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • [akpm@linux-foundation.org: minor tweaks]
    Link: http://lkml.kernel.org/r/20170720184539.31609-3-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Cc: "H. Peter Anvin"
    Cc: "James E.J. Bottomley"
    Cc: "Martin K. Petersen"
    Cc: David Miller
    Cc: Ingo Molnar
    Cc: Ivan Kokshaysky
    Cc: Matt Turner
    Cc: Michael Ellerman
    Cc: Minchan Kim
    Cc: Ralf Baechle
    Cc: Richard Henderson
    Cc: Russell King
    Cc: Sam Ravnborg
    Cc: Sergey Senozhatsky
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Patch series "Multibyte memset variations", v4.

    A relatively common idiom we're missing is a function to fill an area of
    memory with a pattern which is larger than a single byte. I first
    noticed this with a zram patch which wanted to fill a page with an
    'unsigned long' value. There turn out to be quite a few places in the
    kernel which can benefit from using an optimised function rather than a
    loop; sometimes text size, sometimes speed, and sometimes both. The
    optimised PowerPC version (not included here) improves performance by
    about 30% on POWER8 on just the raw memset_l().

    Most of the extra lines of code come from the three testcases I added.

    This patch (of 8):

    memset16(), memset32() and memset64() are like memset(), but allow the
    caller to fill the destination with a value larger than a single byte.
    memset_l() and memset_p() allow the caller to use unsigned long and
    pointer values respectively.

    Link: http://lkml.kernel.org/r/20170720184539.31609-2-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Cc: "H. Peter Anvin"
    Cc: "James E.J. Bottomley"
    Cc: "Martin K. Petersen"
    Cc: David Miller
    Cc: Ingo Molnar
    Cc: Ivan Kokshaysky
    Cc: Matt Turner
    Cc: Michael Ellerman
    Cc: Minchan Kim
    Cc: Ralf Baechle
    Cc: Richard Henderson
    Cc: Russell King
    Cc: Sam Ravnborg
    Cc: Sergey Senozhatsky
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • This macro is useful to avoid link error on 32-bit systems.

    We have the same definition in two drivers, so move it to
    include/linux/kernel.h

    While we are here, refactor DIV_ROUND_UP_ULL() by using
    DIV_ROUND_DOWN_ULL().

    Link: http://lkml.kernel.org/r/1500945156-12907-1-git-send-email-yamada.masahiro@socionext.com
    Signed-off-by: Masahiro Yamada
    Acked-by: Mark Brown
    Cc: Cyrille Pitchen
    Cc: Jaroslav Kysela
    Cc: Takashi Iwai
    Cc: Liam Girdwood
    Cc: Boris Brezillon
    Cc: Marek Vasut
    Cc: Brian Norris
    Cc: Richard Weinberger
    Cc: David Woodhouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Masahiro Yamada
     
  • If there are large numbers of hugepages to iterate while reading
    /proc/pid/smaps, the page walk never does cond_resched(). On archs
    without split pmd locks, there can be significant and observable
    contention on mm->page_table_lock which cause lengthy delays without
    rescheduling.

    Always reschedule in smaps_pte_range() if necessary since the pagewalk
    iteration can be expensive.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1708211405520.131071@chino.kir.corp.google.com
    Signed-off-by: David Rientjes
    Cc: Minchan Kim
    Cc: Hugh Dickins
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Save some code from ~320 invocations all clearing last argument.

    add/remove: 3/0 grow/shrink: 0/158 up/down: 45/-702 (-657)
    function old new delta
    proc_create - 17 +17
    __ksymtab_proc_create - 16 +16
    __kstrtab_proc_create - 12 +12
    yam_init_driver 301 298 -3

    ...

    cifs_proc_init 249 228 -21
    via_fb_pci_probe 2304 2280 -24

    Link: http://lkml.kernel.org/r/20170819094702.GA27864@avx2
    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Commit b18cb64ead40 ("fs/proc: Stop trying to report thread stacks")
    removed the priv parameter user in is_stack so the argument is
    redundant. Drop it.

    [arnd@arndb.de: remove unused variable]
    Link: http://lkml.kernel.org/r/20170801120150.1520051-1-arnd@arndb.de
    Link: http://lkml.kernel.org/r/20170728075833.7241-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Signed-off-by: Arnd Bergmann
    Cc: Andy Lutomirski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • VMA and its address bounds checks are too late in this function. They
    must have been verified earlier in the page fault sequence. Hence just
    remove them.

    Link: http://lkml.kernel.org/r/20170901130137.7617-1-khandual@linux.vnet.ibm.com
    Signed-off-by: Anshuman Khandual
    Suggested-by: Vlastimil Babka
    Acked-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     
  • Free frontswap_map if an error is encountered before enable_swap_info().

    Signed-off-by: David Rientjes
    Reviewed-by: "Huang, Ying"
    Cc: Darrick J. Wong
    Cc: Hugh Dickins
    Cc: [4.12+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • If initializing a small swap file fails because the swap file has a
    problem (holes, etc.) then we need to free the cluster info as part of
    cleanup. Unfortunately a previous patch changed the code to use kvzalloc
    but did not change all the vfree calls to use kvfree.

    Found by running generic/357 from xfstests.

    Link: http://lkml.kernel.org/r/20170831233515.GR3775@magnolia
    Fixes: 54f180d3c181 ("mm, swap: use kvzalloc to allocate some swap data structures")
    Signed-off-by: Darrick J. Wong
    Reviewed-by: "Huang, Ying"
    Acked-by: David Rientjes
    Cc: Hugh Dickins
    Cc: [4.12+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Darrick J. Wong
     
  • We are by error initializing alloc_flags before gfp_allowed_mask is
    applied. This could cause problems after pm_restrict_gfp_mask() is called
    during suspend operation. Apply gfp_allowed_mask before initializing
    alloc_flags so that the first allocation attempt uses correct flags.

    Link: http://lkml.kernel.org/r/201709020016.ADJ21342.OFLJHOOSMFVtFQ@I-love.SAKURA.ne.jp
    Fixes: 83d4ca8148fd9092 ("mm, page_alloc: move __GFP_HARDWALL modifications out of the fastpath")
    Signed-off-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Jesper Dangaard Brouer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • KCMP's KCMP_EPOLL_TFD mode merged in commit 0791e3644e5ef2 ("kcmp: add
    KCMP_EPOLL_TFD mode to compare epoll target files") we've had no selftest
    for it yet (except in criu development list). Thus add it.

    Link: http://lkml.kernel.org/r/20170901151620.GK1898@uranus.lan
    Signed-off-by: Cyrill Gorcunov
    Cc: Andrey Vagin
    Cc: Pavel Emelyanov
    Cc: Michael Kerrisk
    Cc: Shuah Khan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • online_mem_sections() accidentally marks online only the first section
    in the given range. This is a typo which hasn't been noticed because I
    haven't tested large 2GB blocks previously. All users of
    pfn_to_online_page would get confused on the the rest of the pfn range
    in the block.

    All we need to fix this is to use iterator (pfn) rather than start_pfn.

    Link: http://lkml.kernel.org/r/20170904112210.3401-1-mhocko@kernel.org
    Fixes: 2d070eab2e82 ("mm: consider zone which is not fully populated to have holes")
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Anshuman Khandual
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Seen while reading the code, in handle_mm_fault(), in the case
    arch_vma_access_permitted() is failing the call to
    mem_cgroup_oom_disable() is not made.

    To fix that, move the call to mem_cgroup_oom_enable() after calling
    arch_vma_access_permitted() as it should not have entered the memcg OOM.

    Link: http://lkml.kernel.org/r/1504625439-31313-1-git-send-email-ldufour@linux.vnet.ibm.com
    Fixes: bae473a423f6 ("mm: introduce fault_env")
    Signed-off-by: Laurent Dufour
    Acked-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Laurent Dufour
     
  • We've noticed a quite noticeable performance overhead on some hosts with
    significant network traffic when socket memory accounting is enabled.

    Perf top shows that socket memory uncharging path is hot:
    2.13% [kernel] [k] page_counter_cancel
    1.14% [kernel] [k] __sk_mem_reduce_allocated
    1.14% [kernel] [k] _raw_spin_lock
    0.87% [kernel] [k] _raw_spin_lock_irqsave
    0.84% [kernel] [k] tcp_ack
    0.84% [kernel] [k] ixgbe_poll
    0.83% < workload >
    0.82% [kernel] [k] enqueue_entity
    0.68% [kernel] [k] __fget
    0.68% [kernel] [k] tcp_delack_timer_handler
    0.67% [kernel] [k] __schedule
    0.60% < workload >
    0.59% [kernel] [k] __inet6_lookup_established
    0.55% [kernel] [k] __switch_to
    0.55% [kernel] [k] menu_select
    0.54% libc-2.20.so [.] __memcpy_avx_unaligned

    To address this issue, the existing per-cpu stock infrastructure can be
    used.

    refill_stock() can be called from mem_cgroup_uncharge_skmem() to move
    charge to a per-cpu stock instead of calling atomic
    page_counter_uncharge().

    To prevent the uncontrolled growth of per-cpu stocks, refill_stock()
    will explicitly drain the cached charge, if the cached value exceeds
    CHARGE_BATCH.

    This allows significantly optimize the load:
    1.21% [kernel] [k] _raw_spin_lock
    1.01% [kernel] [k] ixgbe_poll
    0.92% [kernel] [k] _raw_spin_lock_irqsave
    0.90% [kernel] [k] enqueue_entity
    0.86% [kernel] [k] tcp_ack
    0.85% < workload >
    0.74% perf-11120.map [.] 0x000000000061bf24
    0.73% [kernel] [k] __schedule
    0.67% [kernel] [k] __fget
    0.63% [kernel] [k] __inet6_lookup_established
    0.62% [kernel] [k] menu_select
    0.59% < workload >
    0.59% [kernel] [k] __switch_to
    0.57% libc-2.20.so [.] __memcpy_avx_unaligned

    Link: http://lkml.kernel.org/r/20170829100150.4580-1-guro@fb.com
    Signed-off-by: Roman Gushchin
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • The fadvise() manpage is silent on fadvise()'s effect on memory-based
    filesystems (shmem, hugetlbfs & ramfs) and pseudo file systems (procfs,
    sysfs, kernfs). The current implementaion of fadvise is mostly a noop
    for such filesystems except for FADV_DONTNEED which will trigger
    expensive remote LRU cache draining. This patch makes the noop of
    fadvise() on such file systems very explicit.

    However this change has two side effects for ramfs and one for tmpfs.
    First fadvise(FADV_DONTNEED) could remove the unmapped clean zero'ed
    pages of ramfs (allocated through read, readahead & read fault) and
    tmpfs (allocated through read fault). Also fadvise(FADV_WILLNEED) could
    create such clean zero'ed pages for ramfs. This change removes those
    possibilities.

    One of our generic libraries does fadvise(FADV_DONTNEED). Recently we
    observed high latency in fadvise() and noticed that the users have
    started using tmpfs files and the latency was due to expensive remote
    LRU cache draining. For normal tmpfs files (have data written on them),
    fadvise(FADV_DONTNEED) will always trigger the unneeded remote cache
    draining.

    Link: http://lkml.kernel.org/r/20170818011023.181465-1-shakeelb@google.com
    Signed-off-by: Shakeel Butt
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Hillf Danton
    Cc: Vlastimil Babka
    Cc: Hugh Dickins
    Cc: Greg Thelen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     
  • zs_stat_inc/dec/get() uses enum zs_stat_type for the stat type, however
    some callers pass an enum fullness_group value. Change the type to int to
    reflect the actual use of the functions and get rid of 'enum-conversion'
    warnings

    Link: http://lkml.kernel.org/r/20170731175000.56538-1-mka@chromium.org
    Signed-off-by: Matthias Kaehlcke
    Reviewed-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Cc: Doug Anderson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthias Kaehlcke
     
  • page_zone_id() is a specialized function to compare the zone for the pages
    that are within the section range. If the section of the pages are
    different, page_zone_id() can be different even if their zone is the same.
    This wrong usage doesn't cause any actual problem since
    __munlock_pagevec_fill() would be called again with failed index.
    However, it's better to use more appropriate function here.

    Link: http://lkml.kernel.org/r/1503559211-10259-1-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • To avoid deviation, the per cpu number of NUMA stats in
    vm_numa_stat_diff[] is included when a user *reads* the NUMA stats.

    Since NUMA stats does not be read by users frequently, and kernel does not
    need it to make a decision, it will not be a problem to make the readers
    more expensive.

    Link: http://lkml.kernel.org/r/1503568801-21305-4-git-send-email-kemi.wang@intel.com
    Signed-off-by: Kemi Wang
    Reported-by: Jesper Dangaard Brouer
    Acked-by: Mel Gorman
    Cc: Aaron Lu
    Cc: Andi Kleen
    Cc: Christopher Lameter
    Cc: Dave Hansen
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tim Chen
    Cc: Ying Huang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kemi Wang