03 Aug, 2016

40 commits

  • The crc32 test function measures the elapsed time in nanoseconds, but
    uses 'struct timespec' for that. We want to remove timespec from the
    kernel for y2038 compatibility, and ktime_get_ns() also helps make the
    code simpler here.

    It is also slightly better to use monontonic time, as we are only
    interested in the time difference.

    Link: http://lkml.kernel.org/r/20160617143932.3289626-1-arnd@arndb.de
    Signed-off-by: Arnd Bergmann
    Cc: "David S . Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     
  • When a large enough area in the iommu bitmap is found but would span a
    boundary we continue the search starting from the next bit position.
    For large allocations this can lead to several useless invocations of
    bitmap_find_next_zero_area() and iommu_is_span_boundary().

    Continue the search from the start of the next segment (which is the
    next bit position such that we'll not cross the same segment boundary
    again).

    Link: http://lkml.kernel.org/r/alpine.LFD.2.20.1606081910070.3211@schleppi
    Signed-off-by: Sebastian Ott
    Reviewed-by: Gerald Schaefer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sebastian Ott
     
  • If a vcs is used, look to see if the vcs tracks the file specified and
    so the -f option becomes optional.

    Link: http://lkml.kernel.org/r/7c86a8df0d48770c45778a43b6b3e4627b2a90ee.1469746395.git.joe@perches.com
    Signed-off-by: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Add a "printk.devkmsg" kernel command line parameter which controls how
    userspace writes into /dev/kmsg. It has three options:

    * ratelimit - ratelimit logging from userspace.
    * on - unlimited logging from userspace
    * off - logging from userspace gets ignored

    The default setting is to ratelimit the messages written to it.

    This changes the kernel default setting of "on" to "ratelimit" and we do
    that because we want to keep userspace spamming /dev/kmsg to sane
    levels. This is especially moot when a small kernel log buffer wraps
    around and messages get lost. So the ratelimiting setting should be a
    sane setting where kernel messages should have a bit higher chance of
    survival from all the spamming.

    It additionally does not limit logging to /dev/kmsg while the system is
    booting if we haven't disabled it on the command line.

    Furthermore, we can control the logging from a lower priority sysctl
    interface - kernel.printk_devkmsg.

    That interface will succeed only if printk.devkmsg *hasn't* been
    supplied on the command line. If it has, then printk.devkmsg is a
    one-time setting which remains for the duration of the system lifetime.
    This "locking" of the setting is to prevent userspace from changing the
    logging on us through sysctl(2).

    This patch is based on previous patches from Linus and Steven.

    [bp@suse.de: fixes]
    Link: http://lkml.kernel.org/r/20160719072344.GC25563@nazgul.tnic
    Link: http://lkml.kernel.org/r/20160716061745.15795-3-bp@alien8.de
    Signed-off-by: Borislav Petkov
    Cc: Dave Young
    Cc: Franck Bui
    Cc: Greg Kroah-Hartman
    Cc: Ingo Molnar
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Uwe Kleine-König
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Borislav Petkov
     
  • Extend the ratelimiting facility to print the amount of suppressed lines
    when it is being released.

    This use case is aimed at short-termed, burst-like users for which we
    want to output the suppressed lines stats only once, after it has been
    disposed of. For an example, see /dev/kmsg usage in a follow-on patch.

    Also, change the printk() line we issue on release to not use
    "callbacks" as it is misleading: we're not suppressing callbacks but
    printk() calls.

    This has been separated from a previous patch by Linus.

    Link: http://lkml.kernel.org/r/20160716061745.15795-2-bp@alien8.de
    Signed-off-by: Borislav Petkov
    Cc: Dave Young
    Cc: Franck Bui
    Cc: Greg Kroah-Hartman
    Cc: Ingo Molnar
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Uwe Kleine-König
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Borislav Petkov
     
  • Move the DRIVER_NAME macro definition before the first usage site and
    fix build error.

    Link: http://lkml.kernel.org/r/20160801163937.GA28119@nazgul.tnic
    Signed-off-by: Borislav Petkov
    Reported-by: kbuild test robot
    Cc: Jean-Christophe Plagniol-Villard
    Cc: Tomi Valkeinen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Borislav Petkov
     
  • asm-generic headers are generic implementations for architecture
    specific code and should not be included by common code. Thus use the
    asm/ version of sections.h to get at the linker sections.

    Link: http://lkml.kernel.org/r/1468285008-7331-1-git-send-email-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Messages' levels and console log level are inspected when the actual
    printing occurs, which may provoke console_unlock() and
    console_cont_flush() to waste CPU cycles on every message that has
    loglevel above the current console_loglevel.

    Schematically, console_unlock() does the following:

    console_unlock()
    {
    ...
    for (;;) {
    ...
    raw_spin_lock_irqsave(&logbuf_lock, flags);
    skip:
    msg = log_from_idx(console_idx);

    if (msg->flags & LOG_NOCONS) {
    ...
    goto skip;
    }

    level = msg->level;
    len += msg_print_text(); >> sprintfs
    memcpy,
    etc.

    if (nr_ext_console_drivers) {
    ext_len = msg_print_ext_header(); >> scnprintf
    ext_len += msg_print_ext_body(); >> scnprintfs
    etc.
    }
    ...
    raw_spin_unlock(&logbuf_lock);

    call_console_drivers(level, ext_text, ext_len, text, len)
    {
    if (level >= console_loglevel && >> drop the message
    !ignore_loglevel)
    return;

    console->write(...);
    }

    local_irq_restore(flags);
    }
    ...
    }

    The thing here is this deferred `level >= console_loglevel' check. We
    are wasting CPU cycles on sprintfs/memcpy/etc. preparing the messages
    that we will eventually drop.

    This can be huge when we register a new CON_PRINTBUFFER console, for
    instance. For every such a console register_console() resets the

    console_seq, console_idx, console_prev

    and sets a `exclusive console' pointer to replay the log buffer to that
    just-registered console. And there can be a lot of messages to replay,
    in the worst case most of which can be dropped after console_loglevel
    test.

    We know messages' levels long before we call msg_print_text() and
    friends, so we can just move console_loglevel check out of
    call_console_drivers() and format a new message only if we are sure that
    it won't be dropped.

    The patch factors out loglevel check into suppress_message_printing()
    function and tests message->level and console_loglevel before formatting
    functions in console_unlock() and console_cont_flush() are getting
    executed. This improves things not only for exclusive CON_PRINTBUFFER
    consoles, but for every console_unlock() that attempts to print a
    message of level above the console_loglevel.

    Link: http://lkml.kernel.org/r/20160627135012.8229-1-sergey.senozhatsky@gmail.com
    Signed-off-by: Sergey Senozhatsky
    Reviewed-by: Petr Mladek
    Cc: Tejun Heo
    Cc: Jan Kara
    Cc: Calvin Owens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • Using functions instead of macros can reduce overall code size by
    eliminating unnecessary "KERN_SOH" prefixes from format strings.

    defconfig x86-64:

    $ size vmlinux*
    text data bss dec hex filename
    10193570 4331464 1105920 15630954 ee826a vmlinux.new
    10192623 4335560 1105920 15634103 ee8eb7 vmlinux.old

    As the return value are unimportant and unused in the kernel tree, these
    new functions return void.

    Miscellanea:

    - change pr_ macros to call new __pr_ functions
    - change vprintk_nmi and vprintk_default to add LOGLEVEL_ argument

    [akpm@linux-foundation.org: fix LOGLEVEL_INFO, per Joe]
    Link: http://lkml.kernel.org/r/e16cc34479dfefcae37c98b481e6646f0f69efc3.1466718827.git.joe@perches.com
    Signed-off-by: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • A trivial cosmetic change: interrupt.h header is redundant since commit
    6b898c07cb1d ("console: use might_sleep in console_lock").

    Link: http://lkml.kernel.org/r/20160620132847.21930-1-sergey.senozhatsky@gmail.com
    Signed-off-by: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • kernel.h header doesn't directly use dynamic debug, instead we can
    include it in module.c (which used it via kernel.h). printk.h only uses
    it if CONFIG_DYNAMIC_DEBUG is on, changing the inclusion to only happen
    in that case.

    Link: http://lkml.kernel.org/r/1468429793-16917-1-git-send-email-luisbg@osg.samsung.com
    [luisbg@osg.samsung.com: include dynamic_debug.h in drb_int.h]
    Link: http://lkml.kernel.org/r/1468447828-18558-2-git-send-email-luisbg@osg.samsung.com
    Signed-off-by: Luis de Bethencourt
    Cc: Rusty Russell
    Cc: Hidehiro Kawai
    Cc: Borislav Petkov
    Cc: Michal Nazarewicz
    Cc: Rasmus Villemoes
    Cc: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luis de Bethencourt
     
  • Change task_work_cancel() to use lockless_dereference(), this is what
    the code really wants but we didn't have this helper when it was
    written.

    Also add the fast-path task->task_works == NULL check, in the likely
    case this task has no pending works and we can avoid
    spin_lock(task->pi_lock).

    While at it, change other users of ACCESS_ONCE() to use READ_ONCE().

    Link: http://lkml.kernel.org/r/20160610150042.GA13868@redhat.com
    Signed-off-by: Oleg Nesterov
    Cc: Andrea Parri
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • For pure bool function's return value, bool is a little better more or
    less than int.

    Link: http://lkml.kernel.org/r/1469331815-2026-1-git-send-email-chengang@emindsoft.com.cn
    Signed-off-by: Chen Gang
    Acked-by: Michael Ellerman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Gang
     
  • For one thing, summarizes all non-umlaut versions into the umlaut one
    (Linus Luessing -> Linus Lüssing).

    For another, maps obsolete email addresses to the current @c0d3.blue
    one.

    Link: http://lkml.kernel.org/r/1467805371-2773-1-git-send-email-linus.luessing@c0d3.blue
    Signed-off-by: Linus Lüssing
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linus Lüssing
     
  • Don't user forward declarations of internal kernel structures in headers
    exported to userspace.

    Move "struct completion;".
    Move "struct task_struct;".

    Link: http://lkml.kernel.org/r/20160713215808.GA22486@p183.telecom.by
    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • There was only one use of __initdata_refok and __exit_refok

    __init_refok was used 46 times against 82 for __ref.

    Those definitions are obsolete since commit 312b1485fb50 ("Introduce new
    section reference annotations tags: __ref, __refdata, __refconst")

    This patch removes the following compatibility definitions and replaces
    them treewide.

    /* compatibility defines */
    #define __init_refok __ref
    #define __initdata_refok __refdata
    #define __exit_refok __ref

    I can also provide separate patches if necessary.
    (One patch per tree and check in 1 month or 2 to remove old definitions)

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/1466796271-3043-1-git-send-email-fabf@skynet.be
    Signed-off-by: Fabian Frederick
    Cc: Ingo Molnar
    Cc: Sam Ravnborg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     
  • When alloc_disk(0) is used the ->major number is completely ignored.
    All devices are allocated with a major of BLOCK_EXT_MAJOR.

    So remove registration and deregistration of 'major'.

    Link: http://lkml.kernel.org/r/20160602064318.4403.49955.stgit@noble
    Signed-off-by: NeilBrown
    Cc: Keith Busch
    Cc: Jens Axboe
    Cc: Maxim Levitsky
    Cc: Greg Kroah-Hartman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • UML is a bit special since it does not have iomem nor dma. That means a
    lot of drivers will not build if they miss a dependency on HAS_IOMEM.
    s390 used to have the same issues but since it gained PCI support UML is
    the only stranger.

    We are tired of patching dozens of new drivers after every merge window
    just to un-break allmod/yesconfig UML builds. One could argue that a
    decent driver has to know on what it depends and therefore a missing
    HAS_IOMEM dependency is a clear driver bug. But the dependency not
    obvious and not everyone does UML builds with COMPILE_TEST enabled when
    developing a device driver.

    A possible solution to make these builds succeed on UML would be
    providing stub functions for ioremap() and friends which fail upon
    runtime. Another one is simply disabling COMPILE_TEST for UML. Since
    it is the least hassle and does not force use to fake iomem support
    let's do the latter.

    Link: http://lkml.kernel.org/r/1466152995-28367-1-git-send-email-richard@nod.at
    Signed-off-by: Richard Weinberger
    Acked-by: Arnd Bergmann
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Richard Weinberger
     
  • Suppress a bunch of warnings of the form:

    fs/proc/task_mmu.c: In function 'show_smap_vma_flags':
    fs/proc/task_mmu.c:635:22: warning: initialized field overwritten [-Wt override-init]
    [ilog2(VM_READ)] = "rd",
    ^~~~
    fs/proc/task_mmu.c:635:22: note: (near initialization for 'mnemonics[0]')

    They happen because of the way we intentionally build the table, so
    silence the warning when building with 'make W=1'.

    Link: http://lkml.kernel.org/r/8727.1470022083@turing-police.cc.vt.edu
    Signed-off-by: Valdis Kletnieks
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Valdis Kletnieks
     
  • /proc/stat shows (among lots of other things) the current boottime (i.e.
    number of seconds since boot). While a 32-bit number is sufficient for
    this particular case, we want to get rid of the 'struct timespec'
    suffers from a 32-bit overflow in 2038.

    This changes the code to use a struct timespec64, which is known to be
    safe in all cases.

    Link: http://lkml.kernel.org/r/20160617201247.2292101-1-arnd@arndb.de
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     
  • This was needed before to ensure that ->signal != 0 and do_each_thread()
    is safe, see commit b95c35e76b29b ("oom: fix the unsafe usage of
    badness() in proc_oom_score()") for details.

    Today tsk->signal can't go away and for_each_thread(tsk) is always safe.

    Link: http://lkml.kernel.org/r/20160608211921.GA15508@redhat.com
    Signed-off-by: Oleg Nesterov
    Acked-by: David Rientjes
    Acked-by: Michal Hocko
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Salah Triki and Luis de Bethencourt are taking over maintainership of
    befs.

    Link: http://lkml.kernel.org/r/1469651079-32455-1-git-send-email-luisbg@osg.samsung.com
    Signed-off-by: Luis de Bethencourt
    Acked-by: Greg Kroah-Hartman
    Acked-by: Theodore Ts'o
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luis de Bethencourt
     
  • cgroup's document path is changed to "cgroup-v1". update it.

    Link: http://lkml.kernel.org/r/1470148443-6509-1-git-send-email-iamyooon@gmail.com
    Signed-off-by: seokhoon.yoon
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    seokhoon.yoon
     
  • handle_object_size_mismatch() used %pk to format a kernel pointer with
    pr_err(). This seemed to be a misspelling for %pK, but using this to
    format a kernel pointer does not make much sence here.

    Therefore use %p instead, like in handle_missaligned_access().

    Link: http://lkml.kernel.org/r/20160730083010.11569-1-nicolas.iooss_linux@m4x.org
    Signed-off-by: Nicolas Iooss
    Acked-by: Andrey Ryabinin
    Cc: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nicolas Iooss
     
  • Commit 53dad6d3a8e5 ("ipc: fix race with LSMs") updated ipc_rcu_putref()
    to receive rcu freeing function but used generic ipc_rcu_free() instead
    of msg_rcu_free() which does security cleaning.

    Running LTP msgsnd06 with kmemleak gives the following:

    cat /sys/kernel/debug/kmemleak

    unreferenced object 0xffff88003c0a11f8 (size 8):
    comm "msgsnd06", pid 1645, jiffies 4294672526 (age 6.549s)
    hex dump (first 8 bytes):
    1b 00 00 00 01 00 00 00 ........
    backtrace:
    kmemleak_alloc+0x23/0x40
    kmem_cache_alloc_trace+0xe1/0x180
    selinux_msg_queue_alloc_security+0x3f/0xd0
    security_msg_queue_alloc+0x2e/0x40
    newque+0x4e/0x150
    ipcget+0x159/0x1b0
    SyS_msgget+0x39/0x40
    entry_SYSCALL_64_fastpath+0x13/0x8f

    Manfred Spraul suggested to fix sem.c as well and Davidlohr Bueso to
    only use ipc_rcu_free in case of security allocation failure in newary()

    Fixes: 53dad6d3a8e ("ipc: fix race with LSMs")
    Link: http://lkml.kernel.org/r/1470083552-22966-1-git-send-email-fabf@skynet.be
    Signed-off-by: Fabian Frederick
    Cc: Davidlohr Bueso
    Cc: Manfred Spraul
    Cc: [3.12+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     
  • We must call shrink_slab() for each memory cgroup on both global and
    memcg reclaim in shrink_node_memcg(). Commit d71df22b55099 accidentally
    changed that so that now shrink_slab() is only called with memcg != NULL
    on memcg reclaim. As a result, memcg-aware shrinkers (including
    dentry/inode) are never invoked on global reclaim. Fix that.

    Fixes: b2e18757f2c9 ("mm, vmscan: begin reclaiming pages on a per-node basis")
    Link: http://lkml.kernel.org/r/1470056590-7177-1-git-send-email-vdavydov@virtuozzo.com
    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Hillf Danton
    Cc: Vlastimil Babka
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Radix trees may be used not only for storing page cache pages, so
    unconditionally accounting radix tree nodes to the current memory cgroup
    is bad: if a radix tree node is used for storing data shared among
    different cgroups we risk pinning dead memory cgroups forever.

    So let's only account radix tree nodes if it was explicitly requested by
    passing __GFP_ACCOUNT to INIT_RADIX_TREE. Currently, we only want to
    account page cache entries, so mark mapping->page_tree so.

    Fixes: 58e698af4c63 ("radix-tree: account radix_tree_node to memory cgroup")
    Link: http://lkml.kernel.org/r/1470057188-7864-1-git-send-email-vdavydov@virtuozzo.com
    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: [4.6+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • If the total amount of memory assigned to quarantine is less than the
    amount of memory assigned to per-cpu quarantines, |new_quarantine_size|
    may overflow. Instead, set it to zero.

    [akpm@linux-foundation.org: cleanup: use WARN_ONCE return value]
    Link: http://lkml.kernel.org/r/1470063563-96266-1-git-send-email-glider@google.com
    Fixes: 55834c59098d ("mm: kasan: initial memory quarantine implementation")
    Signed-off-by: Alexander Potapenko
    Reported-by: Dmitry Vyukov
    Cc: Andrey Ryabinin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     
  • Currently we just dump stack in case of double free bug.
    Let's dump all info about the object that we have.

    [aryabinin@virtuozzo.com: change double free message per Alexander]
    Link: http://lkml.kernel.org/r/1470153654-30160-1-git-send-email-aryabinin@virtuozzo.com
    Link: http://lkml.kernel.org/r/1470062715-14077-6-git-send-email-aryabinin@virtuozzo.com
    Signed-off-by: Andrey Ryabinin
    Cc: Alexander Potapenko
    Cc: Dmitry Vyukov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • The state of object currently tracked in two places - shadow memory, and
    the ->state field in struct kasan_alloc_meta. We can get rid of the
    latter. The will save us a little bit of memory. Also, this allow us
    to move free stack into struct kasan_alloc_meta, without increasing
    memory consumption. So now we should always know when the last time the
    object was freed. This may be useful for long delayed use-after-free
    bugs.

    As a side effect this fixes following UBSAN warning:
    UBSAN: Undefined behaviour in mm/kasan/quarantine.c:102:13
    member access within misaligned address ffff88000d1efebc for type 'struct qlist_node'
    which requires 8 byte alignment

    Link: http://lkml.kernel.org/r/1470062715-14077-5-git-send-email-aryabinin@virtuozzo.com
    Reported-by: kernel test robot
    Signed-off-by: Andrey Ryabinin
    Cc: Alexander Potapenko
    Cc: Dmitry Vyukov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • Size of slab object already stored in cache->object_size.

    Note, that kmalloc() internally rounds up size of allocation, so
    object_size may be not equal to alloc_size, but, usually we don't need
    to know the exact size of allocated object. In case if we need that
    information, we still can figure it out from the report. The dump of
    shadow memory allows to identify the end of allocated memory, and
    thereby the exact allocation size.

    Link: http://lkml.kernel.org/r/1470062715-14077-4-git-send-email-aryabinin@virtuozzo.com
    Signed-off-by: Andrey Ryabinin
    Cc: Alexander Potapenko
    Cc: Dmitry Vyukov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • SLUB doesn't require disabled interrupts to call ___cache_free().

    Link: http://lkml.kernel.org/r/1470062715-14077-3-git-send-email-aryabinin@virtuozzo.com
    Signed-off-by: Andrey Ryabinin
    Acked-by: Alexander Potapenko
    Cc: Dmitry Vyukov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • Currently we call quarantine_reduce() for ___GFP_KSWAPD_RECLAIM (implied
    by __GFP_RECLAIM) allocation. So, basically we call it on almost every
    allocation. quarantine_reduce() sometimes is heavy operation, and
    calling it with disabled interrupts may trigger hard LOCKUP:

    NMI watchdog: Watchdog detected hard LOCKUP on cpu 2irq event stamp: 1411258
    Call Trace:
    dump_stack+0x68/0x96
    watchdog_overflow_callback+0x15b/0x190
    __perf_event_overflow+0x1b1/0x540
    perf_event_overflow+0x14/0x20
    intel_pmu_handle_irq+0x36a/0xad0
    perf_event_nmi_handler+0x2c/0x50
    nmi_handle+0x128/0x480
    default_do_nmi+0xb2/0x210
    do_nmi+0x1aa/0x220
    end_repeat_nmi+0x1a/0x1e
    <> __kernel_text_address+0x86/0xb0
    print_context_stack+0x7b/0x100
    dump_trace+0x12b/0x350
    save_stack_trace+0x2b/0x50
    set_track+0x83/0x140
    free_debug_processing+0x1aa/0x420
    __slab_free+0x1d6/0x2e0
    ___cache_free+0xb6/0xd0
    qlist_free_all+0x83/0x100
    quarantine_reduce+0x177/0x1b0
    kasan_kmalloc+0xf3/0x100

    Reduce the quarantine_reduce iff direct reclaim is allowed.

    Fixes: 55834c59098d("mm: kasan: initial memory quarantine implementation")
    Link: http://lkml.kernel.org/r/1470062715-14077-2-git-send-email-aryabinin@virtuozzo.com
    Signed-off-by: Andrey Ryabinin
    Reported-by: Dave Jones
    Acked-by: Alexander Potapenko
    Cc: Dmitry Vyukov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • Once an object is put into quarantine, we no longer own it, i.e. object
    could leave the quarantine and be reallocated. So having set_track()
    call after the quarantine_put() may corrupt slab objects.

    BUG kmalloc-4096 (Not tainted): Poison overwritten
    -----------------------------------------------------------------------------
    Disabling lock debugging due to kernel taint
    INFO: 0xffff8804540de850-0xffff8804540de857. First byte 0xb5 instead of 0x6b
    ...
    INFO: Freed in qlist_free_all+0x42/0x100 age=75 cpu=3 pid=24492
    __slab_free+0x1d6/0x2e0
    ___cache_free+0xb6/0xd0
    qlist_free_all+0x83/0x100
    quarantine_reduce+0x177/0x1b0
    kasan_kmalloc+0xf3/0x100
    kasan_slab_alloc+0x12/0x20
    kmem_cache_alloc+0x109/0x3e0
    mmap_region+0x53e/0xe40
    do_mmap+0x70f/0xa50
    vm_mmap_pgoff+0x147/0x1b0
    SyS_mmap_pgoff+0x2c7/0x5b0
    SyS_mmap+0x1b/0x30
    do_syscall_64+0x1a0/0x4e0
    return_from_SYSCALL_64+0x0/0x7a
    INFO: Slab 0xffffea0011503600 objects=7 used=7 fp=0x (null) flags=0x8000000000004080
    INFO: Object 0xffff8804540de848 @offset=26696 fp=0xffff8804540dc588
    Redzone ffff8804540de840: bb bb bb bb bb bb bb bb ........
    Object ffff8804540de848: 6b 6b 6b 6b 6b 6b 6b 6b b5 52 00 00 f2 01 60 cc kkkkkkkk.R....`.

    Similarly, poisoning after the quarantine_put() leads to false positive
    use-after-free reports:

    BUG: KASAN: use-after-free in anon_vma_interval_tree_insert+0x304/0x430 at addr ffff880405c540a0
    Read of size 8 by task trinity-c0/3036
    CPU: 0 PID: 3036 Comm: trinity-c0 Not tainted 4.7.0-think+ #9
    Call Trace:
    dump_stack+0x68/0x96
    kasan_report_error+0x222/0x600
    __asan_report_load8_noabort+0x61/0x70
    anon_vma_interval_tree_insert+0x304/0x430
    anon_vma_chain_link+0x91/0xd0
    anon_vma_clone+0x136/0x3f0
    anon_vma_fork+0x81/0x4c0
    copy_process.part.47+0x2c43/0x5b20
    _do_fork+0x16d/0xbd0
    SyS_clone+0x19/0x20
    do_syscall_64+0x1a0/0x4e0
    entry_SYSCALL64_slow_path+0x25/0x25

    Fix this by putting an object in the quarantine after all other
    operations.

    Fixes: 80a9201a5965 ("mm, kasan: switch SLUB to stackdepot, enable memory quarantine for SLUB")
    Link: http://lkml.kernel.org/r/1470062715-14077-1-git-send-email-aryabinin@virtuozzo.com
    Signed-off-by: Andrey Ryabinin
    Reported-by: Dave Jones
    Reported-by: Vegard Nossum
    Reported-by: Sasha Levin
    Acked-by: Alexander Potapenko
    Cc: Dmitry Vyukov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • We've had a report about soft lockups caused by lock bouncing in the
    soft reclaim path:

    BUG: soft lockup - CPU#0 stuck for 22s! [kav4proxy-kavic:3128]
    RIP: 0010:[] [] _raw_spin_lock+0x18/0x20
    Call Trace:
    mem_cgroup_soft_limit_reclaim+0x25a/0x280
    shrink_zones+0xed/0x200
    do_try_to_free_pages+0x74/0x320
    try_to_free_pages+0x112/0x180
    __alloc_pages_slowpath+0x3ff/0x820
    __alloc_pages_nodemask+0x1e9/0x200
    alloc_pages_vma+0xe1/0x290
    do_wp_page+0x19f/0x840
    handle_pte_fault+0x1cd/0x230
    do_page_fault+0x1fd/0x4c0
    page_fault+0x25/0x30

    There are no memcgs created so there cannot be any in the soft limit
    excess obviously:

    [...]
    memory 0 1 1

    so all this just seems to be mem_cgroup_largest_soft_limit_node trying
    to get spin_lock_irq(&mctz->lock) just to find out that the soft limit
    excess tree is empty. This is just pointless wasting of cycles and
    cache line bouncing during heavy parallel reclaim on large machines.
    The particular machine wasn't very healthy and most probably suffering
    from a memory leak which just caused the memory reclaim to trash
    heavily. But bouncing on the lock certainly didn't help...

    Fix this by optimistic lockless check and bail out early if the tree is
    empty. This is theoretically racy but that shouldn't matter all that
    much. First of all soft limit is a best effort feature and it is slowly
    getting deprecated and its usage should be really scarce. Bouncing on a
    lock without a good reason is surely much bigger problem, especially on
    large CPU machines.

    Link: http://lkml.kernel.org/r/1470073277-1056-1-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Zhong Jiang has reported a BUG_ON from huge_pte_alloc hitting when he
    runs his database load with memory online and offline running in
    parallel. The reason is that huge_pmd_share might detect a shared pmd
    which is currently migrated and so it has migration pte which is
    !pte_huge.

    There doesn't seem to be any easy way to prevent from the race and in
    fact seeing the migration swap entry is not harmful. Both callers of
    huge_pte_alloc are prepared to handle them. copy_hugetlb_page_range
    will copy the swap entry and make it COW if needed. hugetlb_fault will
    back off and so the page fault is retries if the page is still under
    migration and waits for its completion in hugetlb_fault.

    That means that the BUG_ON is wrong and we should update it. Let's
    simply check that all present ptes are pte_huge instead.

    Link: http://lkml.kernel.org/r/20160721074340.GA26398@dhcp22.suse.cz
    Signed-off-by: Michal Hocko
    Reported-by: zhongjiang
    Acked-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • In powerpc servers with large memory(32TB), we watched several soft
    lockups for hugepage under stress tests.

    The call traces are as follows:
    1.
    get_page_from_freelist+0x2d8/0xd50
    __alloc_pages_nodemask+0x180/0xc20
    alloc_fresh_huge_page+0xb0/0x190
    set_max_huge_pages+0x164/0x3b0

    2.
    prep_new_huge_page+0x5c/0x100
    alloc_fresh_huge_page+0xc8/0x190
    set_max_huge_pages+0x164/0x3b0

    This patch fixes such soft lockups. It is safe to call cond_resched()
    there because it is out of spin_lock/unlock section.

    Link: http://lkml.kernel.org/r/1469674442-14848-1-git-send-email-hejianet@gmail.com
    Signed-off-by: Jia He
    Reviewed-by: Naoya Horiguchi
    Acked-by: Michal Hocko
    Acked-by: Dave Hansen
    Cc: Mike Kravetz
    Cc: "Kirill A. Shutemov"
    Cc: Paul Gortmaker
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jia He
     
  • Apparently, the tools/testing version dates to a few flags ago, and
    we've sprouted 4 new ones since. Keep in sync with the value in the
    main tree...

    Link: http://lkml.kernel.org/r/23400.1469702675@turing-police.cc.vt.edu
    Signed-off-by: Valdis Kletnieks
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Valdis Kletnieks
     
  • Every swap-in anonymous page starts from inactive lru list's head. It
    should be activated unconditionally when VM decide to reclaim because
    page table entry for the page always usually has marked accessed bit.
    Thus, their window size for getting a new referece is 2 * NR_inactive +
    NR_active while others is NR_inactive + NR_active.

    It's not fair that it has more chance to be referenced compared to other
    newly allocated page which starts from active lru list's head.

    Johannes:

    : The page can still have a valid copy on the swap device, so prefering to
    : reclaim that page over a fresh one could make sense. But as you point
    : out, having it start inactive instead of active actually ends up giving it
    : *more* LRU time, and that seems to be without justification.

    Rik:

    : The reason newly read in swap cache pages start on the inactive list is
    : that we do some amount of read-around, and do not know which pages will
    : get used.
    :
    : However, immediately activating the ones that DO get used, like your patch
    : does, is the right thing to do.

    Link: http://lkml.kernel.org/r/1469762740-17860-1-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Acked-by: Johannes Weiner
    Acked-by: Rik van Riel
    Cc: Nadav Amit
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • I ran into this:

    BUG: sleeping function called from invalid context at mm/page_alloc.c:3784
    in_atomic(): 0, irqs_disabled(): 0, pid: 1434, name: trinity-c1
    2 locks held by trinity-c1/1434:
    #0: (&mm->mmap_sem){......}, at: [] __do_page_fault+0x1ce/0x8f0
    #1: (rcu_read_lock){......}, at: [] filemap_map_pages+0xd6/0xdd0

    CPU: 0 PID: 1434 Comm: trinity-c1 Not tainted 4.7.0+ #58
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
    Call Trace:
    dump_stack+0x65/0x84
    panic+0x185/0x2dd
    ___might_sleep+0x51c/0x600
    __might_sleep+0x90/0x1a0
    __alloc_pages_nodemask+0x5b1/0x2160
    alloc_pages_current+0xcc/0x370
    pte_alloc_one+0x12/0x90
    __pte_alloc+0x1d/0x200
    alloc_set_pte+0xe3e/0x14a0
    filemap_map_pages+0x42b/0xdd0
    handle_mm_fault+0x17d5/0x28b0
    __do_page_fault+0x310/0x8f0
    trace_do_page_fault+0x18d/0x310
    do_async_page_fault+0x27/0xa0
    async_page_fault+0x28/0x30

    The important bits from the above is that filemap_map_pages() is calling
    into the page allocator while holding rcu_read_lock (sleeping is not
    allowed inside RCU read-side critical sections).

    According to Kirill Shutemov, the prefaulting code in do_fault_around()
    is supposed to take care of this, but missing error handling means that
    the allocation failure can go unnoticed.

    We don't need to return VM_FAULT_OOM (or any other error) here, since we
    can just let the normal fault path try again.

    Fixes: 7267ec008b5c ("mm: postpone page table allocation until we have page to map")
    Link: http://lkml.kernel.org/r/1469708107-11868-1-git-send-email-vegard.nossum@oracle.com
    Signed-off-by: Vegard Nossum
    Acked-by: Kirill A. Shutemov
    Cc: "Hillf Danton"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vegard Nossum