24 May, 2016

40 commits

  • shmat and shmdt rely on mmap_sem for write. If the waiting task gets
    killed by the oom killer it would block oom_reaper from asynchronous
    address space reclaim and reduce the chances of timely OOM resolving.
    Wait for the lock in the killable mode and return with EINTR if the task
    got killed while waiting.

    Signed-off-by: Michal Hocko
    Acked-by: Davidlohr Bueso
    Acked-by: Vlastimil Babka
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • dup_mmap needs to lock current's mm mmap_sem for write. If the waiting
    task gets killed by the oom killer it would block oom_reaper from
    asynchronous address space reclaim and reduce the chances of timely OOM
    resolving. Wait for the lock in the killable mode and return with EINTR
    if the task got killed while waiting.

    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Oleg Nesterov
    Cc: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • CLEAR_REFS_MM_HIWATER_RSS and CLEAR_REFS_SOFT_DIRTY are relying on
    mmap_sem for write. If the waiting task gets killed by the oom killer
    and it would operate on the current's mm it would block oom_reaper from
    asynchronous address space reclaim and reduce the chances of timely OOM
    resolving. Wait for the lock in the killable mode and return with EINTR
    if the task got killed while waiting. This will also expedite the
    return to the userspace and do_exit even if the mm is remote.

    Signed-off-by: Michal Hocko
    Acked-by: Oleg Nesterov
    Acked-by: Vlastimil Babka
    Cc: Petr Cermak
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Now that all the callers handle vm_brk failure we can change it wait for
    mmap_sem killable to help oom_reaper to not get blocked just because
    vm_brk gets blocked behind mmap_sem readers.

    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: "Kirill A. Shutemov"
    Cc: Oleg Nesterov
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • load_elf_library doesn't handle vm_brk failure although nothing really
    indicates it cannot do that because the function is allowed to fail due
    to vm_mmap failures already. This might be not a problem now but later
    patch will make vm_brk killable (resp. mmap_sem for write waiting will
    become killable) and so the failure will be more probable.

    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Alexander Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • vm_brk is allowed to fail but load_aout_binary simply ignores the error
    and happily continues. I haven't noticed any problem from that in real
    life but later patches will make the failure more likely because vm_brk
    will become killable (resp. mmap_sem for write waiting will become
    killable) so we should be more careful now.

    The error handling should be quite straightforward because there are
    calls to vm_mmap which check the error properly already. The only
    notable exception is set_brk which is called after beyond_if label. But
    nothing indicates that we cannot move it above set_binfmt as the two do
    not depend on each other and fail before we do set_binfmt and alter
    reference counting.

    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Alexander Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Almost all current users of vm_munmap are ignoring the return value and
    so they do not handle potential error. This means that some VMAs might
    stay behind. This patch doesn't try to solve those potential problems.
    Quite contrary it adds a new failure mode by using down_write_killable
    in vm_munmap. This should be safer than other failure modes, though,
    because the process is guaranteed to die as soon as it leaves the kernel
    and exit_mmap will clean the whole address space.

    This will help in the OOM conditions when the oom victim might be stuck
    waiting for the mmap_sem for write which in turn can block oom_reaper
    which relies on the mmap_sem for read to make a forward progress and
    reclaim the address space of the victim.

    Signed-off-by: Michal Hocko
    Cc: Oleg Nesterov
    Cc: "Kirill A. Shutemov"
    Cc: Konstantin Khlebnikov
    Cc: Andrea Arcangeli
    Cc: Alexander Viro
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • All the callers of vm_mmap seem to check for the failure already and
    bail out in one way or another on the error which means that we can
    change it to use killable version of vm_mmap_pgoff and return -EINTR if
    the current task gets killed while waiting for mmap_sem. This also
    means that vm_mmap_pgoff can be killable by default and drop the
    additional parameter.

    This will help in the OOM conditions when the oom victim might be stuck
    waiting for the mmap_sem for write which in turn can block oom_reaper
    which relies on the mmap_sem for read to make a forward progress and
    reclaim the address space of the victim.

    Please note that load_elf_binary is ignoring vm_mmap error for
    current->personality & MMAP_PAGE_ZERO case but that shouldn't be a
    problem because the address is not used anywhere and we never return to
    the userspace if we got killed.

    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Cc: Oleg Nesterov
    Cc: Andrea Arcangeli
    Cc: Al Viro
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • This is a follow up work for oom_reaper [1]. As the async OOM killing
    depends on oom_sem for read we would really appreciate if a holder for
    write didn't stood in the way. This patchset is changing many of
    down_write calls to be killable to help those cases when the writer is
    blocked and waiting for readers to release the lock and so help
    __oom_reap_task to process the oom victim.

    Most of the patches are really trivial because the lock is help from a
    shallow syscall paths where we can return EINTR trivially and allow the
    current task to die (note that EINTR will never get to the userspace as
    the task has fatal signal pending). Others seem to be easy as well as
    the callers are already handling fatal errors and bail and return to
    userspace which should be sufficient to handle the failure gracefully.
    I am not familiar with all those code paths so a deeper review is really
    appreciated.

    As this work is touching more areas which are not directly connected I
    have tried to keep the CC list as small as possible and people who I
    believed would be familiar are CCed only to the specific patches (all
    should have received the cover though).

    This patchset is based on linux-next and it depends on
    down_write_killable for rw_semaphores which got merged into tip
    locking/rwsem branch and it is merged into this next tree. I guess it
    would be easiest to route these patches via mmotm because of the
    dependency on the tip tree but if respective maintainers prefer other
    way I have no objections.

    I haven't covered all the mmap_write(mm->mmap_sem) instances here

    $ git grep "down_write(.*\)" next/master | wc -l
    98
    $ git grep "down_write(.*\)" | wc -l
    62

    I have tried to cover those which should be relatively easy to review in
    this series because this alone should be a nice improvement. Other
    places can be changed on top.

    [0] http://lkml.kernel.org/r/1456752417-9626-1-git-send-email-mhocko@kernel.org
    [1] http://lkml.kernel.org/r/1452094975-551-1-git-send-email-mhocko@kernel.org
    [2] http://lkml.kernel.org/r/1456750705-7141-1-git-send-email-mhocko@kernel.org

    This patch (of 18):

    This is the first step in making mmap_sem write waiters killable. It
    focuses on the trivial ones which are taking the lock early after
    entering the syscall and they are not changing state before.

    Therefore it is very easy to change them to use down_write_killable and
    immediately return with -EINTR. This will allow the waiter to pass away
    without blocking the mmap_sem which might be required to make a forward
    progress. E.g. the oom reaper will need the lock for reading to
    dismantle the OOM victim address space.

    The only tricky function in this patch is vm_mmap_pgoff which has many
    call sites via vm_mmap. To reduce the risk keep vm_mmap with the
    original non-killable semantic for now.

    vm_munmap callers do not bother checking the return value so open code
    it into the munmap syscall path for now for simplicity.

    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: "Kirill A. Shutemov"
    Cc: Konstantin Khlebnikov
    Cc: Hugh Dickins
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Cc: Dave Hansen
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Add myself as a co-maintainer for scripts/gdb supporting Jan Kizka

    Link: http://lkml.kernel.org/r/fb5d34ce563f33d2f324f26f592b24ded30032ee.1462865983.git.jan.kiszka@siemens.com
    Signed-off-by: Kieran Bingham
    Signed-off-by: Jan Kiszka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kieran Bingham
     
  • The recent fixes to lx-dmesg, now allow the command to print
    successfully on Python3, however the python interpreter wraps the bytes
    for each line with a b'' marker.

    To remove this, we need to decode the line, where .decode() will default
    to 'UTF-8'

    Link: http://lkml.kernel.org/r/d67ccf93f2479c94cb3399262b9b796e0dbefcf2.1462865983.git.jan.kiszka@siemens.com
    Signed-off-by: Kieran Bingham
    Acked-by: Dom Cote
    Tested-by: Dom Cote
    Signed-off-by: Jan Kiszka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kieran Bingham
     
  • When built against Python 3, GDB differs in the return type for its
    read_memory function, causing the lx-dmesg command to fail.

    Now that we have an improved read_16() we can use the new
    read_memoryview() abstraction to make lx-dmesg return valid data on both
    current Python APIs

    Tested with python 3.4 and 2.7
    Tested with gdb 7.7

    Link: http://lkml.kernel.org/r/28477b727ff7fe3101fd4e426060e8a68317a639.1462865983.git.jan.kiszka@siemens.com
    Signed-off-by: Dom Cote
    [kieran@bingham.xyz: Adjusted commit log to better reflect code changes]
    Tested-by: Kieran Bingham (Py2.7,Py3.4,GDB10)
    Signed-off-by: Kieran Bingham
    Signed-off-by: Jan Kiszka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dom Cote
     
  • Change the read_u16 function so it accepts both 'str' and 'byte' as type
    for the arguments.

    When calling read_memory() from gdb API, depending on if it was built
    with 2.7 or 3.X, the format used to return the data will differ ( 'str'
    for 2.7, and 'byte' for 3.X ).

    Add a function read_memoryview() to be able to get a 'memoryview' object
    back from read_memory() both with python 2.7 and 3.X .

    Tested with python 3.4 and 2.7
    Tested with gdb 7.7

    Link: http://lkml.kernel.org/r/73621f564503137a002a639d174e4fb35f73f462.1462865983.git.jan.kiszka@siemens.com
    Signed-off-by: Dom Cote
    Tested-by: Kieran Bingham (Py2.7,Py3.4,GDB10)
    Signed-off-by: Kieran Bingham
    Signed-off-by: Jan Kiszka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dom Cote
     
  • The tasks module already provides helpers to find the task struct by
    pid, and the thread_info by task struct; however this is cumbersome to
    utilise on the gdb commandline.

    Wrap these two functionalities together in an extra single helper to
    allow exploring the thread info, from a PID value

    Link: http://lkml.kernel.org/r/dadc5667f053ec811eb3e3033d99d937fedbc93b.1462865983.git.jan.kiszka@siemens.com
    Signed-off-by: Kieran Bingham
    Signed-off-by: Jan Kiszka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kieran Bingham
     
  • Provide a worked example for utilising the lx_radix_tree_lookup function

    Link: http://lkml.kernel.org/r/e786008ac5aec4b84198812805b326d718bdeb4b.1462865983.git.jan.kiszka@siemens.com
    Signed-off-by: Kieran Bingham
    Signed-off-by: Jan Kiszka
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kieran Bingham
     
  • Linux makes use of the Radix Tree data structure to store pointers
    indexed by integer values. This structure is utilised across many
    structures in the kernel including the IRQ descriptor tables, and
    several filesystems.

    This module provides a method to lookup values from a structure given
    its head node.

    Usage:

    The function lx_radix_tree_lookup, must be given a symbol of type struct
    radix_tree_root, and an index into that tree.

    The object returned is a generic integer value, and must be cast
    correctly to the type based on the storage in the data structure.

    For example, to print the irq descriptor in the sparse irq_desc_tree at
    index 18, try the following:

    (gdb) print (struct irq_desc)$lx_radix_tree_lookup(irq_desc_tree, 18)

    Link: http://lkml.kernel.org/r/d2028c55e50cf95a9b7f8ca0d11885174b0cc709.1462865983.git.jan.kiszka@siemens.com
    Signed-off-by: Kieran Bingham
    Signed-off-by: Jan Kiszka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kieran Bingham
     
  • We won't see more than 2 billion CPUs any time soon, and having cpu_list
    return long makes the output of lx-cpus a bit ugly.

    Link: http://lkml.kernel.org/r/dcb45c3b0a59e0fd321fa56ff7aa398458c689b3.1462865983.git.jan.kiszka@siemens.com
    Signed-off-by: Jan Kiszka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kiszka
     
  • The linux kernel provides macro's for iterating against values from the
    cpu_list masks. By providing some commonly used masks, we can mirror
    the kernels helper macros with easy to use generators.

    Link: http://lkml.kernel.org/r/d045c6599771ada1999d49612ee30fd2f9acf17f.1462865983.git.jan.kiszka@siemens.com
    Signed-off-by: Kieran Bingham
    Signed-off-by: Jan Kiszka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kieran Bingham
     
  • lx-mounts will identify current mount points based on the 'init_task'
    namespace by default, as we do not yet have a kernel thread list
    implementation to select the current running thread.

    Optionally, a user can specify a PID to list from that process'
    namespace

    Link: http://lkml.kernel.org/r/e614c7bc32d2350b4ff1627ec761a7148e65bfe6.1462865983.git.jan.kiszka@siemens.com
    Signed-off-by: Kieran Bingham
    Signed-off-by: Jan Kiszka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kieran Bingham
     
  • Provide iomem_resource and ioports_resource printers and command hooks

    It can be quite interesting to halt the kernel as it's booting and check
    to see this list as it is being populated.

    It should be useful in the event that a kernel is not booting, you can
    identify what memory resources have been registered

    Link: http://lkml.kernel.org/r/f0a6b9fa9c92af4d7ed2e7343ccc84150e9c6fc5.1462865983.git.jan.kiszka@siemens.com
    Signed-off-by: Kieran Bingham
    Signed-off-by: Jan Kiszka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kieran Bingham
     
  • Walk the VFS entries, pre-pending the iname strings to generate a full
    VFS path name from a dentry.

    Link: http://lkml.kernel.org/r/4328fdb2d15ba7f1b21ad21c2eecc38d9cfc4d13.1462865983.git.jan.kiszka@siemens.com
    Signed-off-by: Kieran Bingham
    Signed-off-by: Jan Kiszka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kieran Bingham
     
  • If CONFIG_MODULES is not enabled, lx-lsmod tries to find a non-existent
    symbol and generates an unfriendly traceback:

    (gdb) lx-lsmod
    Address Module Size Used by
    Traceback (most recent call last):
    File "scripts/gdb/linux/modules.py", line 75, in invoke
    for module in module_list():
    File "scripts/gdb/linux/modules.py", line 24, in module_list
    module_ptr_type = module_type.get_type().pointer()
    File "scripts/gdb/linux/utils.py", line 28, in get_type
    self._type = gdb.lookup_type(self._name)
    gdb.error: No struct type named module.
    Error occurred in Python command: No struct type named module.

    Catch the error and return an empty module_list() for a clean command
    output as follows:

    (gdb) lx-lsmod
    Address Module Size Used by
    (gdb)

    Link: http://lkml.kernel.org/r/94d533819437408b85ae5864f939dd7ca6fbfcd6.1462865983.git.jan.kiszka@siemens.com
    Signed-off-by: Kieran Bingham
    Signed-off-by: Jan Kiszka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kieran Bingham
     
  • If we attempt to read a value that is not available to GDB, an exception
    is raised. Most of the time, this is a good thing; however on occasion
    we will want to be able to determine if a symbol is available.

    By catching the exception to simply return None, we can determine if we
    tried to read an invalid value, without the exception taking our
    execution context away from us

    Link: http://lkml.kernel.org/r/c72b25c06fc66e1d68371154097e2cbb112555d8.1462865983.git.jan.kiszka@siemens.com
    Signed-off-by: Kieran Bingham
    Signed-off-by: Jan Kiszka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kieran Bingham
     
  • Simplify the module list functions with the new list_for_each_entry
    abstractions

    Link: http://lkml.kernel.org/r/ad0101c9391088608166fcec26af179868973d86.1462865983.git.jan.kiszka@siemens.com
    Signed-off-by: Kieran Bingham
    Signed-off-by: Jan Kiszka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kieran Bingham
     
  • Facilitate linked-list items by providing a generator to return the
    dereferenced, and type-cast objects from a kernel linked list

    Link: http://lkml.kernel.org/r/2b0998564e6e5abe53585d466f87e491331fd2a4.1462865983.git.jan.kiszka@siemens.com
    Signed-off-by: Kieran Bingham
    Signed-off-by: Jan Kiszka
    Cc: Jeff Mahoney
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kieran Bingham
     
  • Some macro's and defines are needed when parsing memory, and without
    compiling the kernel as -g3 they are not available in the debug-symbols.

    We use the pre-processor here to extract constants to a dedicated module
    for the linux debugger extensions

    Top level Kbuild is used to call in and generate the constants file,
    while maintaining dependencies on autogenerated files in
    include/generated

    Link: http://lkml.kernel.org/r/bc3df9c25f57ea72177c066a51a446fc19e2c27f.1462865983.git.jan.kiszka@siemens.com
    Signed-off-by: Kieran Bingham
    Signed-off-by: Jan Kiszka
    Cc: Michal Marek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kieran Bingham
     
  • This takes the MODULE_REF_BASE into account.

    Link: http://lkml.kernel.org/r/d926d2d54caa034adb964b52215090cbdb875249.1462865983.git.jan.kiszka@siemens.com
    Signed-off-by: Jan Kiszka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kiszka
     
  • This option was replaced by PAGE_COUNTER which is selected by MEMCG.

    Signed-off-by: Konstantin Khlebnikov
    Acked-by: Arnd Bergmann
    Acked-by: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Use kmemdup when some other buffer is immediately copied into allocated
    region. It replaces call to allocation followed by memcpy, by a single
    call to kmemdup.

    [akpm@linux-foundation.org: remove unneeded cast to void*]
    Link: http://lkml.kernel.org/r/1463665743-16269-1-git-send-email-falakreyaz@gmail.com
    Signed-off-by: Muhammad Falak R Wani
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Muhammad Falak R Wani
     
  • First version of this patch has already been posted to LKML by Ben
    Hutchings ~6 months ago, but no further action were performed.

    Ben's original message:

    : rtsx_usb_ms creates a task that mostly sleeps, but tasks in
    : uninterruptible sleep still contribute to the load average (for
    : bug-compatibility with Unix). A load average of ~1 on a system that
    : should be idle is somewhat alarming.
    :
    : Change the sleep to be interruptible, but still ignore signals.

    References: https://bugs.debian.org/765717
    Link: http://lkml.kernel.org/r/b49f95ae83057efa5d96f532803cba47@natalenko.name
    Signed-off-by: Oleksandr Natalenko
    Cc: Oleg Nesterov
    Cc: Ben Hutchings
    Cc: Lee Jones
    Cc: Wolfram Sang
    Cc: Roger Tseng
    Cc: Greg KH
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleksandr Natalenko
     
  • Lots of little changes needed to be made to clean these up, remove the
    four byte pointer assumption and traverse the pid queue properly. Also
    consolidate the traceback code into a single function instead of having
    three copies of it.

    Link: http://lkml.kernel.org/r/1462926655-9390-1-git-send-email-minyard@acm.org
    Signed-off-by: Corey Minyard
    Acked-by: Baoquan He
    Cc: Vivek Goyal
    Cc: Haren Myneni
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Corey Minyard
     
  • …unprotect)_crashkres()

    Commit 3f625002581b ("kexec: introduce a protection mechanism for the
    crashkernel reserved memory") is a similar mechanism for protecting the
    crash kernel reserved memory to previous crash_map/unmap_reserved_pages()
    implementation, the new one is more generic in name and cleaner in code
    (besides, some arch may not be allowed to unmap the pgtable).

    Therefore, this patch consolidates them, and uses the new
    arch_kexec_protect(unprotect)_crashkres() to replace former
    crash_map/unmap_reserved_pages() which by now has been only used by
    S390.

    The consolidation work needs the crash memory to be mapped initially,
    this is done in machine_kdump_pm_init() which is after
    reserve_crashkernel(). Once kdump kernel is loaded, the new
    arch_kexec_protect_crashkres() implemented for S390 will actually
    unmap the pgtable like before.

    Signed-off-by: Xunlei Pang <xlpang@redhat.com>
    Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
    Acked-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
    Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
    Cc: "Eric W. Biederman" <ebiederm@xmission.com>
    Cc: Minfei Huang <mhuang@redhat.com>
    Cc: Vivek Goyal <vgoyal@redhat.com>
    Cc: Dave Young <dyoung@redhat.com>
    Cc: Baoquan He <bhe@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Xunlei Pang
     
  • There are a lof of work to be done in function kexec_load, not only for
    allocating structs and loading initram, but also for some misc.

    To make it more clear, wrap a new function do_kexec_load which is used
    to allocate structs and load initram. And the pre-work will be done in
    kexec_load.

    Signed-off-by: Minfei Huang
    Cc: Vivek Goyal
    Cc: "Eric W. Biederman"
    Cc: Xunlei Pang
    Cc: Baoquan He
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minfei Huang
     
  • For some arch, kexec shall map the reserved pages, then use them, when
    we try to start the kdump service.

    kexec may return directly, without unmaping the reserved pages, if it
    fails during starting service. To fix it, we make a pair of map/unmap
    reserved pages both in generic path and error path.

    This patch only affects s390. Other architecturess don't implement the
    interface of crash_unmap_reserved_pages and crash_map_reserved_pages.

    It isn't a urgent patch. Kernel can work well without any risk,
    although the reserved pages are not unmapped before returning in error
    path.

    Signed-off-by: Minfei Huang
    Cc: Vivek Goyal
    Cc: "Eric W. Biederman"
    Cc: Xunlei Pang
    Cc: Baoquan He
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minfei Huang
     
  • Implement the protection method for the crash kernel memory reservation
    for the 64-bit x86 kdump.

    Signed-off-by: Xunlei Pang
    Cc: Eric Biederman
    Cc: Dave Young
    Cc: Minfei Huang
    Cc: Vivek Goyal
    Cc: Baoquan He
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xunlei Pang
     
  • For the cases that some kernel (module) path stamps the crash reserved
    memory(already mapped by the kernel) where has been loaded the second
    kernel data, the kdump kernel will probably fail to boot when panic
    happens (or even not happens) leaving the culprit at large, this is
    unacceptable.

    The patch introduces a mechanism for detecting such cases:

    1) After each crash kexec loading, it simply marks the reserved memory
    regions readonly since we no longer access it after that. When someone
    stamps the region, the first kernel will panic and trigger the kdump.
    The weak arch_kexec_protect_crashkres() is introduced to do the actual
    protection.

    2) To allow multiple loading, once 1) was done we also need to remark
    the reserved memory to readwrite each time a system call related to
    kdump is made. The weak arch_kexec_unprotect_crashkres() is introduced
    to do the actual protection.

    The architecture can make its specific implementation by overriding
    arch_kexec_protect_crashkres() and arch_kexec_unprotect_crashkres().

    Signed-off-by: Xunlei Pang
    Cc: Eric Biederman
    Cc: Dave Young
    Cc: Minfei Huang
    Cc: Vivek Goyal
    Cc: Baoquan He
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xunlei Pang
     
  • remove_arg_zero() does free_arg_page() for no reason. This was needed
    before and only if CONFIG_MMU=y: see commit 4fc75ff4816c ("exec: fix
    remove_arg_zero"), install_arg_page() was called for every page != NULL
    in bprm->page[] array. Today install_arg_page() has already gone and
    free_arg_page() is nop after another commit b6a2fea39318 ("mm: variable
    length argument support").

    CONFIG_MMU=n does free_arg_pages() in free_bprm() and thus it doesn't
    need remove_arg_zero()->free_arg_page() too; apart from get_arg_page()
    it never checks if the page in bprm->page[] was allocated or not, so the
    "extra" non-freed page is fine. OTOH, this free_arg_page() can add the
    minor pessimization, the caller is going to do copy_strings_kernel()
    right after remove_arg_zero() which will likely need to re-allocate the
    same page again.

    And as Hujunjie pointed out, the "offset == PAGE_SIZE" check is wrong
    because we are going to increment bprm->p once again before return, so
    CONFIG_MMU=n "leaks" the page anyway if '0' is the final byte in this
    page.

    NOTE: remove_arg_zero() assumes that argv[0] is null-terminated but this
    is not necessarily true. copy_strings() does "len = strnlen_user(...)",
    then copy_from_user(len) but another thread or debuger can overwrite the
    trailing '0' in between. Afaics nothing really bad can happen because
    we must always have the null-terminated bprm->filename copied by the 1st
    copy_strings_kernel(), but perhaps we should change this code to check
    "bprm->p < bprm->exec" anyway, and/or change copy_strings() to ensure
    that the last byte in string is always zero.

    Link: http://lkml.kernel.org/r/20160517155335.GA31435@redhat.com
    Signed-off-by: Oleg Nesterov
    Reported by: hujunjie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Linux preallocates the task structs of the idle tasks for all possible
    CPUs. This currently means they all end up on node 0. This also
    implies that the cache line of MWAIT, which is around the flags field in
    the task struct, are all located in node 0.

    We see a noticeable performance improvement on Knights Landing CPUs when
    the cache lines used for MWAIT are located in the local nodes of the
    CPUs using them. I would expect this to give a (likely slight)
    improvement on other systems too.

    The patch implements placing the idle task in the node of its CPUs, by
    passing the right target node to copy_process()

    [akpm@linux-foundation.org: use NUMA_NO_NODE, not a bare -1]
    Link: http://lkml.kernel.org/r/1463492694-15833-1-git-send-email-andi@firstfloor.org
    Signed-off-by: Andi Kleen
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • All the users of siginmask() must ensure that sig < SIGRTMIN. sig_fatal()
    doesn't and this is wrong:

    UBSAN: Undefined behaviour in kernel/signal.c:911:6
    shift exponent 32 is too large for 32-bit type 'long unsigned int'

    the patch doesn't add the neccesary check to sig_fatal(), it moves the
    check into siginmask() and updates other callers.

    Link: http://lkml.kernel.org/r/20160517195052.GA15187@redhat.com
    Reported-by: Meelis Roos
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Use pr_ instead of printk(KERN_ ).

    Signed-off-by: Wang Xiaoqiang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wang Xiaoqiang