03 Aug, 2016

6 commits

  • I hit the following issue when run trinity in my system. The kernel is
    3.4 version, but mainline has the same issue.

    The root cause is that the segment size is too large so the kerenl
    spends too long trying to allocate a page. Other cases will block until
    the test case quits. Also, OOM conditions will occur.

    Call Trace:
    __alloc_pages_nodemask+0x14c/0x8f0
    alloc_pages_current+0xaf/0x120
    kimage_alloc_pages+0x10/0x60
    kimage_alloc_control_pages+0x5d/0x270
    machine_kexec_prepare+0xe5/0x6c0
    ? kimage_free_page_list+0x52/0x70
    sys_kexec_load+0x141/0x600
    ? vfs_write+0x100/0x180
    system_call_fastpath+0x16/0x1b

    The patch changes sanity_check_segment_list() to verify that the usage by
    all segments does not exceed half of memory.

    [akpm@linux-foundation.org: fix for kexec-return-error-number-directly.patch, update comment]
    Link: http://lkml.kernel.org/r/1469625474-53904-1-git-send-email-zhongjiang@huawei.com
    Signed-off-by: zhong jiang
    Suggested-by: Eric W. Biederman
    Cc: Vivek Goyal
    Cc: Dave Young
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    zhong jiang
     
  • Provide a wrapper function to be used by kernel code to check whether a
    crash kernel is loaded. It returns the same value that can be seen in
    /sys/kernel/kexec_crash_loaded by userspace programs.

    I'm exporting the function, because it will be used by Xen, and it is
    possible to compile Xen modules separately to enable the use of PV
    drivers with unmodified bare-metal kernels.

    Link: http://lkml.kernel.org/r/20160713121955.14969.69080.stgit@hananiah.suse.cz
    Signed-off-by: Petr Tesarik
    Cc: Juergen Gross
    Cc: Josh Triplett
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Eric Biederman
    Cc: "H. Peter Anvin"
    Cc: Boris Ostrovsky
    Cc: "Paul E. McKenney"
    Cc: Dave Young
    Cc: David Vrabel
    Cc: Vivek Goyal
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Petr Tesarik
     
  • kexec physical addresses are the boot-time view of the system. For
    certain ARM systems (such as Keystone 2), the boot view of the system
    does not match the kernel's view of the system: the boot view uses a
    special alias in the lower 4GB of the physical address space.

    To cater for these kinds of setups, we need to translate between the
    boot view physical addresses and the normal kernel view physical
    addresses. This patch extracts the current transation points into
    linux/kexec.h, and allows an architecture to override the functions.

    Due to the translations required, we unfortunately end up with six
    translation functions, which are reduced down to four that the
    architecture can override.

    [akpm@linux-foundation.org: kexec.h needs asm/io.h for phys_to_virt()]
    Link: http://lkml.kernel.org/r/E1b8koP-0004HZ-Vf@rmk-PC.armlinux.org.uk
    Signed-off-by: Russell King
    Cc: Keerthy
    Cc: Pratyush Anand
    Cc: Vitaly Andrianov
    Cc: Eric Biederman
    Cc: Dave Young
    Cc: Baoquan He
    Cc: Vivek Goyal
    Cc: Simon Horman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Russell King
     
  • On PAE systems (eg, ARM LPAE) the vmcore note may be located above 4GB
    physical on 32-bit architectures, so we need a wider type than "unsigned
    long" here. Arrange for paddr_vmcoreinfo_note() to return a
    phys_addr_t, thereby allowing it to be located above 4GB.

    This makes no difference for kexec-tools, as they already assume a
    64-bit type when reading from this file.

    Link: http://lkml.kernel.org/r/E1b8koK-0004HS-K9@rmk-PC.armlinux.org.uk
    Signed-off-by: Russell King
    Reviewed-by: Pratyush Anand
    Acked-by: Baoquan He
    Cc: Keerthy
    Cc: Vitaly Andrianov
    Cc: Eric Biederman
    Cc: Dave Young
    Cc: Vivek Goyal
    Cc: Simon Horman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Russell King
     
  • Ensure that user memory sizes do not wrap around when validating the
    user input, which can lead to the following input validation working
    incorrectly.

    [akpm@linux-foundation.org: fix it for kexec-return-error-number-directly.patch]
    Link: http://lkml.kernel.org/r/E1b8koF-0004HM-5x@rmk-PC.armlinux.org.uk
    Signed-off-by: Russell King
    Reviewed-by: Pratyush Anand
    Acked-by: Baoquan He
    Cc: Keerthy
    Cc: Vitaly Andrianov
    Cc: Eric Biederman
    Cc: Dave Young
    Cc: Vivek Goyal
    Cc: Simon Horman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Russell King
     
  • This is a cleanup patch to make kexec more clear to return error number
    directly. The variable result is useless, because there is no other
    function's return value assignes to it. So remove it.

    Link: http://lkml.kernel.org/r/1464179273-57668-1-git-send-email-mnghuan@gmail.com
    Signed-off-by: Minfei Huang
    Cc: Dave Young
    Cc: Baoquan He
    Cc: Borislav Petkov
    Cc: Xunlei Pang
    Cc: Atsushi Kumagai
    Cc: Vivek Goyal
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minfei Huang
     

24 May, 2016

2 commits

  • …unprotect)_crashkres()

    Commit 3f625002581b ("kexec: introduce a protection mechanism for the
    crashkernel reserved memory") is a similar mechanism for protecting the
    crash kernel reserved memory to previous crash_map/unmap_reserved_pages()
    implementation, the new one is more generic in name and cleaner in code
    (besides, some arch may not be allowed to unmap the pgtable).

    Therefore, this patch consolidates them, and uses the new
    arch_kexec_protect(unprotect)_crashkres() to replace former
    crash_map/unmap_reserved_pages() which by now has been only used by
    S390.

    The consolidation work needs the crash memory to be mapped initially,
    this is done in machine_kdump_pm_init() which is after
    reserve_crashkernel(). Once kdump kernel is loaded, the new
    arch_kexec_protect_crashkres() implemented for S390 will actually
    unmap the pgtable like before.

    Signed-off-by: Xunlei Pang <xlpang@redhat.com>
    Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
    Acked-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
    Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
    Cc: "Eric W. Biederman" <ebiederm@xmission.com>
    Cc: Minfei Huang <mhuang@redhat.com>
    Cc: Vivek Goyal <vgoyal@redhat.com>
    Cc: Dave Young <dyoung@redhat.com>
    Cc: Baoquan He <bhe@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Xunlei Pang
     
  • For the cases that some kernel (module) path stamps the crash reserved
    memory(already mapped by the kernel) where has been loaded the second
    kernel data, the kdump kernel will probably fail to boot when panic
    happens (or even not happens) leaving the culprit at large, this is
    unacceptable.

    The patch introduces a mechanism for detecting such cases:

    1) After each crash kexec loading, it simply marks the reserved memory
    regions readonly since we no longer access it after that. When someone
    stamps the region, the first kernel will panic and trigger the kdump.
    The weak arch_kexec_protect_crashkres() is introduced to do the actual
    protection.

    2) To allow multiple loading, once 1) was done we also need to remark
    the reserved memory to readwrite each time a system call related to
    kdump is made. The weak arch_kexec_unprotect_crashkres() is introduced
    to do the actual protection.

    The architecture can make its specific implementation by overriding
    arch_kexec_protect_crashkres() and arch_kexec_unprotect_crashkres().

    Signed-off-by: Xunlei Pang
    Cc: Eric Biederman
    Cc: Dave Young
    Cc: Minfei Huang
    Cc: Vivek Goyal
    Cc: Baoquan He
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xunlei Pang
     

21 May, 2016

1 commit

  • In NMI context, printk() messages are stored into per-CPU buffers to
    avoid a possible deadlock. They are normally flushed to the main ring
    buffer via an IRQ work. But the work is never called when the system
    calls panic() in the very same NMI handler.

    This patch tries to flush NMI buffers before the crash dump is
    generated. In this case it does not risk a double release and bails out
    when the logbuf_lock is already taken. The aim is to get the messages
    into the main ring buffer when possible. It makes them better
    accessible in the vmcore.

    Then the patch tries to flush the buffers second time when other CPUs
    are down. It might be more aggressive and reset logbuf_lock. The aim
    is to get the messages available for the consequent kmsg_dump() and
    console_flush_on_panic() calls.

    The patch causes vprintk_emit() to be called even in NMI context again.
    But it is done via printk_deferred() so that the console handling is
    skipped. Consoles use internal locks and we could not prevent a
    deadlock easily. They are explicitly called later when the crash dump
    is not generated, see console_flush_on_panic().

    Signed-off-by: Petr Mladek
    Cc: Benjamin Herrenschmidt
    Cc: Daniel Thompson
    Cc: David Miller
    Cc: Ingo Molnar
    Cc: Jan Kara
    Cc: Jiri Kosina
    Cc: Martin Schwidefsky
    Cc: Peter Zijlstra
    Cc: Ralf Baechle
    Cc: Russell King
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Petr Mladek
     

20 May, 2016

1 commit

  • Many developers already know that field for reference count of the
    struct page is _count and atomic type. They would try to handle it
    directly and this could break the purpose of page reference count
    tracepoint. To prevent direct _count modification, this patch rename it
    to _refcount and add warning message on the code. After that, developer
    who need to handle reference count will find that field should not be
    accessed directly.

    [akpm@linux-foundation.org: fix comments, per Vlastimil]
    [akpm@linux-foundation.org: Documentation/vm/transhuge.txt too]
    [sfr@canb.auug.org.au: sync ethernet driver changes]
    Signed-off-by: Joonsoo Kim
    Signed-off-by: Stephen Rothwell
    Cc: Vlastimil Babka
    Cc: Hugh Dickins
    Cc: Johannes Berg
    Cc: "David S. Miller"
    Cc: Sunil Goutham
    Cc: Chris Metcalf
    Cc: Manish Chopra
    Cc: Yuval Mintz
    Cc: Tariq Toukan
    Cc: Saeed Mahameed
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

29 Apr, 2016

2 commits

  • PageAnon() always look at head page to check PAGE_MAPPING_ANON and tail
    page's page->mapping has just a poisoned data since commit 1c290f642101
    ("mm: sanitize page->mapping for tail pages").

    If makedumpfile checks page->mapping of a compound tail page to
    distinguish anonymous page as usual, it must fail in newer kernel. So
    it's necessary to export OFFSET(page.compound_head) to avoid checking
    compound tail pages.

    The problem is that unnecessary hugepages won't be removed from a dump
    file in kernels 4.5.x and later. This means that extra disk space would
    be consumed. It's a problem, but not critical.

    Signed-off-by: Atsushi Kumagai
    Acked-by: Dave Young
    Cc: "Eric W. Biederman"
    Cc: Vivek Goyal
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Atsushi Kumagai
     
  • makedumpfile refers page.lru.next to get the order of compound pages for
    page filtering.

    However, now the order is stored in page.compound_order, hence
    VMCOREINFO should be updated to export the offset of
    page.compound_order.

    The fact is, page.compound_order was introduced already in kernel 4.0,
    but the offset of it was the same as page.lru.next until kernel 4.3, so
    this was not actual problem.

    The above can be said also for page.lru.prev and page.compound_dtor,
    it's necessary to detect hugetlbfs pages. Further, the content was
    changed from direct address to the ID which means dtor.

    The problem is that unnecessary hugepages won't be removed from a dump
    file in kernels 4.4.x and later. This means that extra disk space would
    be consumed. It's a problem, but not critical.

    Signed-off-by: Atsushi Kumagai
    Acked-by: Dave Young
    Cc: "Eric W. Biederman"
    Cc: Vivek Goyal
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Atsushi Kumagai
     

30 Jan, 2016

1 commit

  • Set proper ioresource flags and types for crash kernel
    reservation areas.

    Signed-off-by: Toshi Kani
    Signed-off-by: Borislav Petkov
    Reviewed-by: Dave Young
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Baoquan He
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: HATAYAMA Daisuke
    Cc: Linus Torvalds
    Cc: Luis R. Rodriguez
    Cc: Minfei Huang
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Toshi Kani
    Cc: Vivek Goyal
    Cc: kexec@lists.infradead.org
    Cc: linux-arch@vger.kernel.org
    Cc: linux-mm
    Link: http://lkml.kernel.org/r/1453841853-11383-8-git-send-email-bp@alien8.de
    Signed-off-by: Ingo Molnar

    Toshi Kani
     

21 Jan, 2016

1 commit


19 Dec, 2015

1 commit

  • Currently, panic() and crash_kexec() can be called at the same time.
    For example (x86 case):

    CPU 0:
    oops_end()
    crash_kexec()
    mutex_trylock() // acquired
    nmi_shootdown_cpus() // stop other CPUs

    CPU 1:
    panic()
    crash_kexec()
    mutex_trylock() // failed to acquire
    smp_send_stop() // stop other CPUs
    infinite loop

    If CPU 1 calls smp_send_stop() before nmi_shootdown_cpus(), kdump
    fails.

    In another case:

    CPU 0:
    oops_end()
    crash_kexec()
    mutex_trylock() // acquired

    io_check_error()
    panic()
    crash_kexec()
    mutex_trylock() // failed to acquire
    infinite loop

    Clearly, this is an undesirable result.

    To fix this problem, this patch changes crash_kexec() to exclude others
    by using the panic_cpu atomic.

    Signed-off-by: Hidehiro Kawai
    Acked-by: Michal Hocko
    Cc: Andrew Morton
    Cc: Baoquan He
    Cc: Dave Young
    Cc: "Eric W. Biederman"
    Cc: HATAYAMA Daisuke
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Jonathan Corbet
    Cc: kexec@lists.infradead.org
    Cc: linux-doc@vger.kernel.org
    Cc: Martin Schwidefsky
    Cc: Masami Hiramatsu
    Cc: Minfei Huang
    Cc: Peter Zijlstra
    Cc: Seth Jennings
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Cc: Vitaly Kuznetsov
    Cc: Vivek Goyal
    Cc: x86-ml
    Link: http://lkml.kernel.org/r/20151210014630.25437.94161.stgit@softrs
    Signed-off-by: Borislav Petkov
    Signed-off-by: Thomas Gleixner

    Hidehiro Kawai
     

07 Nov, 2015

1 commit

  • kexec output message misses the prefix "kexec", when Dave Young split the
    kexec code. Now, we use file name as the output message prefix.

    Currently, the format of output message:
    [ 140.290795] SYSC_kexec_load: hello, world
    [ 140.291534] kexec: sanity_check_segment_list: hello, world

    Ideally, the format of output message:
    [ 30.791503] kexec: SYSC_kexec_load, Hello, world
    [ 79.182752] kexec_core: sanity_check_segment_list, Hello, world

    Remove the custom prefix "kexec" in output message.

    Signed-off-by: Minfei Huang
    Acked-by: Dave Young
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minfei Huang
     

21 Oct, 2015

1 commit

  • It is helpful when the crashkernel cmdline parsing routines
    actually say which character is the unrecognized one. Make them
    do so.

    Signed-off-by: Borislav Petkov
    Reviewed-by: Dave Young
    Reviewed-by: Joerg Roedel
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Baoquan He
    Cc: H. Peter Anvin
    Cc: Jiri Kosina
    Cc: Juergen Gross
    Cc: Linus Torvalds
    Cc: Mark Salter
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Vivek Goyal
    Cc: WANG Chao
    Cc: jerry_hoemann@hp.com
    Link: http://lkml.kernel.org/r/1445246268-26285-8-git-send-email-bp@alien8.de
    Signed-off-by: Ingo Molnar

    Borislav Petkov
     

11 Sep, 2015

4 commits

  • In x86_64, since v2.6.26 the KERNEL_IMAGE_SIZE is changed to 512M, and
    accordingly the MODULES_VADDR is changed to 0xffffffffa0000000. However,
    in v3.12 Kees Cook introduced kaslr to randomise the location of kernel.
    And the kernel text mapping addr space is enlarged from 512M to 1G. That
    means now KERNEL_IMAGE_SIZE is variable, its value is 512M when kaslr
    support is not compiled in and 1G when kaslr support is compiled in.
    Accordingly the MODULES_VADDR is changed too to be:

    #define MODULES_VADDR (__START_KERNEL_map + KERNEL_IMAGE_SIZE)

    So when kaslr is compiled in and enabled, the kernel text mapping addr
    space and modules vaddr space need be adjusted. Otherwise makedumpfile
    will collapse since the addr for some symbols is not correct.

    Hence KERNEL_IMAGE_SIZE need be exported to vmcoreinfo and got in
    makedumpfile to help calculate MODULES_VADDR.

    Signed-off-by: Baoquan He
    Acked-by: Kees Cook
    Acked-by: Vivek Goyal
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Baoquan He
     
  • People reported that crash_notes in /proc/vmcore were corrupted and this
    cause crash kdump failure. With code debugging and log we got the root
    cause. This is because percpu variable crash_notes are allocated in 2
    vmalloc pages. Currently percpu is based on vmalloc by default. Vmalloc
    can't guarantee 2 continuous vmalloc pages are also on 2 continuous
    physical pages. So when 1st kernel exports the starting address and size
    of crash_notes through sysfs like below:

    /sys/devices/system/cpu/cpux/crash_notes
    /sys/devices/system/cpu/cpux/crash_notes_size

    kdump kernel use them to get the content of crash_notes. However the 2nd
    part may not be in the next neighbouring physical page as we expected if
    crash_notes are allocated accross 2 vmalloc pages. That's why
    nhdr_ptr->n_namesz or nhdr_ptr->n_descsz could be very huge in
    update_note_header_size_elf64() and cause note header merging failure or
    some warnings.

    In this patch change to call __alloc_percpu() to passed in the align value
    by rounding crash_notes_size up to the nearest power of two. This makes
    sure the crash_notes is allocated inside one physical page since
    sizeof(note_buf_t) in all ARCHS is smaller than PAGE_SIZE. Meanwhile add
    a BUILD_BUG_ON to break compile if size is bigger than PAGE_SIZE since
    crash_notes definitely will be in 2 pages. That need be avoided, and need
    be reported if it's unavoidable.

    [akpm@linux-foundation.org: use correct comment layout]
    Signed-off-by: Baoquan He
    Cc: Eric W. Biederman
    Cc: Vivek Goyal
    Cc: Dave Young
    Cc: Lisa Mitchell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Baoquan He
     
  • Transforming PFN(Page Frame Number) to struct page is never failure, so we
    can simplify the code logic to do the image->control_page assignment
    directly in the loop, and remove the unnecessary conditional judgement.

    Signed-off-by: Minfei Huang
    Acked-by: Dave Young
    Acked-by: Vivek Goyal
    Cc: Simon Horman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minfei Huang
     
  • There are two kexec load syscalls, kexec_load another and kexec_file_load.
    kexec_file_load has been splited as kernel/kexec_file.c. In this patch I
    split kexec_load syscall code to kernel/kexec.c.

    And add a new kconfig option KEXEC_CORE, so we can disable kexec_load and
    use kexec_file_load only, or vice verse.

    The original requirement is from Ted Ts'o, he want kexec kernel signature
    being checked with CONFIG_KEXEC_VERIFY_SIG enabled. But kexec-tools use
    kexec_load syscall can bypass the checking.

    Vivek Goyal proposed to create a common kconfig option so user can compile
    in only one syscall for loading kexec kernel. KEXEC/KEXEC_FILE selects
    KEXEC_CORE so that old config files still work.

    Because there's general code need CONFIG_KEXEC_CORE, so I updated all the
    architecture Kconfig with a new option KEXEC_CORE, and let KEXEC selects
    KEXEC_CORE in arch Kconfig. Also updated general kernel code with to
    kexec_load syscall.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Dave Young
    Cc: Eric W. Biederman
    Cc: Vivek Goyal
    Cc: Petr Tesarik
    Cc: Theodore Ts'o
    Cc: Josh Boyer
    Cc: David Howells
    Cc: Geert Uytterhoeven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Young