Eric Lee / smarc-fsl-linux-kernel

03 Aug, 2016

6 commits

1730f1466 kexec: add restriction on kexec_load() segment sizes ... Browse Code »

I hit the following issue when run trinity in my system. The kernel is
3.4 version, but mainline has the same issue.

The root cause is that the segment size is too large so the kerenl
spends too long trying to allocate a page. Other cases will block until
the test case quits. Also, OOM conditions will occur.

Call Trace:
__alloc_pages_nodemask+0x14c/0x8f0
alloc_pages_current+0xaf/0x120
kimage_alloc_pages+0x10/0x60
kimage_alloc_control_pages+0x5d/0x270
machine_kexec_prepare+0xe5/0x6c0
? kimage_free_page_list+0x52/0x70
sys_kexec_load+0x141/0x600
? vfs_write+0x100/0x180
system_call_fastpath+0x16/0x1b

The patch changes sanity_check_segment_list() to verify that the usage by
all segments does not exceed half of memory.

[akpm@linux-foundation.org: fix for kexec-return-error-number-directly.patch, update comment]
Link: http://lkml.kernel.org/r/1469625474-53904-1-git-send-email-zhongjiang@huawei.com
Signed-off-by: zhong jiang
Suggested-by: Eric W. Biederman
Cc: Vivek Goyal
Cc: Dave Young
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

zhong jiang
2016-08-03 07:35:31 +0800
21db79e8b kexec: add a kexec_crash_loaded() function ... Browse Code »

Provide a wrapper function to be used by kernel code to check whether a
crash kernel is loaded. It returns the same value that can be seen in
/sys/kernel/kexec_crash_loaded by userspace programs.

I'm exporting the function, because it will be used by Xen, and it is
possible to compile Xen modules separately to enable the use of PV
drivers with unmodified bare-metal kernels.

Link: http://lkml.kernel.org/r/20160713121955.14969.69080.stgit@hananiah.suse.cz
Signed-off-by: Petr Tesarik
Cc: Juergen Gross
Cc: Josh Triplett
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: Eric Biederman
Cc: "H. Peter Anvin"
Cc: Boris Ostrovsky
Cc: "Paul E. McKenney"
Cc: Dave Young
Cc: David Vrabel
Cc: Vivek Goyal
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Petr Tesarik
2016-08-03 07:35:30 +0800
43546d866 kexec: allow architectures to override boot mapping ... Browse Code »

kexec physical addresses are the boot-time view of the system. For
certain ARM systems (such as Keystone 2), the boot view of the system
does not match the kernel's view of the system: the boot view uses a
special alias in the lower 4GB of the physical address space.

To cater for these kinds of setups, we need to translate between the
boot view physical addresses and the normal kernel view physical
addresses. This patch extracts the current transation points into
linux/kexec.h, and allows an architecture to override the functions.

Due to the translations required, we unfortunately end up with six
translation functions, which are reduced down to four that the
architecture can override.

[akpm@linux-foundation.org: kexec.h needs asm/io.h for phys_to_virt()]
Link: http://lkml.kernel.org/r/E1b8koP-0004HZ-Vf@rmk-PC.armlinux.org.uk
Signed-off-by: Russell King
Cc: Keerthy
Cc: Pratyush Anand
Cc: Vitaly Andrianov
Cc: Eric Biederman
Cc: Dave Young
Cc: Baoquan He
Cc: Vivek Goyal
Cc: Simon Horman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Russell King
2016-08-03 07:35:27 +0800
dae28018f kdump: arrange for paddr_vmcoreinfo_note() to return phys_addr_t ... Browse Code »

On PAE systems (eg, ARM LPAE) the vmcore note may be located above 4GB
physical on 32-bit architectures, so we need a wider type than "unsigned
long" here. Arrange for paddr_vmcoreinfo_note() to return a
phys_addr_t, thereby allowing it to be located above 4GB.

This makes no difference for kexec-tools, as they already assume a
64-bit type when reading from this file.

Link: http://lkml.kernel.org/r/E1b8koK-0004HS-K9@rmk-PC.armlinux.org.uk
Signed-off-by: Russell King
Reviewed-by: Pratyush Anand
Acked-by: Baoquan He
Cc: Keerthy
Cc: Vitaly Andrianov
Cc: Eric Biederman
Cc: Dave Young
Cc: Vivek Goyal
Cc: Simon Horman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Russell King
2016-08-03 07:35:27 +0800
465d37770 kexec: ensure user memory sizes do not wrap ... Browse Code »

Ensure that user memory sizes do not wrap around when validating the
user input, which can lead to the following input validation working
incorrectly.

[akpm@linux-foundation.org: fix it for kexec-return-error-number-directly.patch]
Link: http://lkml.kernel.org/r/E1b8koF-0004HM-5x@rmk-PC.armlinux.org.uk
Signed-off-by: Russell King
Reviewed-by: Pratyush Anand
Acked-by: Baoquan He
Cc: Keerthy
Cc: Vitaly Andrianov
Cc: Eric Biederman
Cc: Dave Young
Cc: Vivek Goyal
Cc: Simon Horman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Russell King
2016-08-03 07:35:26 +0800
4caf96152 kexec: return error number directly ... Browse Code »

This is a cleanup patch to make kexec more clear to return error number
directly. The variable result is useless, because there is no other
function's return value assignes to it. So remove it.

Link: http://lkml.kernel.org/r/1464179273-57668-1-git-send-email-mnghuan@gmail.com
Signed-off-by: Minfei Huang
Cc: Dave Young
Cc: Baoquan He
Cc: Borislav Petkov
Cc: Xunlei Pang
Cc: Atsushi Kumagai
Cc: Vivek Goyal
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minfei Huang
2016-08-03 07:35:24 +0800

24 May, 2016

2 commits

7a0058ec7 s390/kexec: consolidate crash_map/unmap_reserved_pages() and arch_kexec_protect(… ... Browse Code »

…unprotect)_crashkres()

Commit 3f625002581b ("kexec: introduce a protection mechanism for the
crashkernel reserved memory") is a similar mechanism for protecting the
crash kernel reserved memory to previous crash_map/unmap_reserved_pages()
implementation, the new one is more generic in name and cleaner in code
(besides, some arch may not be allowed to unmap the pgtable).

Therefore, this patch consolidates them, and uses the new
arch_kexec_protect(unprotect)_crashkres() to replace former
crash_map/unmap_reserved_pages() which by now has been only used by
S390.

The consolidation work needs the crash memory to be mapped initially,
this is done in machine_kdump_pm_init() which is after
reserve_crashkernel(). Once kdump kernel is loaded, the new
arch_kexec_protect_crashkres() implemented for S390 will actually
unmap the pgtable like before.

Signed-off-by: Xunlei Pang <xlpang@redhat.com>
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Acked-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Minfei Huang <mhuang@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Xunlei Pang
2016-05-24 08:04:14 +0800
9b492cf58 kexec: introduce a protection mechanism for the crashkernel reserved memory ... Browse Code »

For the cases that some kernel (module) path stamps the crash reserved
memory(already mapped by the kernel) where has been loaded the second
kernel data, the kdump kernel will probably fail to boot when panic
happens (or even not happens) leaving the culprit at large, this is
unacceptable.

The patch introduces a mechanism for detecting such cases:

1) After each crash kexec loading, it simply marks the reserved memory
regions readonly since we no longer access it after that. When someone
stamps the region, the first kernel will panic and trigger the kdump.
The weak arch_kexec_protect_crashkres() is introduced to do the actual
protection.

2) To allow multiple loading, once 1) was done we also need to remark
the reserved memory to readwrite each time a system call related to
kdump is made. The weak arch_kexec_unprotect_crashkres() is introduced
to do the actual protection.

The architecture can make its specific implementation by overriding
arch_kexec_protect_crashkres() and arch_kexec_unprotect_crashkres().

Signed-off-by: Xunlei Pang
Cc: Eric Biederman
Cc: Dave Young
Cc: Minfei Huang
Cc: Vivek Goyal
Cc: Baoquan He
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Xunlei Pang
2016-05-24 08:04:14 +0800

21 May, 2016

1 commit

cf9b1106c printk/nmi: flush NMI messages on the system panic ... Browse Code »

In NMI context, printk() messages are stored into per-CPU buffers to
avoid a possible deadlock. They are normally flushed to the main ring
buffer via an IRQ work. But the work is never called when the system
calls panic() in the very same NMI handler.

This patch tries to flush NMI buffers before the crash dump is
generated. In this case it does not risk a double release and bails out
when the logbuf_lock is already taken. The aim is to get the messages
into the main ring buffer when possible. It makes them better
accessible in the vmcore.

Then the patch tries to flush the buffers second time when other CPUs
are down. It might be more aggressive and reset logbuf_lock. The aim
is to get the messages available for the consequent kmsg_dump() and
console_flush_on_panic() calls.

The patch causes vprintk_emit() to be called even in NMI context again.
But it is done via printk_deferred() so that the console handling is
skipped. Consoles use internal locks and we could not prevent a
deadlock easily. They are explicitly called later when the crash dump
is not generated, see console_flush_on_panic().

Signed-off-by: Petr Mladek
Cc: Benjamin Herrenschmidt
Cc: Daniel Thompson
Cc: David Miller
Cc: Ingo Molnar
Cc: Jan Kara
Cc: Jiri Kosina
Cc: Martin Schwidefsky
Cc: Peter Zijlstra
Cc: Ralf Baechle
Cc: Russell King
Cc: Steven Rostedt
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Petr Mladek
2016-05-21 08:58:30 +0800

20 May, 2016

1 commit

0139aa7b7 mm: rename _count, field of the struct page, to _refcount ... Browse Code »

Many developers already know that field for reference count of the
struct page is _count and atomic type. They would try to handle it
directly and this could break the purpose of page reference count
tracepoint. To prevent direct _count modification, this patch rename it
to _refcount and add warning message on the code. After that, developer
who need to handle reference count will find that field should not be
accessed directly.

[akpm@linux-foundation.org: fix comments, per Vlastimil]
[akpm@linux-foundation.org: Documentation/vm/transhuge.txt too]
[sfr@canb.auug.org.au: sync ethernet driver changes]
Signed-off-by: Joonsoo Kim
Signed-off-by: Stephen Rothwell
Cc: Vlastimil Babka
Cc: Hugh Dickins
Cc: Johannes Berg
Cc: "David S. Miller"
Cc: Sunil Goutham
Cc: Chris Metcalf
Cc: Manish Chopra
Cc: Yuval Mintz
Cc: Tariq Toukan
Cc: Saeed Mahameed
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2016-05-20 10:12:14 +0800

29 Apr, 2016

2 commits

d7f53518f kexec: export OFFSET(page.compound_head) to find out compound tail page ... Browse Code »

PageAnon() always look at head page to check PAGE_MAPPING_ANON and tail
page's page->mapping has just a poisoned data since commit 1c290f642101
("mm: sanitize page->mapping for tail pages").

If makedumpfile checks page->mapping of a compound tail page to
distinguish anonymous page as usual, it must fail in newer kernel. So
it's necessary to export OFFSET(page.compound_head) to avoid checking
compound tail pages.

The problem is that unnecessary hugepages won't be removed from a dump
file in kernels 4.5.x and later. This means that extra disk space would
be consumed. It's a problem, but not critical.

Signed-off-by: Atsushi Kumagai
Acked-by: Dave Young
Cc: "Eric W. Biederman"
Cc: Vivek Goyal
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Atsushi Kumagai
2016-04-29 10:34:04 +0800
8639a847b kexec: update VMCOREINFO for compound_order/dtor ... Browse Code »

makedumpfile refers page.lru.next to get the order of compound pages for
page filtering.

However, now the order is stored in page.compound_order, hence
VMCOREINFO should be updated to export the offset of
page.compound_order.

The fact is, page.compound_order was introduced already in kernel 4.0,
but the offset of it was the same as page.lru.next until kernel 4.3, so
this was not actual problem.

The above can be said also for page.lru.prev and page.compound_dtor,
it's necessary to detect hugetlbfs pages. Further, the content was
changed from direct address to the ID which means dtor.

The problem is that unnecessary hugepages won't be removed from a dump
file in kernels 4.4.x and later. This means that extra disk space would
be consumed. It's a problem, but not critical.

Signed-off-by: Atsushi Kumagai
Acked-by: Dave Young
Cc: "Eric W. Biederman"
Cc: Vivek Goyal
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Atsushi Kumagai
2016-04-29 10:34:04 +0800

30 Jan, 2016

1 commit

1a085d072 kexec: Set IORESOURCE_SYSTEM_RAM for System RAM ... Browse Code »

Set proper ioresource flags and types for crash kernel
reservation areas.

Signed-off-by: Toshi Kani
Signed-off-by: Borislav Petkov
Reviewed-by: Dave Young
Cc: Andrew Morton
Cc: Andy Lutomirski
Cc: Baoquan He
Cc: Borislav Petkov
Cc: Brian Gerst
Cc: Denys Vlasenko
Cc: H. Peter Anvin
Cc: HATAYAMA Daisuke
Cc: Linus Torvalds
Cc: Luis R. Rodriguez
Cc: Minfei Huang
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Toshi Kani
Cc: Vivek Goyal
Cc: kexec@lists.infradead.org
Cc: linux-arch@vger.kernel.org
Cc: linux-mm
Link: http://lkml.kernel.org/r/1453841853-11383-8-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar

Toshi Kani
2016-01-30 16:49:57 +0800

21 Jan, 2016

1 commit

2b24692b9 kernel/kexec_core.c: use list_for_each_entry_safe in kimage_free_page_list ... Browse Code »

Use list_for_each_entry_safe() instead of list_for_each_safe() to
simplify the code.

Signed-off-by: Geliang Tang
Cc: Dave Young
Cc: Vivek Goyal
Acked-by: Baoquan He
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Geliang Tang
2016-01-21 09:09:18 +0800

19 Dec, 2015

1 commit

7bbee5ca3 kexec: Fix race between panic() and crash_kexec() ... Browse Code »

Currently, panic() and crash_kexec() can be called at the same time.
For example (x86 case):

CPU 0:
oops_end()
crash_kexec()
mutex_trylock() // acquired
nmi_shootdown_cpus() // stop other CPUs

CPU 1:
panic()
crash_kexec()
mutex_trylock() // failed to acquire
smp_send_stop() // stop other CPUs
infinite loop

If CPU 1 calls smp_send_stop() before nmi_shootdown_cpus(), kdump
fails.

In another case:

CPU 0:
oops_end()
crash_kexec()
mutex_trylock() // acquired

io_check_error()
panic()
crash_kexec()
mutex_trylock() // failed to acquire
infinite loop

Clearly, this is an undesirable result.

To fix this problem, this patch changes crash_kexec() to exclude others
by using the panic_cpu atomic.

Signed-off-by: Hidehiro Kawai
Acked-by: Michal Hocko
Cc: Andrew Morton
Cc: Baoquan He
Cc: Dave Young
Cc: "Eric W. Biederman"
Cc: HATAYAMA Daisuke
Cc: "H. Peter Anvin"
Cc: Ingo Molnar
Cc: Jonathan Corbet
Cc: kexec@lists.infradead.org
Cc: linux-doc@vger.kernel.org
Cc: Martin Schwidefsky
Cc: Masami Hiramatsu
Cc: Minfei Huang
Cc: Peter Zijlstra
Cc: Seth Jennings
Cc: Steven Rostedt
Cc: Thomas Gleixner
Cc: Vitaly Kuznetsov
Cc: Vivek Goyal
Cc: x86-ml
Link: http://lkml.kernel.org/r/20151210014630.25437.94161.stgit@softrs
Signed-off-by: Borislav Petkov
Signed-off-by: Thomas Gleixner

Hidehiro Kawai
2015-12-19 18:07:01 +0800

07 Nov, 2015

1 commit

de90a6bca kexec: use file name as the output message prefix ... Browse Code »

kexec output message misses the prefix "kexec", when Dave Young split the
kexec code. Now, we use file name as the output message prefix.

Currently, the format of output message:
[ 140.290795] SYSC_kexec_load: hello, world
[ 140.291534] kexec: sanity_check_segment_list: hello, world

Ideally, the format of output message:
[ 30.791503] kexec: SYSC_kexec_load, Hello, world
[ 79.182752] kexec_core: sanity_check_segment_list, Hello, world

Remove the custom prefix "kexec" in output message.

Signed-off-by: Minfei Huang
Acked-by: Dave Young
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minfei Huang
2015-11-07 09:50:42 +0800

21 Oct, 2015

1 commit

53b90c0c5 kexec/crash: Say which char is the unrecognized ... Browse Code »

It is helpful when the crashkernel cmdline parsing routines
actually say which character is the unrecognized one. Make them
do so.

Signed-off-by: Borislav Petkov
Reviewed-by: Dave Young
Reviewed-by: Joerg Roedel
Cc: Andrew Morton
Cc: Andy Lutomirski
Cc: Baoquan He
Cc: H. Peter Anvin
Cc: Jiri Kosina
Cc: Juergen Gross
Cc: Linus Torvalds
Cc: Mark Salter
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Vivek Goyal
Cc: WANG Chao
Cc: jerry_hoemann@hp.com
Link: http://lkml.kernel.org/r/1445246268-26285-8-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar

Borislav Petkov
2015-10-21 17:10:57 +0800

11 Sep, 2015

4 commits

1303a27c9 kexec: export KERNEL_IMAGE_SIZE to vmcoreinfo ... Browse Code »

In x86_64, since v2.6.26 the KERNEL_IMAGE_SIZE is changed to 512M, and
accordingly the MODULES_VADDR is changed to 0xffffffffa0000000. However,
in v3.12 Kees Cook introduced kaslr to randomise the location of kernel.
And the kernel text mapping addr space is enlarged from 512M to 1G. That
means now KERNEL_IMAGE_SIZE is variable, its value is 512M when kaslr
support is not compiled in and 1G when kaslr support is compiled in.
Accordingly the MODULES_VADDR is changed too to be:

#define MODULES_VADDR (__START_KERNEL_map + KERNEL_IMAGE_SIZE)

So when kaslr is compiled in and enabled, the kernel text mapping addr
space and modules vaddr space need be adjusted. Otherwise makedumpfile
will collapse since the addr for some symbols is not correct.

Hence KERNEL_IMAGE_SIZE need be exported to vmcoreinfo and got in
makedumpfile to help calculate MODULES_VADDR.

Signed-off-by: Baoquan He
Acked-by: Kees Cook
Acked-by: Vivek Goyal
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Baoquan He
2015-09-11 04:29:01 +0800
bbb78b8f3 kexec: align crash_notes allocation to make it be inside one physical page ... Browse Code »

People reported that crash_notes in /proc/vmcore were corrupted and this
cause crash kdump failure. With code debugging and log we got the root
cause. This is because percpu variable crash_notes are allocated in 2
vmalloc pages. Currently percpu is based on vmalloc by default. Vmalloc
can't guarantee 2 continuous vmalloc pages are also on 2 continuous
physical pages. So when 1st kernel exports the starting address and size
of crash_notes through sysfs like below:

/sys/devices/system/cpu/cpux/crash_notes
/sys/devices/system/cpu/cpux/crash_notes_size

kdump kernel use them to get the content of crash_notes. However the 2nd
part may not be in the next neighbouring physical page as we expected if
crash_notes are allocated accross 2 vmalloc pages. That's why
nhdr_ptr->n_namesz or nhdr_ptr->n_descsz could be very huge in
update_note_header_size_elf64() and cause note header merging failure or
some warnings.

In this patch change to call __alloc_percpu() to passed in the align value
by rounding crash_notes_size up to the nearest power of two. This makes
sure the crash_notes is allocated inside one physical page since
sizeof(note_buf_t) in all ARCHS is smaller than PAGE_SIZE. Meanwhile add
a BUILD_BUG_ON to break compile if size is bigger than PAGE_SIZE since
crash_notes definitely will be in 2 pages. That need be avoided, and need
be reported if it's unavoidable.

[akpm@linux-foundation.org: use correct comment layout]
Signed-off-by: Baoquan He
Cc: Eric W. Biederman
Cc: Vivek Goyal
Cc: Dave Young
Cc: Lisa Mitchell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Baoquan He
2015-09-11 04:29:01 +0800
04e9949b2 kexec: remove unnecessary test in kimage_alloc_crash_control_pages() ... Browse Code »

Transforming PFN(Page Frame Number) to struct page is never failure, so we
can simplify the code logic to do the image->control_page assignment
directly in the loop, and remove the unnecessary conditional judgement.

Signed-off-by: Minfei Huang
Acked-by: Dave Young
Acked-by: Vivek Goyal
Cc: Simon Horman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minfei Huang
2015-09-11 04:29:01 +0800
2965faa5e kexec: split kexec_load syscall from kexec core code ... Browse Code »

There are two kexec load syscalls, kexec_load another and kexec_file_load.
kexec_file_load has been splited as kernel/kexec_file.c. In this patch I
split kexec_load syscall code to kernel/kexec.c.

And add a new kconfig option KEXEC_CORE, so we can disable kexec_load and
use kexec_file_load only, or vice verse.

The original requirement is from Ted Ts'o, he want kexec kernel signature
being checked with CONFIG_KEXEC_VERIFY_SIG enabled. But kexec-tools use
kexec_load syscall can bypass the checking.

Vivek Goyal proposed to create a common kconfig option so user can compile
in only one syscall for loading kexec kernel. KEXEC/KEXEC_FILE selects
KEXEC_CORE so that old config files still work.

Because there's general code need CONFIG_KEXEC_CORE, so I updated all the
architecture Kconfig with a new option KEXEC_CORE, and let KEXEC selects
KEXEC_CORE in arch Kconfig. Also updated general kernel code with to
kexec_load syscall.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Dave Young
Cc: Eric W. Biederman
Cc: Vivek Goyal
Cc: Petr Tesarik
Cc: Theodore Ts'o
Cc: Josh Boyer
Cc: David Howells
Cc: Geert Uytterhoeven
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dave Young
2015-09-11 04:29:01 +0800