Eric Lee / smarc-fsl-linux-kernel

01 Dec, 2018

1 commit

aef9f7db7 s390/mm: Check for valid vma before zapping in gmap_discard ... Browse Code »

commit 1843abd03250115af6cec0892683e70cf2297c25 upstream.

Userspace could have munmapped the area before doing unmapping from
the gmap. This would leave us with a valid vmaddr, but an invalid vma
from which we would try to zap memory.

Let's check before using the vma.

Fixes: 1e133ab296f3 ("s390/mm: split arch/s390/mm/pgtable.c")
Signed-off-by: Janosch Frank
Reviewed-by: David Hildenbrand
Reported-by: Dan Carpenter
Message-Id:
Signed-off-by: Janosch Frank
Signed-off-by: Greg Kroah-Hartman

Janosch Frank
2018-12-01 16:42:59 +0800

04 Oct, 2018

2 commits

a838008bb s390/extmem: fix gcc 8 stringop-overflow warning ... Browse Code »

[ Upstream commit 6b2ddf33baec23dace85bd647e3fc4ac070963e8 ]

arch/s390/mm/extmem.c: In function '__segment_load':
arch/s390/mm/extmem.c:436:2: warning: 'strncat' specified bound 7 equals
source length [-Wstringop-overflow=]
strncat(seg->res_name, " (DCSS)", 7);

What gcc complains about here is the misuse of strncat function, which
in this case does not limit a number of bytes taken from "src", so it is
in the end the same as strcat(seg->res_name, " (DCSS)");

Keeping in mind that a res_name is 15 bytes, strncat in this case
would overflow the buffer and write 0 into alignment byte between the
fields in the struct. To avoid that increasing res_name size to 16,
and reusing strlcat.

Reviewed-by: Heiko Carstens
Signed-off-by: Vasily Gorbik
Signed-off-by: Martin Schwidefsky
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Vasily Gorbik
2018-10-04 08:00:50 +0800
1117e411a s390/mm: correct allocate_pgste proc_handler callback ... Browse Code »

[ Upstream commit 5bedf8aa03c28cb8dc98bdd32a41b66d8f7d3eaa ]

Since proc_dointvec does not perform value range control,
proc_dointvec_minmax should be used to limit value range, which is
clearly intended here, as the internal representation of the value:

unsigned int alloc_pgste:1;

In fact it currently works, since we have

mm->context.alloc_pgste = page_table_allocate_pgste || ...

... since commit 23fefe119ceb5 ("s390/kvm: avoid global config of vm.alloc_pgste=1")

Before that it was

mm->context.alloc_pgste = page_table_allocate_pgste;

which was broken. That was introduced with commit 0b46e0a3ec0d7 ("s390/kvm:
remove delayed reallocation of page tables for KVM").

Fixes: 0b46e0a3ec0d7 ("s390/kvm: remove delayed reallocation of page tables for KVM")
Acked-by: Christian Borntraeger
Reviewed-by: Heiko Carstens
Signed-off-by: Vasily Gorbik
Signed-off-by: Martin Schwidefsky
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Vasily Gorbik
2018-10-04 08:00:47 +0800

05 Sep, 2018

2 commits

9fae74e9a s390/mm: fix addressing exception after suspend/resume ... Browse Code »

commit 37a366face294facb9c9d9fdd9f5b64a27456cbd upstream.

Commit c9b5ad546e7d "s390/mm: tag normal pages vs pages used in page tables"
accidentally changed the logic in arch_set_page_states(), which is used by
the suspend/resume code. set_page_stable(page, order) was changed to
set_page_stable_dat(page, 0). After this, only the first page of higher order
pages will be set to stable, and a write to one of the unstable pages will
result in an addressing exception.

Fix this by using "order" again, instead of "0".

Fixes: c9b5ad546e7d ("s390/mm: tag normal pages vs pages used in page tables")
Cc: stable@vger.kernel.org # 4.14+
Reviewed-by: Heiko Carstens
Signed-off-by: Gerald Schaefer
Signed-off-by: Martin Schwidefsky
Signed-off-by: Greg Kroah-Hartman

Gerald Schaefer
2018-09-05 15:26:40 +0800
97f76f3bc s390/kvm: fix deadlock when killed by oom ... Browse Code »

commit 306d6c49ac9ded11114cb53b0925da52f2c2ada1 upstream.

When the oom killer kills a userspace process in the page fault handler
while in guest context, the fault handler fails to release the mm_sem
if the FAULT_FLAG_RETRY_NOWAIT option is set. This leads to a deadlock
when tearing down the mm when the process terminates. This bug can only
happen when pfault is enabled, so only KVM clients are affected.

The problem arises in the rare cases in which handle_mm_fault does not
release the mm_sem. This patch fixes the issue by manually releasing
the mm_sem when needed.

Fixes: 24eb3a824c4f3 ("KVM: s390: Add FAULT_FLAG_RETRY_NOWAIT for guest fault")
Cc: # 3.15+
Signed-off-by: Claudio Imbrenda
Signed-off-by: Martin Schwidefsky
Signed-off-by: Greg Kroah-Hartman

Claudio Imbrenda
2018-09-05 15:26:36 +0800

14 Dec, 2017

1 commit

eb7fb979f s390/mm: fix off-by-one bug in 5-level page table handling ... Browse Code »

commit 8d306f53b63099fec2d56300149e400d181ba4f5 upstream.

Martin Cermak reported that setting a uprobe doesn't work. Reason for
this is that the common uprobes code tries to get an unmapped area at
the last possible page within an address space.

This broke with commit 1aea9b3f9210 ("s390/mm: implement 5 level pages
tables") which introduced an off-by-one bug which prevents to map
anything at the last possible page within an address space.

The check with the off-by-one bug however can be removed since with
commit 8ab867cb0806 ("s390/mm: fix BUG_ON in crst_table_upgrade") the
necessary check is done at both call sites.

Reported-by: Martin Cermak
Bisected-by: Thomas Richter
Fixes: 1aea9b3f9210 ("s390/mm: implement 5 level pages tables")
Reviewed-by: Hendrik Brueckner
Signed-off-by: Heiko Carstens
Signed-off-by: Martin Schwidefsky
Signed-off-by: Greg Kroah-Hartman

Heiko Carstens
2017-12-14 16:52:56 +0800

02 Nov, 2017

1 commit

b24413180 License cleanup: add SPDX GPL-2.0 license identifier to files with no license ... Browse Code »

Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if
Reviewed-by: Philippe Ombredanne
Reviewed-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman

Greg Kroah-Hartman
2017-11-02 18:10:55 +0800

19 Sep, 2017

1 commit

ba385c059 s390/mm: fix write access check in gup_huge_pmd() ... Browse Code »

The check for the _SEGMENT_ENTRY_PROTECT bit in gup_huge_pmd() is the
wrong way around. It must not be set for write==1, and not be checked for
write==0. Fix this similar to how it was fixed for ptes long time ago in
commit 25591b070336 ("[S390] fix get_user_pages_fast").

One impact of this bug would be unnecessarily using the gup slow path for
write==0 on r/w mappings. A potentially more severe impact would be that
gup_huge_pmd() will succeed for write==1 on r/o mappings.

Cc:
Signed-off-by: Gerald Schaefer
Signed-off-by: Martin Schwidefsky

Gerald Schaefer
2017-09-19 14:36:20 +0800

12 Sep, 2017

1 commit

260d16580 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux ... Browse Code »

Pull more s390 updates from Martin Schwidefsky:
"The second patch set for the 4.14 merge window:

- Convert the dasd device driver to the blk-mq interface.

- Provide three zcrypt interfaces for vfio_ap. These will be required
for KVM guest access to the crypto cards attached via the AP bus.

- A couple of memory management bug fixes."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/dasd: blk-mq conversion
s390/mm: use a single lock for the fields in mm_context_t
s390/mm: fix race on mm->context.flush_mm
s390/mm: fix local TLB flushing vs. detach of an mm address space
s390/zcrypt: externalize AP queue interrupt control
s390/zcrypt: externalize AP config info query
s390/zcrypt: externalize test AP queue
s390/mm: use VM_BUG_ON in crst_table_[upgrade|downgrade]

Linus Torvalds
2017-09-12 21:01:59 +0800

09 Sep, 2017

1 commit

0756b7fbb Merge tag 'kvm-4.14-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm ... Browse Code »

Pull KVM updates from Radim Krčmář:
"First batch of KVM changes for 4.14

Common:
- improve heuristic for boosting preempted spinlocks by ignoring
VCPUs in user mode

ARM:
- fix for decoding external abort types from guests

- added support for migrating the active priority of interrupts when
running a GICv2 guest on a GICv3 host

- minor cleanup

PPC:
- expose storage keys to userspace

- merge kvm-ppc-fixes with a fix that missed 4.13 because of
vacations

- fixes

s390:
- merge of kvm/master to avoid conflicts with additional sthyi fixes

- wire up the no-dat enhancements in KVM

- multiple epoch facility (z14 feature)

- Configuration z/Architecture Mode

- more sthyi fixes

- gdb server range checking fix

- small code cleanups

x86:
- emulate Hyper-V TSC frequency MSRs

- add nested INVPCID

- emulate EPTP switching VMFUNC

- support Virtual GIF

- support 5 level page tables

- speedup nested VM exits by packing byte operations

- speedup MMIO by using hardware provided physical address

- a lot of fixes and cleanups, especially nested"

* tag 'kvm-4.14-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (67 commits)
KVM: arm/arm64: Support uaccess of GICC_APRn
KVM: arm/arm64: Extract GICv3 max APRn index calculation
KVM: arm/arm64: vITS: Drop its_ite->lpi field
KVM: arm/arm64: vgic: constify seq_operations and file_operations
KVM: arm/arm64: Fix guest external abort matching
KVM: PPC: Book3S HV: Fix memory leak in kvm_vm_ioctl_get_htab_fd
KVM: s390: vsie: cleanup mcck reinjection
KVM: s390: use WARN_ON_ONCE only for checking
KVM: s390: guestdbg: fix range check
KVM: PPC: Book3S HV: Report storage key support to userspace
KVM: PPC: Book3S HV: Fix case where HDEC is treated as 32-bit on POWER9
KVM: PPC: Book3S HV: Fix invalid use of register expression
KVM: PPC: Book3S HV: Fix H_REGISTER_VPA VPA size validation
KVM: PPC: Book3S HV: Fix setting of storage key in H_ENTER
KVM: PPC: e500mc: Fix a NULL dereference
KVM: PPC: e500: Fix some NULL dereferences on error
KVM: PPC: Book3S HV: Protect updates to spapr_tce_tables list
KVM: s390: we are always in czam mode
KVM: s390: expose no-DAT to guest and migration support
KVM: s390: sthyi: remove invalid guest write access
...

Linus Torvalds
2017-09-09 06:18:36 +0800

07 Sep, 2017

1 commit

6e0ff1b4d Merge tag 'kvm-s390-next-4.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux ... Browse Code »

KVM: s390: Fixes and features for 4.14

- merge of topic branch tlb-flushing from the s390 tree to get the
no-dat base features
- merge of kvm/master to avoid conflicts with additional sthyi fixes
- wire up the no-dat enhancements in KVM
- multiple epoch facility (z14 feature)
- Configuration z/Architecture Mode
- more sthyi fixes
- gdb server range checking fix
- small code cleanups

Radim Krčmář
2017-09-07 22:46:46 +0800

06 Sep, 2017

3 commits

f28a4b4dd s390/mm: use a single lock for the fields in mm_context_t ... Browse Code »

The three locks 'lock', 'pgtable_lock' and 'gmap_lock' in the
mm_context_t can be reduced to a single lock.

Signed-off-by: Martin Schwidefsky

Martin Schwidefsky
2017-09-06 15:24:43 +0800
2fc4876ea s390/mm: use VM_BUG_ON in crst_table_[upgrade|downgrade] ... Browse Code »

The BUG_ON in crst_table_[upgrade|downgrade] is a debugging aid,
replace it with VM_BUG_ON.

Signed-off-by: Martin Schwidefsky

Martin Schwidefsky
2017-09-06 15:24:41 +0800
9e85ae6af Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux ... Browse Code »

Pull s390 updates from Martin Schwidefsky:
"The first part of the s390 updates for 4.14:

- Add machine type 0x3906 for IBM z14

- Add IBM z14 TLB flushing improvements for KVM guests

- Exploit the TOD clock epoch extension to provide a continuous TOD
clock afer 2042/09/17

- Add NIAI spinlock hints for IBM z14

- Rework the vmcp driver and use CMA for the respone buffer of z/VM
CP commands

- Drop some s390 specific asm headers and use the generic version

- Add block discard for DASD-FBA devices under z/VM

- Add average request times to DASD statistics

- A few of those constify patches which seem to be in vogue right now

- Cleanup and bug fixes"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (50 commits)
s390/mm: avoid empty zero pages for KVM guests to avoid postcopy hangs
s390/dasd: Add discard support for FBA devices
s390/zcrypt: make CPRBX const
s390/uaccess: avoid mvcos jump label
s390/mm: use generic mm_hooks
s390/facilities: fix typo
s390/vmcp: simplify vmcp_response_free()
s390/topology: Remove the unused parent_node() macro
s390/dasd: Change unsigned long long to unsigned long
s390/smp: convert cpuhp_setup_state() return code to zero on success
s390: fix 'novx' early parameter handling
s390/dasd: add average request times to dasd statistics
s390/scm: use common completion path
s390/pci: log changes to uid checking
s390/vmcp: simplify vmcp_ioctl()
s390/vmcp: return -ENOTTY for unknown ioctl commands
s390/vmcp: split vmcp header file and move to uapi
s390/vmcp: make use of contiguous memory allocator
s390/cpcmd,vmcp: avoid GFP_DMA allocations
s390/vmcp: fix uaccess check and avoid undefined behavior
...

Linus Torvalds
2017-09-06 00:45:46 +0800

31 Aug, 2017

1 commit

8ab867cb0 s390/mm: fix BUG_ON in crst_table_upgrade ... Browse Code »

A 31-bit compat process can force a BUG_ON in crst_table_upgrade
with specific, invalid mmap calls, e.g.

mmap((void*) 0x7fff8000, 0x10000, 3, 32, -1, 0)

The arch_get_unmapped_area[_topdown] functions miss an if condition
in the decision to do a page table upgrade.

Fixes: 9b11c7912d00 ("s390/mm: simplify arch_get_unmapped_area[_topdown]")
Cc: # v4.12+
Signed-off-by: Martin Schwidefsky

Martin Schwidefsky
2017-08-31 20:03:21 +0800

29 Aug, 2017

2 commits

fa41ba0d0 s390/mm: avoid empty zero pages for KVM guests to avoid postcopy hangs ... Browse Code »

Right now there is a potential hang situation for postcopy migrations,
if the guest is enabling storage keys on the target system during the
postcopy process.

For storage key virtualization, we have to forbid the empty zero page as
the storage key is a property of the physical page frame. As we enable
storage key handling lazily we then drop all mappings for empty zero
pages for lazy refaulting later on.

This does not work with the postcopy migration, which relies on the
empty zero page never triggering a fault again in the future. The reason
is that postcopy migration will simply read a page on the target system
if that page is a known zero page to fault in an empty zero page. At
the same time postcopy remembers that this page was already transferred
- so any future userfault on that page will NOT be retransmitted again
to avoid races.

If now the guest enters the storage key mode while in postcopy, we will
break this assumption of postcopy.

The solution is to disable the empty zero page for KVM guests early on
and not during storage key enablement. With this change, the postcopy
migration process is guaranteed to start after no zero pages are left.

As guest pages are very likely not empty zero pages anyway the memory
overhead is also pretty small.

While at it this also adds proper page table locking to the zero page
removal.

Signed-off-by: Christian Borntraeger
Acked-by: Janosch Frank
Cc: stable@vger.kernel.org
Signed-off-by: Martin Schwidefsky

Christian Borntraeger
2017-08-29 22:31:27 +0800
1bab1c02a KVM: s390: expose no-DAT to guest and migration support ... Browse Code »

The STFLE bit 147 indicates whether the ESSA no-DAT operation code is
valid, the bit is not normally provided to the host; the host is
instead provided with an SCLP bit that indicates whether guests can
support the feature.

This patch:
* enables the STFLE bit in the guest if the corresponding SCLP bit is
present in the host.
* adds support for migrating the no-DAT bit in the PGSTEs
* fixes the software interpretation of the ESSA instruction that is
used when migrating, both for the new operation code and for the old
"set stable", as per specifications.

Signed-off-by: Claudio Imbrenda
Reviewed-by: Christian Borntraeger
Acked-by: Cornelia Huck
Signed-off-by: Christian Borntraeger

Claudio Imbrenda
2017-08-29 21:15:56 +0800

09 Aug, 2017

1 commit

34ad7cdc1 s390/mm: prevent memory offline for memory blocks with cma areas ... Browse Code »

Memory blocks that contain areas for the contiguous memory allocator
(cma) should not be allowed to go offline. Otherwise this would render
cma completely useless.
This might make sense on other architectures where memory might be
taken offline due to hardware errors, but not on architectures which
support memory hotplug for load balancing.

Signed-off-by: Heiko Carstens
Signed-off-by: Martin Schwidefsky

Heiko Carstens
2017-08-09 21:09:28 +0800

26 Jul, 2017

3 commits

a01ef3082 s390/mm,vmem: simplify region and segment table allocation code ... Browse Code »

Reviewed-by: Martin Schwidefsky
Reviewed-by: Janosch Frank
Signed-off-by: Heiko Carstens
Signed-off-by: Martin Schwidefsky

Heiko Carstens
2017-07-26 14:25:12 +0800
f1c1174fa s390/mm: use new mm defines instead of magic values ... Browse Code »

Reviewed-by: Martin Schwidefsky
Signed-off-by: Heiko Carstens
Signed-off-by: Martin Schwidefsky

Heiko Carstens
2017-07-26 14:25:09 +0800
810fa7efe Merge branch 'tlb-flushing' into features ... Browse Code »

Add the TLB flushing changes via a tip branch to ease merging with
the KVM tree.

Martin Schwidefsky
2017-07-26 14:23:27 +0800

25 Jul, 2017

4 commits

cd774b907 s390/mm,kvm: use nodat PGSTE tag to optimize TLB flushing ... Browse Code »

Signed-off-by: Martin Schwidefsky

Martin Schwidefsky
2017-07-25 12:55:35 +0800
28c807e51 s390/mm: add guest ASCE TLB flush optimization ... Browse Code »

Signed-off-by: Martin Schwidefsky

Martin Schwidefsky
2017-07-25 12:55:33 +0800
118bd31be s390/mm: add no-dat TLB flush optimization ... Browse Code »

Signed-off-by: Martin Schwidefsky

Martin Schwidefsky
2017-07-25 12:55:30 +0800
c9b5ad546 s390/mm: tag normal pages vs pages used in page tables ... Browse Code »

The ESSA instruction has a new option that allows to tag pages that
are not used as a page table. Without the tag the hypervisor has to
assume that any guest page could be used in a page table inside the
guest. This forces the hypervisor to flush all guest TLB entries
whenever a host page table entry is invalidated. With the tag
the host can skip the TLB flush if the page is tagged as normal page.

Signed-off-by: Martin Schwidefsky

Martin Schwidefsky
2017-07-25 12:55:28 +0800

13 Jul, 2017

1 commit

97ca7bfc1 s390/mm: set change and reference bit on lazy key enablement ... Browse Code »

When we enable storage keys for a guest lazily, we reset the ACC and F
values. That is correct assuming that these are 0 on a clear reset and
the guest obviously has not used any key setting instruction.

We also zero out the change and reference bit. This is not correct as
the architecture prefers over-indication instead of under-indication
for the keyless->keyed transition.

This patch fixes the behaviour and always sets guest change and guest
reference for all guest storage keys on the keyless -> keyed switch.

Signed-off-by: Christian Borntraeger
Reviewed-by: Claudio Imbrenda
Signed-off-by: Martin Schwidefsky

Christian Borntraeger
2017-07-13 17:28:29 +0800

07 Jul, 2017

4 commits

7868a2087 mm/hugetlb: add size parameter to huge_pte_offset() ... Browse Code »

A poisoned or migrated hugepage is stored as a swap entry in the page
tables. On architectures that support hugepages consisting of
contiguous page table entries (such as on arm64) this leads to ambiguity
in determining the page table entry to return in huge_pte_offset() when
a poisoned entry is encountered.

Let's remove the ambiguity by adding a size parameter to convey
additional information about the requested address. Also fixup the
definition/usage of huge_pte_offset() throughout the tree.

Link: http://lkml.kernel.org/r/20170522133604.11392-4-punit.agrawal@arm.com
Signed-off-by: Punit Agrawal
Acked-by: Steve Capper
Cc: Catalin Marinas
Cc: Will Deacon
Cc: Tony Luck
Cc: Fenghua Yu
Cc: James Hogan (odd fixer:METAG ARCHITECTURE)
Cc: Ralf Baechle (supporter:MIPS)
Cc: "James E.J. Bottomley"
Cc: Helge Deller
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: Michael Ellerman
Cc: Martin Schwidefsky
Cc: Heiko Carstens
Cc: Yoshinori Sato
Cc: Rich Felker
Cc: "David S. Miller"
Cc: Chris Metcalf
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Alexander Viro
Cc: Michal Hocko
Cc: Mike Kravetz
Cc: Naoya Horiguchi
Cc: "Aneesh Kumar K.V"
Cc: "Kirill A. Shutemov"
Cc: Hillf Danton
Cc: Mark Rutland
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Punit Agrawal
2017-07-07 07:24:34 +0800
3d79a728f mm, memory_hotplug: replace for_device by want_memblock in arch_add_memory ... Browse Code »

arch_add_memory gets for_device argument which then controls whether we
want to create memblocks for created memory sections. Simplify the
logic by telling whether we want memblocks directly rather than going
through pointless negation. This also makes the api easier to
understand because it is clear what we want rather than nothing telling
for_device which can mean anything.

This shouldn't introduce any functional change.

Link: http://lkml.kernel.org/r/20170515085827.16474-13-mhocko@kernel.org
Signed-off-by: Michal Hocko
Tested-by: Dan Williams
Acked-by: Vlastimil Babka
Cc: Andi Kleen
Cc: Andrea Arcangeli
Cc: Balbir Singh
Cc: Daniel Kiper
Cc: David Rientjes
Cc: Heiko Carstens
Cc: Igor Mammedov
Cc: Jerome Glisse
Cc: Joonsoo Kim
Cc: Martin Schwidefsky
Cc: Mel Gorman
Cc: Reza Arbab
Cc: Tobias Regnery
Cc: Toshi Kani
Cc: Vitaly Kuznetsov
Cc: Xishi Qiu
Cc: Yasuaki Ishimatsu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2017-07-07 07:24:32 +0800
f1dd2cd13 mm, memory_hotplug: do not associate hotadded memory to zones until online ... Browse Code »

The current memory hotplug implementation relies on having all the
struct pages associate with a zone/node during the physical hotplug
phase (arch_add_memory->__add_pages->__add_section->__add_zone). In the
vast majority of cases this means that they are added to ZONE_NORMAL.
This has been so since 9d99aaa31f59 ("[PATCH] x86_64: Support memory
hotadd without sparsemem") and it wasn't a big deal back then because
movable onlining didn't exist yet.

Much later memory hotplug wanted to (ab)use ZONE_MOVABLE for movable
onlining 511c2aba8f07 ("mm, memory-hotplug: dynamic configure movable
memory and portion memory") and then things got more complicated.
Rather than reconsidering the zone association which was no longer
needed (because the memory hotplug already depended on SPARSEMEM) a
convoluted semantic of zone shifting has been developed. Only the
currently last memblock or the one adjacent to the zone_movable can be
onlined movable. This essentially means that the online type changes as
the new memblocks are added.

Let's simulate memory hot online manually
$ echo 0x100000000 > /sys/devices/system/memory/probe
$ grep . /sys/devices/system/memory/memory32/valid_zones
Normal Movable

$ echo $((0x100000000+(128< /sys/devices/system/memory/probe
$ grep . /sys/devices/system/memory/memory3?/valid_zones
/sys/devices/system/memory/memory32/valid_zones:Normal
/sys/devices/system/memory/memory33/valid_zones:Normal Movable

$ echo $((0x100000000+2*(128< /sys/devices/system/memory/probe
$ grep . /sys/devices/system/memory/memory3?/valid_zones
/sys/devices/system/memory/memory32/valid_zones:Normal
/sys/devices/system/memory/memory33/valid_zones:Normal
/sys/devices/system/memory/memory34/valid_zones:Normal Movable

$ echo online_movable > /sys/devices/system/memory/memory34/state
$ grep . /sys/devices/system/memory/memory3?/valid_zones
/sys/devices/system/memory/memory32/valid_zones:Normal
/sys/devices/system/memory/memory33/valid_zones:Normal Movable
/sys/devices/system/memory/memory34/valid_zones:Movable Normal

This is an awkward semantic because an udev event is sent as soon as the
block is onlined and an udev handler might want to online it based on
some policy (e.g. association with a node) but it will inherently race
with new blocks showing up.

This patch changes the physical online phase to not associate pages with
any zone at all. All the pages are just marked reserved and wait for
the onlining phase to be associated with the zone as per the online
request. There are only two requirements

- existing ZONE_NORMAL and ZONE_MOVABLE cannot overlap

- ZONE_NORMAL precedes ZONE_MOVABLE in physical addresses

the latter one is not an inherent requirement and can be changed in the
future. It preserves the current behavior and made the code slightly
simpler. This is subject to change in future.

This means that the same physical online steps as above will lead to the
following state: Normal Movable

/sys/devices/system/memory/memory32/valid_zones:Normal Movable
/sys/devices/system/memory/memory33/valid_zones:Normal Movable

/sys/devices/system/memory/memory32/valid_zones:Normal Movable
/sys/devices/system/memory/memory33/valid_zones:Normal Movable
/sys/devices/system/memory/memory34/valid_zones:Normal Movable

/sys/devices/system/memory/memory32/valid_zones:Normal Movable
/sys/devices/system/memory/memory33/valid_zones:Normal Movable
/sys/devices/system/memory/memory34/valid_zones:Movable

Implementation:
The current move_pfn_range is reimplemented to check the above
requirements (allow_online_pfn_range) and then updates the respective
zone (move_pfn_range_to_zone), the pgdat and links all the pages in the
pfn range with the zone/node. __add_pages is updated to not require the
zone and only initializes sections in the range. This allowed to
simplify the arch_add_memory code (s390 could get rid of quite some of
code).

devm_memremap_pages is the only user of arch_add_memory which relies on
the zone association because it only hooks into the memory hotplug only
half way. It uses it to associate the new memory with ZONE_DEVICE but
doesn't allow it to be {on,off}lined via sysfs. This means that this
particular code path has to call move_pfn_range_to_zone explicitly.

The original zone shifting code is kept in place and will be removed in
the follow up patch for an easier review.

Please note that this patch also changes the original behavior when
offlining a memory block adjacent to another zone (Normal vs. Movable)
used to allow to change its movable type. This will be handled later.

[richard.weiyang@gmail.com: simplify zone_intersects()]
Link: http://lkml.kernel.org/r/20170616092335.5177-1-richard.weiyang@gmail.com
[richard.weiyang@gmail.com: remove duplicate call for set_page_links]
Link: http://lkml.kernel.org/r/20170616092335.5177-2-richard.weiyang@gmail.com
[akpm@linux-foundation.org: remove unused local `i']
Link: http://lkml.kernel.org/r/20170515085827.16474-12-mhocko@kernel.org
Signed-off-by: Michal Hocko
Signed-off-by: Wei Yang
Tested-by: Dan Williams
Tested-by: Reza Arbab
Acked-by: Heiko Carstens # For s390 bits
Acked-by: Vlastimil Babka
Cc: Martin Schwidefsky
Cc: Andi Kleen
Cc: Andrea Arcangeli
Cc: Balbir Singh
Cc: Daniel Kiper
Cc: David Rientjes
Cc: Igor Mammedov
Cc: Jerome Glisse
Cc: Joonsoo Kim
Cc: Mel Gorman
Cc: Tobias Regnery
Cc: Toshi Kani
Cc: Vitaly Kuznetsov
Cc: Xishi Qiu
Cc: Yasuaki Ishimatsu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2017-07-07 07:24:32 +0800
1b862aecf mm, memory_hotplug: get rid of is_zone_device_section ... Browse Code »

Device memory hotplug hooks into regular memory hotplug only half way.
It needs memory sections to track struct pages but there is no
need/desire to associate those sections with memory blocks and export
them to the userspace via sysfs because they cannot be onlined anyway.

This is currently expressed by for_device argument to arch_add_memory
which then makes sure to associate the given memory range with
ZONE_DEVICE. register_new_memory then relies on is_zone_device_section
to distinguish special memory hotplug from the regular one. While this
works now, later patches in this series want to move __add_zone outside
of arch_add_memory path so we have to come up with something else.

Add want_memblock down the __add_pages path and use it to control
whether the section->memblock association should be done.
arch_add_memory then just trivially want memblock for everything but
for_device hotplug.

remove_memory_section doesn't need is_zone_device_section either. We
can simply skip all the memblock specific cleanup if there is no
memblock for the given section.

This shouldn't introduce any functional change.

Link: http://lkml.kernel.org/r/20170515085827.16474-5-mhocko@kernel.org
Signed-off-by: Michal Hocko
Tested-by: Dan Williams
Acked-by: Vlastimil Babka
Cc: Andi Kleen
Cc: Andrea Arcangeli
Cc: Balbir Singh
Cc: Daniel Kiper
Cc: David Rientjes
Cc: Heiko Carstens
Cc: Igor Mammedov
Cc: Jerome Glisse
Cc: Joonsoo Kim
Cc: Martin Schwidefsky
Cc: Mel Gorman
Cc: Reza Arbab
Cc: Tobias Regnery
Cc: Toshi Kani
Cc: Vitaly Kuznetsov
Cc: Xishi Qiu
Cc: Yasuaki Ishimatsu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2017-07-07 07:24:32 +0800

04 Jul, 2017

1 commit

e0f3e8f14 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux ... Browse Code »

Pull s390 updates from Martin Schwidefsky:
"The bulk of the s390 patches for 4.13. Some new things but mostly bug
fixes and cleanups. Noteworthy changes:

- The SCM block driver is converted to blk-mq

- Switch s390 to 5 level page tables. The virtual address space for a
user space process can now have up to 16EB-4KB.

- Introduce a ELF phdr flag for qemu to avoid the global
vm.alloc_pgste which forces all processes to large page tables

- A couple of PCI improvements to improve error recovery

- Included is the merge of the base support for proper machine checks
for KVM"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (52 commits)
s390/dasd: Fix faulty ENODEV for RO sysfs attribute
s390/pci: recognize name clashes with uids
s390/pci: provide more debug information
s390/pci: fix handling of PEC 306
s390/pci: improve pci hotplug
s390/pci: introduce clp_get_state
s390/pci: improve error handling during fmb (de)registration
s390/pci: improve unreg_ioat error handling
s390/pci: improve error handling during interrupt deregistration
s390/pci: don't cleanup in arch_setup_msi_irqs
KVM: s390: Backup the guest's machine check info
s390/nmi: s390: New low level handling for machine check happening in guest
s390/fpu: export save_fpu_regs for all configs
s390/kvm: avoid global config of vm.alloc_pgste=1
s390: rename struct psw_bits members
s390: rename psw_bits enums
s390/mm: use correct address space when enabling DAT
s390/cio: introduce io_subchannel_type
s390/ipl: revert Load Normal semantics for LPAR CCW-type re-IPL
s390/dumpstack: remove raw stack dump
...

Linus Torvalds
2017-07-04 06:39:36 +0800

19 Jun, 2017

1 commit

1be7107fb mm: larger stack guard gap, between vmas ... Browse Code »

Stack guard page is a useful feature to reduce a risk of stack smashing
into a different mapping. We have been using a single page gap which
is sufficient to prevent having stack adjacent to a different mapping.
But this seems to be insufficient in the light of the stack usage in
userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
which is 256kB or stack strings with MAX_ARG_STRLEN.

This will become especially dangerous for suid binaries and the default
no limit for the stack size limit because those applications can be
tricked to consume a large portion of the stack and a single glibc call
could jump over the guard page. These attacks are not theoretical,
unfortunatelly.

Make those attacks less probable by increasing the stack guard gap
to 1MB (on systems with 4k pages; but make it depend on the page size
because systems with larger base pages might cap stack allocations in
the PAGE_SIZE units) which should cover larger alloca() and VLA stack
allocations. It is obviously not a full fix because the problem is
somehow inherent, but it should reduce attack space a lot.

One could argue that the gap size should be configurable from userspace,
but that can be done later when somebody finds that the new 1MB is wrong
for some special case applications. For now, add a kernel command line
option (stack_guard_gap) to specify the stack gap size (in page units).

Implementation wise, first delete all the old code for stack guard page:
because although we could get away with accounting one extra page in a
stack vma, accounting a larger gap can break userspace - case in point,
a program run with "ulimit -S -v 20000" failed when the 1MB gap was
counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
and strict non-overcommit mode.

Instead of keeping gap inside the stack vma, maintain the stack guard
gap as a gap between vmas: using vm_start_gap() in place of vm_start
(or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
places which need to respect the gap - mainly arch_get_unmapped_area(),
and and the vma tree's subtree_gap support for that.

Original-patch-by: Oleg Nesterov
Original-patch-by: Michal Hocko
Signed-off-by: Hugh Dickins
Acked-by: Michal Hocko
Tested-by: Helge Deller # parisc
Signed-off-by: Linus Torvalds

Hugh Dickins
2017-06-19 21:50:20 +0800

12 Jun, 2017

6 commits

a75259825 s390: rename struct psw_bits members ... Browse Code »

Rename a couple of the struct psw_bits members so it is more obvious
for what they are good. Initially I thought using the single character
names from the PoP would be sufficient and obvious, but admittedly
that is not true.

The current implementation is not easy to use, if one has to look into
the source file to figure out which member represents the 'per' bit
(which is the 'r' member).

Therefore rename the members to sane names that are identical to the
uapi psw mask defines:

r -> per
i -> io
e -> ext
t -> dat
m -> mcheck
w -> wait
p -> pstate

Signed-off-by: Heiko Carstens
Signed-off-by: Martin Schwidefsky

Heiko Carstens
2017-06-12 22:26:02 +0800
8bb3fdd68 s390: rename psw_bits enums ... Browse Code »

The address space enums that must be used when modifying the address
space part of a psw with the psw_bits() macro can easily be confused
with the psw defines that are used to mask and compare directly the
mask part of a psw.
We have e.g. PSW_AS_PRIMARY vs PSW_ASC_PRIMARY.

To avoid confusion rename the PSW_AS_* enums to PSW_BITS_AS_*.

In addition also rename the PSW_AMODE_* enums, so they also follow the
same naming scheme: PSW_BITS_AMODE_*.

Signed-off-by: Heiko Carstens
Signed-off-by: Martin Schwidefsky

Heiko Carstens
2017-06-12 22:26:02 +0800
60c497014 s390/mm: use correct address space when enabling DAT ... Browse Code »

Right now the kernel uses the primary address space until finally the
switch to the correct home address space will be done when the idle
PSW will be loaded within psw_idle().

Correct this and simply use the home address space when DAT is enabled
for the first time.

This doesn't really fix a bug, but fixes odd behavior.

Reviewed-by: Christian Borntraeger
Signed-off-by: Heiko Carstens
Signed-off-by: Martin Schwidefsky

Heiko Carstens
2017-06-12 22:26:02 +0800
fe7b27472 s390/fault: use _ASCE_ORIGIN instead of PAGE_MASK ... Browse Code »

When masking an ASCE to get its origin use the corresponding define
instead of the unrelated PAGE_MASK.
This doesn't fix a bug since both masks are identical.

Signed-off-by: Heiko Carstens
Signed-off-by: Martin Schwidefsky

Heiko Carstens
2017-06-12 22:25:59 +0800
d12a3d603 s390/mm: add __rcu annotations ... Browse Code »

Add __rcu annotations so sparse correctly warns only if "slot" gets
derefenced without using rcu_dereference(). Right now we get warnings
because of the missing annotation:

arch/s390/mm/gmap.c:135:17: warning: incorrect type in assignment (different address spaces)
arch/s390/mm/gmap.c:135:17: expected void **slot
arch/s390/mm/gmap.c:135:17: got void [noderef] **

Signed-off-by: Heiko Carstens
Signed-off-by: Martin Schwidefsky

Heiko Carstens
2017-06-12 22:25:55 +0800
1aea9b3f9 s390/mm: implement 5 level pages tables ... Browse Code »

Add the logic to upgrade the page table for a 64-bit process to
five levels. This increases the TASK_SIZE from 8PB to 16EB-4K.

Signed-off-by: Martin Schwidefsky

Martin Schwidefsky
2017-06-12 22:25:54 +0800

09 May, 2017

1 commit

e6c7c6300 s390: use set_memory.h header ... Browse Code »

set_memory_* functions have moved to set_memory.h. Switch to this
explicitly

Link: http://lkml.kernel.org/r/1488920133-27229-5-git-send-email-labbott@redhat.com
Signed-off-by: Laura Abbott
Acked-by: Heiko Carstens
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Laura Abbott
2017-05-09 08:15:13 +0800

26 Apr, 2017

1 commit

1366def38 s390/pageattr: avoid unnecessary page table splitting ... Browse Code »

The kernel page table splitting code will split page tables even for
features the CPU does not support. E.g. a CPU may not support the NX
feature.
In order to avoid this, remove those bits from the flags parameter
that correlate with unsupported CPU features within __set_memory(). In
addition add an early exit if the flags parameter does not have any
bits set afterwards.

Signed-off-by: Heiko Carstens
Signed-off-by: Martin Schwidefsky

Heiko Carstens
2017-04-26 19:41:21 +0800