01 Dec, 2018

1 commit

  • commit 1843abd03250115af6cec0892683e70cf2297c25 upstream.

    Userspace could have munmapped the area before doing unmapping from
    the gmap. This would leave us with a valid vmaddr, but an invalid vma
    from which we would try to zap memory.

    Let's check before using the vma.

    Fixes: 1e133ab296f3 ("s390/mm: split arch/s390/mm/pgtable.c")
    Signed-off-by: Janosch Frank
    Reviewed-by: David Hildenbrand
    Reported-by: Dan Carpenter
    Message-Id:
    Signed-off-by: Janosch Frank
    Signed-off-by: Greg Kroah-Hartman

    Janosch Frank
     

04 Oct, 2018

2 commits

  • [ Upstream commit 6b2ddf33baec23dace85bd647e3fc4ac070963e8 ]

    arch/s390/mm/extmem.c: In function '__segment_load':
    arch/s390/mm/extmem.c:436:2: warning: 'strncat' specified bound 7 equals
    source length [-Wstringop-overflow=]
    strncat(seg->res_name, " (DCSS)", 7);

    What gcc complains about here is the misuse of strncat function, which
    in this case does not limit a number of bytes taken from "src", so it is
    in the end the same as strcat(seg->res_name, " (DCSS)");

    Keeping in mind that a res_name is 15 bytes, strncat in this case
    would overflow the buffer and write 0 into alignment byte between the
    fields in the struct. To avoid that increasing res_name size to 16,
    and reusing strlcat.

    Reviewed-by: Heiko Carstens
    Signed-off-by: Vasily Gorbik
    Signed-off-by: Martin Schwidefsky
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Vasily Gorbik
     
  • [ Upstream commit 5bedf8aa03c28cb8dc98bdd32a41b66d8f7d3eaa ]

    Since proc_dointvec does not perform value range control,
    proc_dointvec_minmax should be used to limit value range, which is
    clearly intended here, as the internal representation of the value:

    unsigned int alloc_pgste:1;

    In fact it currently works, since we have

    mm->context.alloc_pgste = page_table_allocate_pgste || ...

    ... since commit 23fefe119ceb5 ("s390/kvm: avoid global config of vm.alloc_pgste=1")

    Before that it was

    mm->context.alloc_pgste = page_table_allocate_pgste;

    which was broken. That was introduced with commit 0b46e0a3ec0d7 ("s390/kvm:
    remove delayed reallocation of page tables for KVM").

    Fixes: 0b46e0a3ec0d7 ("s390/kvm: remove delayed reallocation of page tables for KVM")
    Acked-by: Christian Borntraeger
    Reviewed-by: Heiko Carstens
    Signed-off-by: Vasily Gorbik
    Signed-off-by: Martin Schwidefsky
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Vasily Gorbik
     

05 Sep, 2018

2 commits

  • commit 37a366face294facb9c9d9fdd9f5b64a27456cbd upstream.

    Commit c9b5ad546e7d "s390/mm: tag normal pages vs pages used in page tables"
    accidentally changed the logic in arch_set_page_states(), which is used by
    the suspend/resume code. set_page_stable(page, order) was changed to
    set_page_stable_dat(page, 0). After this, only the first page of higher order
    pages will be set to stable, and a write to one of the unstable pages will
    result in an addressing exception.

    Fix this by using "order" again, instead of "0".

    Fixes: c9b5ad546e7d ("s390/mm: tag normal pages vs pages used in page tables")
    Cc: stable@vger.kernel.org # 4.14+
    Reviewed-by: Heiko Carstens
    Signed-off-by: Gerald Schaefer
    Signed-off-by: Martin Schwidefsky
    Signed-off-by: Greg Kroah-Hartman

    Gerald Schaefer
     
  • commit 306d6c49ac9ded11114cb53b0925da52f2c2ada1 upstream.

    When the oom killer kills a userspace process in the page fault handler
    while in guest context, the fault handler fails to release the mm_sem
    if the FAULT_FLAG_RETRY_NOWAIT option is set. This leads to a deadlock
    when tearing down the mm when the process terminates. This bug can only
    happen when pfault is enabled, so only KVM clients are affected.

    The problem arises in the rare cases in which handle_mm_fault does not
    release the mm_sem. This patch fixes the issue by manually releasing
    the mm_sem when needed.

    Fixes: 24eb3a824c4f3 ("KVM: s390: Add FAULT_FLAG_RETRY_NOWAIT for guest fault")
    Cc: # 3.15+
    Signed-off-by: Claudio Imbrenda
    Signed-off-by: Martin Schwidefsky
    Signed-off-by: Greg Kroah-Hartman

    Claudio Imbrenda
     

14 Dec, 2017

1 commit

  • commit 8d306f53b63099fec2d56300149e400d181ba4f5 upstream.

    Martin Cermak reported that setting a uprobe doesn't work. Reason for
    this is that the common uprobes code tries to get an unmapped area at
    the last possible page within an address space.

    This broke with commit 1aea9b3f9210 ("s390/mm: implement 5 level pages
    tables") which introduced an off-by-one bug which prevents to map
    anything at the last possible page within an address space.

    The check with the off-by-one bug however can be removed since with
    commit 8ab867cb0806 ("s390/mm: fix BUG_ON in crst_table_upgrade") the
    necessary check is done at both call sites.

    Reported-by: Martin Cermak
    Bisected-by: Thomas Richter
    Fixes: 1aea9b3f9210 ("s390/mm: implement 5 level pages tables")
    Reviewed-by: Hendrik Brueckner
    Signed-off-by: Heiko Carstens
    Signed-off-by: Martin Schwidefsky
    Signed-off-by: Greg Kroah-Hartman

    Heiko Carstens
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

19 Sep, 2017

1 commit

  • The check for the _SEGMENT_ENTRY_PROTECT bit in gup_huge_pmd() is the
    wrong way around. It must not be set for write==1, and not be checked for
    write==0. Fix this similar to how it was fixed for ptes long time ago in
    commit 25591b070336 ("[S390] fix get_user_pages_fast").

    One impact of this bug would be unnecessarily using the gup slow path for
    write==0 on r/w mappings. A potentially more severe impact would be that
    gup_huge_pmd() will succeed for write==1 on r/o mappings.

    Cc:
    Signed-off-by: Gerald Schaefer
    Signed-off-by: Martin Schwidefsky

    Gerald Schaefer
     

12 Sep, 2017

1 commit

  • Pull more s390 updates from Martin Schwidefsky:
    "The second patch set for the 4.14 merge window:

    - Convert the dasd device driver to the blk-mq interface.

    - Provide three zcrypt interfaces for vfio_ap. These will be required
    for KVM guest access to the crypto cards attached via the AP bus.

    - A couple of memory management bug fixes."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
    s390/dasd: blk-mq conversion
    s390/mm: use a single lock for the fields in mm_context_t
    s390/mm: fix race on mm->context.flush_mm
    s390/mm: fix local TLB flushing vs. detach of an mm address space
    s390/zcrypt: externalize AP queue interrupt control
    s390/zcrypt: externalize AP config info query
    s390/zcrypt: externalize test AP queue
    s390/mm: use VM_BUG_ON in crst_table_[upgrade|downgrade]

    Linus Torvalds
     

09 Sep, 2017

1 commit

  • Pull KVM updates from Radim Krčmář:
    "First batch of KVM changes for 4.14

    Common:
    - improve heuristic for boosting preempted spinlocks by ignoring
    VCPUs in user mode

    ARM:
    - fix for decoding external abort types from guests

    - added support for migrating the active priority of interrupts when
    running a GICv2 guest on a GICv3 host

    - minor cleanup

    PPC:
    - expose storage keys to userspace

    - merge kvm-ppc-fixes with a fix that missed 4.13 because of
    vacations

    - fixes

    s390:
    - merge of kvm/master to avoid conflicts with additional sthyi fixes

    - wire up the no-dat enhancements in KVM

    - multiple epoch facility (z14 feature)

    - Configuration z/Architecture Mode

    - more sthyi fixes

    - gdb server range checking fix

    - small code cleanups

    x86:
    - emulate Hyper-V TSC frequency MSRs

    - add nested INVPCID

    - emulate EPTP switching VMFUNC

    - support Virtual GIF

    - support 5 level page tables

    - speedup nested VM exits by packing byte operations

    - speedup MMIO by using hardware provided physical address

    - a lot of fixes and cleanups, especially nested"

    * tag 'kvm-4.14-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (67 commits)
    KVM: arm/arm64: Support uaccess of GICC_APRn
    KVM: arm/arm64: Extract GICv3 max APRn index calculation
    KVM: arm/arm64: vITS: Drop its_ite->lpi field
    KVM: arm/arm64: vgic: constify seq_operations and file_operations
    KVM: arm/arm64: Fix guest external abort matching
    KVM: PPC: Book3S HV: Fix memory leak in kvm_vm_ioctl_get_htab_fd
    KVM: s390: vsie: cleanup mcck reinjection
    KVM: s390: use WARN_ON_ONCE only for checking
    KVM: s390: guestdbg: fix range check
    KVM: PPC: Book3S HV: Report storage key support to userspace
    KVM: PPC: Book3S HV: Fix case where HDEC is treated as 32-bit on POWER9
    KVM: PPC: Book3S HV: Fix invalid use of register expression
    KVM: PPC: Book3S HV: Fix H_REGISTER_VPA VPA size validation
    KVM: PPC: Book3S HV: Fix setting of storage key in H_ENTER
    KVM: PPC: e500mc: Fix a NULL dereference
    KVM: PPC: e500: Fix some NULL dereferences on error
    KVM: PPC: Book3S HV: Protect updates to spapr_tce_tables list
    KVM: s390: we are always in czam mode
    KVM: s390: expose no-DAT to guest and migration support
    KVM: s390: sthyi: remove invalid guest write access
    ...

    Linus Torvalds
     

07 Sep, 2017

1 commit


06 Sep, 2017

3 commits

  • The three locks 'lock', 'pgtable_lock' and 'gmap_lock' in the
    mm_context_t can be reduced to a single lock.

    Signed-off-by: Martin Schwidefsky

    Martin Schwidefsky
     
  • The BUG_ON in crst_table_[upgrade|downgrade] is a debugging aid,
    replace it with VM_BUG_ON.

    Signed-off-by: Martin Schwidefsky

    Martin Schwidefsky
     
  • Pull s390 updates from Martin Schwidefsky:
    "The first part of the s390 updates for 4.14:

    - Add machine type 0x3906 for IBM z14

    - Add IBM z14 TLB flushing improvements for KVM guests

    - Exploit the TOD clock epoch extension to provide a continuous TOD
    clock afer 2042/09/17

    - Add NIAI spinlock hints for IBM z14

    - Rework the vmcp driver and use CMA for the respone buffer of z/VM
    CP commands

    - Drop some s390 specific asm headers and use the generic version

    - Add block discard for DASD-FBA devices under z/VM

    - Add average request times to DASD statistics

    - A few of those constify patches which seem to be in vogue right now

    - Cleanup and bug fixes"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (50 commits)
    s390/mm: avoid empty zero pages for KVM guests to avoid postcopy hangs
    s390/dasd: Add discard support for FBA devices
    s390/zcrypt: make CPRBX const
    s390/uaccess: avoid mvcos jump label
    s390/mm: use generic mm_hooks
    s390/facilities: fix typo
    s390/vmcp: simplify vmcp_response_free()
    s390/topology: Remove the unused parent_node() macro
    s390/dasd: Change unsigned long long to unsigned long
    s390/smp: convert cpuhp_setup_state() return code to zero on success
    s390: fix 'novx' early parameter handling
    s390/dasd: add average request times to dasd statistics
    s390/scm: use common completion path
    s390/pci: log changes to uid checking
    s390/vmcp: simplify vmcp_ioctl()
    s390/vmcp: return -ENOTTY for unknown ioctl commands
    s390/vmcp: split vmcp header file and move to uapi
    s390/vmcp: make use of contiguous memory allocator
    s390/cpcmd,vmcp: avoid GFP_DMA allocations
    s390/vmcp: fix uaccess check and avoid undefined behavior
    ...

    Linus Torvalds
     

31 Aug, 2017

1 commit

  • A 31-bit compat process can force a BUG_ON in crst_table_upgrade
    with specific, invalid mmap calls, e.g.

    mmap((void*) 0x7fff8000, 0x10000, 3, 32, -1, 0)

    The arch_get_unmapped_area[_topdown] functions miss an if condition
    in the decision to do a page table upgrade.

    Fixes: 9b11c7912d00 ("s390/mm: simplify arch_get_unmapped_area[_topdown]")
    Cc: # v4.12+
    Signed-off-by: Martin Schwidefsky

    Martin Schwidefsky
     

29 Aug, 2017

2 commits

  • Right now there is a potential hang situation for postcopy migrations,
    if the guest is enabling storage keys on the target system during the
    postcopy process.

    For storage key virtualization, we have to forbid the empty zero page as
    the storage key is a property of the physical page frame. As we enable
    storage key handling lazily we then drop all mappings for empty zero
    pages for lazy refaulting later on.

    This does not work with the postcopy migration, which relies on the
    empty zero page never triggering a fault again in the future. The reason
    is that postcopy migration will simply read a page on the target system
    if that page is a known zero page to fault in an empty zero page. At
    the same time postcopy remembers that this page was already transferred
    - so any future userfault on that page will NOT be retransmitted again
    to avoid races.

    If now the guest enters the storage key mode while in postcopy, we will
    break this assumption of postcopy.

    The solution is to disable the empty zero page for KVM guests early on
    and not during storage key enablement. With this change, the postcopy
    migration process is guaranteed to start after no zero pages are left.

    As guest pages are very likely not empty zero pages anyway the memory
    overhead is also pretty small.

    While at it this also adds proper page table locking to the zero page
    removal.

    Signed-off-by: Christian Borntraeger
    Acked-by: Janosch Frank
    Cc: stable@vger.kernel.org
    Signed-off-by: Martin Schwidefsky

    Christian Borntraeger
     
  • The STFLE bit 147 indicates whether the ESSA no-DAT operation code is
    valid, the bit is not normally provided to the host; the host is
    instead provided with an SCLP bit that indicates whether guests can
    support the feature.

    This patch:
    * enables the STFLE bit in the guest if the corresponding SCLP bit is
    present in the host.
    * adds support for migrating the no-DAT bit in the PGSTEs
    * fixes the software interpretation of the ESSA instruction that is
    used when migrating, both for the new operation code and for the old
    "set stable", as per specifications.

    Signed-off-by: Claudio Imbrenda
    Reviewed-by: Christian Borntraeger
    Acked-by: Cornelia Huck
    Signed-off-by: Christian Borntraeger

    Claudio Imbrenda
     

09 Aug, 2017

1 commit

  • Memory blocks that contain areas for the contiguous memory allocator
    (cma) should not be allowed to go offline. Otherwise this would render
    cma completely useless.
    This might make sense on other architectures where memory might be
    taken offline due to hardware errors, but not on architectures which
    support memory hotplug for load balancing.

    Signed-off-by: Heiko Carstens
    Signed-off-by: Martin Schwidefsky

    Heiko Carstens
     

26 Jul, 2017

3 commits


25 Jul, 2017

4 commits


13 Jul, 2017

1 commit

  • When we enable storage keys for a guest lazily, we reset the ACC and F
    values. That is correct assuming that these are 0 on a clear reset and
    the guest obviously has not used any key setting instruction.

    We also zero out the change and reference bit. This is not correct as
    the architecture prefers over-indication instead of under-indication
    for the keyless->keyed transition.

    This patch fixes the behaviour and always sets guest change and guest
    reference for all guest storage keys on the keyless -> keyed switch.

    Signed-off-by: Christian Borntraeger
    Reviewed-by: Claudio Imbrenda
    Signed-off-by: Martin Schwidefsky

    Christian Borntraeger
     

07 Jul, 2017

4 commits

  • A poisoned or migrated hugepage is stored as a swap entry in the page
    tables. On architectures that support hugepages consisting of
    contiguous page table entries (such as on arm64) this leads to ambiguity
    in determining the page table entry to return in huge_pte_offset() when
    a poisoned entry is encountered.

    Let's remove the ambiguity by adding a size parameter to convey
    additional information about the requested address. Also fixup the
    definition/usage of huge_pte_offset() throughout the tree.

    Link: http://lkml.kernel.org/r/20170522133604.11392-4-punit.agrawal@arm.com
    Signed-off-by: Punit Agrawal
    Acked-by: Steve Capper
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: James Hogan (odd fixer:METAG ARCHITECTURE)
    Cc: Ralf Baechle (supporter:MIPS)
    Cc: "James E.J. Bottomley"
    Cc: Helge Deller
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Yoshinori Sato
    Cc: Rich Felker
    Cc: "David S. Miller"
    Cc: Chris Metcalf
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Alexander Viro
    Cc: Michal Hocko
    Cc: Mike Kravetz
    Cc: Naoya Horiguchi
    Cc: "Aneesh Kumar K.V"
    Cc: "Kirill A. Shutemov"
    Cc: Hillf Danton
    Cc: Mark Rutland
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Punit Agrawal
     
  • arch_add_memory gets for_device argument which then controls whether we
    want to create memblocks for created memory sections. Simplify the
    logic by telling whether we want memblocks directly rather than going
    through pointless negation. This also makes the api easier to
    understand because it is clear what we want rather than nothing telling
    for_device which can mean anything.

    This shouldn't introduce any functional change.

    Link: http://lkml.kernel.org/r/20170515085827.16474-13-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Tested-by: Dan Williams
    Acked-by: Vlastimil Babka
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Balbir Singh
    Cc: Daniel Kiper
    Cc: David Rientjes
    Cc: Heiko Carstens
    Cc: Igor Mammedov
    Cc: Jerome Glisse
    Cc: Joonsoo Kim
    Cc: Martin Schwidefsky
    Cc: Mel Gorman
    Cc: Reza Arbab
    Cc: Tobias Regnery
    Cc: Toshi Kani
    Cc: Vitaly Kuznetsov
    Cc: Xishi Qiu
    Cc: Yasuaki Ishimatsu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • The current memory hotplug implementation relies on having all the
    struct pages associate with a zone/node during the physical hotplug
    phase (arch_add_memory->__add_pages->__add_section->__add_zone). In the
    vast majority of cases this means that they are added to ZONE_NORMAL.
    This has been so since 9d99aaa31f59 ("[PATCH] x86_64: Support memory
    hotadd without sparsemem") and it wasn't a big deal back then because
    movable onlining didn't exist yet.

    Much later memory hotplug wanted to (ab)use ZONE_MOVABLE for movable
    onlining 511c2aba8f07 ("mm, memory-hotplug: dynamic configure movable
    memory and portion memory") and then things got more complicated.
    Rather than reconsidering the zone association which was no longer
    needed (because the memory hotplug already depended on SPARSEMEM) a
    convoluted semantic of zone shifting has been developed. Only the
    currently last memblock or the one adjacent to the zone_movable can be
    onlined movable. This essentially means that the online type changes as
    the new memblocks are added.

    Let's simulate memory hot online manually
    $ echo 0x100000000 > /sys/devices/system/memory/probe
    $ grep . /sys/devices/system/memory/memory32/valid_zones
    Normal Movable

    $ echo $((0x100000000+(128< /sys/devices/system/memory/probe
    $ grep . /sys/devices/system/memory/memory3?/valid_zones
    /sys/devices/system/memory/memory32/valid_zones:Normal
    /sys/devices/system/memory/memory33/valid_zones:Normal Movable

    $ echo $((0x100000000+2*(128< /sys/devices/system/memory/probe
    $ grep . /sys/devices/system/memory/memory3?/valid_zones
    /sys/devices/system/memory/memory32/valid_zones:Normal
    /sys/devices/system/memory/memory33/valid_zones:Normal
    /sys/devices/system/memory/memory34/valid_zones:Normal Movable

    $ echo online_movable > /sys/devices/system/memory/memory34/state
    $ grep . /sys/devices/system/memory/memory3?/valid_zones
    /sys/devices/system/memory/memory32/valid_zones:Normal
    /sys/devices/system/memory/memory33/valid_zones:Normal Movable
    /sys/devices/system/memory/memory34/valid_zones:Movable Normal

    This is an awkward semantic because an udev event is sent as soon as the
    block is onlined and an udev handler might want to online it based on
    some policy (e.g. association with a node) but it will inherently race
    with new blocks showing up.

    This patch changes the physical online phase to not associate pages with
    any zone at all. All the pages are just marked reserved and wait for
    the onlining phase to be associated with the zone as per the online
    request. There are only two requirements

    - existing ZONE_NORMAL and ZONE_MOVABLE cannot overlap

    - ZONE_NORMAL precedes ZONE_MOVABLE in physical addresses

    the latter one is not an inherent requirement and can be changed in the
    future. It preserves the current behavior and made the code slightly
    simpler. This is subject to change in future.

    This means that the same physical online steps as above will lead to the
    following state: Normal Movable

    /sys/devices/system/memory/memory32/valid_zones:Normal Movable
    /sys/devices/system/memory/memory33/valid_zones:Normal Movable

    /sys/devices/system/memory/memory32/valid_zones:Normal Movable
    /sys/devices/system/memory/memory33/valid_zones:Normal Movable
    /sys/devices/system/memory/memory34/valid_zones:Normal Movable

    /sys/devices/system/memory/memory32/valid_zones:Normal Movable
    /sys/devices/system/memory/memory33/valid_zones:Normal Movable
    /sys/devices/system/memory/memory34/valid_zones:Movable

    Implementation:
    The current move_pfn_range is reimplemented to check the above
    requirements (allow_online_pfn_range) and then updates the respective
    zone (move_pfn_range_to_zone), the pgdat and links all the pages in the
    pfn range with the zone/node. __add_pages is updated to not require the
    zone and only initializes sections in the range. This allowed to
    simplify the arch_add_memory code (s390 could get rid of quite some of
    code).

    devm_memremap_pages is the only user of arch_add_memory which relies on
    the zone association because it only hooks into the memory hotplug only
    half way. It uses it to associate the new memory with ZONE_DEVICE but
    doesn't allow it to be {on,off}lined via sysfs. This means that this
    particular code path has to call move_pfn_range_to_zone explicitly.

    The original zone shifting code is kept in place and will be removed in
    the follow up patch for an easier review.

    Please note that this patch also changes the original behavior when
    offlining a memory block adjacent to another zone (Normal vs. Movable)
    used to allow to change its movable type. This will be handled later.

    [richard.weiyang@gmail.com: simplify zone_intersects()]
    Link: http://lkml.kernel.org/r/20170616092335.5177-1-richard.weiyang@gmail.com
    [richard.weiyang@gmail.com: remove duplicate call for set_page_links]
    Link: http://lkml.kernel.org/r/20170616092335.5177-2-richard.weiyang@gmail.com
    [akpm@linux-foundation.org: remove unused local `i']
    Link: http://lkml.kernel.org/r/20170515085827.16474-12-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Signed-off-by: Wei Yang
    Tested-by: Dan Williams
    Tested-by: Reza Arbab
    Acked-by: Heiko Carstens # For s390 bits
    Acked-by: Vlastimil Babka
    Cc: Martin Schwidefsky
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Balbir Singh
    Cc: Daniel Kiper
    Cc: David Rientjes
    Cc: Igor Mammedov
    Cc: Jerome Glisse
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Tobias Regnery
    Cc: Toshi Kani
    Cc: Vitaly Kuznetsov
    Cc: Xishi Qiu
    Cc: Yasuaki Ishimatsu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Device memory hotplug hooks into regular memory hotplug only half way.
    It needs memory sections to track struct pages but there is no
    need/desire to associate those sections with memory blocks and export
    them to the userspace via sysfs because they cannot be onlined anyway.

    This is currently expressed by for_device argument to arch_add_memory
    which then makes sure to associate the given memory range with
    ZONE_DEVICE. register_new_memory then relies on is_zone_device_section
    to distinguish special memory hotplug from the regular one. While this
    works now, later patches in this series want to move __add_zone outside
    of arch_add_memory path so we have to come up with something else.

    Add want_memblock down the __add_pages path and use it to control
    whether the section->memblock association should be done.
    arch_add_memory then just trivially want memblock for everything but
    for_device hotplug.

    remove_memory_section doesn't need is_zone_device_section either. We
    can simply skip all the memblock specific cleanup if there is no
    memblock for the given section.

    This shouldn't introduce any functional change.

    Link: http://lkml.kernel.org/r/20170515085827.16474-5-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Tested-by: Dan Williams
    Acked-by: Vlastimil Babka
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Balbir Singh
    Cc: Daniel Kiper
    Cc: David Rientjes
    Cc: Heiko Carstens
    Cc: Igor Mammedov
    Cc: Jerome Glisse
    Cc: Joonsoo Kim
    Cc: Martin Schwidefsky
    Cc: Mel Gorman
    Cc: Reza Arbab
    Cc: Tobias Regnery
    Cc: Toshi Kani
    Cc: Vitaly Kuznetsov
    Cc: Xishi Qiu
    Cc: Yasuaki Ishimatsu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

04 Jul, 2017

1 commit

  • Pull s390 updates from Martin Schwidefsky:
    "The bulk of the s390 patches for 4.13. Some new things but mostly bug
    fixes and cleanups. Noteworthy changes:

    - The SCM block driver is converted to blk-mq

    - Switch s390 to 5 level page tables. The virtual address space for a
    user space process can now have up to 16EB-4KB.

    - Introduce a ELF phdr flag for qemu to avoid the global
    vm.alloc_pgste which forces all processes to large page tables

    - A couple of PCI improvements to improve error recovery

    - Included is the merge of the base support for proper machine checks
    for KVM"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (52 commits)
    s390/dasd: Fix faulty ENODEV for RO sysfs attribute
    s390/pci: recognize name clashes with uids
    s390/pci: provide more debug information
    s390/pci: fix handling of PEC 306
    s390/pci: improve pci hotplug
    s390/pci: introduce clp_get_state
    s390/pci: improve error handling during fmb (de)registration
    s390/pci: improve unreg_ioat error handling
    s390/pci: improve error handling during interrupt deregistration
    s390/pci: don't cleanup in arch_setup_msi_irqs
    KVM: s390: Backup the guest's machine check info
    s390/nmi: s390: New low level handling for machine check happening in guest
    s390/fpu: export save_fpu_regs for all configs
    s390/kvm: avoid global config of vm.alloc_pgste=1
    s390: rename struct psw_bits members
    s390: rename psw_bits enums
    s390/mm: use correct address space when enabling DAT
    s390/cio: introduce io_subchannel_type
    s390/ipl: revert Load Normal semantics for LPAR CCW-type re-IPL
    s390/dumpstack: remove raw stack dump
    ...

    Linus Torvalds
     

19 Jun, 2017

1 commit

  • Stack guard page is a useful feature to reduce a risk of stack smashing
    into a different mapping. We have been using a single page gap which
    is sufficient to prevent having stack adjacent to a different mapping.
    But this seems to be insufficient in the light of the stack usage in
    userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
    used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
    which is 256kB or stack strings with MAX_ARG_STRLEN.

    This will become especially dangerous for suid binaries and the default
    no limit for the stack size limit because those applications can be
    tricked to consume a large portion of the stack and a single glibc call
    could jump over the guard page. These attacks are not theoretical,
    unfortunatelly.

    Make those attacks less probable by increasing the stack guard gap
    to 1MB (on systems with 4k pages; but make it depend on the page size
    because systems with larger base pages might cap stack allocations in
    the PAGE_SIZE units) which should cover larger alloca() and VLA stack
    allocations. It is obviously not a full fix because the problem is
    somehow inherent, but it should reduce attack space a lot.

    One could argue that the gap size should be configurable from userspace,
    but that can be done later when somebody finds that the new 1MB is wrong
    for some special case applications. For now, add a kernel command line
    option (stack_guard_gap) to specify the stack gap size (in page units).

    Implementation wise, first delete all the old code for stack guard page:
    because although we could get away with accounting one extra page in a
    stack vma, accounting a larger gap can break userspace - case in point,
    a program run with "ulimit -S -v 20000" failed when the 1MB gap was
    counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
    and strict non-overcommit mode.

    Instead of keeping gap inside the stack vma, maintain the stack guard
    gap as a gap between vmas: using vm_start_gap() in place of vm_start
    (or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
    places which need to respect the gap - mainly arch_get_unmapped_area(),
    and and the vma tree's subtree_gap support for that.

    Original-patch-by: Oleg Nesterov
    Original-patch-by: Michal Hocko
    Signed-off-by: Hugh Dickins
    Acked-by: Michal Hocko
    Tested-by: Helge Deller # parisc
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

12 Jun, 2017

6 commits

  • Rename a couple of the struct psw_bits members so it is more obvious
    for what they are good. Initially I thought using the single character
    names from the PoP would be sufficient and obvious, but admittedly
    that is not true.

    The current implementation is not easy to use, if one has to look into
    the source file to figure out which member represents the 'per' bit
    (which is the 'r' member).

    Therefore rename the members to sane names that are identical to the
    uapi psw mask defines:

    r -> per
    i -> io
    e -> ext
    t -> dat
    m -> mcheck
    w -> wait
    p -> pstate

    Signed-off-by: Heiko Carstens
    Signed-off-by: Martin Schwidefsky

    Heiko Carstens
     
  • The address space enums that must be used when modifying the address
    space part of a psw with the psw_bits() macro can easily be confused
    with the psw defines that are used to mask and compare directly the
    mask part of a psw.
    We have e.g. PSW_AS_PRIMARY vs PSW_ASC_PRIMARY.

    To avoid confusion rename the PSW_AS_* enums to PSW_BITS_AS_*.

    In addition also rename the PSW_AMODE_* enums, so they also follow the
    same naming scheme: PSW_BITS_AMODE_*.

    Signed-off-by: Heiko Carstens
    Signed-off-by: Martin Schwidefsky

    Heiko Carstens
     
  • Right now the kernel uses the primary address space until finally the
    switch to the correct home address space will be done when the idle
    PSW will be loaded within psw_idle().

    Correct this and simply use the home address space when DAT is enabled
    for the first time.

    This doesn't really fix a bug, but fixes odd behavior.

    Reviewed-by: Christian Borntraeger
    Signed-off-by: Heiko Carstens
    Signed-off-by: Martin Schwidefsky

    Heiko Carstens
     
  • When masking an ASCE to get its origin use the corresponding define
    instead of the unrelated PAGE_MASK.
    This doesn't fix a bug since both masks are identical.

    Signed-off-by: Heiko Carstens
    Signed-off-by: Martin Schwidefsky

    Heiko Carstens
     
  • Add __rcu annotations so sparse correctly warns only if "slot" gets
    derefenced without using rcu_dereference(). Right now we get warnings
    because of the missing annotation:

    arch/s390/mm/gmap.c:135:17: warning: incorrect type in assignment (different address spaces)
    arch/s390/mm/gmap.c:135:17: expected void **slot
    arch/s390/mm/gmap.c:135:17: got void [noderef] **

    Signed-off-by: Heiko Carstens
    Signed-off-by: Martin Schwidefsky

    Heiko Carstens
     
  • Add the logic to upgrade the page table for a 64-bit process to
    five levels. This increases the TASK_SIZE from 8PB to 16EB-4K.

    Signed-off-by: Martin Schwidefsky

    Martin Schwidefsky
     

09 May, 2017

1 commit

  • set_memory_* functions have moved to set_memory.h. Switch to this
    explicitly

    Link: http://lkml.kernel.org/r/1488920133-27229-5-git-send-email-labbott@redhat.com
    Signed-off-by: Laura Abbott
    Acked-by: Heiko Carstens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Laura Abbott
     

26 Apr, 2017

1 commit

  • The kernel page table splitting code will split page tables even for
    features the CPU does not support. E.g. a CPU may not support the NX
    feature.
    In order to avoid this, remove those bits from the flags parameter
    that correlate with unsupported CPU features within __set_memory(). In
    addition add an early exit if the flags parameter does not have any
    bits set afterwards.

    Signed-off-by: Heiko Carstens
    Signed-off-by: Martin Schwidefsky

    Heiko Carstens