13 Jun, 2016

33 commits

  • Use dynamically allocated irq descriptors on s390 which allows
    us to get rid of the s390 specific config option PCI_NR_MSI and
    exploit more MSI interrupts. Also the size of the kernel image
    is reduced by 131K (using performance_defconfig).

    Signed-off-by: Sebastian Ott
    Signed-off-by: Martin Schwidefsky

    Sebastian Ott
     
  • When s390 traces with hex_ascii or sprintf view are
    extracted and sorted, use the sort option -s (stable)
    to avoid multiple lines with the same time stamp being
    sorted using the rest of the line as secondary key.

    Signed-off-by: Thomas Richter
    Signed-off-by: Martin Schwidefsky

    Thomas Richter
     
  • Small cleanup patch to use the shorter __section macro everywhere.

    Signed-off-by: Heiko Carstens
    Signed-off-by: Martin Schwidefsky

    Heiko Carstens
     
  • On s390 __ro_after_init is currently mapped to __read_mostly which
    means that data marked as __ro_after_init will not be protected.

    Reason for this is that the common code __ro_after_init implementation
    is x86 centric: the ro_after_init data section was added to rodata,
    since x86 enables write protection to kernel text and rodata very
    late. On s390 we have write protection for these sections enabled with
    the initial page tables. So adding the ro_after_init data section to
    rodata does not work on s390.

    In order to make __ro_after_init work properly on s390 move the
    ro_after_init data, right behind rodata. Unlike the rodata section it
    will be marked read-only later after all init calls happened.

    This s390 specific implementation adds new __start_ro_after_init and
    __end_ro_after_init labels. Everything in between will be marked
    read-only after the init calls happened. In addition to the
    __ro_after_init data move also the exception table there, since from a
    practical point of view it fits the __ro_after_init requirements.

    Signed-off-by: Heiko Carstens
    Reviewed-by: Kees Cook
    Signed-off-by: Martin Schwidefsky

    Heiko Carstens
     
  • commit c74ba8b3480d ("arch: Introduce post-init read-only memory")
    introduced the __ro_after_init attribute which allows to add variables
    to the ro_after_init data section.

    This new section was added to rodata, even though it contains writable
    data. This in turn causes problems on architectures which mark the
    page table entries read-only that point to rodata very early.

    This patch allows architectures to implement an own handling of the
    .data..ro_after_init section.
    Usually that would be:
    - mark the rodata section read-only very early
    - mark the ro_after_init section read-only within mark_rodata_ro

    Signed-off-by: Heiko Carstens
    Reviewed-by: Kees Cook
    Signed-off-by: Martin Schwidefsky

    Heiko Carstens
     
  • ptep_flush_lazy and pmdp_flush_lazy use mm->context.attach_count to
    decide between a lazy TLB flush vs an immediate TLB flush. The field
    contains two 16-bit counters, the number of CPUs that have the mm
    attached and can create TLB entries for it and the number of CPUs in
    the middle of a page table update.

    The __tlb_flush_asce, ptep_flush_direct and pmdp_flush_direct functions
    use the attach counter and a mask check with mm_cpumask(mm) to decide
    between a local flush local of the current CPU and a global flush.

    For all these functions the decision between lazy vs immediate and
    local vs global TLB flush can be based on CPU masks. There are two
    masks: the mm->context.cpu_attach_mask with the CPUs that are actively
    using the mm, and the mm_cpumask(mm) with the CPUs that have used the
    mm since the last full flush. The decision between lazy vs immediate
    flush is based on the mm->context.cpu_attach_mask, to decide between
    local vs global flush the mm_cpumask(mm) is used.

    With this patch all checks will use the CPU masks, the old counter
    mm->context.attach_count with its two 16-bit values is turned into a
    single counter mm->context.flush_count that keeps track of the number
    of CPUs with incomplete page table updates. The sole user of this
    counter is finish_arch_post_lock_switch() which waits for the end of
    all page table updates.

    Signed-off-by: Martin Schwidefsky

    Martin Schwidefsky
     
  • The bitmap_equal function has optimized code for small bitmaps with less
    than BITS_PER_LONG bits. For larger bitmaps the out-of-line function
    __bitmap_equal is called.

    For a constant number of bits divisible by BITS_PER_LONG the memcmp
    function can be used. For s390 gcc knows how to optimize this function,
    memcmp calls with up to 256 bytes / 2048 bits are translated into a
    single instruction.

    Reviewed-by: David Hildenbrand
    Signed-off-by: Martin Schwidefsky

    Martin Schwidefsky
     
  • The vunmap_pte_range() function calls ptep_get_and_clear() without any
    locking. ptep_get_and_clear() uses ptep_xchg_lazy()/ptep_flush_direct()
    for the page table update. ptep_flush_direct requires that preemption
    is disabled, but without any locking this is not the case. If the kernel
    preempts the task while the attach_counter is increased an endless loop
    in finish_arch_post_lock_switch() will occur the next time the task is
    scheduled.

    Add explicit preempt_disable()/preempt_enable() calls to the relevant
    functions in arch/s390/mm/pgtable.c.

    Signed-off-by: Martin Schwidefsky

    Martin Schwidefsky
     
  • The External-Time-Reference (ETR) clock synchronization interface has
    been superseded by Server-Time-Protocol (STP). Remove the outdated
    ETR interface.

    Signed-off-by: Martin Schwidefsky

    Martin Schwidefsky
     
  • The PTFF instruction can be used to retrieve information about UTC
    including the current number of leap seconds. Use this value to
    convert the coordinated server time value of the TOD clock to a
    proper UTC timestamp to initialize the system time. Without this
    correction the system time will be off by the number of leap seonds
    until it has been corrected via NTP.

    Signed-off-by: Martin Schwidefsky

    Martin Schwidefsky
     
  • It is possible to specify a user offset for the TOD clock, e.g. +2 hours.
    The TOD clock will carry this offset even if the clock is synchronized
    with STP. This makes the time stamps acquired with get_sync_clock()
    useless as another LPAR migth use a different TOD offset.

    Use the PTFF instrution to get the TOD epoch difference and subtract
    it from the TOD clock value to get a physical timestamp. As the epoch
    difference contains the sync check delta as well the LPAR offset value
    to the physical clock needs to be refreshed after each clock
    synchronization.

    Signed-off-by: Martin Schwidefsky

    Martin Schwidefsky
     
  • The PTFF instruction is not a function of ETR, rename and move the
    PTFF definitions from etr.h to timex.h.

    Signed-off-by: Martin Schwidefsky

    Martin Schwidefsky
     
  • The sync clock operation of the channel subsystem call for STP delivers
    the TOD clock difference as a result. Use this TOD clock difference
    instead of the difference between the TOD timestamps before and after
    the sync clock operation.

    Signed-off-by: Martin Schwidefsky

    Martin Schwidefsky
     
  • Reducing the size of reserved memory for the crash kernel will result
    in an immediate crash on s390. Reason for that is that we do not
    create struct pages for memory that is reserved. If that memory is
    freed any access to struct pages which correspond to this memory will
    result in invalid memory accesses and a kernel panic.

    Fix this by properly creating struct pages when the system gets
    initialized. Change the code also to make use of set_memory_ro() and
    set_memory_rw() so page tables will be split if required.

    Signed-off-by: Heiko Carstens
    Signed-off-by: Martin Schwidefsky

    Heiko Carstens
     
  • Implement an s390 version of the weak crash_free_reserved_phys_range
    function. This allows us to update the size of the reserved crash
    kernel memory if it will be resized.

    This was previously done with a call to crash_unmap_reserved_pages
    from crash_shrink_memory which was removed with ("s390/kexec:
    consolidate crash_map/unmap_reserved_pages() and
    arch_kexec_protect(unprotect)_crashkres()")

    Fixes: 7a0058ec7860 ("s390/kexec: consolidate crash_map/unmap_reserved_pages() and arch_kexec_protect(unprotect)_crashkres()")
    Signed-off-by: Heiko Carstens
    Signed-off-by: Martin Schwidefsky

    Heiko Carstens
     
  • The segment/region table that is part of the kernel image must be
    properly aligned to 16k in order to make the crdte inline assembly
    work.
    Otherwise it will calculate a wrong segment/region table start address
    and access incorrect memory locations if the swapper_pg_dir is not
    aligned to 16k.

    Therefore define BSS_FIRST_SECTIONS in order to put the swapper_pg_dir
    at the beginning of the bss section and also align the bss section to
    16k just like other architectures did.

    Signed-off-by: Heiko Carstens
    Signed-off-by: Martin Schwidefsky

    Heiko Carstens
     
  • Lets provide the basic machine information for dump_stack on
    s390. This enables the "Hardware name:" line and results in
    output like

    [...]
    Oops: 0004 ilc:2 [#1] SMP
    Modules linked in:
    CPU: 1 PID: 74 Comm: sh Not tainted 4.5.0+ #205
    Hardware name: IBM 2964 NC9 704 (KVM)
    [...]

    Signed-off-by: Christian Borntraeger
    Acked-by: Heiko Carstens
    Signed-off-by: Martin Schwidefsky

    Christian Borntraeger
     
  • Signed-off-by: Daniel van Gerpen
    Acked-by: Peter Oberparleiter
    Signed-off-by: Martin Schwidefsky

    Daniel van Gerpen
     
  • Show the dynamic and static cpu mhz of each cpu. Since these values
    are per cpu this requires a fundamental extension of the format of
    /proc/cpuinfo.

    Historically we had only a single line per cpu and a summary at the
    top of the file. This format is hardly extendible if we want to add
    more per cpu information.

    Therefore this patch adds per cpu blocks at the end of /proc/cpuinfo:

    cpu : 0
    cpu Mhz dynamic : 5504
    cpu Mhz static : 5504

    cpu : 1
    cpu Mhz dynamic : 5504
    cpu Mhz static : 5504

    cpu : 2
    cpu Mhz dynamic : 5504
    cpu Mhz static : 5504

    cpu : 3
    cpu Mhz dynamic : 5504
    cpu Mhz static : 5504

    Right now each block contains only the dynamic and static cpu mhz,
    but it can be easily extended like on every other architecture.

    This extension is supposed to be compatible with the old format.

    Signed-off-by: Heiko Carstens
    Acked-by: Sascha Silbe
    Signed-off-by: Martin Schwidefsky

    Heiko Carstens
     
  • Change the code to print all the current output during the first
    iteration. This is a preparation patch for the upcoming per cpu block
    extension to /proc/cpuinfo.

    Signed-off-by: Heiko Carstens
    Signed-off-by: Martin Schwidefsky

    Heiko Carstens
     
  • Ensure that we always have __stringify().

    Signed-off-by: Jason Baron
    Signed-off-by: Heiko Carstens
    Signed-off-by: Martin Schwidefsky

    Jason Baron
     
  • Add statistics that show how memory is mapped within the kernel
    identity mapping. This is more or less the same like git
    commit ce0c0e50f94e ("x86, generic: CPA add statistics about state
    of direct mapping v4") for x86.

    I also intentionally copied the lower case "k" within DirectMap4k vs
    the upper case "M" and "G" within the two other lines. Let's have
    consistent inconsistencies across architectures.

    The output of /proc/meminfo now contains these additional lines:

    DirectMap4k: 2048 kB
    DirectMap1M: 3991552 kB
    DirectMap2G: 4194304 kB

    The implementation on s390 is lockless unlike the x86 version, since I
    assume changes to the kernel mapping are a very rare event. Therefore
    it really doesn't matter if these statistics could potentially be
    inconsistent if read while kernel pages tables are being changed.

    Signed-off-by: Heiko Carstens
    Acked-by: Martin Schwidefsky
    Signed-off-by: Martin Schwidefsky

    Heiko Carstens
     
  • For the kernel identity mapping map everything read-writeable and
    subsequently call set_memory_ro() to make the ro section read-only.
    This simplifies the code a lot.

    Signed-off-by: Heiko Carstens
    Acked-by: Martin Schwidefsky
    Signed-off-by: Martin Schwidefsky

    Heiko Carstens
     
  • set_memory_ro() and set_memory_rw() currently only work on 4k
    mappings, which is good enough for module code aka the vmalloc area.

    However we stumbled already twice into the need to make this also work
    on larger mappings:
    - the ro after init patch set
    - the crash kernel resize code

    Therefore this patch implements automatic kernel page table splitting
    if e.g. set_memory_ro() would be called on parts of a 2G mapping.
    This works quite the same as the x86 code, but is much simpler.

    In order to make this work and to be architecturally compliant we now
    always use the csp, cspg or crdte instructions to replace valid page
    table entries. This means that set_memory_ro() and set_memory_rw()
    will be much more expensive than before. In order to avoid huge
    latencies the code contains a couple of cond_resched() calls.

    The current code only splits page tables, but does not merge them if
    it would be possible. The reason for this is that currently there is
    no real life scenarion where this would really happen. All current use
    cases that I know of only change access rights once during the life
    time. If that should change we can still implement kernel page table
    merging at a later time.

    Signed-off-by: Heiko Carstens
    Acked-by: Martin Schwidefsky
    Signed-off-by: Martin Schwidefsky

    Heiko Carstens
     
  • Make pmd_wrprotect() and pmd_mkwrite() available independently from
    CONFIG_TRANSPARENT_HUGEPAGE and CONFIG_HUGETLB_PAGE so these can be
    used on the kernel mapping.

    Also introduce a couple of pud helper functions, namely pud_pfn(),
    pud_wrprotect(), pud_mkwrite(), pud_mkdirty() and pud_mkclean()
    which only work on the kernel mapping.

    Signed-off-by: Heiko Carstens
    Acked-by: Martin Schwidefsky
    Signed-off-by: Martin Schwidefsky

    Heiko Carstens
     
  • Always use PAGE_KERNEL when re-enabling pages within the kernel
    mapping due to debug pagealloc. Without using this pgprot value
    pte_mkwrite() and pte_wrprotect() won't work on such mappings after an
    unmap -> map cycle anymore.

    Signed-off-by: Heiko Carstens
    Acked-by: Martin Schwidefsky
    Signed-off-by: Martin Schwidefsky

    Heiko Carstens
     
  • Use pte_clear() instead of open-coding it.

    Signed-off-by: Heiko Carstens
    Acked-by: Martin Schwidefsky
    Signed-off-by: Martin Schwidefsky

    Heiko Carstens
     
  • _REGION3_ENTRY_RO is a duplicate of _REGION_ENTRY_PROTECT.

    Signed-off-by: Heiko Carstens
    Acked-by: Martin Schwidefsky
    Signed-off-by: Martin Schwidefsky

    Heiko Carstens
     
  • Instead of open-coded SEGMENT_KERNEL and REGION3_KERNEL assignments use
    defines. Also to make e.g. pmd_wrprotect() work on the kernel mapping
    a couple more flags must be set. Therefore add the missing flags also.

    In order to make everything symmetrical this patch also adds software
    dirty, young, read and write bits for region 3 table entries.

    Signed-off-by: Heiko Carstens
    Acked-by: Martin Schwidefsky
    Signed-off-by: Martin Schwidefsky

    Heiko Carstens
     
  • Usually segment and region tables are 16k aligned due to the way the
    buddy allocator works. This is not true for the vmem code which only
    asks for a 4k alignment. In order to be consistent enforce a 16k
    alignment here as well.

    This alignment will be assumed and therefore is required by the
    pageattr code.

    Signed-off-by: Heiko Carstens
    Acked-by: Christian Borntraeger
    Acked-by: Martin Schwidefsky
    Signed-off-by: Martin Schwidefsky

    Heiko Carstens
     
  • We have already two inline assemblies which make use of the csp
    instruction. Since I need a third instance let's introduce a generic
    inline assmebly which can be used by everyone.

    Signed-off-by: Heiko Carstens
    Acked-by: Martin Schwidefsky
    Signed-off-by: Martin Schwidefsky

    Heiko Carstens
     
  • Use memdup_user_nul to duplicate a memory region from user-space
    to kernel-space and terminate with a NULL, instead of open coding
    using kmalloc + copy_from_user and explicitly NULL terminating.

    Signed-off-by: Muhammad Falak R Wani
    [heiko.carstens@de.ibm.com: remove comment]
    Signed-off-by: Heiko Carstens

    Signed-off-by: Martin Schwidefsky

    Muhammad Falak R Wani
     
  • commit 1e133ab296f ("s390/mm: split arch/s390/mm/pgtable.c") factored
    out the page table handling code from __gmap_zap and __s390_reset_cmma
    into ptep_zap_unused and added a simple flag that tells which one of the
    function (reset or not) is to be made. This also changed the behaviour,
    as it also zaps unused page table entries on reset.
    Turns out that this is wrong as s390_reset_cmma uses the page walker,
    which DOES NOT take the ptl lock.

    The most simple fix is to not do the zapping part on reset (which uses
    the walker)

    Signed-off-by: Christian Borntraeger
    Fixes: 1e133ab296f ("s390/mm: split arch/s390/mm/pgtable.c")
    Cc: stable@vger.kernel.org # 4.6+
    Signed-off-by: Martin Schwidefsky

    Christian Borntraeger
     

12 Jun, 2016

7 commits

  • Linus Torvalds
     
  • Pull thermal management fixes from Zhang Rui:

    - fix an ordering issue in cpu cooling that cooling device is
    registered before it's ready (freq_table being populated).
    (Lukasz Luba)

    - fix a missing comment update (Caesar Wang)

    * 'for-rc' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux:
    thermal: add the note for set_trip_temp
    thermal: cpu_cooling: fix improper order during initialization

    Linus Torvalds
     
  • Pull block layer fixes from Jens Axboe:
    "A small collection of fixes for the current series. This contains:

    - Two fixes for xen-blkfront, from Bob Liu.

    - A bug fix for NVMe, releasing only the specific resources we
    requested.

    - Fix for a debugfs flags entry for nbd, from Josef.

    - Plug fix from Omar, fixing up a case of code being switched between
    two functions.

    - A missing bio_put() for the new discard callers of
    submit_bio_wait(), fixing a regression causing a leak of the bio.
    From Shaun.

    - Improve dirty limit calculation precision in the writeback code,
    fixing a case where setting a limit lower than 1% of memory would
    end up being zero. From Tejun"

    * 'for-linus' of git://git.kernel.dk/linux-block:
    NVMe: Only release requested regions
    xen-blkfront: fix resume issues after a migration
    xen-blkfront: don't call talk_to_blkback when already connected to blkback
    nbd: pass the nbd pointer for flags debugfs
    block: missing bio_put following submit_bio_wait
    blk-mq: really fix plug list flushing for nomerge queues
    writeback: use higher precision calculation in domain_dirty_limits()

    Linus Torvalds
     
  • Pull GPIO fixes from Linus Walleij:
    "A new bunch of GPIO fixes for v4.7.

    This time I am very grateful that Ricardo Ribalda Delgado went in and
    fixed my stupid refcounting mistakes in the removal path for GPIO
    chips. I had a feeling something was wrong here and so it was. It
    exploded on OMAP and it fixes their problem. Now it should be (more)
    solid.

    The rest i compilation, Kconfig and driver fixes. Some tagged for
    stable.

    Summary:

    - Fix a NULL pointer dereference when we are searching the GPIO
    device list but one of the devices have been removed (struct
    gpio_chip pointer is NULL).

    - Fix unaligned reference counters: we were ending on +3 after all
    said and done. It should be 0. Remove an extraneous get_device(),
    and call cdev_del() followed by device_del() in gpiochip_remove()
    instead and the count goes to zero and calls the release() function
    properly.

    - Fix a compile warning due to a missing #include in the OF/device
    tree portions.

    - Select ANON_INODES for GPIOLIB, we're using that for our character
    device. Some randconfig tests disclosed the problem.

    - Make sure the Zynq driver clock runs also without CONFIG_PM enabled

    - Fix an off-by-one error in the 104-DIO-48E driver

    - Fix warnings in bcm_kona_gpio_reset()"

    * tag 'gpio-v4.7-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio:
    gpio: bcm-kona: fix bcm_kona_gpio_reset() warnings
    gpio: select ANON_INODES
    gpio: include in gpiolib-of
    gpiolib: Fix unaligned used of reference counters
    gpiolib: Fix NULL pointer deference
    gpio: zynq: initialize clock even without CONFIG_PM
    gpio: 104-dio-48e: Fix control port offset computation off-by-one error

    Linus Torvalds
     
  • Pull SCSI fixes from James Bottomley:
    "Two current fixes:

    - one affects Qemu CD ROM emulation, which stopped working after the
    updates in SCSI to require VPD pages from all conformant devices.

    Fix temporarily by blacklisting Qemu (we can relax later when they
    come into compliance).

    - The other is a fix to the optimal transfer size. We set up a
    minefield for ourselves by being confused about whether the limits
    are in bytes or sectors (SCSI optimal is in blocks and the queue
    parameter is in bytes).

    This tries to fix the problem (wrong setting for queue limits
    max_sectors) and make the problem more obvious by introducing a
    wrapper function"

    * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
    sd: Fix rw_max for devices that report an optimal xfer size
    scsi: Add QEMU CD-ROM to VPD Inquiry Blacklist

    Linus Torvalds
     
  • Pull i2c fixes from Wolfram Sang:

    - a bigger fix for i801 to finally be able to be loaded on some
    machines again

    - smaller driver fixes

    - documentation update because of a renamed file

    * 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
    i2c: mux: reg: Provide of_match_table
    i2c: mux: refer to i2c-mux.txt
    i2c: octeon: Avoid printk after too long SMBUS message
    i2c: octeon: Missing AAK flag in case of I2C_M_RECV_LEN
    i2c: i801: Allow ACPI SystemIO OpRegion to conflict with PCI BAR

    Linus Torvalds
     
  • Pull DeviceTree fixes from Rob Herring:

    - fix unflatten_dt_nodes when dad parameter is set.

    - add vendor prefixes for TechNexion and UniWest

    - documentation fix for Marvell BT

    - OF IRQ kerneldoc fixes

    - restrict CMA alignment adjustments to non dma-coherent

    - a couple of warning fixes in reserved-memory code

    - DT maintainers updates

    * tag 'devicetree-fixes-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux:
    drivers: of: add definition of early_init_dt_alloc_reserved_memory_arch
    drivers/of: Fix depth for sub-tree blob in unflatten_dt_nodes()
    drivers: of: Fix of_pci.h header guard
    dt-bindings: Add vendor prefix for TechNexion
    of: add vendor prefix for UniWest
    dt: bindings: fix documentation for MARVELL's bt-sd8xxx wireless device
    of: add missing const for of_parse_phandle_with_args() in !CONFIG_OF
    of: silence warnings due to max() usage
    drivers: of: of_reserved_mem: fixup the CMA alignment not to affect dma-coherent
    of: irq: fix of_irq_get[_byname]() kernel-doc
    MAINTAINERS: DeviceTree maintainer updates

    Linus Torvalds