19 Sep, 2017

1 commit

  • Fujitsu-Siemens Lifebook S6120 misdetects the cable type for some
    drives. The problematic one in this case is an mSATA SSD hooked up via a
    mSATA->PATA bridge. With regular hard disks the detection seems to work
    correctly.

    Strangely an older Lifebook model (S6020) detects the cable as 80c
    with the mSATA SSD, even if using the exact same flex cable.

    Cc: Tejun Heo
    Signed-off-by: Ville Syrjälä
    Signed-off-by: Tejun Heo

    Ville Syrjälä
     

08 Sep, 2017

10 commits

  • gcc-7 warns about the result of a constant multiplication used as
    a boolean:

    drivers/ata/libata-core.c: In function 'ata_timing_quantize':
    drivers/ata/libata-core.c:3164:30: warning: '*' in boolean context, suggest '&&' instead [-Wint-in-bool-context]

    This slightly rearranges the macro to simplify the code and avoid
    the warning at the same time.

    Signed-off-by: Arnd Bergmann
    Signed-off-by: Tejun Heo

    Arnd Bergmann
     
  • Pull media updates from Mauro Carvalho Chehab:
    "Brazil's Independence Day pull request :-)

    This is one of the biggest media pull requests, with 625 patches
    affecting almost all parts of media (RC, DVB, V4L2, CEC, docs).

    This contains:

    - A lot of new drivers:
    * DVB frontends: mxl5xx, stv0910, stv6111;
    * camera flash: as3645a led driver;
    * HDMI receiver: adv748X;
    * camera sensor: Omnivision 6650 5M driver (ov6650);
    * HDMI CEC: ao-cec meson driver;
    * V4L2: Qualcom camss driver;
    * Remote controller: gpio-ir-tx, pwm-ir-tx and zx-irdec drivers.

    - The DDbridge DVB driver got a massive update, with makes it in sync
    with modern hardware from that vendor;

    - There's an important milestone on this series: the DVB
    documentation was written in 2003, but only started to be updated
    in 2007. It also used to contain several gaps from the time it was
    kept out of tree, mentioning error codes and device nodes that
    never existed upstream. On this series, it received a massive
    update: all non-deprecated digital TV APIs are now in sync with the
    current implementation;

    - Some DVB APIs that aren't used by any upstream driver got removed;

    - Other parts of the media documentation algo got updated, fixing
    some bugs on its PDF output and making it compatible with Sphinx
    version 1.6.

    As the number of hacks required to build PDF output reduced, I hope
    we'll have less troubles as newer versions of our documentation
    toolchain are released (famous last words);

    - As usual, lots of driver cleanups and improvements"

    * tag 'media/v4.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (624 commits)
    media: leds: as3645a: add V4L2_FLASH_LED_CLASS dependency
    media: get rid of removed DMX_GET_CAPS and DMX_SET_SOURCE leftovers
    media: Revert "[media] v4l: async: make v4l2 coexist with devicetree nodes in a dt overlay"
    media: staging: atomisp: sh_css_calloc shall return a pointer to the allocated space
    media: Revert "[media] lirc_dev: remove superfluous get/put_device() calls"
    media: add qcom_camss.rst to v4l-drivers rst file
    media: dvb headers: make checkpatch happier
    media: dvb uapi: move frontend legacy API to another part of the book
    media: pixfmt-srggb12p.rst: better format the table for PDF output
    media: docs-rst: media: Don't use \small for V4L2_PIX_FMT_SRGGB10 documentation
    media: index.rst: don't write "Contents:" on PDF output
    media: pixfmt*.rst: replace a two dots by a comma
    media: vidioc-g-fmt.rst: adjust table format
    media: vivid.rst: add a blank line to correct ReST format
    media: v4l2 uapi book: get rid of driver programming's chapter
    media: format.rst: use the right markup for important notes
    media: docs-rst: cardlists: change their format to flat-tables
    media: em28xx-cardlist.rst: update to reflect last changes
    media: v4l2-event.rst: adjust table to fit on PDF output
    media: docs: don't show ToC for each part on PDF output
    ...

    Linus Torvalds
     
  • Pull sound updates from Takashi Iwai:
    "We have touched quite a lot of files but with fewer changes at this
    cycle; as you can see, most of changes are trivial fixes, especially
    constification patches.

    Among the massive attacks by constification gangs, we had a few core
    changes (mostly for ASoC core), as well the fixes and the updates by
    major vendors.

    Some highlights:

    ALSA core:

    - Fix possible races in control API user-TLV codes

    - Small cleanup of PCM core

    ASoC:

    - Continued work for componentization; still half-baked, but we're
    certainly progressing

    - Use of devres for jack detection GPIOs, rather as a cleanup

    - Jack detection support for Qualcomm MSM8916

    - Support for Allwinner H3, Cirrus Logic CS43130, Intel Kabylake
    systems with RT5663, Realtek RT274, TI TLV320AIC32x6 and Wolfson
    WM8523"

    * tag 'sound-4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (512 commits)
    ALSA: hda/ca0132 - Fix memory leak at error path
    ALSA: hda: Fix forget to free resource in error handling code path in hda_codec_driver_probe
    ASoC: cs43130: Fix unused compiler warnings for PM runtime
    ASoC: cs43130: Fix possible Oops with invalid dev_id
    ASoC: cs43130: fix spelling mistake: "irq_occurrance" -> "irq_occurrence"
    ALSA: atmel: Remove leftovers of AVR32 removal
    ALSA: atmel: convert AC97c driver to GPIO descriptor API
    ALSA: hda/realtek - Enable jack detection function for Intel ALC700
    ALSA: hda: Fix regression of hdmi eld control created based on invalid pcm
    ASoC: Intel: Skylake: Add IPC to configure the copier secondary pins
    ASoC: add missing compile rule for max98371
    ASoC: add missing compile rule for sirf-audio-codec
    ASoC: add missing compile rule for max98371
    ASoC: cs43130: Add devicetree bindings for CS43130
    ASoC: cs43130: Add support for CS43130 codec
    ASoC: make clock direction configurable in asoc-simple
    ALSA: ctxfi: Remove null check before kfree
    ASoC: max98927: Changed device property read function
    ASoC: max98927: Modified DAPM widget and map to enable/disable VI sense path
    ASoC: max98927: Added PM suspend and resume function
    ...

    Linus Torvalds
     
  • Pull MD updates from Shaohua Li:
    "This update mainly fixes bugs:

    - Make raid5 ppl support several ppl from Pawel

    - Several raid5-cache bug fixes from Song

    - Bitmap fixes from Neil and Me

    - One raid1/10 regression fix since 4.12 from Me

    - Other small fixes and cleanup"

    * tag 'md/4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md:
    md/bitmap: disable bitmap_resize for file-backed bitmaps.
    raid5-ppl: Recovery support for multiple partial parity logs
    md: Runtime support for multiple ppls
    md/raid0: attach correct cgroup info in bio
    lib/raid6: align AVX512 constants to 512 bits, not bytes
    raid5: remove raid5_build_block
    md/r5cache: call mddev_lock/unlock() in r5c_journal_mode_show
    md: replace seq_release_private with seq_release
    md: notify about new spare disk in the container
    md/raid1/10: reset bio allocated from mempool
    md/raid5: release/flush io in raid5_do_work()
    md/bitmap: copy correct data for bitmap super

    Linus Torvalds
     
  • Pull MMC updates from Ulf Hansson:
    "MMC core:
    - Continue to refactor the mmc block code to prepare for blkmq
    - Move mmc block debugfs into block module
    - Next step for eMMC CMDQ by adding a new mmc host interface for it
    - Move Kconfig option MMC_DEBUG from core to host
    - Some additional minor improvements

    MMC host:
    - Declare structs as const when applicable
    - Explicitly request exclusive reset control when applicable
    - Improve some error paths and other various cleanups
    - sdhci: Preparations to support SDHCI OMAP
    - sdhci: Improve some PM related code
    - sdhci: Re-factoring and modernizations
    - sdhci-xenon: Add runtime PM and system sleep support
    - sdhci-xenon: Add support for eMMC HS400 Enhanced Strobe
    - sdhci-cadence: Add system sleep support
    - sdhci-of-at91: Improve system sleep support
    - dw_mmc: Add support for Hisilicon hi3660
    - sunxi: Add support for A83T eMMC
    - sunxi: Add support for DDR52 mode
    - meson-gx: Add support for UHS-I SD-cards
    - meson-gx: Cleanups and improvements
    - tmio: Fix CMD12 (STOP) handling
    - tmio: Cleanups and improvements
    - renesas_sdhi: Add r8a7743/5 support
    - renesas-sdhi: Add support for R-Car Gen3 SDHI DMAC
    - renesas_sdhi: Cleanups and improvements"

    * tag 'mmc-v4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc: (145 commits)
    mmc: renesas_sdhi: Add r8a7743/5 support
    mmc: meson-gx: fix __ffsdi2 undefined on arm32
    mmc: sdhci-xenon: add runtime pm support and reimplement standby
    mmc: core: Move mmc_start_areq() declaration
    mmc: mmci: stop building qcom dml as module
    mmc: sunxi: Reset the device at probe time
    clk: sunxi-ng: Provide a default reset hook
    mmc: meson-gx: rework tuning function
    mmc: meson-gx: change default tx phase
    mmc: meson-gx: implement voltage switch callback
    mmc: meson-gx: use CCF to handle the clock phases
    mmc: meson-gx: implement card_busy callback
    mmc: meson-gx: simplify interrupt handler
    mmc: meson-gx: work around clk-stop issue
    mmc: meson-gx: fix dual data rate mode frequencies
    mmc: meson-gx: rework clock init function
    mmc: meson-gx: rework clk_set function
    mmc: meson-gx: rework set_ios function
    mmc: meson-gx: cfg init overwrite values
    mmc: meson-gx: initialize sane clk default before clock register
    ...

    Linus Torvalds
     
  • Pull block layer updates from Jens Axboe:
    "This is the first pull request for 4.14, containing most of the code
    changes. It's a quiet series this round, which I think we needed after
    the churn of the last few series. This contains:

    - Fix for a registration race in loop, from Anton Volkov.

    - Overflow complaint fix from Arnd for DAC960.

    - Series of drbd changes from the usual suspects.

    - Conversion of the stec/skd driver to blk-mq. From Bart.

    - A few BFQ improvements/fixes from Paolo.

    - CFQ improvement from Ritesh, allowing idling for group idle.

    - A few fixes found by Dan's smatch, courtesy of Dan.

    - A warning fixup for a race between changing the IO scheduler and
    device remova. From David Jeffery.

    - A few nbd fixes from Josef.

    - Support for cgroup info in blktrace, from Shaohua.

    - Also from Shaohua, new features in the null_blk driver to allow it
    to actually hold data, among other things.

    - Various corner cases and error handling fixes from Weiping Zhang.

    - Improvements to the IO stats tracking for blk-mq from me. Can
    drastically improve performance for fast devices and/or big
    machines.

    - Series from Christoph removing bi_bdev as being needed for IO
    submission, in preparation for nvme multipathing code.

    - Series from Bart, including various cleanups and fixes for switch
    fall through case complaints"

    * 'for-4.14/block' of git://git.kernel.dk/linux-block: (162 commits)
    kernfs: checking for IS_ERR() instead of NULL
    drbd: remove BIOSET_NEED_RESCUER flag from drbd_{md_,}io_bio_set
    drbd: Fix allyesconfig build, fix recent commit
    drbd: switch from kmalloc() to kmalloc_array()
    drbd: abort drbd_start_resync if there is no connection
    drbd: move global variables to drbd namespace and make some static
    drbd: rename "usermode_helper" to "drbd_usermode_helper"
    drbd: fix race between handshake and admin disconnect/down
    drbd: fix potential deadlock when trying to detach during handshake
    drbd: A single dot should be put into a sequence.
    drbd: fix rmmod cleanup, remove _all_ debugfs entries
    drbd: Use setup_timer() instead of init_timer() to simplify the code.
    drbd: fix potential get_ldev/put_ldev refcount imbalance during attach
    drbd: new disk-option disable-write-same
    drbd: Fix resource role for newly created resources in events2
    drbd: mark symbols static where possible
    drbd: Send P_NEG_ACK upon write error in protocol != C
    drbd: add explicit plugging when submitting batches
    drbd: change list_for_each_safe to while(list_first_entry_or_null)
    drbd: introduce drbd_recv_header_maybe_unplug
    ...

    Linus Torvalds
     
  • Pull xen updates from Juergen Gross:

    - the new pvcalls backend for routing socket calls from a guest to dom0

    - some cleanups of Xen code

    - a fix for wrong usage of {get,put}_cpu()

    * tag 'for-linus-4.14b-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: (27 commits)
    xen/mmu: set MMU_NORMAL_PT_UPDATE in remap_area_mfn_pte_fn
    xen: Don't try to call xen_alloc_p2m_entry() on autotranslating guests
    xen/events: events_fifo: Don't use {get,put}_cpu() in xen_evtchn_fifo_init()
    xen/pvcalls: use WARN_ON(1) instead of __WARN()
    xen: remove not used trace functions
    xen: remove unused function xen_set_domain_pte()
    xen: remove tests for pvh mode in pure pv paths
    xen-platform: constify pci_device_id.
    xen: cleanup xen.h
    xen: introduce a Kconfig option to enable the pvcalls backend
    xen/pvcalls: implement write
    xen/pvcalls: implement read
    xen/pvcalls: implement the ioworker functions
    xen/pvcalls: disconnect and module_exit
    xen/pvcalls: implement release command
    xen/pvcalls: implement poll command
    xen/pvcalls: implement accept command
    xen/pvcalls: implement listen command
    xen/pvcalls: implement bind command
    xen/pvcalls: implement connect command
    ...

    Linus Torvalds
     
  • Pull powerpc updates from Michael Ellerman:
    "Nothing really major this release, despite quite a lot of activity.
    Just lots of things all over the place.

    Some things of note include:

    - Access via perf to a new type of PMU (IMC) on Power9, which can
    count both core events as well as nest unit events (Memory
    controller etc).

    - Optimisations to the radix MMU TLB flushing, mostly to avoid
    unnecessary Page Walk Cache (PWC) flushes when the structure of the
    tree is not changing.

    - Reworks/cleanups of do_page_fault() to modernise it and bring it
    closer to other architectures where possible.

    - Rework of our page table walking so that THP updates only need to
    send IPIs to CPUs where the affected mm has run, rather than all
    CPUs.

    - The size of our vmalloc area is increased to 56T on 64-bit hash MMU
    systems. This avoids problems with the percpu allocator on systems
    with very sparse NUMA layouts.

    - STRICT_KERNEL_RWX support on PPC32.

    - A new sched domain topology for Power9, to capture the fact that
    pairs of cores may share an L2 cache.

    - Power9 support for VAS, which is a new mechanism for accessing
    coprocessors, and initial support for using it with the NX
    compression accelerator.

    - Major work on the instruction emulation support, adding support for
    many new instructions, and reworking it so it can be used to
    implement the emulation needed to fixup alignment faults.

    - Support for guests under PowerVM to use the Power9 XIVE interrupt
    controller.

    And probably that many things again that are almost as interesting,
    but I had to keep the list short. Plus the usual fixes and cleanups as
    always.

    Thanks to: Alexey Kardashevskiy, Alistair Popple, Andreas Schwab,
    Aneesh Kumar K.V, Anju T Sudhakar, Arvind Yadav, Balbir Singh,
    Benjamin Herrenschmidt, Bhumika Goyal, Breno Leitao, Bryant G. Ly,
    Christophe Leroy, Cédric Le Goater, Dan Carpenter, Dou Liyang,
    Frederic Barrat, Gautham R. Shenoy, Geliang Tang, Geoff Levand, Hannes
    Reinecke, Haren Myneni, Ivan Mikhaylov, John Allen, Julia Lawall,
    LABBE Corentin, Laurentiu Tudor, Madhavan Srinivasan, Markus Elfring,
    Masahiro Yamada, Matt Brown, Michael Neuling, Murilo Opsfelder Araujo,
    Nathan Fontenot, Naveen N. Rao, Nicholas Piggin, Oliver O'Halloran,
    Paul Mackerras, Rashmica Gupta, Rob Herring, Rui Teng, Sam Bobroff,
    Santosh Sivaraj, Scott Wood, Shilpasri G Bhat, Sukadev Bhattiprolu,
    Suraj Jitindar Singh, Tobin C. Harding, Victor Aoqui"

    * tag 'powerpc-4.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (321 commits)
    powerpc/xive: Fix section __init warning
    powerpc: Fix kernel crash in emulation of vector loads and stores
    powerpc/xive: improve debugging macros
    powerpc/xive: add XIVE Exploitation Mode to CAS
    powerpc/xive: introduce H_INT_ESB hcall
    powerpc/xive: add the HW IRQ number under xive_irq_data
    powerpc/xive: introduce xive_esb_write()
    powerpc/xive: rename xive_poke_esb() in xive_esb_read()
    powerpc/xive: guest exploitation of the XIVE interrupt controller
    powerpc/xive: introduce a common routine xive_queue_page_alloc()
    powerpc/sstep: Avoid used uninitialized error
    axonram: Return directly after a failed kzalloc() in axon_ram_probe()
    axonram: Improve a size determination in axon_ram_probe()
    axonram: Delete an error message for a failed memory allocation in axon_ram_probe()
    powerpc/powernv/npu: Move tlb flush before launching ATSD
    powerpc/macintosh: constify wf_sensor_ops structures
    powerpc/iommu: Use permission-specific DEVICE_ATTR variants
    powerpc/eeh: Delete an error out of memory message at init time
    powerpc/mm: Use seq_putc() in two functions
    macintosh: Convert to using %pOF instead of full_name
    ...

    Linus Torvalds
     
  • Pull EFI updates from Ingo Molnar:
    "The main changes in this cycle were:

    - Transparently fall back to other poweroff method(s) if EFI poweroff
    fails (and returns)

    - Use separate PE/COFF section headers for the RX and RW parts of the
    ARM stub loader so that the firmware can use strict mapping
    permissions

    - Add support for requesting the firmware to wipe RAM at warm reboot

    - Increase the size of the random seed obtained from UEFI so CRNG
    fast init can complete earlier

    - Update the EFI framebuffer address if it points to a BAR that gets
    moved by the PCI resource allocation code

    - Enable "reset attack mitigation" of TPM environments: this is
    enabled if the kernel is configured with
    CONFIG_RESET_ATTACK_MITIGATION=y.

    - Clang related fixes

    - Misc cleanups, constification, refactoring, etc"

    * 'efi-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    efi/bgrt: Use efi_mem_type()
    efi: Move efi_mem_type() to common code
    efi/reboot: Make function pointer orig_pm_power_off static
    efi/random: Increase size of firmware supplied randomness
    efi/libstub: Enable reset attack mitigation
    firmware/efi/esrt: Constify attribute_group structures
    firmware/efi: Constify attribute_group structures
    firmware/dcdbas: Constify attribute_group structures
    arm/efi: Split zImage code and data into separate PE/COFF sections
    arm/efi: Replace open coded constants with symbolic ones
    arm/efi: Remove pointless dummy .reloc section
    arm/efi: Remove forbidden values from the PE/COFF header
    drivers/fbdev/efifb: Allow BAR to be moved instead of claiming it
    efi/reboot: Fall back to original power-off method if EFI_RESET_SHUTDOWN returns
    efi/arm/arm64: Add missing assignment of efi.config_table
    efi/libstub/arm64: Set -fpie when building the EFI stub
    efi/libstub/arm64: Force 'hidden' visibility for section markers
    efi/libstub/arm64: Use hidden attribute for struct screen_info reference
    efi/arm: Don't mark ACPI reclaim memory as MEMBLOCK_NOMAP

    Linus Torvalds
     
  • Pull x86 platform updates from Ingo Molnar:
    "The main changes include various Hyper-V optimizations such as faster
    hypercalls and faster/better TLB flushes - and there's also some
    Intel-MID cleanups"

    * 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    tracing/hyper-v: Trace hyperv_mmu_flush_tlb_others()
    x86/hyper-v: Support extended CPU ranges for TLB flush hypercalls
    x86/platform/intel-mid: Make several arrays static, to make code smaller
    MAINTAINERS: Add missed file for Hyper-V
    x86/hyper-v: Use hypercall for remote TLB flush
    hyper-v: Globalize vp_index
    x86/hyper-v: Implement rep hypercalls
    hyper-v: Use fast hypercall for HVCALL_SIGNAL_EVENT
    x86/hyper-v: Introduce fast hypercall implementation
    x86/hyper-v: Make hv_do_hypercall() inline
    x86/hyper-v: Include hyperv/ only when CONFIG_HYPERV is set
    x86/platform/intel-mid: Make 'bt_sfi_data' const
    x86/platform/intel-mid: Make IRQ allocation a bit more flexible
    x86/platform/intel-mid: Group timers callbacks together

    Linus Torvalds
     

07 Sep, 2017

29 commits

  • Pull libata updates from Tejun Heo:
    "Except for the ahci fix that fixes a boot issue, nothing major in this
    pull request. Some new platform controller support and device specific
    changes"

    * 'for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata:
    libata: zpodd: make arrays cdb static, reduces object code size
    ahci: don't use MSI for devices with the silly Intel NVMe remapping scheme
    dt-bindings: ata: add DT bindings for MediaTek SATA controller
    ata: mediatek: add support for MediaTek SATA controller
    pata_octeon_cf: use of_property_read_{bool|u32}()
    cs5536: add support for IDE controller variant
    ata: sata_gemini: Introduce explicit IDE pin control
    ata: sata_gemini: Retire custom pin control
    ata: ahci_platform: Add shutdown handler
    ata: sata_gemini: explicitly request exclusive reset control
    ata: Drop unnecessary static
    ata: Convert to using %pOF instead of full_name

    Linus Torvalds
     
  • Pull cgroup updates from Tejun Heo:
    "Several notable changes this cycle:

    - Thread mode was merged. This will be used for cgroup2 support for
    CPU and possibly other controllers. Unfortunately, CPU controller
    cgroup2 support didn't make this pull request but most contentions
    have been resolved and the support is likely to be merged before
    the next merge window.

    - cgroup.stat now shows the number of descendant cgroups.

    - cpuset now can enable the easier-to-configure v2 behavior on v1
    hierarchy"

    * 'for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits)
    cpuset: Allow v2 behavior in v1 cgroup
    cgroup: Add mount flag to enable cpuset to use v2 behavior in v1 cgroup
    cgroup: remove unneeded checks
    cgroup: misc changes
    cgroup: short-circuit cset_cgroup_from_root() on the default hierarchy
    cgroup: re-use the parent pointer in cgroup_destroy_locked()
    cgroup: add cgroup.stat interface with basic hierarchy stats
    cgroup: implement hierarchy limits
    cgroup: keep track of number of descent cgroups
    cgroup: add comment to cgroup_enable_threaded()
    cgroup: remove unnecessary empty check when enabling threaded mode
    cgroup: update debug controller to print out thread mode information
    cgroup: implement cgroup v2 thread support
    cgroup: implement CSS_TASK_ITER_THREADED
    cgroup: introduce cgroup->dom_cgrp and threaded css_set handling
    cgroup: add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCS
    cgroup: reorganize cgroup.procs / task write path
    cgroup: replace css_set walking populated test with testing cgrp->nr_populated_csets
    cgroup: distinguish local and children populated states
    cgroup: remove now unused list_head @pending in cgroup_apply_cftypes()
    ...

    Linus Torvalds
     
  • Pull workqueue updates from Tejun Heo:
    "Nothing major. I introduced a flag collsion bug during v4.13 cycle
    which is fixed in this pull request. Fortunately, the flag is for
    debugging / verification and the bug isn't critical"

    * 'for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
    workqueue: Fix flag collision
    workqueue: Use TASK_IDLE
    workqueue: fix path to documentation
    workqueue: doc change for ST behavior on NUMA systems

    Linus Torvalds
     
  • Pull percpu updates from Tejun Heo:
    "A lot of changes for percpu this time around. percpu inherited the
    same area allocator from the original pre-virtual-address-mapped
    implementation. This was from the time when percpu allocator wasn't
    used all that much and the implementation was focused on simplicity,
    with the unfortunate computational complexity of O(number of areas
    allocated from the chunk) per alloc / free.

    With the increase in percpu usage, we're hitting cases where the lack
    of scalability is hurting. The most prominent one right now is bpf
    perpcu map creation / destruction which may allocate and free a lot of
    entries consecutively and it's likely that the problem will become
    more prominent in the future.

    To address the issue, Dennis replaced the area allocator with hinted
    bitmap allocator which is more consistent. While the new allocator
    does perform a bit worse in some cases, it outperforms the old
    allocator way more than an order of magnitude in other more common
    scenarios while staying mostly flat in CPU overhead and completely
    flat in memory consumption"

    * 'for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (27 commits)
    percpu: update header to contain bitmap allocator explanation.
    percpu: update pcpu_find_block_fit to use an iterator
    percpu: use metadata blocks to update the chunk contig hint
    percpu: update free path to take advantage of contig hints
    percpu: update alloc path to only scan if contig hints are broken
    percpu: keep track of the best offset for contig hints
    percpu: skip chunks if the alloc does not fit in the contig hint
    percpu: add first_bit to keep track of the first free in the bitmap
    percpu: introduce bitmap metadata blocks
    percpu: replace area map allocator with bitmap
    percpu: generalize bitmap (un)populated iterators
    percpu: increase minimum percpu allocation size and align first regions
    percpu: introduce nr_empty_pop_pages to help empty page accounting
    percpu: change the number of pages marked in the first_chunk pop bitmap
    percpu: combine percpu address checks
    percpu: modify base_addr to be region specific
    percpu: setup_first_chunk rename schunk/dchunk to chunk
    percpu: end chunk area maps page aligned for the populated bitmap
    percpu: unify allocation of schunk and dchunk
    percpu: setup_first_chunk remove dyn_size and consolidate logic
    ...

    Linus Torvalds
     
  • Merge updates from Andrew Morton:

    - various misc bits

    - DAX updates

    - OCFS2

    - most of MM

    * emailed patches from Andrew Morton : (119 commits)
    mm,fork: introduce MADV_WIPEONFORK
    x86,mpx: make mpx depend on x86-64 to free up VMA flag
    mm: add /proc/pid/smaps_rollup
    mm: hugetlb: clear target sub-page last when clearing huge page
    mm: oom: let oom_reap_task and exit_mmap run concurrently
    swap: choose swap device according to numa node
    mm: replace TIF_MEMDIE checks by tsk_is_oom_victim
    mm, oom: do not rely on TIF_MEMDIE for memory reserves access
    z3fold: use per-cpu unbuddied lists
    mm, swap: don't use VMA based swap readahead if HDD is used as swap
    mm, swap: add sysfs interface for VMA based swap readahead
    mm, swap: VMA based swap readahead
    mm, swap: fix swap readahead marking
    mm, swap: add swap readahead hit statistics
    mm/vmalloc.c: don't reinvent the wheel but use existing llist API
    mm/vmstat.c: fix wrong comment
    selftests/memfd: add memfd_create hugetlbfs selftest
    mm/shmem: add hugetlbfs support to memfd_create()
    mm, devm_memremap_pages: use multi-order radix for ZONE_DEVICE lookups
    mm/vmalloc.c: halve the number of comparisons performed in pcpu_get_vm_areas()
    ...

    Linus Torvalds
     
  • While debugging a problem, I thought that using
    cr4_set_bits_and_update_boot() to restore CR4.PCIDE would be
    helpful. It turns out to be counterproductive.

    Add a comment documenting how this works.

    Signed-off-by: Andy Lutomirski
    Signed-off-by: Linus Torvalds

    Andy Lutomirski
     
  • When Linux brings a CPU down and back up, it switches to init_mm and then
    loads swapper_pg_dir into CR3. With PCID enabled, this has the side effect
    of masking off the ASID bits in CR3.

    This can result in some confusion in the TLB handling code. If we
    bring a CPU down and back up with any ASID other than 0, we end up
    with the wrong ASID active on the CPU after resume. This could
    cause our internal state to become corrupt, although major
    corruption is unlikely because init_mm doesn't have any user pages.
    More obviously, if CONFIG_DEBUG_VM=y, we'll trip over an assertion
    in the next context switch. The result of *that* is a failure to
    resume from suspend with probability 1 - 1/6^(cpus-1).

    Fix it by reinitializing cpu_tlbstate on resume and CPU bringup.

    Reported-by: Linus Torvalds
    Reported-by: Jiri Kosina
    Fixes: 10af6235e0d3 ("x86/mm: Implement PCID based optimization: try to preserve old TLB entries using PCID")
    Signed-off-by: Andy Lutomirski
    Signed-off-by: Linus Torvalds

    Andy Lutomirski
     
  • Introduce MADV_WIPEONFORK semantics, which result in a VMA being empty
    in the child process after fork. This differs from MADV_DONTFORK in one
    important way.

    If a child process accesses memory that was MADV_WIPEONFORK, it will get
    zeroes. The address ranges are still valid, they are just empty.

    If a child process accesses memory that was MADV_DONTFORK, it will get a
    segmentation fault, since those address ranges are no longer valid in
    the child after fork.

    Since MADV_DONTFORK also seems to be used to allow very large programs
    to fork in systems with strict memory overcommit restrictions, changing
    the semantics of MADV_DONTFORK might break existing programs.

    MADV_WIPEONFORK only works on private, anonymous VMAs.

    The use case is libraries that store or cache information, and want to
    know that they need to regenerate it in the child process after fork.

    Examples of this would be:
    - systemd/pulseaudio API checks (fail after fork) (replacing a getpid
    check, which is too slow without a PID cache)
    - PKCS#11 API reinitialization check (mandated by specification)
    - glibc's upcoming PRNG (reseed after fork)
    - OpenSSL PRNG (reseed after fork)

    The security benefits of a forking server having a re-inialized PRNG in
    every child process are pretty obvious. However, due to libraries
    having all kinds of internal state, and programs getting compiled with
    many different versions of each library, it is unreasonable to expect
    calling programs to re-initialize everything manually after fork.

    A further complication is the proliferation of clone flags, programs
    bypassing glibc's functions to call clone directly, and programs calling
    unshare, causing the glibc pthread_atfork hook to not get called.

    It would be better to have the kernel take care of this automatically.

    The patch also adds MADV_KEEPONFORK, to undo the effects of a prior
    MADV_WIPEONFORK.

    This is similar to the OpenBSD minherit syscall with MAP_INHERIT_ZERO:

    https://man.openbsd.org/minherit.2

    [akpm@linux-foundation.org: numerically order arch/parisc/include/uapi/asm/mman.h #defines]
    Link: http://lkml.kernel.org/r/20170811212829.29186-3-riel@redhat.com
    Signed-off-by: Rik van Riel
    Reported-by: Florian Weimer
    Reported-by: Colm MacCártaigh
    Reviewed-by: Mike Kravetz
    Cc: "H. Peter Anvin"
    Cc: "Kirill A. Shutemov"
    Cc: Andy Lutomirski
    Cc: Dave Hansen
    Cc: Ingo Molnar
    Cc: Helge Deller
    Cc: Kees Cook
    Cc: Matthew Wilcox
    Cc: Thomas Gleixner
    Cc: Will Drewry
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • Patch series "mm,fork,security: introduce MADV_WIPEONFORK", v4.

    If a child process accesses memory that was MADV_WIPEONFORK, it will get
    zeroes. The address ranges are still valid, they are just empty.

    If a child process accesses memory that was MADV_DONTFORK, it will get a
    segmentation fault, since those address ranges are no longer valid in
    the child after fork.

    Since MADV_DONTFORK also seems to be used to allow very large programs
    to fork in systems with strict memory overcommit restrictions, changing
    the semantics of MADV_DONTFORK might break existing programs.

    The use case is libraries that store or cache information, and want to
    know that they need to regenerate it in the child process after fork.

    Examples of this would be:
    - systemd/pulseaudio API checks (fail after fork) (replacing a getpid
    check, which is too slow without a PID cache)
    - PKCS#11 API reinitialization check (mandated by specification)
    - glibc's upcoming PRNG (reseed after fork)
    - OpenSSL PRNG (reseed after fork)

    The security benefits of a forking server having a re-inialized PRNG in
    every child process are pretty obvious. However, due to libraries
    having all kinds of internal state, and programs getting compiled with
    many different versions of each library, it is unreasonable to expect
    calling programs to re-initialize everything manually after fork.

    A further complication is the proliferation of clone flags, programs
    bypassing glibc's functions to call clone directly, and programs calling
    unshare, causing the glibc pthread_atfork hook to not get called.

    It would be better to have the kernel take care of this automatically.

    The patchset also adds MADV_KEEPONFORK, to undo the effects of a prior
    MADV_WIPEONFORK.

    This is similar to the OpenBSD minherit syscall with MAP_INHERIT_ZERO:

    https://man.openbsd.org/minherit.2

    This patch (of 2):

    MPX only seems to be available on 64 bit CPUs, starting with Skylake and
    Goldmont. Move VM_MPX into the 64 bit only portion of vma->vm_flags, in
    order to free up a VMA flag.

    Link: http://lkml.kernel.org/r/20170811212829.29186-2-riel@redhat.com
    Signed-off-by: Rik van Riel
    Acked-by: Dave Hansen
    Cc: Mike Kravetz
    Cc: Florian Weimer
    Cc: Kees Cook
    Cc: Andy Lutomirski
    Cc: Will Drewry
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: "Kirill A. Shutemov"
    Cc: Matthew Wilcox
    Cc: Colm MacCártaigh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • /proc/pid/smaps_rollup is a new proc file that improves the performance
    of user programs that determine aggregate memory statistics (e.g., total
    PSS) of a process.

    Android regularly "samples" the memory usage of various processes in
    order to balance its memory pool sizes. This sampling process involves
    opening /proc/pid/smaps and summing certain fields. For very large
    processes, sampling memory use this way can take several hundred
    milliseconds, due mostly to the overhead of the seq_printf calls in
    task_mmu.c.

    smaps_rollup improves the situation. It contains most of the fields of
    /proc/pid/smaps, but instead of a set of fields for each VMA,
    smaps_rollup instead contains one synthetic smaps-format entry
    representing the whole process. In the single smaps_rollup synthetic
    entry, each field is the summation of the corresponding field in all of
    the real-smaps VMAs. Using a common format for smaps_rollup and smaps
    allows userspace parsers to repurpose parsers meant for use with
    non-rollup smaps for smaps_rollup, and it allows userspace to switch
    between smaps_rollup and smaps at runtime (say, based on the
    availability of smaps_rollup in a given kernel) with minimal fuss.

    By using smaps_rollup instead of smaps, a caller can avoid the
    significant overhead of formatting, reading, and parsing each of a large
    process's potentially very numerous memory mappings. For sampling
    system_server's PSS in Android, we measured a 12x speedup, representing
    a savings of several hundred milliseconds.

    One alternative to a new per-process proc file would have been including
    PSS information in /proc/pid/status. We considered this option but
    thought that PSS would be too expensive (by a few orders of magnitude)
    to collect relative to what's already emitted as part of
    /proc/pid/status, and slowing every user of /proc/pid/status for the
    sake of readers that happen to want PSS feels wrong.

    The code itself works by reusing the existing VMA-walking framework we
    use for regular smaps generation and keeping the mem_size_stats
    structure around between VMA walks instead of using a fresh one for each
    VMA. In this way, summation happens automatically. We let seq_file
    walk over the VMAs just as it does for regular smaps and just emit
    nothing to the seq_file until we hit the last VMA.

    Benchmarks:

    using smaps:
    iterations:1000 pid:1163 pss:220023808
    0m29.46s real 0m08.28s user 0m20.98s system

    using smaps_rollup:
    iterations:1000 pid:1163 pss:220702720
    0m04.39s real 0m00.03s user 0m04.31s system

    We're using the PSS samples we collect asynchronously for
    system-management tasks like fine-tuning oom_adj_score, memory use
    tracking for debugging, application-level memory-use attribution, and
    deciding whether we want to kill large processes during system idle
    maintenance windows. Android has been using PSS for these purposes for
    a long time; as the average process VMA count has increased and and
    devices become more efficiency-conscious, PSS-collection inefficiency
    has started to matter more. IMHO, it'd be a lot safer to optimize the
    existing PSS-collection model, which has been fine-tuned over the years,
    instead of changing the memory tracking approach entirely to work around
    smaps-generation inefficiency.

    Tim said:

    : There are two main reasons why Android gathers PSS information:
    :
    : 1. Android devices can show the user the amount of memory used per
    : application via the settings app. This is a less important use case.
    :
    : 2. We log PSS to help identify leaks in applications. We have found
    : an enormous number of bugs (in the Android platform, in Google's own
    : apps, and in third-party applications) using this data.
    :
    : To do this, system_server (the main process in Android userspace) will
    : sample the PSS of a process three seconds after it changes state (for
    : example, app is launched and becomes the foreground application) and about
    : every ten minutes after that. The net result is that PSS collection is
    : regularly running on at least one process in the system (usually a few
    : times a minute while the screen is on, less when screen is off due to
    : suspend). PSS of a process is an incredibly useful stat to track, and we
    : aren't going to get rid of it. We've looked at some very hacky approaches
    : using RSS ("take the RSS of the target process, subtract the RSS of the
    : zygote process that is the parent of all Android apps") to reduce the
    : accounting time, but it regularly overestimated the memory used by 20+
    : percent. Accordingly, I don't think that there's a good alternative to
    : using PSS.
    :
    : We started looking into PSS collection performance after we noticed random
    : frequency spikes while a phone's screen was off; occasionally, one of the
    : CPU clusters would ramp to a high frequency because there was 200-300ms of
    : constant CPU work from a single thread in the main Android userspace
    : process. The work causing the spike (which is reasonable governor
    : behavior given the amount of CPU time needed) was always PSS collection.
    : As a result, Android is burning more power than we should be on PSS
    : collection.
    :
    : The other issue (and why I'm less sure about improving smaps as a
    : long-term solution) is that the number of VMAs per process has increased
    : significantly from release to release. After trying to figure out why we
    : were seeing these 200-300ms PSS collection times on Android O but had not
    : noticed it in previous versions, we found that the number of VMAs in the
    : main system process increased by 50% from Android N to Android O (from
    : ~1800 to ~2700) and varying increases in every userspace process. Android
    : M to N also had an increase in the number of VMAs, although not as much.
    : I'm not sure why this is increasing so much over time, but thinking about
    : ASLR and ways to make ASLR better, I expect that this will continue to
    : increase going forward. I would not be surprised if we hit 5000 VMAs on
    : the main Android process (system_server) by 2020.
    :
    : If we assume that the number of VMAs is going to increase over time, then
    : doing anything we can do to reduce the overhead of each VMA during PSS
    : collection seems like the right way to go, and that means outputting an
    : aggregate statistic (to avoid whatever overhead there is per line in
    : writing smaps and in reading each line from userspace).

    Link: http://lkml.kernel.org/r/20170812022148.178293-1-dancol@google.com
    Signed-off-by: Daniel Colascione
    Cc: Tim Murray
    Cc: Joel Fernandes
    Cc: Al Viro
    Cc: Randy Dunlap
    Cc: Minchan Kim
    Cc: Michal Hocko
    Cc: Sonny Rao
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Colascione
     
  • Huge page helps to reduce TLB miss rate, but it has higher cache
    footprint, sometimes this may cause some issue. For example, when
    clearing huge page on x86_64 platform, the cache footprint is 2M. But
    on a Xeon E5 v3 2699 CPU, there are 18 cores, 36 threads, and only 45M
    LLC (last level cache). That is, in average, there are 2.5M LLC for
    each core and 1.25M LLC for each thread.

    If the cache pressure is heavy when clearing the huge page, and we clear
    the huge page from the begin to the end, it is possible that the begin
    of huge page is evicted from the cache after we finishing clearing the
    end of the huge page. And it is possible for the application to access
    the begin of the huge page after clearing the huge page.

    To help the above situation, in this patch, when we clear a huge page,
    the order to clear sub-pages is changed. In quite some situation, we
    can get the address that the application will access after we clear the
    huge page, for example, in a page fault handler. Instead of clearing
    the huge page from begin to end, we will clear the sub-pages farthest
    from the the sub-page to access firstly, and clear the sub-page to
    access last. This will make the sub-page to access most cache-hot and
    sub-pages around it more cache-hot too. If we cannot know the address
    the application will access, the begin of the huge page is assumed to be
    the the address the application will access.

    With this patch, the throughput increases ~28.3% in vm-scalability
    anon-w-seq test case with 72 processes on a 2 socket Xeon E5 v3 2699
    system (36 cores, 72 threads). The test case creates 72 processes, each
    process mmap a big anonymous memory area and writes to it from the begin
    to the end. For each process, other processes could be seen as other
    workload which generates heavy cache pressure. At the same time, the
    cache miss rate reduced from ~33.4% to ~31.7%, the IPC (instruction per
    cycle) increased from 0.56 to 0.74, and the time spent in user space is
    reduced ~7.9%

    Christopher Lameter suggests to clear bytes inside a sub-page from end
    to begin too. But tests show no visible performance difference in the
    tests. May because the size of page is small compared with the cache
    size.

    Thanks Andi Kleen to propose to use address to access to determine the
    order of sub-pages to clear.

    The hugetlbfs access address could be improved, will do that in another
    patch.

    [ying.huang@intel.com: improve readability of clear_huge_page()]
    Link: http://lkml.kernel.org/r/20170830051842.1397-1-ying.huang@intel.com
    Link: http://lkml.kernel.org/r/20170815014618.15842-1-ying.huang@intel.com
    Suggested-by: Andi Kleen
    Signed-off-by: "Huang, Ying"
    Acked-by: Jan Kara
    Reviewed-by: Michal Hocko
    Cc: Andrea Arcangeli
    Cc: "Kirill A. Shutemov"
    Cc: Nadia Yvette Chambers
    Cc: Matthew Wilcox
    Cc: Hugh Dickins
    Cc: Minchan Kim
    Cc: Shaohua Li
    Cc: Christopher Lameter
    Cc: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • This is purely required because exit_aio() may block and exit_mmap() may
    never start, if the oom_reap_task cannot start running on a mm with
    mm_users == 0.

    At the same time if the OOM reaper doesn't wait at all for the memory of
    the current OOM candidate to be freed by exit_mmap->unmap_vmas, it would
    generate a spurious OOM kill.

    If it wasn't because of the exit_aio or similar blocking functions in
    the last mmput, it would be enough to change the oom_reap_task() in the
    case it finds mm_users == 0, to wait for a timeout or to wait for
    __mmput to set MMF_OOM_SKIP itself, but it's not just exit_mmap the
    problem here so the concurrency of exit_mmap and oom_reap_task is
    apparently warranted.

    It's a non standard runtime, exit_mmap() runs without mmap_sem, and
    oom_reap_task runs with the mmap_sem for reading as usual (kind of
    MADV_DONTNEED).

    The race between the two is solved with a combination of
    tsk_is_oom_victim() (serialized by task_lock) and MMF_OOM_SKIP
    (serialized by a dummy down_write/up_write cycle on the same lines of
    the ksm_exit method).

    If the oom_reap_task() may be running concurrently during exit_mmap,
    exit_mmap will wait it to finish in down_write (before taking down mm
    structures that would make the oom_reap_task fail with use after free).

    If exit_mmap comes first, oom_reap_task() will skip the mm if
    MMF_OOM_SKIP is already set and in turn all memory is already freed and
    furthermore the mm data structures may already have been taken down by
    free_pgtables.

    [aarcange@redhat.com: incremental one liner]
    Link: http://lkml.kernel.org/r/20170726164319.GC29716@redhat.com
    [rientjes@google.com: remove unused mmput_async]
    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1708141733130.50317@chino.kir.corp.google.com
    [aarcange@redhat.com: microoptimization]
    Link: http://lkml.kernel.org/r/20170817171240.GB5066@redhat.com
    Link: http://lkml.kernel.org/r/20170726162912.GA29716@redhat.com
    Fixes: 26db62f179d1 ("oom: keep mm of the killed task available")
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: David Rientjes
    Reported-by: David Rientjes
    Tested-by: David Rientjes
    Reviewed-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Oleg Nesterov
    Cc: Hugh Dickins
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • If the system has more than one swap device and swap device has the node
    information, we can make use of this information to decide which swap
    device to use in get_swap_pages() to get better performance.

    The current code uses a priority based list, swap_avail_list, to decide
    which swap device to use and if multiple swap devices share the same
    priority, they are used round robin. This patch changes the previous
    single global swap_avail_list into a per-numa-node list, i.e. for each
    numa node, it sees its own priority based list of available swap
    devices. Swap device's priority can be promoted on its matching node's
    swap_avail_list.

    The current swap device's priority is set as: user can set a >=0 value,
    or the system will pick one starting from -1 then downwards. The
    priority value in the swap_avail_list is the negated value of the swap
    device's due to plist being sorted from low to high. The new policy
    doesn't change the semantics for priority >=0 cases, the previous
    starting from -1 then downwards now becomes starting from -2 then
    downwards and -1 is reserved as the promoted value.

    Take 4-node EX machine as an example, suppose 4 swap devices are
    available, each sit on a different node:
    swapA on node 0
    swapB on node 1
    swapC on node 2
    swapD on node 3

    After they are all swapped on in the sequence of ABCD.

    Current behaviour:
    their priorities will be:
    swapA: -1
    swapB: -2
    swapC: -3
    swapD: -4
    And their position in the global swap_avail_list will be:
    swapA -> swapB -> swapC -> swapD
    prio:1 prio:2 prio:3 prio:4

    New behaviour:
    their priorities will be(note that -1 is skipped):
    swapA: -2
    swapB: -3
    swapC: -4
    swapD: -5
    And their positions in the 4 swap_avail_lists[nid] will be:
    swap_avail_lists[0]: /* node 0's available swap device list */
    swapA -> swapB -> swapC -> swapD
    prio:1 prio:3 prio:4 prio:5
    swap_avali_lists[1]: /* node 1's available swap device list */
    swapB -> swapA -> swapC -> swapD
    prio:1 prio:2 prio:4 prio:5
    swap_avail_lists[2]: /* node 2's available swap device list */
    swapC -> swapA -> swapB -> swapD
    prio:1 prio:2 prio:3 prio:5
    swap_avail_lists[3]: /* node 3's available swap device list */
    swapD -> swapA -> swapB -> swapC
    prio:1 prio:2 prio:3 prio:4

    To see the effect of the patch, a test that starts N process, each mmap
    a region of anonymous memory and then continually write to it at random
    position to trigger both swap in and out is used.

    On a 2 node Skylake EP machine with 64GiB memory, two 170GB SSD drives
    are used as swap devices with each attached to a different node, the
    result is:

    runtime=30m/processes=32/total test size=128G/each process mmap region=4G
    kernel throughput
    vanilla 13306
    auto-binding 15169 +14%

    runtime=30m/processes=64/total test size=128G/each process mmap region=2G
    kernel throughput
    vanilla 11885
    auto-binding 14879 +25%

    [aaron.lu@intel.com: v2]
    Link: http://lkml.kernel.org/r/20170814053130.GD2369@aaronlu.sh.intel.com
    Link: http://lkml.kernel.org/r/20170816024439.GA10925@aaronlu.sh.intel.com
    [akpm@linux-foundation.org: use kmalloc_array()]
    Link: http://lkml.kernel.org/r/20170814053130.GD2369@aaronlu.sh.intel.com
    Link: http://lkml.kernel.org/r/20170816024439.GA10925@aaronlu.sh.intel.com
    Signed-off-by: Aaron Lu
    Cc: "Chen, Tim C"
    Cc: Huang Ying
    Cc: Andi Kleen
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aaron Lu
     
  • TIF_MEMDIE is set only to the tasks whick were either directly selected
    by the OOM killer or passed through mark_oom_victim from the allocator
    path. tsk_is_oom_victim is more generic and allows to identify all
    tasks (threads) which share the mm with the oom victim.

    Please note that the freezer still needs to check TIF_MEMDIE because we
    cannot thaw tasks which do not participage in oom_victims counting
    otherwise a !TIF_MEMDIE task could interfere after oom_disbale returns.

    Link: http://lkml.kernel.org/r/20170810075019.28998-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Tetsuo Handa
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Roman Gushchin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • For ages we have been relying on TIF_MEMDIE thread flag to mark OOM
    victims and then, among other things, to give these threads full access
    to memory reserves. There are few shortcomings of this implementation,
    though.

    First of all and the most serious one is that the full access to memory
    reserves is quite dangerous because we leave no safety room for the
    system to operate and potentially do last emergency steps to move on.

    Secondly this flag is per task_struct while the OOM killer operates on
    mm_struct granularity so all processes sharing the given mm are killed.
    Giving the full access to all these task_structs could lead to a quick
    memory reserves depletion. We have tried to reduce this risk by giving
    TIF_MEMDIE only to the main thread and the currently allocating task but
    that doesn't really solve this problem while it surely opens up a room
    for corner cases - e.g. GFP_NO{FS,IO} requests might loop inside the
    allocator without access to memory reserves because a particular thread
    was not the group leader.

    Now that we have the oom reaper and that all oom victims are reapable
    after 1b51e65eab64 ("oom, oom_reaper: allow to reap mm shared by the
    kthreads") we can be more conservative and grant only partial access to
    memory reserves because there are reasonable chances of the parallel
    memory freeing. We still want some access to reserves because we do not
    want other consumers to eat up the victim's freed memory. oom victims
    will still contend with __GFP_HIGH users but those shouldn't be so
    aggressive to starve oom victims completely.

    Introduce ALLOC_OOM flag and give all tsk_is_oom_victim tasks access to
    the half of the reserves. This makes the access to reserves independent
    on which task has passed through mark_oom_victim. Also drop any usage
    of TIF_MEMDIE from the page allocator proper and replace it by
    tsk_is_oom_victim as well which will make page_alloc.c completely
    TIF_MEMDIE free finally.

    CONFIG_MMU=n doesn't have oom reaper so let's stick to the original
    ALLOC_NO_WATERMARKS approach.

    There is a demand to make the oom killer memcg aware which will imply
    many tasks killed at once. This change will allow such a usecase
    without worrying about complete memory reserves depletion.

    Link: http://lkml.kernel.org/r/20170810075019.28998-2-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Mel Gorman
    Cc: Tetsuo Handa
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Roman Gushchin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • It's been noted that z3fold doesn't scale well when it's run in a large
    number of threads on many cores, which can be easily reproduced with fio
    'randrw' test with --numjobs=32. E.g. the result for 1 cluster (4 cores)
    is:

    Run status group 0 (all jobs):
    READ: io=244785MB, aggrb=496883KB/s, minb=15527KB/s, ...
    WRITE: io=246735MB, aggrb=500841KB/s, minb=15651KB/s, ...

    While for 8 cores (2 clusters) the result is:

    Run status group 0 (all jobs):
    READ: io=244785MB, aggrb=265942KB/s, minb=8310KB/s, ...
    WRITE: io=246735MB, aggrb=268060KB/s, minb=8376KB/s, ...

    The bottleneck here is the pool lock which many threads become waiting
    upon. To reduce that spin lock contention, z3fold can operate only on
    the lists local to the current CPU whenever possible. Due to the nature
    of z3fold unbuddied list handling (it only takes the first entry off the
    list on a hot path), if the z3fold pool is big enough and balanced well
    enough, limiting search to only local unbuddied list doesn't lead to a
    significant compression ratio degrade (2.57x vs 2.65x in our
    measurements).

    This patch also introduces two worker threads: one for async in-page
    object layout optimization and one for releasing freed pages. This is
    done to speed up z3fold_free() which is often on a hot path.

    The fio results for 8-core case are now the following:

    Run status group 0 (all jobs):
    READ: io=244785MB, aggrb=1568.3MB/s, minb=50182KB/s, ...
    WRITE: io=246735MB, aggrb=1580.8MB/s, minb=50582KB/s, ...

    So we're in for almost 6x performance increase.

    Link: http://lkml.kernel.org/r/20170806181443.f9b65018f8bde25ef990f9e8@gmail.com
    Signed-off-by: Vitaly Wool
    Cc: Dan Streetman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vitaly Wool
     
  • VMA based swap readahead will readahead the virtual pages that is
    continuous in the virtual address space. While the original swap
    readahead will readahead the swap slots that is continuous in the swap
    device. Although VMA based swap readahead is more correct for the swap
    slots to be readahead, it will trigger more small random readings, which
    may cause the performance of HDD (hard disk) to degrade heavily, and may
    finally exceed the benefit.

    To avoid the issue, in this patch, if the HDD is used as swap, the VMA
    based swap readahead will be disabled, and the original swap readahead
    will be used instead.

    Link: http://lkml.kernel.org/r/20170807054038.1843-6-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Shaohua Li
    Cc: Hugh Dickins
    Cc: Fengguang Wu
    Cc: Tim Chen
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • The sysfs interface to control the VMA based swap readahead is added as
    follow,

    /sys/kernel/mm/swap/vma_ra_enabled

    Enable the VMA based swap readahead algorithm, or use the original
    global swap readahead algorithm.

    /sys/kernel/mm/swap/vma_ra_max_order

    Set the max order of the readahead window size for the VMA based swap
    readahead algorithm.

    The corresponding ABI documentation is added too.

    Link: http://lkml.kernel.org/r/20170807054038.1843-5-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Shaohua Li
    Cc: Hugh Dickins
    Cc: Fengguang Wu
    Cc: Tim Chen
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • The swap readahead is an important mechanism to reduce the swap in
    latency. Although pure sequential memory access pattern isn't very
    popular for anonymous memory, the space locality is still considered
    valid.

    In the original swap readahead implementation, the consecutive blocks in
    swap device are readahead based on the global space locality estimation.
    But the consecutive blocks in swap device just reflect the order of page
    reclaiming, don't necessarily reflect the access pattern in virtual
    memory. And the different tasks in the system may have different access
    patterns, which makes the global space locality estimation incorrect.

    In this patch, when page fault occurs, the virtual pages near the fault
    address will be readahead instead of the swap slots near the fault swap
    slot in swap device. This avoid to readahead the unrelated swap slots.
    At the same time, the swap readahead is changed to work on per-VMA from
    globally. So that the different access patterns of the different VMAs
    could be distinguished, and the different readahead policy could be
    applied accordingly. The original core readahead detection and scaling
    algorithm is reused, because it is an effect algorithm to detect the
    space locality.

    The test and result is as follow,

    Common test condition
    =====================

    Test Machine: Xeon E5 v3 (2 sockets, 72 threads, 32G RAM) Swap device:
    NVMe disk

    Micro-benchmark with combined access pattern
    ============================================

    vm-scalability, sequential swap test case, 4 processes to eat 50G
    virtual memory space, repeat the sequential memory writing until 300
    seconds. The first round writing will trigger swap out, the following
    rounds will trigger sequential swap in and out.

    At the same time, run vm-scalability random swap test case in
    background, 8 processes to eat 30G virtual memory space, repeat the
    random memory write until 300 seconds. This will trigger random swap-in
    in the background.

    This is a combined workload with sequential and random memory accessing
    at the same time. The result (for sequential workload) is as follow,

    Base Optimized
    ---- ---------
    throughput 345413 KB/s 414029 KB/s (+19.9%)
    latency.average 97.14 us 61.06 us (-37.1%)
    latency.50th 2 us 1 us
    latency.60th 2 us 1 us
    latency.70th 98 us 2 us
    latency.80th 160 us 2 us
    latency.90th 260 us 217 us
    latency.95th 346 us 369 us
    latency.99th 1.34 ms 1.09 ms
    ra_hit% 52.69% 99.98%

    The original swap readahead algorithm is confused by the background
    random access workload, so readahead hit rate is lower. The VMA-base
    readahead algorithm works much better.

    Linpack
    =======

    The test memory size is bigger than RAM to trigger swapping.

    Base Optimized
    ---- ---------
    elapsed_time 393.49 s 329.88 s (-16.2%)
    ra_hit% 86.21% 98.82%

    The score of base and optimized kernel hasn't visible changes. But the
    elapsed time reduced and readahead hit rate improved, so the optimized
    kernel runs better for startup and tear down stages. And the absolute
    value of readahead hit rate is high, shows that the space locality is
    still valid in some practical workloads.

    Link: http://lkml.kernel.org/r/20170807054038.1843-4-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Shaohua Li
    Cc: Hugh Dickins
    Cc: Fengguang Wu
    Cc: Tim Chen
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • In the original implementation, it is possible that the existing pages
    in the swap cache (not newly readahead) could be marked as the readahead
    pages. This will cause the statistics of swap readahead be wrong and
    influence the swap readahead algorithm too.

    This is fixed via marking a page as the readahead page only if it is
    newly allocated and read from the disk.

    When testing with linpack, after the fixing the swap readahead hit rate
    increased from ~66% to ~86%.

    Link: http://lkml.kernel.org/r/20170807054038.1843-3-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Shaohua Li
    Cc: Hugh Dickins
    Cc: Fengguang Wu
    Cc: Tim Chen
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • Patch series "mm, swap: VMA based swap readahead", v4.

    The swap readahead is an important mechanism to reduce the swap in
    latency. Although pure sequential memory access pattern isn't very
    popular for anonymous memory, the space locality is still considered
    valid.

    In the original swap readahead implementation, the consecutive blocks in
    swap device are readahead based on the global space locality estimation.
    But the consecutive blocks in swap device just reflect the order of page
    reclaiming, don't necessarily reflect the access pattern in virtual
    memory space. And the different tasks in the system may have different
    access patterns, which makes the global space locality estimation
    incorrect.

    In this patchset, when page fault occurs, the virtual pages near the
    fault address will be readahead instead of the swap slots near the fault
    swap slot in swap device. This avoid to readahead the unrelated swap
    slots. At the same time, the swap readahead is changed to work on
    per-VMA from globally. So that the different access patterns of the
    different VMAs could be distinguished, and the different readahead
    policy could be applied accordingly. The original core readahead
    detection and scaling algorithm is reused, because it is an effect
    algorithm to detect the space locality.

    In addition to the swap readahead changes, some new sysfs interface is
    added to show the efficiency of the readahead algorithm and some other
    swap statistics.

    This new implementation will incur more small random read, on SSD, the
    improved correctness of estimation and readahead target should beat the
    potential increased overhead, this is also illustrated in the test
    results below. But on HDD, the overhead may beat the benefit, so the
    original implementation will be used by default.

    The test and result is as follow,

    Common test condition
    =====================

    Test Machine: Xeon E5 v3 (2 sockets, 72 threads, 32G RAM)
    Swap device: NVMe disk

    Micro-benchmark with combined access pattern
    ============================================

    vm-scalability, sequential swap test case, 4 processes to eat 50G
    virtual memory space, repeat the sequential memory writing until 300
    seconds. The first round writing will trigger swap out, the following
    rounds will trigger sequential swap in and out.

    At the same time, run vm-scalability random swap test case in
    background, 8 processes to eat 30G virtual memory space, repeat the
    random memory write until 300 seconds. This will trigger random swap-in
    in the background.

    This is a combined workload with sequential and random memory accessing
    at the same time. The result (for sequential workload) is as follow,

    Base Optimized
    ---- ---------
    throughput 345413 KB/s 414029 KB/s (+19.9%)
    latency.average 97.14 us 61.06 us (-37.1%)
    latency.50th 2 us 1 us
    latency.60th 2 us 1 us
    latency.70th 98 us 2 us
    latency.80th 160 us 2 us
    latency.90th 260 us 217 us
    latency.95th 346 us 369 us
    latency.99th 1.34 ms 1.09 ms
    ra_hit% 52.69% 99.98%

    The original swap readahead algorithm is confused by the background
    random access workload, so readahead hit rate is lower. The VMA-base
    readahead algorithm works much better.

    Linpack
    =======

    The test memory size is bigger than RAM to trigger swapping.

    Base Optimized
    ---- ---------
    elapsed_time 393.49 s 329.88 s (-16.2%)
    ra_hit% 86.21% 98.82%

    The score of base and optimized kernel hasn't visible changes. But the
    elapsed time reduced and readahead hit rate improved, so the optimized
    kernel runs better for startup and tear down stages. And the absolute
    value of readahead hit rate is high, shows that the space locality is
    still valid in some practical workloads.

    This patch (of 5):

    The statistics for total readahead pages and total readahead hits are
    recorded and exported via the following sysfs interface.

    /sys/kernel/mm/swap/ra_hits
    /sys/kernel/mm/swap/ra_total

    With them, the efficiency of the swap readahead could be measured, so
    that the swap readahead algorithm and parameters could be tuned
    accordingly.

    [akpm@linux-foundation.org: don't display swap stats if CONFIG_SWAP=n]
    Link: http://lkml.kernel.org/r/20170807054038.1843-2-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Shaohua Li
    Cc: Hugh Dickins
    Cc: Fengguang Wu
    Cc: Tim Chen
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • Although llist provides proper APIs, they are not used. Make them used.

    Link: http://lkml.kernel.org/r/1502095374-16112-1-git-send-email-byungchul.park@lge.com
    Signed-off-by: Byungchul Park
    Cc: zijun_hu
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Joel Fernandes
    Cc: Andrey Ryabinin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Byungchul Park
     
  • Comment for pagetypeinfo_showblockcount() is mistakenly duplicated from
    pagetypeinfo_show_free()'s comment. This commit fixes it.

    Link: http://lkml.kernel.org/r/20170809185816.11244-1-sj38.park@gmail.com
    Fixes: 467c996c1e19 ("Print out statistics in relation to fragmentation avoidance to /proc/pagetypeinfo")
    Signed-off-by: SeongJae Park
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    SeongJae Park
     
  • With the addition of hugetlbfs support in memfd_create, the memfd
    selftests should verify correct functionality with hugetlbfs.

    Instead of writing a separate memfd hugetlbfs test, modify the
    memfd_test program to take an optional argument 'hugetlbfs'. If the
    hugetlbfs argument is specified, basic memfd_create functionality will
    be exercised on hugetlbfs. If hugetlbfs is not specified, the current
    functionality of the test is unchanged.

    Note that many of the tests in memfd_test test file sealing operations.
    hugetlbfs does not support file sealing, therefore for hugetlbfs all
    sealing related tests are skipped.

    In order to test on hugetlbfs, there needs to be preallocated huge
    pages. A new script (run_tests) is added. This script will first run
    the existing memfd_create tests. It will then, attempt to allocate the
    required number of huge pages before running the hugetlbfs test. At the
    end of testing, it will release any huge pages allocated for testing
    purposes.

    Link: http://lkml.kernel.org/r/1502495772-24736-3-git-send-email-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Andrea Arcangeli
    Cc: "Kirill A . Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • This patch came out of discussions in this e-mail thread:
    http://lkml.kernel.org/r/1499357846-7481-1-git-send-email-mike.kravetz%40oracle.com

    The Oracle JVM team is developing a new garbage collection model. This
    new model requires multiple mappings of the same anonymous memory. One
    straight forward way to accomplish this is with memfd_create. They can
    use the returned fd to create multiple mappings of the same memory.

    The JVM today has an option to use (static hugetlb) huge pages. If this
    option is specified, they would like to use the same garbage collection
    model requiring multiple mappings to the same memory. Using hugetlbfs,
    it is possible to explicitly mount a filesystem and specify file paths
    in order to get an fd that can be used for multiple mappings. However,
    this introduces additional system admin work and coordination.

    Ideally they would like to get a hugetlbfs fd without requiring explicit
    mounting of a filesystem. Today, mmap and shmget can make use of
    hugetlbfs without explicitly mounting a filesystem. The patch adds this
    functionality to memfd_create.

    Add a new flag MFD_HUGETLB to memfd_create() that will specify the file
    to be created resides in the hugetlbfs filesystem. This is the generic
    hugetlbfs filesystem not associated with any specific mount point. As
    with other system calls that request hugetlbfs backed pages, there is
    the ability to encode huge page size in the flag arguments.

    hugetlbfs does not support sealing operations, therefore specifying
    MFD_ALLOW_SEALING with MFD_HUGETLB will result in EINVAL.

    Of course, the memfd_man page would need updating if this type of
    functionality moves forward.

    Link: http://lkml.kernel.org/r/1502149672-7759-2-git-send-email-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz
    Acked-by: Michal Hocko
    Cc: Hugh Dickins
    Cc: Andrea Arcangeli
    Cc: "Kirill A . Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • devm_memremap_pages() records mapped ranges in pgmap_radix with an entry
    per section's worth of memory (128MB). The key for each of those
    entries is a section number.

    This leads to false positives when devm_memremap_pages() is passed a
    section-unaligned range as lookups in the misalignment fail to return
    NULL. We can close this hole by using the pfn as the key for entries in
    the tree. The number of entries required to describe a remapped range
    is reduced by leveraging multi-order entries.

    In practice this approach usually yields just one entry in the tree if
    the size and starting address are of the same power-of-2 alignment.
    Previously we always needed nr_entries = mapping_size / 128MB.

    Link: https://lists.01.org/pipermail/linux-nvdimm/2016-August/006666.html
    Link: http://lkml.kernel.org/r/150215410565.39310.13767886055248249438.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Reported-by: Toshi Kani
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • In pcpu_get_vm_areas(), it checks each range is not overlapped. To make
    sure it is, only (N^2)/2 comparison is necessary, while current code
    does N^2 times. By starting from the next range, it achieves the goal
    and the continue could be removed.

    Also,

    - the overlap check of two ranges could be done with one clause

    - one typo in comment is fixed.

    Link: http://lkml.kernel.org/r/20170803063822.48702-1-richard.weiyang@gmail.com
    Signed-off-by: Wei Yang
    Acked-by: Tejun Heo
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • When order is -1 or too big, *1UL << order* will be 0, which will cause
    a divide error. Although it seems that all callers of
    __fragmentation_index() will only do so with a valid order, the patch
    can make it more robust.

    Should prevent reoccurrences of
    https://bugzilla.kernel.org/show_bug.cgi?id=196555

    Link: http://lkml.kernel.org/r/1501751520-2598-1-git-send-email-wen.yang99@zte.com.cn
    Signed-off-by: Wen Yang
    Reviewed-by: Jiang Biao
    Suggested-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wen Yang
     
  • alloc_gigantic_page doesn't consider movability of the gigantic hugetlb
    when scanning eligible ranges for the allocation. As 1GB hugetlb pages
    are not movable currently this can break the movable zone assumption
    that all allocations are migrateable and as such break memory hotplug.

    Reorganize the code and use the standard zonelist allocations scheme
    that we use for standard hugetbl pages. htlb_alloc_mask will ensure
    that only migratable hugetlb pages will ever see a movable zone.

    Link: http://lkml.kernel.org/r/20170803083549.21407-1-mhocko@kernel.org
    Fixes: 944d9fec8d7a ("hugetlb: add support for gigantic page allocation at runtime")
    Signed-off-by: Michal Hocko
    Reviewed-by: Mike Kravetz
    Cc: Luiz Capitulino
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko