17 May, 2019

1 commit

  • Pull ARM SoC-related driver updates from Olof Johansson:
    "Various driver updates for platforms and a couple of the small driver
    subsystems we merge through our tree:

    Among the larger pieces:

    - Power management improvements for TI am335x and am437x (RTC
    suspend/wake)

    - Misc new additions for Amlogic (socinfo updates)

    - ZynqMP FPGA manager

    - Nvidia improvements for reset/powergate handling

    - PMIC wrapper for Mediatek MT8516

    - Misc fixes/improvements for ARM SCMI, TEE, NXP i.MX SCU drivers"

    * tag 'armsoc-drivers' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (57 commits)
    soc: aspeed: fix Kconfig
    soc: add aspeed folder and misc drivers
    spi: zynqmp: Fix build break
    soc: imx: Add generic i.MX8 SoC driver
    MAINTAINERS: Update email for Qualcomm SoC maintainer
    memory: tegra: Fix a typos for "fdcdwr2" mc client
    Revert "ARM: tegra: Restore memory arbitration on resume from LP1 on Tegra30+"
    memory: tegra: Replace readl-writel with mc_readl-mc_writel
    memory: tegra: Fix integer overflow on tick value calculation
    memory: tegra: Fix missed registers values latching
    ARM: tegra: cpuidle: Handle tick broadcasting within cpuidle core on Tegra20/30
    optee: allow to work without static shared memory
    soc/tegra: pmc: Move powergate initialisation to probe
    soc/tegra: pmc: Remove reset sysfs entries on error
    soc/tegra: pmc: Fix reset sources and levels
    soc: amlogic: meson-gx-pwrc-vpu: Add support for G12A
    soc: amlogic: meson-gx-pwrc-vpu: Fix power on/off register bitmask
    fpga manager: Adding FPGA Manager support for Xilinx zynqmp
    dt-bindings: fpga: Add bindings for ZynqMP fpga driver
    firmware: xilinx: Add fpga API's
    ...

    Linus Torvalds
     

16 May, 2019

7 commits

  • Pull ARM Device-tree updates from Olof Johansson:
    "Besides new bindings and additional descriptions of hardware blocks
    for various SoCs and boards, the main new contents here is:

    SoCs:
    - Intel Agilex (SoCFPGA)
    - NXP i.MX8MM (Quad Cortex-A53 with media/graphics focus)

    New boards:
    - Allwinner:
    + RerVision H3-DVK (H3)
    + Oceanic 5205 5inMFD (H6)
    + Beelink GS2 (H6)
    + Orange Pi 3 (H6)
    - Rockchip:
    + Orange Pi RK3399
    + Nanopi NEO4
    + Veyron-Mighty Chromebook variant
    - Amlogic:
    + SEI Robotics SEI510
    - ST Micro:
    + stm32mp157a discovery1
    + stm32mp157c discovery2
    - NXP:
    + Eckelmann ci4x10 (i.MX6DL)
    + i.MX8MM EVK (i.MX8MM)
    + ZII i.MX7 RPU2 (i.MX7)
    + ZII SPB4 (VF610)
    + Zii Ultra (i.MX8M)
    + TQ TQMa7S (i.MX7Solo)
    + TQ TQMa7D (i.MX7Dual)
    + Kobo Aura (i.MX50)
    + Menlosystems M53 (i.MX53)j
    - Nvidia:
    + Jetson Nano (Tegra T210)"

    * tag 'armsoc-dt' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (593 commits)
    arm64: dts: bitmain: Add UART pinctrl support for Sophon Edge
    arm64: dts: bitmain: Add pinctrl support for BM1880 SoC
    arm64: dts: bitmain: Add GPIO Line names for Sophon Edge board
    arm64: dts: bitmain: Add GPIO support for BM1880 SoC
    ARM: dts: gemini: Indent DIR-685 partition table
    dt-bindings: hwmon (pwm-fan) Remove dead "cooling-*-state" properties
    ARM: dts: qcom-apq8064: Set 'cxo_board' as ref clock of the DSI PHY
    arm64: dts: msm8998: thermal: Restrict thermal zone name length to under 20
    arm64: dts: msm8998: thermal: Fix number of supported sensors
    arm64: dts: msm8998-mtp: thermal: Remove skin and battery thermal zones
    arm64: dts: exynos: Move fixed-clocks out of soc
    arm64: dts: exynos: Move pmu and timer nodes out of soc
    ARM: dts: s5pv210: Fix camera clock provider on Goni board
    ARM: dts: exynos: Properly override node to use MDMA0 on Universal C210
    ARM: dts: exynos: Move fixed-clocks out of soc on Exynos3250
    ARM: dts: exynos: Remove unneeded address/size cells from fixed-clock on Exynos3250
    ARM: dts: exynos: Move pmu and timer nodes out of soc
    arm64: dts: rockchip: fix IO domain voltage setting of APIO5 on rockpro64
    arm64: dts: db820c: Add sound card support
    arm64: dts: apq8096-db820c: Add HDMI display support
    ...

    Linus Torvalds
     
  • Pull ARM SoC platform updates from Olof Johansson:
    "SoC updates, mostly refactorings and cleanups of old legacy platforms.

    Major themes this release:

    - Conversion of ixp4xx to a modern platform (drivers, DT, bindings)

    - Moving some of the ep93xx headers around to get it closer to
    multiplatform enabled.

    - Cleanups of Davinci

    This also contains a few patches that were queued up as fixes before
    5.1 but I didn't get sent in before release"

    * tag 'armsoc-soc' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (123 commits)
    ARM: debug-ll: add default address for digicolor
    ARM: u300: regulator: add MODULE_LICENSE()
    ARM: ep93xx: move private headers out of mach/*
    ARM: ep93xx: move pinctrl interfaces into include/linux/soc
    ARM: ep93xx: keypad: stop using mach/platform.h
    ARM: ep93xx: move network platform data to separate header
    ARM: stm32: add AMBA support for stm32 family
    MAINTAINERS: update arch/arm/mach-davinci
    ARM: rockchip: add missing of_node_put in rockchip_smp_prepare_pmu
    ARM: dts: Add queue manager and NPE to the IXP4xx DTSI
    soc: ixp4xx: qmgr: Add DT probe code
    soc: ixp4xx: qmgr: Add DT bindings for IXP4xx qmgr
    soc: ixp4xx: npe: Add DT probe code
    soc: ixp4xx: Add DT bindings for IXP4xx NPE
    soc: ixp4xx: qmgr: Pass resources
    soc: ixp4xx: Remove unused functions
    soc: ixp4xx: Uninline several functions
    soc: ixp4xx: npe: Pass addresses as resources
    ARM: ixp4xx: Turn the QMGR into a platform device
    ARM: ixp4xx: Turn the NPE into a platform device
    ...

    Linus Torvalds
     
  • Pull thermal soc updates from Eduardo Valentin:

    - thermal core has a new devm_* API for registering cooling devices. I
    took the entire series, that is why you see changes on drivers/hwmon
    in this pull (Guenter Roeck)

    - rockchip thermal driver gains support to PX30 SoC (Elaine Zhang)

    - the generic-adc thermal driver now considers the lookup table DT
    property as optional (Jean-Francois Dagenais)

    - Refactoring of tsens thermal driver (Amit Kucheria)

    - Cleanups on cpu cooling driver (Daniel Lezcano)

    - broadcom thermal driver dropped support to ACPI (Srinath Mannam)

    - tegra thermal driver gains support to OC hw throttle and GPU throtle
    (Wei Ni)

    - Fixes in several thermal drivers.

    * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/evalenti/linux-soc-thermal: (59 commits)
    hwmon: (pwm-fan) Use devm_thermal_of_cooling_device_register
    hwmon: (npcm750-pwm-fan) Use devm_thermal_of_cooling_device_register
    hwmon: (mlxreg-fan) Use devm_thermal_of_cooling_device_register
    hwmon: (gpio-fan) Use devm_thermal_of_cooling_device_register
    hwmon: (aspeed-pwm-tacho) Use devm_thermal_of_cooling_device_register
    thermal: rcar_gen3_thermal: Fix to show correct trip points number
    thermal: rcar_thermal: update calculation formula for R-Car Gen3 SoCs
    thermal: cpu_cooling: Actually trace CPU load in thermal_power_cpu_get_power
    thermal: rockchip: Support the PX30 SoC in thermal driver
    dt-bindings: rockchip-thermal: Support the PX30 SoC compatible
    thermal: rockchip: fix up the tsadc pinctrl setting error
    thermal: broadcom: Remove ACPI support
    thermal: Fix build error of missing devm_ioremap_resource on UM
    thermal/drivers/cpu_cooling: Remove pointless field
    thermal/drivers/cpu_cooling: Add Software Package Data Exchange (SPDX)
    thermal/drivers/cpu_cooling: Fixup the header and copyright
    thermal/drivers/cpu_cooling: Remove pointless test in power2state()
    thermal: rcar_gen3_thermal: disable interrupt in .remove
    thermal: rcar_gen3_thermal: fix interrupt type
    thermal: Introduce devm_thermal_of_cooling_device_register
    ...

    Linus Torvalds
     
  • Merge in a few pending fixes from pre-5.1 that didn't get sent in:

    MAINTAINERS: update arch/arm/mach-davinci
    ARM: dts: ls1021: Fix SGMII PCS link remaining down after PHY disconnect
    ARM: dts: imx6q-logicpd: Reduce inrush current on USBH1
    ARM: dts: imx6q-logicpd: Reduce inrush current on start
    ARM: dts: imx: Fix the AR803X phy-mode
    ARM: dts: sun8i: a33: Reintroduce default pinctrl muxing
    arm64: dts: allwinner: a64: Rename hpvcc-supply to cpvdd-supply
    ARM: sunxi: fix a leaked reference by adding missing of_node_put
    ARM: sunxi: fix a leaked reference by adding missing of_node_put

    Signed-off-by: Olof Johansson

    Olof Johansson
     
  • Pull power supply and reset updates from Sebastian Reichel:
    "Core:
    - Add over-current health state
    - Add standard, adaptive and custom charge types
    - Add new properties for start/end charge threshold

    New Drivers / Hardware:
    - UCS1002 Programmable USB Port Power Controller
    - Ingenic JZ47xx Battery Fuel Gauge
    - AXP20x USB Power: Add AXP813 support
    - AT91 poweroff: Add SAM9X60 support
    - OLPC battery: Add XO-1.5 and XO-1.75 support

    Misc Changes:
    - syscon-reboot: support mask property
    - AXP288 fuel gauge: Blacklist ACEPC T8/T11. Looks like some vendor
    thought it's a good idea to build a desktop system with a fuel
    gauge, that slowly "discharges"...
    - cpcap-battery: Fix calculation errors
    - misc fixes"

    * tag 'for-v5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-power-supply: (54 commits)
    power: supply: olpc_battery: force the le/be casts
    power: supply: ucs1002: Fix build error without CONFIG_REGULATOR
    power: supply: ucs1002: Fix wrong return value checking
    power: supply: Add driver for Microchip UCS1002
    dt-bindings: power: supply: Add bindings for Microchip UCS1002
    power: supply: core: Add POWER_SUPPLY_HEALTH_OVERCURRENT constant
    power: supply: core: fix clang -Wunsequenced
    power: supply: core: Add missing documentation for CHARGE_CONTROL_* properties
    power: supply: core: Add CHARGE_CONTROL_{START_THRESHOLD,END_THRESHOLD} properties
    power: supply: core: Add Standard, Adaptive, and Custom charge types
    power: supply: axp288_fuel_gauge: Add ACEPC T8 and T11 mini PCs to the blacklist
    power: supply: bq27xxx_battery: Notify also about status changes
    power: supply: olpc_battery: Have the framework register sysfs files for us
    power: supply: olpc_battery: Add OLPC XO 1.75 support
    power: supply: olpc_battery: Avoid using platform_info
    power: supply: olpc_battery: Use devm_power_supply_register()
    power: supply: olpc_battery: Move priv data to a struct
    power: supply: olpc_battery: Use DT to get battery version
    x86/platform/olpc: Use a correct version when making up a battery node
    x86/platform/olpc: Trivial code move in DT fixup
    ...

    Linus Torvalds
     
  • Pull nfsd updates from Bruce Fields:
    "This consists mostly of nfsd container work:

    Scott Mayhew revived an old api that communicates with a userspace
    daemon to manage some on-disk state that's used to track clients
    across server reboots. We've been using a usermode_helper upcall for
    that, but it's tough to run those with the right namespaces, so a
    daemon is much friendlier to container use cases.

    Trond fixed nfsd's handling of user credentials in user namespaces. He
    also contributed patches that allow containers to support different
    sets of NFS protocol versions.

    The only remaining container bug I'm aware of is that the NFS reply
    cache is shared between all containers. If anyone's aware of other
    gaps in our container support, let me know.

    The rest of this is miscellaneous bugfixes"

    * tag 'nfsd-5.2' of git://linux-nfs.org/~bfields/linux: (23 commits)
    nfsd: update callback done processing
    locks: move checks from locks_free_lock() to locks_release_private()
    nfsd: fh_drop_write in nfsd_unlink
    nfsd: allow fh_want_write to be called twice
    nfsd: knfsd must use the container user namespace
    SUNRPC: rsi_parse() should use the current user namespace
    SUNRPC: Fix the server AUTH_UNIX userspace mappings
    lockd: Pass the user cred from knfsd when starting the lockd server
    SUNRPC: Temporary sockets should inherit the cred from their parent
    SUNRPC: Cache the process user cred in the RPC server listener
    nfsd: Allow containers to set supported nfs versions
    nfsd: Add custom rpcbind callbacks for knfsd
    SUNRPC: Allow further customisation of RPC program registration
    SUNRPC: Clean up generic dispatcher code
    SUNRPC: Add a callback to initialise server requests
    SUNRPC/nfs: Fix return value for nfs4_callback_compound()
    nfsd: handle legacy client tracking records sent by nfsdcld
    nfsd: re-order client tracking method selection
    nfsd: keep a tally of RECLAIM_COMPLETE operations when using nfsdcld
    nfsd: un-deprecate nfsdcld
    ...

    Linus Torvalds
     
  • Pull tracing updates from Steven Rostedt:
    "The major changes in this tracing update includes:

    - Removal of non-DYNAMIC_FTRACE from 32bit x86

    - Removal of mcount support from x86

    - Emulating a call from int3 on x86_64, fixes live kernel patching

    - Consolidated Tracing Error logs file

    Minor updates:

    - Removal of klp_check_compiler_support()

    - kdb ftrace dumping output changes

    - Accessing and creating ftrace instances from inside the kernel

    - Clean up of #define if macro

    - Introduction of TRACE_EVENT_NOP() to disable trace events based on
    config options

    And other minor fixes and clean ups"

    * tag 'trace-v5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (44 commits)
    x86: Hide the int3_emulate_call/jmp functions from UML
    livepatch: Remove klp_check_compiler_support()
    ftrace/x86: Remove mcount support
    ftrace/x86_32: Remove support for non DYNAMIC_FTRACE
    tracing: Simplify "if" macro code
    tracing: Fix documentation about disabling options using trace_options
    tracing: Replace kzalloc with kcalloc
    tracing: Fix partial reading of trace event's id file
    tracing: Allow RCU to run between postponed startup tests
    tracing: Fix white space issues in parse_pred() function
    tracing: Eliminate const char[] auto variables
    ring-buffer: Fix mispelling of Calculate
    tracing: probeevent: Fix to make the type of $comm string
    tracing: probeevent: Do not accumulate on ret variable
    tracing: uprobes: Re-enable $comm support for uprobe events
    ftrace/x86_64: Emulate call function while updating in breakpoint handler
    x86_64: Allow breakpoints to emulate call instructions
    x86_64: Add gap to int3 to allow for call emulation
    tracing: kdb: Allow ftdump to skip all but the last few entries
    tracing: Add trace_total_entries() / trace_total_entries_cpu()
    ...

    Linus Torvalds
     

15 May, 2019

32 commits

  • Pull more ACPI updates from Rafael Wysocki:
    "These fix two regressions introduced during the 5.0 cycle, in ACPICA
    and in device PM, cause the values returned by _ADR to be stored in 64
    bits and fix two ACPI documentation issues.

    Specifics:

    - Update the ACPICA code in the kernel to upstream revision 20190509
    including one regression fix:
    * Prevent excessive ACPI debug messages from being printed by
    moving the ACPI_DEBUG_DEFAULT definition to the right place
    (Erik Schmauss).

    - Set the enable_for_wake bits for wakeup GPEs during suspend to idle
    to allow acpi_enable_all_wakeup_gpes() to enable them as
    aproppriate and make wakeup devices sighaling events through ACPI
    GPEs work with suspend-to-idle again (Rajat Jain).

    - Use 64 bits to store the return values of _ADR which are assumed to
    be 64-bit by some bus specs and may contain nonzero bits in the
    upper 32 bits part for some devices (Pierre-Louis Bossart).

    - Fix two minor issues with the ACPI documentation (Sakari Ailus)"

    * tag 'acpi-5.2-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
    ACPI: PM: Set enable_for_wake for wakeup GPEs during suspend-to-idle
    Documentation: ACPI: Direct references are allowed to devices only
    Documentation: ACPI: Use tabs for graph ASL indentation
    ACPICA: Update version to 20190509
    ACPICA: Linux: move ACPI_DEBUG_DEFAULT flag out of ifndef
    ACPI: bus: change _ADR representation to 64 bits

    Linus Torvalds
     
  • Pull more power management updates from Rafael Wysocki:
    "These fix a recent regression causing kernels built with CONFIG_PM
    unset to crash on systems that support the Performance and Energy Bias
    Hint (EPB), clean up the cpufreq core and some users of transition
    notifiers and introduce a new power domain flag into the generic power
    domains framework (genpd).

    Specifics:

    - Fix recent regression causing kernels built with CONFIG_PM unset to
    crash on systems that support the Performance and Energy Bias Hint
    (EPB) by avoiding to compile the EPB-related code depending on
    CONFIG_PM when it is unset (Rafael Wysocki).

    - Clean up the transition notifier invocation code in the cpufreq
    core and change some users of cpufreq transition notifiers
    accordingly (Viresh Kumar).

    - Change MAINTAINERS to cover the schedutil governor as part of
    cpufreq (Viresh Kumar).

    - Simplify cpufreq_init_policy() to avoid redundant computations (Yue
    Hu).

    - Add explanatory comment to the cpufreq core (Rafael Wysocki).

    - Introduce a new flag, GENPD_FLAG_RPM_ALWAYS_ON, to the generic
    power domains (genpd) framework along with the first user of it
    (Leonard Crestez)"

    * tag 'pm-5.2-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
    soc: imx: gpc: Use GENPD_FLAG_RPM_ALWAYS_ON for ERR009619
    PM / Domains: Add GENPD_FLAG_RPM_ALWAYS_ON flag
    cpufreq: Update MAINTAINERS to include schedutil governor
    cpufreq: Don't find governor for setpolicy drivers in cpufreq_init_policy()
    cpufreq: Explain the kobject_put() in cpufreq_policy_alloc()
    cpufreq: Call transition notifier only once for each policy
    x86: intel_epb: Take CONFIG_PM into account

    Linus Torvalds
     
  • * pm-cpufreq:
    cpufreq: Update MAINTAINERS to include schedutil governor
    cpufreq: Don't find governor for setpolicy drivers in cpufreq_init_policy()
    cpufreq: Explain the kobject_put() in cpufreq_policy_alloc()
    cpufreq: Call transition notifier only once for each policy

    * pm-domains:
    soc: imx: gpc: Use GENPD_FLAG_RPM_ALWAYS_ON for ERR009619
    PM / Domains: Add GENPD_FLAG_RPM_ALWAYS_ON flag

    Rafael J. Wysocki
     
  • * acpi-bus:
    ACPI: bus: change _ADR representation to 64 bits

    * acpi-doc:
    Documentation: ACPI: Direct references are allowed to devices only
    Documentation: ACPI: Use tabs for graph ASL indentation

    * acpi-pm:
    ACPI: PM: Set enable_for_wake for wakeup GPEs during suspend-to-idle

    Rafael J. Wysocki
     
  • Pull more rdma updates from Jason Gunthorpe:
    "This is being sent to get a fix for the gcc 9.1 build warnings, and
    I've also pulled in some bug fix patches that were posted in the last
    two weeks.

    - Avoid the gcc 9.1 warning about overflowing a union member

    - Fix the wrong callback type for a single response netlink to doit

    - Bug fixes from more usage of the mlx5 devx interface"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
    net/mlx5: Set completion EQs as shared resources
    IB/mlx5: Verify DEVX general object type correctly
    RDMA/core: Change system parameters callback from dumpit to doit
    RDMA: Directly cast the sockaddr union to sockaddr

    Linus Torvalds
     
  • Merge more updates from Andrew Morton:

    - a couple of hotfixes

    - almost all of the rest of MM

    - lib/ updates

    - binfmt_elf updates

    - autofs updates

    - quite a lot of misc fixes and updates
    - reiserfs, fatfs
    - signals
    - exec
    - cpumask
    - rapidio
    - sysctl
    - pids
    - eventfd
    - gcov
    - panic
    - pps

    - gdb script updates

    - ipc updates

    * emailed patches from Andrew Morton : (126 commits)
    mm: memcontrol: fix NUMA round-robin reclaim at intermediate level
    mm: memcontrol: fix recursive statistics correctness & scalabilty
    mm: memcontrol: move stat/event counting functions out-of-line
    mm: memcontrol: make cgroup stats and events query API explicitly local
    drivers/virt/fsl_hypervisor.c: prevent integer overflow in ioctl
    drivers/virt/fsl_hypervisor.c: dereferencing error pointers in ioctl
    mm, memcg: rename ambiguously named memory.stat counters and functions
    arch: remove and
    treewide: replace #include with #include
    fs/block_dev.c: Remove duplicate header
    fs/cachefiles/namei.c: remove duplicate header
    include/linux/sched/signal.h: replace `tsk' with `task'
    fs/coda/psdev.c: remove duplicate header
    ipc: do cyclic id allocation for the ipc object.
    ipc: conserve sequence numbers in ipcmni_extend mode
    ipc: allow boot time extension of IPCMNI from 32k to 16M
    ipc/mqueue: optimize msg_get()
    ipc/mqueue: remove redundant wq task assignment
    ipc: prevent lockup on alloc_msg and free_msg
    scripts/gdb: print cached rate in lx-clk-summary
    ...

    Linus Torvalds
     
  • Right now, when somebody needs to know the recursive memory statistics
    and events of a cgroup subtree, they need to walk the entire subtree and
    sum up the counters manually.

    There are two issues with this:

    1. When a cgroup gets deleted, its stats are lost. The state counters
    should all be 0 at that point, of course, but the events are not.
    When this happens, the event counters, which are supposed to be
    monotonic, can go backwards in the parent cgroups.

    2. During regular operation, we always have a certain number of lazily
    freed cgroups sitting around that have been deleted, have no tasks,
    but have a few cache pages remaining. These groups' statistics do not
    change until we eventually hit memory pressure, but somebody
    watching, say, memory.stat on an ancestor has to iterate those every
    time.

    This patch addresses both issues by introducing recursive counters at
    each level that are propagated from the write side when stats change.

    Upward propagation happens when the per-cpu caches spill over into the
    local atomic counter. This is the same thing we do during charge and
    uncharge, except that the latter uses atomic RMWs, which are more
    expensive; stat changes happen at around the same rate. In a sparse
    file test (page faults and reclaim at maximum CPU speed) with 5 cgroup
    nesting levels, perf shows __mod_memcg_page state at ~1%.

    Link: http://lkml.kernel.org/r/20190412151507.2769-4-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reviewed-by: Shakeel Butt
    Reviewed-by: Roman Gushchin
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • These are getting too big to be inlined in every callsite. They were
    stolen from vmstat.c, which already out-of-lines them, and they have
    only been growing since. The callsites aren't that hot, either.

    Move __mod_memcg_state()
    __mod_lruvec_state() and
    __count_memcg_events() out of line and add kerneldoc comments.

    Link: http://lkml.kernel.org/r/20190412151507.2769-3-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reviewed-by: Shakeel Butt
    Reviewed-by: Roman Gushchin
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Patch series "mm: memcontrol: memory.stat cost & correctness".

    The cgroup memory.stat file holds recursive statistics for the entire
    subtree. The current implementation does this tree walk on-demand
    whenever the file is read. This is giving us problems in production.

    1. The cost of aggregating the statistics on-demand is high. A lot of
    system service cgroups are mostly idle and their stats don't change
    between reads, yet we always have to check them. There are also always
    some lazily-dying cgroups sitting around that are pinned by a handful
    of remaining page cache; the same applies to them.

    In an application that periodically monitors memory.stat in our
    fleet, we have seen the aggregation consume up to 5% CPU time.

    2. When cgroups die and disappear from the cgroup tree, so do their
    accumulated vm events. The result is that the event counters at
    higher-level cgroups can go backwards and confuse some of our
    automation, let alone people looking at the graphs over time.

    To address both issues, this patch series changes the stat
    implementation to spill counts upwards when the counters change.

    The upward spilling is batched using the existing per-cpu cache. In a
    sparse file stress test with 5 level cgroup nesting, the additional cost
    of the flushing was negligible (a little under 1% of CPU at 100% CPU
    utilization, compared to the 5% of reading memory.stat during regular
    operation).

    This patch (of 4):

    memcg_page_state(), lruvec_page_state(), memcg_sum_events() are
    currently returning the state of the local memcg or lruvec, not the
    recursive state.

    In practice there is a demand for both versions, although the callers
    that want the recursive counts currently sum them up by hand.

    Per default, cgroups are considered recursive entities and generally we
    expect more users of the recursive counters, with the local counts being
    special cases. To reflect that in the name, add a _local suffix to the
    current implementations.

    The following patch will re-incarnate these functions with recursive
    semantics, but with an O(1) implementation.

    [hannes@cmpxchg.org: fix bisection hole]
    Link: http://lkml.kernel.org/r/20190417160347.GC23013@cmpxchg.org
    Link: http://lkml.kernel.org/r/20190412151507.2769-2-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reviewed-by: Shakeel Butt
    Reviewed-by: Roman Gushchin
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • I spent literally an hour trying to work out why an earlier version of
    my memory.events aggregation code doesn't work properly, only to find
    out I was calling memcg->events instead of memcg->memory_events, which
    is fairly confusing.

    This naming seems in need of reworking, so make it harder to do the
    wrong thing by using vmevents instead of events, which makes it more
    clear that these are vm counters rather than memcg-specific counters.

    There are also a few other inconsistent names in both the percpu and
    aggregated structs, so these are all cleaned up to be more coherent and
    easy to understand.

    This commit contains code cleanup only: there are no logic changes.

    [akpm@linux-foundation.org: fix it for preceding changes]
    Link: http://lkml.kernel.org/r/20190208224319.GA23801@chrisdown.name
    Signed-off-by: Chris Down
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Roman Gushchin
    Cc: Dennis Zhou
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Down
     
  • Now that all instances of #include have been replaced with
    #include , we can remove these.

    Link: http://lkml.kernel.org/r/1553267665-27228-2-git-send-email-yamada.masahiro@socionext.com
    Signed-off-by: Masahiro Yamada
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Masahiro Yamada
     
  • This file uses "task" 85 times and "tsk" 25 times. It is better to be
    consistent.

    Link: http://lkml.kernel.org/r/20181129180547.15976-1-avagin@gmail.com
    Signed-off-by: Andrei Vagin
    Reviewed-by: Andrew Morton
    Cc: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrei Vagin
     
  • Rewrite, based on the patch from Waiman Long:

    The mixing in of a sequence number into the IPC IDs is probably to avoid
    ID reuse in userspace as much as possible. With ipcmni_extend mode, the
    number of usable sequence numbers is greatly reduced leading to higher
    chance of ID reuse.

    To address this issue, we need to conserve the sequence number space as
    much as possible. Right now, the sequence number is incremented for
    every new ID created. In reality, we only need to increment the
    sequence number when new allocated ID is not greater than the last one
    allocated. It is in such case that the new ID may collide with an
    existing one. This is being done irrespective of the ipcmni mode.

    In order to avoid any races, the index is first allocated and then the
    pointer is replaced.

    Changes compared to the initial patch:
    - Handle failures from idr_alloc().
    - Avoid that concurrent operations can see the wrong sequence number.
    (This is achieved by using idr_replace()).
    - IPCMNI_SEQ_SHIFT is not a constant, thus renamed to
    ipcmni_seq_shift().
    - IPCMNI_SEQ_MAX is not a constant, thus renamed to ipcmni_seq_max().

    Link: http://lkml.kernel.org/r/20190329204930.21620-2-longman@redhat.com
    Signed-off-by: Manfred Spraul
    Signed-off-by: Waiman Long
    Suggested-by: Matthew Wilcox
    Acked-by: Waiman Long
    Cc: Al Viro
    Cc: Davidlohr Bueso
    Cc: "Eric W . Biederman"
    Cc: Jonathan Corbet
    Cc: Kees Cook
    Cc: "Luis R. Rodriguez"
    Cc: Takashi Iwai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • This patch implements the PPS ECHO functionality for pps-gpio, that
    sysfs claims is available already.

    Configuration is done via device tree bindings.

    No changes are made to userspace interfaces.

    This patch was originally written by Lukas Senger as part of a masters
    thesis project and modified for inclusion into the linux kernel by Tom
    Burkart.

    Link: http://lkml.kernel.org/r/20190324043305.6627-4-tom@aussec.com
    Signed-off-by: Tom Burkart
    Acked-by: Rodolfo Giometti
    Signed-off-by: Lukas Senger
    Cc: Philipp Zabel
    Cc: Rob Herring
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tom Burkart
     
  • This patch changes the GPIO access for the pps-gpio driver from the
    integer based API to the descriptor based API.

    The integer based API is considered deprecated and the descriptor based
    API is the preferred way to access GPIOs as per
    Documentation/driver-api/gpio/intro.rst

    No changes are made to userspace interfaces.

    Link: http://lkml.kernel.org/r/20190324043305.6627-2-tom@aussec.com
    Signed-off-by: Tom Burkart
    Acked-by: Rodolfo Giometti
    Reviewed-by: Philipp Zabel
    Cc: Lukas Senger
    Cc: Rob Herring
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tom Burkart
     
  • Allow specifying reboot_mode for panic only. This is needed on systems
    where ramoops is used to store panic logs, and user wants to use warm
    reset to preserve those, while still having cold reset on normal
    reboots.

    Link: http://lkml.kernel.org/r/20190322004735.27702-1-aaro.koskinen@iki.fi
    Signed-off-by: Aaro Koskinen
    Reviewed-by: Kees Cook
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aaro Koskinen
     
  • When kernel panic happens, it will first print the panic call stack,
    then the ending msg like:

    [ 35.743249] ---[ end Kernel panic - not syncing: Fatal exception
    [ 35.749975] ------------[ cut here ]------------

    The above message are very useful for debugging.

    But if system is configured to not reboot on panic, say the
    "panic_timeout" parameter equals 0, it will likely print out many noisy
    message like WARN() call stack for each and every CPU except the panic
    one, messages like below:

    WARNING: CPU: 1 PID: 280 at kernel/sched/core.c:1198 set_task_cpu+0x183/0x190
    Call Trace:

    try_to_wake_up
    default_wake_function
    autoremove_wake_function
    __wake_up_common
    __wake_up_common_lock
    __wake_up
    wake_up_klogd_work_func
    irq_work_run_list
    irq_work_tick
    update_process_times
    tick_sched_timer
    __hrtimer_run_queues
    hrtimer_interrupt
    smp_apic_timer_interrupt
    apic_timer_interrupt

    For people working in console mode, the screen will first show the panic
    call stack, but immediately overridden by these noisy extra messages,
    which makes debugging much more difficult, as the original context gets
    lost on screen.

    Also these noisy messages will confuse some users, as I have seen many bug
    reporters posted the noisy message into bugzilla, instead of the real
    panic call stack and context.

    Adding a flag "suppress_printk" which gets set in panic() to avoid those
    noisy messages, without changing current kernel behavior that both panic
    blinking and sysrq magic key can work as is, suggested by Petr Mladek.

    To verify this, make sure kernel is not configured to reboot on panic and
    in console
    # echo c > /proc/sysrq-trigger
    to see if console only prints out the panic call stack.

    Link: http://lkml.kernel.org/r/1551430186-24169-1-git-send-email-feng.tang@intel.com
    Signed-off-by: Feng Tang
    Suggested-by: Petr Mladek
    Reviewed-by: Petr Mladek
    Acked-by: Steven Rostedt (VMware)
    Acked-by: Sergey Senozhatsky
    Cc: Thomas Gleixner
    Cc: Kees Cook
    Cc: Borislav Petkov
    Cc: Andi Kleen
    Cc: Peter Zijlstra
    Cc: Greg Kroah-Hartman
    Cc: Jiri Slaby
    Cc: Sasha Levin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Feng Tang
     
  • cpumask_parse() finds first occurrence of either or strchr() and
    strlen(). We can do it better with a single call of strchrnul().

    [akpm@linux-foundation.org: remove unneeded cast]
    Link: http://lkml.kernel.org/r/20190409204208.12190-1-ynorov@marvell.com
    Signed-off-by: Yury Norov
    Acked-by: Rasmus Villemoes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yury Norov
     
  • struct linux_binprm::buf is the first field and it is exactly 128 bytes
    in size. It means that on x86_64 all accesses to other fields will go
    though [r64 + disp32] addressing mode which is 3 bytes bloatier than
    [r64 + disp8] addressing mode. Given that accesses to other fields
    outnumber accesses to ->buf, move it down.

    Space savings (x86_64 defconfig):
    more on distro configs because LSMs actively dereference "bprm"
    but do not care about first 128 bytes of the executable itself.

    add/remove: 0/0 grow/shrink: 0/24 up/down: 0/-492 (-492)
    Function old new delta
    selinux_bprm_committing_creds 552 549 -3
    finalize_exec 94 91 -3
    __audit_log_bprm_fcaps 283 280 -3
    __audit_bprm 39 36 -3
    perf_trace_sched_process_exec 347 341 -6
    install_exec_creds 105 99 -6
    cap_bprm_set_creds.cold 60 54 -6
    would_dump 137 128 -9
    load_script 637 628 -9
    bprm_change_interp 61 52 -9
    trace_event_raw_event_sched_process_exec 260 250 -10
    search_binary_handler 255 240 -15
    remove_arg_zero 295 277 -18
    free_bprm 119 101 -18
    prepare_binprm 379 360 -19
    setup_new_exec 336 315 -21
    flush_old_exec 1638 1617 -21
    copy_strings.isra 746 724 -22
    setup_arg_pages 559 530 -29
    load_misc_binary 1151 1118 -33
    selinux_bprm_set_creds 792 753 -39
    load_elf_binary 11111 11072 -39
    cap_bprm_set_creds 1496 1454 -42
    __do_execve_file.isra 2395 2286 -109

    Link: http://lkml.kernel.org/r/20190421165025.GA26843@avx2
    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • The ror32 implementation (word >> shift) | (word << (32 - shift) has
    undefined behaviour if shift is outside the [1, 31] range. Similarly
    for the 64 bit variants. Most callers pass a compile-time constant
    (naturally in that range), but there's an UBSAN report that these may
    actually be called with a shift count of 0.

    Instead of special-casing that, we can make them DTRT for all values of
    shift while also avoiding UB. For some reason, this was already partly
    done for rol32 (which was well-defined for [0, 31]). gcc 8 recognizes
    these patterns as rotates, so for example

    __u32 rol32(__u32 word, unsigned int shift)
    {
    return (word << (shift & 31)) | (word >> ((-shift) & 31));
    }

    compiles to

    0000000000000020 :
    20: 89 f8 mov %edi,%eax
    22: 89 f1 mov %esi,%ecx
    24: d3 c0 rol %cl,%eax
    26: c3 retq

    Older compilers unfortunately do not do as well, but this only affects
    the small minority of users that don't pass constants.

    Due to integer promotions, ro[lr]8 were already well-defined for shifts
    in [0, 8], and ro[lr]16 were mostly well-defined for shifts in [0, 16]
    (only mostly - u16 gets promoted to _signed_ int, so if bit 15 is set,
    word << 16 is undefined). For consistency, update those as well.

    Link: http://lkml.kernel.org/r/20190410211906.2190-1-linux@rasmusvillemoes.dk
    Signed-off-by: Rasmus Villemoes
    Reported-by: Ido Schimmel
    Tested-by: Ido Schimmel
    Reviewed-by: Will Deacon
    Cc: Vadim Pasternak
    Cc: Andrey Ryabinin
    Cc: Jacek Anaszewski
    Cc: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rasmus Villemoes
     
  • The integer exponentiation is used in few places and might be used in
    the future by other call sites. Move it to wider use.

    Link: http://lkml.kernel.org/r/20190323172531.80025-2-andriy.shevchenko@linux.intel.com
    Signed-off-by: Andy Shevchenko
    Cc: Daniel Thompson
    Cc: Lee Jones
    Cc: Ray Jui
    Cc: Thierry Reding
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Shevchenko
     
  • Rather than a fixed-size array of pending sorted runs, use the ->prev
    links to keep track of things. This reduces stack usage, eliminates
    some ugly overflow handling, and reduces the code size.

    Also:
    * merge() no longer needs to handle NULL inputs, so simplify.
    * The same applies to merge_and_restore_back_links(), which is renamed
    to the less ponderous merge_final(). (It's a static helper function,
    so we don't need a super-descriptive name; comments will do.)
    * Document the actual return value requirements on the (*cmp)()
    function; some callers are already using this feature.

    x86-64 code size 1086 -> 739 bytes (-347)

    (Yes, I see checkpatch complaining about no space after comma in
    "__attribute__((nonnull(2,3,4,5)))". Checkpatch is wrong.)

    Feedback from Rasmus Villemoes, Andy Shevchenko and Geert Uytterhoeven.

    [akpm@linux-foundation.org: remove __pure usage due to mysterious warning]
    Link: http://lkml.kernel.org/r/f63c410e0ff76009c9b58e01027e751ff7fdb749.1552704200.git.lkml@sdf.org
    Signed-off-by: George Spelvin
    Acked-by: Andrey Abramov
    Acked-by: Rasmus Villemoes
    Reviewed-by: Andy Shevchenko
    Cc: Geert Uytterhoeven
    Cc: Daniel Wagner
    Cc: Dave Chinner
    Cc: Don Mullis
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    George Spelvin
     
  • This is a lot more appropriate than PI_LIST, which in the kernel one
    would assume that it has to do with priority-inheritance; which is not
    -- furthermore futexes make use of plists so this can be even more
    confusing, albeit the debug nature of the config option.

    Link: http://lkml.kernel.org/r/20190317185434.1626-1-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • The name clear_all_latency_tracing is misleading, in fact which only
    clear per task's latency_record[], and we do have another function named
    clear_global_latency_tracing which clear the global latency_record[]
    buffer.

    Link: http://lkml.kernel.org/r/20190226114602.16902-1-linf@wangsu.com
    Signed-off-by: Lin Feng
    Cc: Alexey Dobriyan
    Cc: Fabian Frederick
    Cc: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lin Feng
     
  • Commit 60a3cdd06394 ("x86: add optimized inlining") introduced
    CONFIG_OPTIMIZE_INLINING, but it has been available only for x86.

    The idea is obviously arch-agnostic. This commit moves the config entry
    from arch/x86/Kconfig.debug to lib/Kconfig.debug so that all
    architectures can benefit from it.

    This can make a huge difference in kernel image size especially when
    CONFIG_OPTIMIZE_FOR_SIZE is enabled.

    For example, I got 3.5% smaller arm64 kernel for v5.1-rc1.

    dec file
    18983424 arch/arm64/boot/Image.before
    18321920 arch/arm64/boot/Image.after

    This also slightly improves the "Kernel hacking" Kconfig menu as
    e61aca5158a8 ("Merge branch 'kconfig-diet' from Dave Hansen') suggested;
    this config option would be a good fit in the "compiler option" menu.

    Link: http://lkml.kernel.org/r/20190423034959.13525-12-yamada.masahiro@socionext.com
    Signed-off-by: Masahiro Yamada
    Acked-by: Borislav Petkov
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Boris Brezillon
    Cc: Brian Norris
    Cc: Christophe Leroy
    Cc: David Woodhouse
    Cc: Heiko Carstens
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Marek Vasut
    Cc: Mark Rutland
    Cc: Mathieu Malaterre
    Cc: Miquel Raynal
    Cc: Paul Mackerras
    Cc: Ralf Baechle
    Cc: Richard Weinberger
    Cc: Russell King
    Cc: Stefan Agner
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Masahiro Yamada
     
  • The "WITH Linux-syscall-note" should be added to headers exported to the
    user-space.

    Some kernel-space headers have "WITH Linux-syscall-note", which seems a
    mistake.

    [1] arch/x86/include/asm/hyperv-tlfs.h

    Commit 5a4858032217 ("x86/hyper-v: move hyperv.h out of uapi") moved
    this file out of uapi, but missed to update the SPDX License tag.

    [2] include/asm-generic/shmparam.h

    Commit 76ce2a80a28e ("Rename include/{uapi => }/asm-generic/shmparam.h
    really") moved this file out of uapi, but missed to update the SPDX
    License tag.

    [3] include/linux/qcom-geni-se.h

    Commit eddac5af0654 ("soc: qcom: Add GENI based QUP Wrapper driver")
    added this file, but I do not see a good reason why its license tag must
    include "WITH Linux-syscall-note".

    Link: http://lkml.kernel.org/r/1554196104-3522-1-git-send-email-yamada.masahiro@socionext.com
    Signed-off-by: Masahiro Yamada
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Masahiro Yamada
     
  • The select() implementation is carefully tuned to put a sensible amount
    of data on the stack for holding a copy of the user space fd_set, but
    not too large to risk overflowing the kernel stack.

    When building a 32-bit kernel with clang, we need a little more space
    than with gcc, which often triggers a warning:

    fs/select.c:619:5: error: stack frame size of 1048 bytes in function 'core_sys_select' [-Werror,-Wframe-larger-than=]
    int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,

    I experimentally found that for 32-bit ARM, reducing the maximum stack
    usage by 64 bytes keeps us reliably under the warning limit again.

    Link: http://lkml.kernel.org/r/20190307090146.1874906-1-arnd@arndb.de
    Signed-off-by: Arnd Bergmann
    Reviewed-by: Andi Kleen
    Cc: Nick Desaulniers
    Cc: Alexander Viro
    Cc: Christoph Hellwig
    Cc: Eric Dumazet
    Cc: "Darrick J. Wong"
    Cc: Greg Kroah-Hartman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     
  • When freeing a page with an order >= shuffle_page_order randomly select
    the front or back of the list for insertion.

    While the mm tries to defragment physical pages into huge pages this can
    tend to make the page allocator more predictable over time. Inject the
    front-back randomness to preserve the initial randomness established by
    shuffle_free_memory() when the kernel was booted.

    The overhead of this manipulation is constrained by only being applied
    for MAX_ORDER sized pages by default.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/154899812788.3165233.9066631950746578517.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Reviewed-by: Kees Cook
    Cc: Michal Hocko
    Cc: Dave Hansen
    Cc: Keith Busch
    Cc: Robert Elliott
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • In preparation for runtime randomization of the zone lists, take all
    (well, most of) the list_*() functions in the buddy allocator and put
    them in helper functions. Provide a common control point for injecting
    additional behavior when freeing pages.

    [dan.j.williams@intel.com: fix buddy list helpers]
    Link: http://lkml.kernel.org/r/155033679702.1773410.13041474192173212653.stgit@dwillia2-desk3.amr.corp.intel.com
    [vbabka@suse.cz: remove del_page_from_free_area() migratetype parameter]
    Link: http://lkml.kernel.org/r/4672701b-6775-6efd-0797-b6242591419e@suse.cz
    Link: http://lkml.kernel.org/r/154899812264.3165233.5219320056406926223.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Signed-off-by: Vlastimil Babka
    Tested-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Dave Hansen
    Cc: Kees Cook
    Cc: Keith Busch
    Cc: Robert Elliott
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • Patch series "mm: Randomize free memory", v10.

    This patch (of 3):

    Randomization of the page allocator improves the average utilization of
    a direct-mapped memory-side-cache. Memory side caching is a platform
    capability that Linux has been previously exposed to in HPC
    (high-performance computing) environments on specialty platforms. In
    that instance it was a smaller pool of high-bandwidth-memory relative to
    higher-capacity / lower-bandwidth DRAM. Now, this capability is going
    to be found on general purpose server platforms where DRAM is a cache in
    front of higher latency persistent memory [1].

    Robert offered an explanation of the state of the art of Linux
    interactions with memory-side-caches [2], and I copy it here:

    It's been a problem in the HPC space:
    http://www.nersc.gov/research-and-development/knl-cache-mode-performance-coe/

    A kernel module called zonesort is available to try to help:
    https://software.intel.com/en-us/articles/xeon-phi-software

    and this abandoned patch series proposed that for the kernel:
    https://lkml.kernel.org/r/20170823100205.17311-1-lukasz.daniluk@intel.com

    Dan's patch series doesn't attempt to ensure buffers won't conflict, but
    also reduces the chance that the buffers will. This will make performance
    more consistent, albeit slower than "optimal" (which is near impossible
    to attain in a general-purpose kernel). That's better than forcing
    users to deploy remedies like:
    "To eliminate this gradual degradation, we have added a Stream
    measurement to the Node Health Check that follows each job;
    nodes are rebooted whenever their measured memory bandwidth
    falls below 300 GB/s."

    A replacement for zonesort was merged upstream in commit cc9aec03e58f
    ("x86/numa_emulation: Introduce uniform split capability"). With this
    numa_emulation capability, memory can be split into cache sized
    ("near-memory" sized) numa nodes. A bind operation to such a node, and
    disabling workloads on other nodes, enables full cache performance.
    However, once the workload exceeds the cache size then cache conflicts
    are unavoidable. While HPC environments might be able to tolerate
    time-scheduling of cache sized workloads, for general purpose server
    platforms, the oversubscribed cache case will be the common case.

    The worst case scenario is that a server system owner benchmarks a
    workload at boot with an un-contended cache only to see that performance
    degrade over time, even below the average cache performance due to
    excessive conflicts. Randomization clips the peaks and fills in the
    valleys of cache utilization to yield steady average performance.

    Here are some performance impact details of the patches:

    1/ An Intel internal synthetic memory bandwidth measurement tool, saw a
    3X speedup in a contrived case that tries to force cache conflicts.
    The contrived cased used the numa_emulation capability to force an
    instance of the benchmark to be run in two of the near-memory sized
    numa nodes. If both instances were placed on the same emulated they
    would fit and cause zero conflicts. While on separate emulated nodes
    without randomization they underutilized the cache and conflicted
    unnecessarily due to the in-order allocation per node.

    2/ A well known Java server application benchmark was run with a heap
    size that exceeded cache size by 3X. The cache conflict rate was 8%
    for the first run and degraded to 21% after page allocator aging. With
    randomization enabled the rate levelled out at 11%.

    3/ A MongoDB workload did not observe measurable difference in
    cache-conflict rates, but the overall throughput dropped by 7% with
    randomization in one case.

    4/ Mel Gorman ran his suite of performance workloads with randomization
    enabled on platforms without a memory-side-cache and saw a mix of some
    improvements and some losses [3].

    While there is potentially significant improvement for applications that
    depend on low latency access across a wide working-set, the performance
    may be negligible to negative for other workloads. For this reason the
    shuffle capability defaults to off unless a direct-mapped
    memory-side-cache is detected. Even then, the page_alloc.shuffle=0
    parameter can be specified to disable the randomization on those systems.

    Outside of memory-side-cache utilization concerns there is potentially
    security benefit from randomization. Some data exfiltration and
    return-oriented-programming attacks rely on the ability to infer the
    location of sensitive data objects. The kernel page allocator, especially
    early in system boot, has predictable first-in-first out behavior for
    physical pages. Pages are freed in physical address order when first
    onlined.

    Quoting Kees:
    "While we already have a base-address randomization
    (CONFIG_RANDOMIZE_MEMORY), attacks against the same hardware and
    memory layouts would certainly be using the predictability of
    allocation ordering (i.e. for attacks where the base address isn't
    important: only the relative positions between allocated memory).
    This is common in lots of heap-style attacks. They try to gain
    control over ordering by spraying allocations, etc.

    I'd really like to see this because it gives us something similar
    to CONFIG_SLAB_FREELIST_RANDOM but for the page allocator."

    While SLAB_FREELIST_RANDOM reduces the predictability of some local slab
    caches it leaves vast bulk of memory to be predictably in order allocated.
    However, it should be noted, the concrete security benefits are hard to
    quantify, and no known CVE is mitigated by this randomization.

    Introduce shuffle_free_memory(), and its helper shuffle_zone(), to perform
    a Fisher-Yates shuffle of the page allocator 'free_area' lists when they
    are initially populated with free memory at boot and at hotplug time. Do
    this based on either the presence of a page_alloc.shuffle=Y command line
    parameter, or autodetection of a memory-side-cache (to be added in a
    follow-on patch).

    The shuffling is done in terms of CONFIG_SHUFFLE_PAGE_ORDER sized free
    pages where the default CONFIG_SHUFFLE_PAGE_ORDER is MAX_ORDER-1 i.e. 10,
    4MB this trades off randomization granularity for time spent shuffling.
    MAX_ORDER-1 was chosen to be minimally invasive to the page allocator
    while still showing memory-side cache behavior improvements, and the
    expectation that the security implications of finer granularity
    randomization is mitigated by CONFIG_SLAB_FREELIST_RANDOM. The
    performance impact of the shuffling appears to be in the noise compared to
    other memory initialization work.

    This initial randomization can be undone over time so a follow-on patch is
    introduced to inject entropy on page free decisions. It is reasonable to
    ask if the page free entropy is sufficient, but it is not enough due to
    the in-order initial freeing of pages. At the start of that process
    putting page1 in front or behind page0 still keeps them close together,
    page2 is still near page1 and has a high chance of being adjacent. As
    more pages are added ordering diversity improves, but there is still high
    page locality for the low address pages and this leads to no significant
    impact to the cache conflict rate.

    [1]: https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/
    [2]: https://lkml.kernel.org/r/AT5PR8401MB1169D656C8B5E121752FC0F8AB120@AT5PR8401MB1169.NAMPRD84.PROD.OUTLOOK.COM
    [3]: https://lkml.org/lkml/2018/10/12/309

    [dan.j.williams@intel.com: fix shuffle enable]
    Link: http://lkml.kernel.org/r/154943713038.3858443.4125180191382062871.stgit@dwillia2-desk3.amr.corp.intel.com
    [cai@lca.pw: fix SHUFFLE_PAGE_ALLOCATOR help texts]
    Link: http://lkml.kernel.org/r/20190425201300.75650-1-cai@lca.pw
    Link: http://lkml.kernel.org/r/154899811738.3165233.12325692939590944259.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Signed-off-by: Qian Cai
    Reviewed-by: Kees Cook
    Acked-by: Michal Hocko
    Cc: Dave Hansen
    Cc: Keith Busch
    Cc: Robert Elliott
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • Pressure metrics are already recorded and exposed in procfs for the
    entire system, but any tool which monitors cgroup pressure has to
    special case the root cgroup to read from procfs. This patch exposes
    the already recorded pressure metrics on the root cgroup.

    Link: http://lkml.kernel.org/r/20190510174938.3361741-1-dschatzberg@fb.com
    Signed-off-by: Dan Schatzberg
    Acked-by: Johannes Weiner
    Cc: Tejun Heo
    Cc: Li Zefan
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Schatzberg
     
  • Psi monitor aims to provide a low-latency short-term pressure detection
    mechanism configurable by users. It allows users to monitor psi metrics
    growth and trigger events whenever a metric raises above user-defined
    threshold within user-defined time window.

    Time window and threshold are both expressed in usecs. Multiple psi
    resources with different thresholds and window sizes can be monitored
    concurrently.

    Psi monitors activate when system enters stall state for the monitored
    psi metric and deactivate upon exit from the stall state. While system
    is in the stall state psi signal growth is monitored at a rate of 10
    times per tracking window. Min window size is 500ms, therefore the min
    monitoring interval is 50ms. Max window size is 10s with monitoring
    interval of 1s.

    When activated psi monitor stays active for at least the duration of one
    tracking window to avoid repeated activations/deactivations when psi
    signal is bouncing.

    Notifications to the users are rate-limited to one per tracking window.

    Link: http://lkml.kernel.org/r/20190319235619.260832-8-surenb@google.com
    Signed-off-by: Suren Baghdasaryan
    Signed-off-by: Johannes Weiner
    Cc: Dennis Zhou
    Cc: Ingo Molnar
    Cc: Jens Axboe
    Cc: Li Zefan
    Cc: Peter Zijlstra
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Suren Baghdasaryan