Eric Lee / smarc-fsl-linux-kernel

17 May, 2019

1 commit

dc413a90e Merge tag 'armsoc-drivers' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc ... Browse Code »

Pull ARM SoC-related driver updates from Olof Johansson:
"Various driver updates for platforms and a couple of the small driver
subsystems we merge through our tree:

Among the larger pieces:

- Power management improvements for TI am335x and am437x (RTC
suspend/wake)

- Misc new additions for Amlogic (socinfo updates)

- ZynqMP FPGA manager

- Nvidia improvements for reset/powergate handling

- PMIC wrapper for Mediatek MT8516

- Misc fixes/improvements for ARM SCMI, TEE, NXP i.MX SCU drivers"

* tag 'armsoc-drivers' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (57 commits)
soc: aspeed: fix Kconfig
soc: add aspeed folder and misc drivers
spi: zynqmp: Fix build break
soc: imx: Add generic i.MX8 SoC driver
MAINTAINERS: Update email for Qualcomm SoC maintainer
memory: tegra: Fix a typos for "fdcdwr2" mc client
Revert "ARM: tegra: Restore memory arbitration on resume from LP1 on Tegra30+"
memory: tegra: Replace readl-writel with mc_readl-mc_writel
memory: tegra: Fix integer overflow on tick value calculation
memory: tegra: Fix missed registers values latching
ARM: tegra: cpuidle: Handle tick broadcasting within cpuidle core on Tegra20/30
optee: allow to work without static shared memory
soc/tegra: pmc: Move powergate initialisation to probe
soc/tegra: pmc: Remove reset sysfs entries on error
soc/tegra: pmc: Fix reset sources and levels
soc: amlogic: meson-gx-pwrc-vpu: Add support for G12A
soc: amlogic: meson-gx-pwrc-vpu: Fix power on/off register bitmask
fpga manager: Adding FPGA Manager support for Xilinx zynqmp
dt-bindings: fpga: Add bindings for ZynqMP fpga driver
firmware: xilinx: Add fpga API's
...

Linus Torvalds
2019-05-17 00:19:14 +0800

16 May, 2019

7 commits

e8a1d7011 Merge tag 'armsoc-dt' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc ... Browse Code »

Pull ARM Device-tree updates from Olof Johansson:
"Besides new bindings and additional descriptions of hardware blocks
for various SoCs and boards, the main new contents here is:

SoCs:
- Intel Agilex (SoCFPGA)
- NXP i.MX8MM (Quad Cortex-A53 with media/graphics focus)

New boards:
- Allwinner:
+ RerVision H3-DVK (H3)
+ Oceanic 5205 5inMFD (H6)
+ Beelink GS2 (H6)
+ Orange Pi 3 (H6)
- Rockchip:
+ Orange Pi RK3399
+ Nanopi NEO4
+ Veyron-Mighty Chromebook variant
- Amlogic:
+ SEI Robotics SEI510
- ST Micro:
+ stm32mp157a discovery1
+ stm32mp157c discovery2
- NXP:
+ Eckelmann ci4x10 (i.MX6DL)
+ i.MX8MM EVK (i.MX8MM)
+ ZII i.MX7 RPU2 (i.MX7)
+ ZII SPB4 (VF610)
+ Zii Ultra (i.MX8M)
+ TQ TQMa7S (i.MX7Solo)
+ TQ TQMa7D (i.MX7Dual)
+ Kobo Aura (i.MX50)
+ Menlosystems M53 (i.MX53)j
- Nvidia:
+ Jetson Nano (Tegra T210)"

* tag 'armsoc-dt' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (593 commits)
arm64: dts: bitmain: Add UART pinctrl support for Sophon Edge
arm64: dts: bitmain: Add pinctrl support for BM1880 SoC
arm64: dts: bitmain: Add GPIO Line names for Sophon Edge board
arm64: dts: bitmain: Add GPIO support for BM1880 SoC
ARM: dts: gemini: Indent DIR-685 partition table
dt-bindings: hwmon (pwm-fan) Remove dead "cooling-*-state" properties
ARM: dts: qcom-apq8064: Set 'cxo_board' as ref clock of the DSI PHY
arm64: dts: msm8998: thermal: Restrict thermal zone name length to under 20
arm64: dts: msm8998: thermal: Fix number of supported sensors
arm64: dts: msm8998-mtp: thermal: Remove skin and battery thermal zones
arm64: dts: exynos: Move fixed-clocks out of soc
arm64: dts: exynos: Move pmu and timer nodes out of soc
ARM: dts: s5pv210: Fix camera clock provider on Goni board
ARM: dts: exynos: Properly override node to use MDMA0 on Universal C210
ARM: dts: exynos: Move fixed-clocks out of soc on Exynos3250
ARM: dts: exynos: Remove unneeded address/size cells from fixed-clock on Exynos3250
ARM: dts: exynos: Move pmu and timer nodes out of soc
arm64: dts: rockchip: fix IO domain voltage setting of APIO5 on rockpro64
arm64: dts: db820c: Add sound card support
arm64: dts: apq8096-db820c: Add HDMI display support
...

Linus Torvalds
2019-05-16 23:38:17 +0800
22c58fd70 Merge tag 'armsoc-soc' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc ... Browse Code »

Pull ARM SoC platform updates from Olof Johansson:
"SoC updates, mostly refactorings and cleanups of old legacy platforms.

Major themes this release:

- Conversion of ixp4xx to a modern platform (drivers, DT, bindings)

- Moving some of the ep93xx headers around to get it closer to
multiplatform enabled.

- Cleanups of Davinci

This also contains a few patches that were queued up as fixes before
5.1 but I didn't get sent in before release"

* tag 'armsoc-soc' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (123 commits)
ARM: debug-ll: add default address for digicolor
ARM: u300: regulator: add MODULE_LICENSE()
ARM: ep93xx: move private headers out of mach/*
ARM: ep93xx: move pinctrl interfaces into include/linux/soc
ARM: ep93xx: keypad: stop using mach/platform.h
ARM: ep93xx: move network platform data to separate header
ARM: stm32: add AMBA support for stm32 family
MAINTAINERS: update arch/arm/mach-davinci
ARM: rockchip: add missing of_node_put in rockchip_smp_prepare_pmu
ARM: dts: Add queue manager and NPE to the IXP4xx DTSI
soc: ixp4xx: qmgr: Add DT probe code
soc: ixp4xx: qmgr: Add DT bindings for IXP4xx qmgr
soc: ixp4xx: npe: Add DT probe code
soc: ixp4xx: Add DT bindings for IXP4xx NPE
soc: ixp4xx: qmgr: Pass resources
soc: ixp4xx: Remove unused functions
soc: ixp4xx: Uninline several functions
soc: ixp4xx: npe: Pass addresses as resources
ARM: ixp4xx: Turn the QMGR into a platform device
ARM: ixp4xx: Turn the NPE into a platform device
...

Linus Torvalds
2019-05-16 23:31:32 +0800
a455eda33 Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/evalenti/linux-soc-thermal ... Browse Code »

Pull thermal soc updates from Eduardo Valentin:

- thermal core has a new devm_* API for registering cooling devices. I
took the entire series, that is why you see changes on drivers/hwmon
in this pull (Guenter Roeck)

- rockchip thermal driver gains support to PX30 SoC (Elaine Zhang)

- the generic-adc thermal driver now considers the lookup table DT
property as optional (Jean-Francois Dagenais)

- Refactoring of tsens thermal driver (Amit Kucheria)

- Cleanups on cpu cooling driver (Daniel Lezcano)

- broadcom thermal driver dropped support to ACPI (Srinath Mannam)

- tegra thermal driver gains support to OC hw throttle and GPU throtle
(Wei Ni)

- Fixes in several thermal drivers.

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/evalenti/linux-soc-thermal: (59 commits)
hwmon: (pwm-fan) Use devm_thermal_of_cooling_device_register
hwmon: (npcm750-pwm-fan) Use devm_thermal_of_cooling_device_register
hwmon: (mlxreg-fan) Use devm_thermal_of_cooling_device_register
hwmon: (gpio-fan) Use devm_thermal_of_cooling_device_register
hwmon: (aspeed-pwm-tacho) Use devm_thermal_of_cooling_device_register
thermal: rcar_gen3_thermal: Fix to show correct trip points number
thermal: rcar_thermal: update calculation formula for R-Car Gen3 SoCs
thermal: cpu_cooling: Actually trace CPU load in thermal_power_cpu_get_power
thermal: rockchip: Support the PX30 SoC in thermal driver
dt-bindings: rockchip-thermal: Support the PX30 SoC compatible
thermal: rockchip: fix up the tsadc pinctrl setting error
thermal: broadcom: Remove ACPI support
thermal: Fix build error of missing devm_ioremap_resource on UM
thermal/drivers/cpu_cooling: Remove pointless field
thermal/drivers/cpu_cooling: Add Software Package Data Exchange (SPDX)
thermal/drivers/cpu_cooling: Fixup the header and copyright
thermal/drivers/cpu_cooling: Remove pointless test in power2state()
thermal: rcar_gen3_thermal: disable interrupt in .remove
thermal: rcar_gen3_thermal: fix interrupt type
thermal: Introduce devm_thermal_of_cooling_device_register
...

Linus Torvalds
2019-05-16 22:56:57 +0800
7a0c4c170 Merge branch 'fixes' into arm/soc ... Browse Code »

Merge in a few pending fixes from pre-5.1 that didn't get sent in:

MAINTAINERS: update arch/arm/mach-davinci
ARM: dts: ls1021: Fix SGMII PCS link remaining down after PHY disconnect
ARM: dts: imx6q-logicpd: Reduce inrush current on USBH1
ARM: dts: imx6q-logicpd: Reduce inrush current on start
ARM: dts: imx: Fix the AR803X phy-mode
ARM: dts: sun8i: a33: Reintroduce default pinctrl muxing
arm64: dts: allwinner: a64: Rename hpvcc-supply to cpvdd-supply
ARM: sunxi: fix a leaked reference by adding missing of_node_put
ARM: sunxi: fix a leaked reference by adding missing of_node_put

Signed-off-by: Olof Johansson

Olof Johansson
2019-05-16 13:51:48 +0800
8649efb2f Merge tag 'for-v5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-power-supply ... Browse Code »

Pull power supply and reset updates from Sebastian Reichel:
"Core:
- Add over-current health state
- Add standard, adaptive and custom charge types
- Add new properties for start/end charge threshold

New Drivers / Hardware:
- UCS1002 Programmable USB Port Power Controller
- Ingenic JZ47xx Battery Fuel Gauge
- AXP20x USB Power: Add AXP813 support
- AT91 poweroff: Add SAM9X60 support
- OLPC battery: Add XO-1.5 and XO-1.75 support

Misc Changes:
- syscon-reboot: support mask property
- AXP288 fuel gauge: Blacklist ACEPC T8/T11. Looks like some vendor
thought it's a good idea to build a desktop system with a fuel
gauge, that slowly "discharges"...
- cpcap-battery: Fix calculation errors
- misc fixes"

* tag 'for-v5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-power-supply: (54 commits)
power: supply: olpc_battery: force the le/be casts
power: supply: ucs1002: Fix build error without CONFIG_REGULATOR
power: supply: ucs1002: Fix wrong return value checking
power: supply: Add driver for Microchip UCS1002
dt-bindings: power: supply: Add bindings for Microchip UCS1002
power: supply: core: Add POWER_SUPPLY_HEALTH_OVERCURRENT constant
power: supply: core: fix clang -Wunsequenced
power: supply: core: Add missing documentation for CHARGE_CONTROL_* properties
power: supply: core: Add CHARGE_CONTROL_{START_THRESHOLD,END_THRESHOLD} properties
power: supply: core: Add Standard, Adaptive, and Custom charge types
power: supply: axp288_fuel_gauge: Add ACEPC T8 and T11 mini PCs to the blacklist
power: supply: bq27xxx_battery: Notify also about status changes
power: supply: olpc_battery: Have the framework register sysfs files for us
power: supply: olpc_battery: Add OLPC XO 1.75 support
power: supply: olpc_battery: Avoid using platform_info
power: supply: olpc_battery: Use devm_power_supply_register()
power: supply: olpc_battery: Move priv data to a struct
power: supply: olpc_battery: Use DT to get battery version
x86/platform/olpc: Use a correct version when making up a battery node
x86/platform/olpc: Trivial code move in DT fixup
...

Linus Torvalds
2019-05-16 09:50:40 +0800
700a800a9 Merge tag 'nfsd-5.2' of git://linux-nfs.org/~bfields/linux ... Browse Code »

Pull nfsd updates from Bruce Fields:
"This consists mostly of nfsd container work:

Scott Mayhew revived an old api that communicates with a userspace
daemon to manage some on-disk state that's used to track clients
across server reboots. We've been using a usermode_helper upcall for
that, but it's tough to run those with the right namespaces, so a
daemon is much friendlier to container use cases.

Trond fixed nfsd's handling of user credentials in user namespaces. He
also contributed patches that allow containers to support different
sets of NFS protocol versions.

The only remaining container bug I'm aware of is that the NFS reply
cache is shared between all containers. If anyone's aware of other
gaps in our container support, let me know.

The rest of this is miscellaneous bugfixes"

* tag 'nfsd-5.2' of git://linux-nfs.org/~bfields/linux: (23 commits)
nfsd: update callback done processing
locks: move checks from locks_free_lock() to locks_release_private()
nfsd: fh_drop_write in nfsd_unlink
nfsd: allow fh_want_write to be called twice
nfsd: knfsd must use the container user namespace
SUNRPC: rsi_parse() should use the current user namespace
SUNRPC: Fix the server AUTH_UNIX userspace mappings
lockd: Pass the user cred from knfsd when starting the lockd server
SUNRPC: Temporary sockets should inherit the cred from their parent
SUNRPC: Cache the process user cred in the RPC server listener
nfsd: Allow containers to set supported nfs versions
nfsd: Add custom rpcbind callbacks for knfsd
SUNRPC: Allow further customisation of RPC program registration
SUNRPC: Clean up generic dispatcher code
SUNRPC: Add a callback to initialise server requests
SUNRPC/nfs: Fix return value for nfs4_callback_compound()
nfsd: handle legacy client tracking records sent by nfsdcld
nfsd: re-order client tracking method selection
nfsd: keep a tally of RECLAIM_COMPLETE operations when using nfsdcld
nfsd: un-deprecate nfsdcld
...

Linus Torvalds
2019-05-16 09:21:43 +0800
d2d8b1460 Merge tag 'trace-v5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace ... Browse Code »

Pull tracing updates from Steven Rostedt:
"The major changes in this tracing update includes:

- Removal of non-DYNAMIC_FTRACE from 32bit x86

- Removal of mcount support from x86

- Emulating a call from int3 on x86_64, fixes live kernel patching

- Consolidated Tracing Error logs file

Minor updates:

- Removal of klp_check_compiler_support()

- kdb ftrace dumping output changes

- Accessing and creating ftrace instances from inside the kernel

- Clean up of #define if macro

- Introduction of TRACE_EVENT_NOP() to disable trace events based on
config options

And other minor fixes and clean ups"

* tag 'trace-v5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (44 commits)
x86: Hide the int3_emulate_call/jmp functions from UML
livepatch: Remove klp_check_compiler_support()
ftrace/x86: Remove mcount support
ftrace/x86_32: Remove support for non DYNAMIC_FTRACE
tracing: Simplify "if" macro code
tracing: Fix documentation about disabling options using trace_options
tracing: Replace kzalloc with kcalloc
tracing: Fix partial reading of trace event's id file
tracing: Allow RCU to run between postponed startup tests
tracing: Fix white space issues in parse_pred() function
tracing: Eliminate const char[] auto variables
ring-buffer: Fix mispelling of Calculate
tracing: probeevent: Fix to make the type of $comm string
tracing: probeevent: Do not accumulate on ret variable
tracing: uprobes: Re-enable $comm support for uprobe events
ftrace/x86_64: Emulate call function while updating in breakpoint handler
x86_64: Allow breakpoints to emulate call instructions
x86_64: Add gap to int3 to allow for call emulation
tracing: kdb: Allow ftdump to skip all but the last few entries
tracing: Add trace_total_entries() / trace_total_entries_cpu()
...

Linus Torvalds
2019-05-16 07:05:47 +0800

15 May, 2019

32 commits

fcdec1436 Merge tag 'acpi-5.2-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm ... Browse Code »

Pull more ACPI updates from Rafael Wysocki:
"These fix two regressions introduced during the 5.0 cycle, in ACPICA
and in device PM, cause the values returned by _ADR to be stored in 64
bits and fix two ACPI documentation issues.

Specifics:

- Update the ACPICA code in the kernel to upstream revision 20190509
including one regression fix:
* Prevent excessive ACPI debug messages from being printed by
moving the ACPI_DEBUG_DEFAULT definition to the right place
(Erik Schmauss).

- Set the enable_for_wake bits for wakeup GPEs during suspend to idle
to allow acpi_enable_all_wakeup_gpes() to enable them as
aproppriate and make wakeup devices sighaling events through ACPI
GPEs work with suspend-to-idle again (Rajat Jain).

- Use 64 bits to store the return values of _ADR which are assumed to
be 64-bit by some bus specs and may contain nonzero bits in the
upper 32 bits part for some devices (Pierre-Louis Bossart).

- Fix two minor issues with the ACPI documentation (Sakari Ailus)"

* tag 'acpi-5.2-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI: PM: Set enable_for_wake for wakeup GPEs during suspend-to-idle
Documentation: ACPI: Direct references are allowed to devices only
Documentation: ACPI: Use tabs for graph ASL indentation
ACPICA: Update version to 20190509
ACPICA: Linux: move ACPI_DEBUG_DEFAULT flag out of ifndef
ACPI: bus: change _ADR representation to 64 bits

Linus Torvalds
2019-05-15 23:58:49 +0800
bfbfbf736 Merge tag 'pm-5.2-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm ... Browse Code »

Pull more power management updates from Rafael Wysocki:
"These fix a recent regression causing kernels built with CONFIG_PM
unset to crash on systems that support the Performance and Energy Bias
Hint (EPB), clean up the cpufreq core and some users of transition
notifiers and introduce a new power domain flag into the generic power
domains framework (genpd).

Specifics:

- Fix recent regression causing kernels built with CONFIG_PM unset to
crash on systems that support the Performance and Energy Bias Hint
(EPB) by avoiding to compile the EPB-related code depending on
CONFIG_PM when it is unset (Rafael Wysocki).

- Clean up the transition notifier invocation code in the cpufreq
core and change some users of cpufreq transition notifiers
accordingly (Viresh Kumar).

- Change MAINTAINERS to cover the schedutil governor as part of
cpufreq (Viresh Kumar).

- Simplify cpufreq_init_policy() to avoid redundant computations (Yue
Hu).

- Add explanatory comment to the cpufreq core (Rafael Wysocki).

- Introduce a new flag, GENPD_FLAG_RPM_ALWAYS_ON, to the generic
power domains (genpd) framework along with the first user of it
(Leonard Crestez)"

* tag 'pm-5.2-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
soc: imx: gpc: Use GENPD_FLAG_RPM_ALWAYS_ON for ERR009619
PM / Domains: Add GENPD_FLAG_RPM_ALWAYS_ON flag
cpufreq: Update MAINTAINERS to include schedutil governor
cpufreq: Don't find governor for setpolicy drivers in cpufreq_init_policy()
cpufreq: Explain the kobject_put() in cpufreq_policy_alloc()
cpufreq: Call transition notifier only once for each policy
x86: intel_epb: Take CONFIG_PM into account

Linus Torvalds
2019-05-15 23:46:44 +0800
2a8d69f61 Merge branches 'pm-cpufreq' and 'pm-domains' ... Browse Code »

* pm-cpufreq:
cpufreq: Update MAINTAINERS to include schedutil governor
cpufreq: Don't find governor for setpolicy drivers in cpufreq_init_policy()
cpufreq: Explain the kobject_put() in cpufreq_policy_alloc()
cpufreq: Call transition notifier only once for each policy

* pm-domains:
soc: imx: gpc: Use GENPD_FLAG_RPM_ALWAYS_ON for ERR009619
PM / Domains: Add GENPD_FLAG_RPM_ALWAYS_ON flag

Rafael J. Wysocki
2019-05-15 17:04:08 +0800
e3e28670b Merge branches 'acpi-bus', 'acpi-doc' and 'acpi-pm' ... Browse Code »

* acpi-bus:
ACPI: bus: change _ADR representation to 64 bits

* acpi-doc:
Documentation: ACPI: Direct references are allowed to devices only
Documentation: ACPI: Use tabs for graph ASL indentation

* acpi-pm:
ACPI: PM: Set enable_for_wake for wakeup GPEs during suspend-to-idle

Rafael J. Wysocki
2019-05-15 17:03:16 +0800
5ac943322 Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma ... Browse Code »

Pull more rdma updates from Jason Gunthorpe:
"This is being sent to get a fix for the gcc 9.1 build warnings, and
I've also pulled in some bug fix patches that were posted in the last
two weeks.

- Avoid the gcc 9.1 warning about overflowing a union member

- Fix the wrong callback type for a single response netlink to doit

- Bug fixes from more usage of the mlx5 devx interface"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
net/mlx5: Set completion EQs as shared resources
IB/mlx5: Verify DEVX general object type correctly
RDMA/core: Change system parameters callback from dumpit to doit
RDMA: Directly cast the sockaddr union to sockaddr

Linus Torvalds
2019-05-15 11:56:31 +0800
1064d8577 Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge more updates from Andrew Morton:

- a couple of hotfixes

- almost all of the rest of MM

- lib/ updates

- binfmt_elf updates

- autofs updates

- quite a lot of misc fixes and updates
- reiserfs, fatfs
- signals
- exec
- cpumask
- rapidio
- sysctl
- pids
- eventfd
- gcov
- panic
- pps

- gdb script updates

- ipc updates

* emailed patches from Andrew Morton : (126 commits)
mm: memcontrol: fix NUMA round-robin reclaim at intermediate level
mm: memcontrol: fix recursive statistics correctness & scalabilty
mm: memcontrol: move stat/event counting functions out-of-line
mm: memcontrol: make cgroup stats and events query API explicitly local
drivers/virt/fsl_hypervisor.c: prevent integer overflow in ioctl
drivers/virt/fsl_hypervisor.c: dereferencing error pointers in ioctl
mm, memcg: rename ambiguously named memory.stat counters and functions
arch: remove and
treewide: replace #include with #include
fs/block_dev.c: Remove duplicate header
fs/cachefiles/namei.c: remove duplicate header
include/linux/sched/signal.h: replace `tsk' with `task'
fs/coda/psdev.c: remove duplicate header
ipc: do cyclic id allocation for the ipc object.
ipc: conserve sequence numbers in ipcmni_extend mode
ipc: allow boot time extension of IPCMNI from 32k to 16M
ipc/mqueue: optimize msg_get()
ipc/mqueue: remove redundant wq task assignment
ipc: prevent lockup on alloc_msg and free_msg
scripts/gdb: print cached rate in lx-clk-summary
...

Linus Torvalds
2019-05-15 11:08:51 +0800
42a300353 mm: memcontrol: fix recursive statistics correctness & scalabilty ... Browse Code »

Right now, when somebody needs to know the recursive memory statistics
and events of a cgroup subtree, they need to walk the entire subtree and
sum up the counters manually.

There are two issues with this:

1. When a cgroup gets deleted, its stats are lost. The state counters
should all be 0 at that point, of course, but the events are not.
When this happens, the event counters, which are supposed to be
monotonic, can go backwards in the parent cgroups.

2. During regular operation, we always have a certain number of lazily
freed cgroups sitting around that have been deleted, have no tasks,
but have a few cache pages remaining. These groups' statistics do not
change until we eventually hit memory pressure, but somebody
watching, say, memory.stat on an ancestor has to iterate those every
time.

This patch addresses both issues by introducing recursive counters at
each level that are propagated from the write side when stats change.

Upward propagation happens when the per-cpu caches spill over into the
local atomic counter. This is the same thing we do during charge and
uncharge, except that the latter uses atomic RMWs, which are more
expensive; stat changes happen at around the same rate. In a sparse
file test (page faults and reclaim at maximum CPU speed) with 5 cgroup
nesting levels, perf shows __mod_memcg_page state at ~1%.

Link: http://lkml.kernel.org/r/20190412151507.2769-4-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner
Reviewed-by: Shakeel Butt
Reviewed-by: Roman Gushchin
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2019-05-15 10:52:53 +0800
db9adbcbe mm: memcontrol: move stat/event counting functions out-of-line ... Browse Code »

These are getting too big to be inlined in every callsite. They were
stolen from vmstat.c, which already out-of-lines them, and they have
only been growing since. The callsites aren't that hot, either.

Move __mod_memcg_state()
__mod_lruvec_state() and
__count_memcg_events() out of line and add kerneldoc comments.

Link: http://lkml.kernel.org/r/20190412151507.2769-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner
Reviewed-by: Shakeel Butt
Reviewed-by: Roman Gushchin
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2019-05-15 10:52:53 +0800
205b20cc5 mm: memcontrol: make cgroup stats and events query API explicitly local ... Browse Code »

Patch series "mm: memcontrol: memory.stat cost & correctness".

The cgroup memory.stat file holds recursive statistics for the entire
subtree. The current implementation does this tree walk on-demand
whenever the file is read. This is giving us problems in production.

1. The cost of aggregating the statistics on-demand is high. A lot of
system service cgroups are mostly idle and their stats don't change
between reads, yet we always have to check them. There are also always
some lazily-dying cgroups sitting around that are pinned by a handful
of remaining page cache; the same applies to them.

In an application that periodically monitors memory.stat in our
fleet, we have seen the aggregation consume up to 5% CPU time.

2. When cgroups die and disappear from the cgroup tree, so do their
accumulated vm events. The result is that the event counters at
higher-level cgroups can go backwards and confuse some of our
automation, let alone people looking at the graphs over time.

To address both issues, this patch series changes the stat
implementation to spill counts upwards when the counters change.

The upward spilling is batched using the existing per-cpu cache. In a
sparse file stress test with 5 level cgroup nesting, the additional cost
of the flushing was negligible (a little under 1% of CPU at 100% CPU
utilization, compared to the 5% of reading memory.stat during regular
operation).

This patch (of 4):

memcg_page_state(), lruvec_page_state(), memcg_sum_events() are
currently returning the state of the local memcg or lruvec, not the
recursive state.

In practice there is a demand for both versions, although the callers
that want the recursive counts currently sum them up by hand.

Per default, cgroups are considered recursive entities and generally we
expect more users of the recursive counters, with the local counts being
special cases. To reflect that in the name, add a _local suffix to the
current implementations.

The following patch will re-incarnate these functions with recursive
semantics, but with an O(1) implementation.

[hannes@cmpxchg.org: fix bisection hole]
Link: http://lkml.kernel.org/r/20190417160347.GC23013@cmpxchg.org
Link: http://lkml.kernel.org/r/20190412151507.2769-2-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner
Reviewed-by: Shakeel Butt
Reviewed-by: Roman Gushchin
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2019-05-15 10:52:53 +0800
871789d4a mm, memcg: rename ambiguously named memory.stat counters and functions ... Browse Code »

I spent literally an hour trying to work out why an earlier version of
my memory.events aggregation code doesn't work properly, only to find
out I was calling memcg->events instead of memcg->memory_events, which
is fairly confusing.

This naming seems in need of reworking, so make it harder to do the
wrong thing by using vmevents instead of events, which makes it more
clear that these are vm counters rather than memcg-specific counters.

There are also a few other inconsistent names in both the percpu and
aggregated structs, so these are all cleaned up to be more coherent and
easy to understand.

This commit contains code cleanup only: there are no logic changes.

[akpm@linux-foundation.org: fix it for preceding changes]
Link: http://lkml.kernel.org/r/20190208224319.GA23801@chrisdown.name
Signed-off-by: Chris Down
Acked-by: Johannes Weiner
Cc: Michal Hocko
Cc: Tejun Heo
Cc: Roman Gushchin
Cc: Dennis Zhou
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Chris Down
2019-05-15 10:52:52 +0800
b09e89366 arch: remove <asm/sizes.h> and <asm-generic/sizes.h> ... Browse Code »

Now that all instances of #include have been replaced with
#include , we can remove these.

Link: http://lkml.kernel.org/r/1553267665-27228-2-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Masahiro Yamada
2019-05-15 10:52:52 +0800
9e9291c71 include/linux/sched/signal.h: replace `tsk' with `task' ... Browse Code »

This file uses "task" 85 times and "tsk" 25 times. It is better to be
consistent.

Link: http://lkml.kernel.org/r/20181129180547.15976-1-avagin@gmail.com
Signed-off-by: Andrei Vagin
Reviewed-by: Andrew Morton
Cc: Oleg Nesterov
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrei Vagin
2019-05-15 10:52:52 +0800
3278a2c20 ipc: conserve sequence numbers in ipcmni_extend mode ... Browse Code »

Rewrite, based on the patch from Waiman Long:

The mixing in of a sequence number into the IPC IDs is probably to avoid
ID reuse in userspace as much as possible. With ipcmni_extend mode, the
number of usable sequence numbers is greatly reduced leading to higher
chance of ID reuse.

To address this issue, we need to conserve the sequence number space as
much as possible. Right now, the sequence number is incremented for
every new ID created. In reality, we only need to increment the
sequence number when new allocated ID is not greater than the last one
allocated. It is in such case that the new ID may collide with an
existing one. This is being done irrespective of the ipcmni mode.

In order to avoid any races, the index is first allocated and then the
pointer is replaced.

Changes compared to the initial patch:
- Handle failures from idr_alloc().
- Avoid that concurrent operations can see the wrong sequence number.
(This is achieved by using idr_replace()).
- IPCMNI_SEQ_SHIFT is not a constant, thus renamed to
ipcmni_seq_shift().
- IPCMNI_SEQ_MAX is not a constant, thus renamed to ipcmni_seq_max().

Link: http://lkml.kernel.org/r/20190329204930.21620-2-longman@redhat.com
Signed-off-by: Manfred Spraul
Signed-off-by: Waiman Long
Suggested-by: Matthew Wilcox
Acked-by: Waiman Long
Cc: Al Viro
Cc: Davidlohr Bueso
Cc: "Eric W . Biederman"
Cc: Jonathan Corbet
Cc: Kees Cook
Cc: "Luis R. Rodriguez"
Cc: Takashi Iwai
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Manfred Spraul
2019-05-15 10:52:52 +0800
4c69add45 pps: pps-gpio PPS ECHO implementation ... Browse Code »

This patch implements the PPS ECHO functionality for pps-gpio, that
sysfs claims is available already.

Configuration is done via device tree bindings.

No changes are made to userspace interfaces.

This patch was originally written by Lukas Senger as part of a masters
thesis project and modified for inclusion into the linux kernel by Tom
Burkart.

Link: http://lkml.kernel.org/r/20190324043305.6627-4-tom@aussec.com
Signed-off-by: Tom Burkart
Acked-by: Rodolfo Giometti
Signed-off-by: Lukas Senger
Cc: Philipp Zabel
Cc: Rob Herring
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tom Burkart
2019-05-15 10:52:51 +0800
4461d6517 pps: descriptor-based gpio ... Browse Code »

This patch changes the GPIO access for the pps-gpio driver from the
integer based API to the descriptor based API.

The integer based API is considered deprecated and the descriptor based
API is the preferred way to access GPIOs as per
Documentation/driver-api/gpio/intro.rst

No changes are made to userspace interfaces.

Link: http://lkml.kernel.org/r/20190324043305.6627-2-tom@aussec.com
Signed-off-by: Tom Burkart
Acked-by: Rodolfo Giometti
Reviewed-by: Philipp Zabel
Cc: Lukas Senger
Cc: Rob Herring
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tom Burkart
2019-05-15 10:52:51 +0800
b287a25a7 panic/reboot: allow specifying reboot_mode for panic only ... Browse Code »

Allow specifying reboot_mode for panic only. This is needed on systems
where ramoops is used to store panic logs, and user wants to use warm
reset to preserve those, while still having cold reset on normal
reboots.

Link: http://lkml.kernel.org/r/20190322004735.27702-1-aaro.koskinen@iki.fi
Signed-off-by: Aaro Koskinen
Reviewed-by: Kees Cook
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Aaro Koskinen
2019-05-15 10:52:51 +0800
c39ea0b9d panic: avoid the extra noise dmesg ... Browse Code »

When kernel panic happens, it will first print the panic call stack,
then the ending msg like:

[ 35.743249] ---[ end Kernel panic - not syncing: Fatal exception
[ 35.749975] ------------[ cut here ]------------

The above message are very useful for debugging.

But if system is configured to not reboot on panic, say the
"panic_timeout" parameter equals 0, it will likely print out many noisy
message like WARN() call stack for each and every CPU except the panic
one, messages like below:

WARNING: CPU: 1 PID: 280 at kernel/sched/core.c:1198 set_task_cpu+0x183/0x190
Call Trace:

try_to_wake_up
default_wake_function
autoremove_wake_function
__wake_up_common
__wake_up_common_lock
__wake_up
wake_up_klogd_work_func
irq_work_run_list
irq_work_tick
update_process_times
tick_sched_timer
__hrtimer_run_queues
hrtimer_interrupt
smp_apic_timer_interrupt
apic_timer_interrupt

For people working in console mode, the screen will first show the panic
call stack, but immediately overridden by these noisy extra messages,
which makes debugging much more difficult, as the original context gets
lost on screen.

Also these noisy messages will confuse some users, as I have seen many bug
reporters posted the noisy message into bugzilla, instead of the real
panic call stack and context.

Adding a flag "suppress_printk" which gets set in panic() to avoid those
noisy messages, without changing current kernel behavior that both panic
blinking and sysrq magic key can work as is, suggested by Petr Mladek.

To verify this, make sure kernel is not configured to reboot on panic and
in console
# echo c > /proc/sysrq-trigger
to see if console only prints out the panic call stack.

Link: http://lkml.kernel.org/r/1551430186-24169-1-git-send-email-feng.tang@intel.com
Signed-off-by: Feng Tang
Suggested-by: Petr Mladek
Reviewed-by: Petr Mladek
Acked-by: Steven Rostedt (VMware)
Acked-by: Sergey Senozhatsky
Cc: Thomas Gleixner
Cc: Kees Cook
Cc: Borislav Petkov
Cc: Andi Kleen
Cc: Peter Zijlstra
Cc: Greg Kroah-Hartman
Cc: Jiri Slaby
Cc: Sasha Levin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Feng Tang
2019-05-15 10:52:51 +0800
3713a4e1f include/linux/cpumask.h: fix double string traverse in cpumask_parse ... Browse Code »

cpumask_parse() finds first occurrence of either or strchr() and
strlen(). We can do it better with a single call of strchrnul().

[akpm@linux-foundation.org: remove unneeded cast]
Link: http://lkml.kernel.org/r/20190409204208.12190-1-ynorov@marvell.com
Signed-off-by: Yury Norov
Acked-by: Rasmus Villemoes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yury Norov
2019-05-15 10:52:50 +0800
a6231d199 exec: move struct linux_binprm::buf ... Browse Code »

struct linux_binprm::buf is the first field and it is exactly 128 bytes
in size. It means that on x86_64 all accesses to other fields will go
though [r64 + disp32] addressing mode which is 3 bytes bloatier than
[r64 + disp8] addressing mode. Given that accesses to other fields
outnumber accesses to ->buf, move it down.

Space savings (x86_64 defconfig):
more on distro configs because LSMs actively dereference "bprm"
but do not care about first 128 bytes of the executable itself.

add/remove: 0/0 grow/shrink: 0/24 up/down: 0/-492 (-492)
Function old new delta
selinux_bprm_committing_creds 552 549 -3
finalize_exec 94 91 -3
__audit_log_bprm_fcaps 283 280 -3
__audit_bprm 39 36 -3
perf_trace_sched_process_exec 347 341 -6
install_exec_creds 105 99 -6
cap_bprm_set_creds.cold 60 54 -6
would_dump 137 128 -9
load_script 637 628 -9
bprm_change_interp 61 52 -9
trace_event_raw_event_sched_process_exec 260 250 -10
search_binary_handler 255 240 -15
remove_arg_zero 295 277 -18
free_bprm 119 101 -18
prepare_binprm 379 360 -19
setup_new_exec 336 315 -21
flush_old_exec 1638 1617 -21
copy_strings.isra 746 724 -22
setup_arg_pages 559 530 -29
load_misc_binary 1151 1118 -33
selinux_bprm_set_creds 792 753 -39
load_elf_binary 11111 11072 -39
cap_bprm_set_creds 1496 1454 -42
__do_execve_file.isra 2395 2286 -109

Link: http://lkml.kernel.org/r/20190421165025.GA26843@avx2
Signed-off-by: Alexey Dobriyan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2019-05-15 10:52:50 +0800
ef4d6f6b2 include/linux/bitops.h: sanitize rotate primitives ... Browse Code »

The ror32 implementation (word >> shift) | (word << (32 - shift) has
undefined behaviour if shift is outside the [1, 31] range. Similarly
for the 64 bit variants. Most callers pass a compile-time constant
(naturally in that range), but there's an UBSAN report that these may
actually be called with a shift count of 0.

Instead of special-casing that, we can make them DTRT for all values of
shift while also avoiding UB. For some reason, this was already partly
done for rol32 (which was well-defined for [0, 31]). gcc 8 recognizes
these patterns as rotates, so for example

__u32 rol32(__u32 word, unsigned int shift)
{
return (word << (shift & 31)) | (word >> ((-shift) & 31));
}

compiles to

0000000000000020 :
20: 89 f8 mov %edi,%eax
22: 89 f1 mov %esi,%ecx
24: d3 c0 rol %cl,%eax
26: c3 retq

Older compilers unfortunately do not do as well, but this only affects
the small minority of users that don't pass constants.

Due to integer promotions, ro[lr]8 were already well-defined for shifts
in [0, 8], and ro[lr]16 were mostly well-defined for shifts in [0, 16]
(only mostly - u16 gets promoted to _signed_ int, so if bit 15 is set,
word << 16 is undefined). For consistency, update those as well.

Link: http://lkml.kernel.org/r/20190410211906.2190-1-linux@rasmusvillemoes.dk
Signed-off-by: Rasmus Villemoes
Reported-by: Ido Schimmel
Tested-by: Ido Schimmel
Reviewed-by: Will Deacon
Cc: Vadim Pasternak
Cc: Andrey Ryabinin
Cc: Jacek Anaszewski
Cc: Pavel Machek
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rasmus Villemoes
2019-05-15 10:52:49 +0800
9f6158946 lib/math: move int_pow() from pwm_bl.c for wider use ... Browse Code »

The integer exponentiation is used in few places and might be used in
the future by other call sites. Move it to wider use.

Link: http://lkml.kernel.org/r/20190323172531.80025-2-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko
Cc: Daniel Thompson
Cc: Lee Jones
Cc: Ray Jui
Cc: Thierry Reding
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andy Shevchenko
2019-05-15 10:52:49 +0800
043b3f7b6 lib/list_sort: simplify and remove MAX_LIST_LENGTH_BITS ... Browse Code »

Rather than a fixed-size array of pending sorted runs, use the ->prev
links to keep track of things. This reduces stack usage, eliminates
some ugly overflow handling, and reduces the code size.

Also:
* merge() no longer needs to handle NULL inputs, so simplify.
* The same applies to merge_and_restore_back_links(), which is renamed
to the less ponderous merge_final(). (It's a static helper function,
so we don't need a super-descriptive name; comments will do.)
* Document the actual return value requirements on the (*cmp)()
function; some callers are already using this feature.

x86-64 code size 1086 -> 739 bytes (-347)

(Yes, I see checkpatch complaining about no space after comma in
"__attribute__((nonnull(2,3,4,5)))". Checkpatch is wrong.)

Feedback from Rasmus Villemoes, Andy Shevchenko and Geert Uytterhoeven.

[akpm@linux-foundation.org: remove __pure usage due to mysterious warning]
Link: http://lkml.kernel.org/r/f63c410e0ff76009c9b58e01027e751ff7fdb749.1552704200.git.lkml@sdf.org
Signed-off-by: George Spelvin
Acked-by: Andrey Abramov
Acked-by: Rasmus Villemoes
Reviewed-by: Andy Shevchenko
Cc: Geert Uytterhoeven
Cc: Daniel Wagner
Cc: Dave Chinner
Cc: Don Mullis
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

George Spelvin
2019-05-15 10:52:49 +0800
8e18faeac lib/plist: rename DEBUG_PI_LIST to DEBUG_PLIST ... Browse Code »

This is a lot more appropriate than PI_LIST, which in the kernel one
would assume that it has to do with priority-inheritance; which is not
-- furthermore futexes make use of plists so this can be even more
confusing, albeit the debug nature of the config option.

Link: http://lkml.kernel.org/r/20190317185434.1626-1-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso
Reviewed-by: Andrew Morton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2019-05-15 10:52:49 +0800
e02c9b0d6 kernel/latencytop.c: rename clear_all_latency_tracing to clear_tsk_latency_tracing ... Browse Code »

The name clear_all_latency_tracing is misleading, in fact which only
clear per task's latency_record[], and we do have another function named
clear_global_latency_tracing which clear the global latency_record[]
buffer.

Link: http://lkml.kernel.org/r/20190226114602.16902-1-linf@wangsu.com
Signed-off-by: Lin Feng
Cc: Alexey Dobriyan
Cc: Fabian Frederick
Cc: Arjan van de Ven
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lin Feng
2019-05-15 10:52:49 +0800
9012d0116 compiler: allow all arches to enable CONFIG_OPTIMIZE_INLINING ... Browse Code »

Commit 60a3cdd06394 ("x86: add optimized inlining") introduced
CONFIG_OPTIMIZE_INLINING, but it has been available only for x86.

The idea is obviously arch-agnostic. This commit moves the config entry
from arch/x86/Kconfig.debug to lib/Kconfig.debug so that all
architectures can benefit from it.

This can make a huge difference in kernel image size especially when
CONFIG_OPTIMIZE_FOR_SIZE is enabled.

For example, I got 3.5% smaller arm64 kernel for v5.1-rc1.

dec file
18983424 arch/arm64/boot/Image.before
18321920 arch/arm64/boot/Image.after

This also slightly improves the "Kernel hacking" Kconfig menu as
e61aca5158a8 ("Merge branch 'kconfig-diet' from Dave Hansen') suggested;
this config option would be a good fit in the "compiler option" menu.

Link: http://lkml.kernel.org/r/20190423034959.13525-12-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada
Acked-by: Borislav Petkov
Cc: Arnd Bergmann
Cc: Benjamin Herrenschmidt
Cc: Boris Brezillon
Cc: Brian Norris
Cc: Christophe Leroy
Cc: David Woodhouse
Cc: Heiko Carstens
Cc: "H. Peter Anvin"
Cc: Ingo Molnar
Cc: Marek Vasut
Cc: Mark Rutland
Cc: Mathieu Malaterre
Cc: Miquel Raynal
Cc: Paul Mackerras
Cc: Ralf Baechle
Cc: Richard Weinberger
Cc: Russell King
Cc: Stefan Agner
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Masahiro Yamada
2019-05-15 10:52:48 +0800
687a3e4d8 treewide: remove SPDX "WITH Linux-syscall-note" from kernel-space headers ... Browse Code »

The "WITH Linux-syscall-note" should be added to headers exported to the
user-space.

Some kernel-space headers have "WITH Linux-syscall-note", which seems a
mistake.

[1] arch/x86/include/asm/hyperv-tlfs.h

Commit 5a4858032217 ("x86/hyper-v: move hyperv.h out of uapi") moved
this file out of uapi, but missed to update the SPDX License tag.

[2] include/asm-generic/shmparam.h

Commit 76ce2a80a28e ("Rename include/{uapi => }/asm-generic/shmparam.h
really") moved this file out of uapi, but missed to update the SPDX
License tag.

[3] include/linux/qcom-geni-se.h

Commit eddac5af0654 ("soc: qcom: Add GENI based QUP Wrapper driver")
added this file, but I do not see a good reason why its license tag must
include "WITH Linux-syscall-note".

Link: http://lkml.kernel.org/r/1554196104-3522-1-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Masahiro Yamada
2019-05-15 10:52:48 +0800
ad312f95d fs/select: avoid clang stack usage warning ... Browse Code »

The select() implementation is carefully tuned to put a sensible amount
of data on the stack for holding a copy of the user space fd_set, but
not too large to risk overflowing the kernel stack.

When building a 32-bit kernel with clang, we need a little more space
than with gcc, which often triggers a warning:

fs/select.c:619:5: error: stack frame size of 1048 bytes in function 'core_sys_select' [-Werror,-Wframe-larger-than=]
int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,

I experimentally found that for 32-bit ARM, reducing the maximum stack
usage by 64 bytes keeps us reliably under the warning limit again.

Link: http://lkml.kernel.org/r/20190307090146.1874906-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann
Reviewed-by: Andi Kleen
Cc: Nick Desaulniers
Cc: Alexander Viro
Cc: Christoph Hellwig
Cc: Eric Dumazet
Cc: "Darrick J. Wong"
Cc: Greg Kroah-Hartman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Arnd Bergmann
2019-05-15 10:52:48 +0800
97500a4a5 mm: maintain randomization of page free lists ... Browse Code »

When freeing a page with an order >= shuffle_page_order randomly select
the front or back of the list for insertion.

While the mm tries to defragment physical pages into huge pages this can
tend to make the page allocator more predictable over time. Inject the
front-back randomness to preserve the initial randomness established by
shuffle_free_memory() when the kernel was booted.

The overhead of this manipulation is constrained by only being applied
for MAX_ORDER sized pages by default.

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/154899812788.3165233.9066631950746578517.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams
Reviewed-by: Kees Cook
Cc: Michal Hocko
Cc: Dave Hansen
Cc: Keith Busch
Cc: Robert Elliott
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Williams
2019-05-15 10:52:48 +0800
b03641af6 mm: move buddy list manipulations into helpers ... Browse Code »

In preparation for runtime randomization of the zone lists, take all
(well, most of) the list_*() functions in the buddy allocator and put
them in helper functions. Provide a common control point for injecting
additional behavior when freeing pages.

[dan.j.williams@intel.com: fix buddy list helpers]
Link: http://lkml.kernel.org/r/155033679702.1773410.13041474192173212653.stgit@dwillia2-desk3.amr.corp.intel.com
[vbabka@suse.cz: remove del_page_from_free_area() migratetype parameter]
Link: http://lkml.kernel.org/r/4672701b-6775-6efd-0797-b6242591419e@suse.cz
Link: http://lkml.kernel.org/r/154899812264.3165233.5219320056406926223.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams
Signed-off-by: Vlastimil Babka
Tested-by: Tetsuo Handa
Acked-by: Michal Hocko
Cc: Vlastimil Babka
Cc: Dave Hansen
Cc: Kees Cook
Cc: Keith Busch
Cc: Robert Elliott
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Williams
2019-05-15 10:52:48 +0800
e900a918b mm: shuffle initial free memory to improve memory-side-cache utilization ... Browse Code »

Patch series "mm: Randomize free memory", v10.

This patch (of 3):

Randomization of the page allocator improves the average utilization of
a direct-mapped memory-side-cache. Memory side caching is a platform
capability that Linux has been previously exposed to in HPC
(high-performance computing) environments on specialty platforms. In
that instance it was a smaller pool of high-bandwidth-memory relative to
higher-capacity / lower-bandwidth DRAM. Now, this capability is going
to be found on general purpose server platforms where DRAM is a cache in
front of higher latency persistent memory [1].

Robert offered an explanation of the state of the art of Linux
interactions with memory-side-caches [2], and I copy it here:

It's been a problem in the HPC space:
http://www.nersc.gov/research-and-development/knl-cache-mode-performance-coe/

A kernel module called zonesort is available to try to help:
https://software.intel.com/en-us/articles/xeon-phi-software

and this abandoned patch series proposed that for the kernel:
https://lkml.kernel.org/r/20170823100205.17311-1-lukasz.daniluk@intel.com

Dan's patch series doesn't attempt to ensure buffers won't conflict, but
also reduces the chance that the buffers will. This will make performance
more consistent, albeit slower than "optimal" (which is near impossible
to attain in a general-purpose kernel). That's better than forcing
users to deploy remedies like:
"To eliminate this gradual degradation, we have added a Stream
measurement to the Node Health Check that follows each job;
nodes are rebooted whenever their measured memory bandwidth
falls below 300 GB/s."

A replacement for zonesort was merged upstream in commit cc9aec03e58f
("x86/numa_emulation: Introduce uniform split capability"). With this
numa_emulation capability, memory can be split into cache sized
("near-memory" sized) numa nodes. A bind operation to such a node, and
disabling workloads on other nodes, enables full cache performance.
However, once the workload exceeds the cache size then cache conflicts
are unavoidable. While HPC environments might be able to tolerate
time-scheduling of cache sized workloads, for general purpose server
platforms, the oversubscribed cache case will be the common case.

The worst case scenario is that a server system owner benchmarks a
workload at boot with an un-contended cache only to see that performance
degrade over time, even below the average cache performance due to
excessive conflicts. Randomization clips the peaks and fills in the
valleys of cache utilization to yield steady average performance.

Here are some performance impact details of the patches:

1/ An Intel internal synthetic memory bandwidth measurement tool, saw a
3X speedup in a contrived case that tries to force cache conflicts.
The contrived cased used the numa_emulation capability to force an
instance of the benchmark to be run in two of the near-memory sized
numa nodes. If both instances were placed on the same emulated they
would fit and cause zero conflicts. While on separate emulated nodes
without randomization they underutilized the cache and conflicted
unnecessarily due to the in-order allocation per node.

2/ A well known Java server application benchmark was run with a heap
size that exceeded cache size by 3X. The cache conflict rate was 8%
for the first run and degraded to 21% after page allocator aging. With
randomization enabled the rate levelled out at 11%.

3/ A MongoDB workload did not observe measurable difference in
cache-conflict rates, but the overall throughput dropped by 7% with
randomization in one case.

4/ Mel Gorman ran his suite of performance workloads with randomization
enabled on platforms without a memory-side-cache and saw a mix of some
improvements and some losses [3].

While there is potentially significant improvement for applications that
depend on low latency access across a wide working-set, the performance
may be negligible to negative for other workloads. For this reason the
shuffle capability defaults to off unless a direct-mapped
memory-side-cache is detected. Even then, the page_alloc.shuffle=0
parameter can be specified to disable the randomization on those systems.

Outside of memory-side-cache utilization concerns there is potentially
security benefit from randomization. Some data exfiltration and
return-oriented-programming attacks rely on the ability to infer the
location of sensitive data objects. The kernel page allocator, especially
early in system boot, has predictable first-in-first out behavior for
physical pages. Pages are freed in physical address order when first
onlined.

Quoting Kees:
"While we already have a base-address randomization
(CONFIG_RANDOMIZE_MEMORY), attacks against the same hardware and
memory layouts would certainly be using the predictability of
allocation ordering (i.e. for attacks where the base address isn't
important: only the relative positions between allocated memory).
This is common in lots of heap-style attacks. They try to gain
control over ordering by spraying allocations, etc.

I'd really like to see this because it gives us something similar
to CONFIG_SLAB_FREELIST_RANDOM but for the page allocator."

While SLAB_FREELIST_RANDOM reduces the predictability of some local slab
caches it leaves vast bulk of memory to be predictably in order allocated.
However, it should be noted, the concrete security benefits are hard to
quantify, and no known CVE is mitigated by this randomization.

Introduce shuffle_free_memory(), and its helper shuffle_zone(), to perform
a Fisher-Yates shuffle of the page allocator 'free_area' lists when they
are initially populated with free memory at boot and at hotplug time. Do
this based on either the presence of a page_alloc.shuffle=Y command line
parameter, or autodetection of a memory-side-cache (to be added in a
follow-on patch).

The shuffling is done in terms of CONFIG_SHUFFLE_PAGE_ORDER sized free
pages where the default CONFIG_SHUFFLE_PAGE_ORDER is MAX_ORDER-1 i.e. 10,
4MB this trades off randomization granularity for time spent shuffling.
MAX_ORDER-1 was chosen to be minimally invasive to the page allocator
while still showing memory-side cache behavior improvements, and the
expectation that the security implications of finer granularity
randomization is mitigated by CONFIG_SLAB_FREELIST_RANDOM. The
performance impact of the shuffling appears to be in the noise compared to
other memory initialization work.

This initial randomization can be undone over time so a follow-on patch is
introduced to inject entropy on page free decisions. It is reasonable to
ask if the page free entropy is sufficient, but it is not enough due to
the in-order initial freeing of pages. At the start of that process
putting page1 in front or behind page0 still keeps them close together,
page2 is still near page1 and has a high chance of being adjacent. As
more pages are added ordering diversity improves, but there is still high
page locality for the low address pages and this leads to no significant
impact to the cache conflict rate.

[1]: https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/
[2]: https://lkml.kernel.org/r/AT5PR8401MB1169D656C8B5E121752FC0F8AB120@AT5PR8401MB1169.NAMPRD84.PROD.OUTLOOK.COM
[3]: https://lkml.org/lkml/2018/10/12/309

[dan.j.williams@intel.com: fix shuffle enable]
Link: http://lkml.kernel.org/r/154943713038.3858443.4125180191382062871.stgit@dwillia2-desk3.amr.corp.intel.com
[cai@lca.pw: fix SHUFFLE_PAGE_ALLOCATOR help texts]
Link: http://lkml.kernel.org/r/20190425201300.75650-1-cai@lca.pw
Link: http://lkml.kernel.org/r/154899811738.3165233.12325692939590944259.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams
Signed-off-by: Qian Cai
Reviewed-by: Kees Cook
Acked-by: Michal Hocko
Cc: Dave Hansen
Cc: Keith Busch
Cc: Robert Elliott
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Williams
2019-05-15 10:52:48 +0800
df5ba5be7 kernel/sched/psi.c: expose pressure metrics on root cgroup ... Browse Code »

Pressure metrics are already recorded and exposed in procfs for the
entire system, but any tool which monitors cgroup pressure has to
special case the root cgroup to read from procfs. This patch exposes
the already recorded pressure metrics on the root cgroup.

Link: http://lkml.kernel.org/r/20190510174938.3361741-1-dschatzberg@fb.com
Signed-off-by: Dan Schatzberg
Acked-by: Johannes Weiner
Cc: Tejun Heo
Cc: Li Zefan
Cc: Ingo Molnar
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Schatzberg
2019-05-15 10:52:48 +0800
0e94682b7 psi: introduce psi monitor ... Browse Code »

Psi monitor aims to provide a low-latency short-term pressure detection
mechanism configurable by users. It allows users to monitor psi metrics
growth and trigger events whenever a metric raises above user-defined
threshold within user-defined time window.

Time window and threshold are both expressed in usecs. Multiple psi
resources with different thresholds and window sizes can be monitored
concurrently.

Psi monitors activate when system enters stall state for the monitored
psi metric and deactivate upon exit from the stall state. While system
is in the stall state psi signal growth is monitored at a rate of 10
times per tracking window. Min window size is 500ms, therefore the min
monitoring interval is 50ms. Max window size is 10s with monitoring
interval of 1s.

When activated psi monitor stays active for at least the duration of one
tracking window to avoid repeated activations/deactivations when psi
signal is bouncing.

Notifications to the users are rate-limited to one per tracking window.

Link: http://lkml.kernel.org/r/20190319235619.260832-8-surenb@google.com
Signed-off-by: Suren Baghdasaryan
Signed-off-by: Johannes Weiner
Cc: Dennis Zhou
Cc: Ingo Molnar
Cc: Jens Axboe
Cc: Li Zefan
Cc: Peter Zijlstra
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Suren Baghdasaryan
2019-05-15 10:52:48 +0800