08 Aug, 2020

40 commits

  • Pull mount leak fix from Al Viro:
    "Regression fix for the syscalls-for-init series - fix a leak of a 'struct path'"

    * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fs: fix a struct path leak in path_umount

    Linus Torvalds
     
  • Pull PCI updates from Bjorn Helgaas:
    "Enumeration:
    - Fix pci_cfg_wait queue locking problem (Bjorn Helgaas)
    - Convert PCIe capability PCIBIOS errors to errno (Bolarinwa Olayemi
    Saheed)
    - Align PCIe capability and PCI accessor return values (Bolarinwa
    Olayemi Saheed)
    - Fix pci_create_slot() reference count leak (Qiushi Wu)
    - Announce device after early fixups (Tiezhu Yang)

    PCI device hotplug:
    - Make rpadlpar functions static (Wei Yongjun)

    Driver binding:
    - Add device even if driver attach failed (Rajat Jain)

    Virtualization:
    - xen: Remove redundant initialization of irq (Colin Ian King)

    IOMMU:
    - Add pci_pri_supported() to check device or associated PF (Ashok Raj)
    - Release IVRS table in AMD ACS quirk (Hanjun Guo)
    - Mark AMD Navi10 GPU rev 0x00 ATS as broken (Kai-Heng Feng)
    - Treat "external-facing" devices themselves as internal (Rajat Jain)

    MSI:
    - Forward MSI-X error code in pci_alloc_irq_vectors_affinity() (Piotr
    Stankiewicz)

    Error handling:
    - Clear PCIe Device Status errors only if OS owns AER (Jonathan
    Cameron)
    - Log correctable errors as warning, not error (Matt Jolly)
    - Use 'pci_channel_state_t' instead of 'enum pci_channel_state' (Luc
    Van Oostenryck)

    Peer-to-peer DMA:
    - Allow P2PDMA on AMD Zen and newer CPUs (Logan Gunthorpe)

    ASPM:
    - Add missing newline in sysfs 'policy' (Xiongfeng Wang)

    Native PCIe controllers:
    - Convert to devm_platform_ioremap_resource_byname() (Dejin Zheng)
    - Convert to devm_platform_ioremap_resource() (Dejin Zheng)
    - Remove duplicate error message from devm_pci_remap_cfg_resource()
    callers (Dejin Zheng)
    - Fix runtime PM imbalance on error (Dinghao Liu)
    - Remove dev_err() when handing an error from platform_get_irq()
    (Krzysztof Wilczyński)
    - Use pci_host_bridge.windows list directly instead of splicing in a
    temporary list for cadence, mvebu, host-common (Rob Herring)
    - Use pci_host_probe() instead of open-coding all the pieces for
    altera, brcmstb, iproc, mobiveil, rcar, rockchip, tegra, v3,
    versatile, xgene, xilinx, xilinx-nwl (Rob Herring)
    - Default host bridge parent device to the platform device (Rob
    Herring)
    - Use pci_is_root_bus() instead of tracking root bus number
    separately in aardvark, designware (imx6, keystone,
    designware-host), mobiveil, xilinx-nwl, xilinx, rockchip, rcar (Rob
    Herring)
    - Set host bridge bus number in pci_scan_root_bus_bridge() instead of
    each driver for aardvark, designware-host, host-common, mediatek,
    rcar, tegra, v3-semi (Rob Herring)
    - Move DT resource setup into devm_pci_alloc_host_bridge() (Rob
    Herring)
    - Set bridge map_irq and swizzle_irq to default functions; drivers
    that don't support legacy IRQs (iproc) need to undo this (Rob
    Herring)

    ARM Versatile PCIe controller driver:
    - Drop flag PCI_ENABLE_PROC_DOMAINS (Rob Herring)

    Cadence PCIe controller driver:
    - Use "dma-ranges" instead of "cdns,no-bar-match-nbits" property
    (Kishon Vijay Abraham I)
    - Remove "mem" from reg binding (Kishon Vijay Abraham I)
    - Fix cdns_pcie_{host|ep}_setup() error path (Kishon Vijay Abraham I)
    - Convert all r/w accessors to perform only 32-bit accesses (Kishon
    Vijay Abraham I)
    - Add support to start link and verify link status (Kishon Vijay
    Abraham I)
    - Allow pci_host_bridge to have custom pci_ops (Kishon Vijay Abraham I)
    - Add new *ops* for CPU addr fixup (Kishon Vijay Abraham I)
    - Fix updating Vendor ID and Subsystem Vendor ID register (Kishon
    Vijay Abraham I)
    - Use bridge resources for outbound window setup (Rob Herring)
    - Remove private bus number and range storage (Rob Herring)

    Cadence PCIe endpoint driver:
    - Add MSI-X support (Alan Douglas)

    HiSilicon PCIe controller driver:
    - Remove non-ECAM HiSilicon hip05/hip06 driver (Rob Herring)

    Intel VMD host bridge driver:
    - Use Shadow MEMBAR registers for QEMU/KVM guests (Jon Derrick)

    Loongson PCIe controller driver:
    - Use DECLARE_PCI_FIXUP_EARLY for bridge_class_quirk() (Tiezhu Yang)

    Marvell Aardvark PCIe controller driver:
    - Indicate error in 'val' when config read fails (Pali Rohár)
    - Don't touch PCIe registers if no card connected (Pali Rohár)

    Marvell MVEBU PCIe controller driver:
    - Setup BAR0 in order to fix MSI (Shmuel Hazan)

    Microsoft Hyper-V host bridge driver:
    - Fix a timing issue which causes kdump to fail occasionally (Wei Hu)
    - Make some functions static (Wei Yongjun)

    NVIDIA Tegra PCIe controller driver:
    - Revert tegra124 raw_violation_fixup (Nicolas Chauvet)
    - Remove PLL power supplies (Thierry Reding)

    Qualcomm PCIe controller driver:
    - Change duplicate PCI reset to phy reset (Abhishek Sahu)
    - Add missing ipq806x clocks in PCIe driver (Ansuel Smith)
    - Add missing reset for ipq806x (Ansuel Smith)
    - Add ext reset (Ansuel Smith)
    - Use bulk clk API and assert on error (Ansuel Smith)
    - Add support for tx term offset for rev 2.1.0 (Ansuel Smith)
    - Define some PARF params needed for ipq8064 SoC (Ansuel Smith)
    - Add ipq8064 rev2 variant (Ansuel Smith)
    - Support PCI speed set for ipq806x (Sham Muthayyan)

    Renesas R-Car PCIe controller driver:
    - Use devm_pci_alloc_host_bridge() (Rob Herring)
    - Use struct pci_host_bridge.windows list directly (Rob Herring)
    - Convert rcar-gen2 to use modern host bridge probe functions (Rob
    Herring)

    TI J721E PCIe driver:
    - Add TI J721E PCIe host and endpoint driver (Kishon Vijay Abraham I)

    Xilinx Versal CPM PCIe controller driver:
    - Add Versal CPM Root Port driver and YAML schema (Bharat Kumar
    Gogada)

    MicroSemi Switchtec management driver:
    - Add missing __iomem and __user tags to fix sparse warnings (Logan
    Gunthorpe)

    Miscellaneous:
    - Replace http:// links with https:// (Alexander A. Klimov)
    - Replace lkml.org, spinics, gmane with lore.kernel.org (Bjorn
    Helgaas)
    - Remove unused pci_lost_interrupt() (Heiner Kallweit)
    - Move PCI_VENDOR_ID_REDHAT definition to pci_ids.h (Huacai Chen)
    - Fix kerneldoc warnings (Krzysztof Kozlowski)"

    * tag 'pci-v5.9-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (113 commits)
    PCI: Fix kerneldoc warnings
    PCI: xilinx-cpm: Add Versal CPM Root Port driver
    PCI: xilinx-cpm: Add YAML schemas for Versal CPM Root Port
    PCI: Set bridge map_irq and swizzle_irq to default functions
    PCI: Move DT resource setup into devm_pci_alloc_host_bridge()
    PCI: rcar-gen2: Convert to use modern host bridge probe functions
    PCI: Remove dev_err() when handing an error from platform_get_irq()
    MAINTAINERS: Add Kishon Vijay Abraham I for TI J721E SoC PCIe
    misc: pci_endpoint_test: Add J721E in pci_device_id table
    PCI: j721e: Add TI J721E PCIe driver
    PCI: switchtec: Add missing __iomem tag to fix sparse warnings
    PCI: switchtec: Add missing __iomem and __user tags to fix sparse warnings
    PCI: rpadlpar: Make functions static
    PCI/P2PDMA: Allow P2PDMA on AMD Zen and newer CPUs
    PCI: Release IVRS table in AMD ACS quirk
    PCI: Announce device after early fixups
    PCI: Mark AMD Navi10 GPU rev 0x00 ATS as broken
    PCI: Remove unused pci_lost_interrupt()
    dt-bindings: PCI: Add EP mode dt-bindings for TI's J721E SoC
    dt-bindings: PCI: Add host mode dt-bindings for TI's J721E SoC
    ...

    Linus Torvalds
     
  • Pull tracing updates from Steven Rostedt:

    - The biggest news in that the tracing ring buffer can now time events
    that interrupted other ring buffer events.

    Before this change, if an interrupt came in while recording another
    event, and that interrupt also had an event, those events would all
    have the same time stamp as the event it interrupted.

    Now, with the new design, those events will have a unique time stamp
    and rightfully display the time for those events that were recorded
    while interrupting another event.

    - Bootconfig how has an "override" operator that lets the users have a
    default config, but then add options to override the default.

    - A fix was made to properly filter function graph tracing to the
    ftrace PIDs. This came in at the end of the -rc cycle, and needs to
    be backported.

    - Several clean ups, performance updates, and minor fixes as well.

    * tag 'trace-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (39 commits)
    tracing: Add trace_array_init_printk() to initialize instance trace_printk() buffers
    kprobes: Fix compiler warning for !CONFIG_KPROBES_ON_FTRACE
    tracing: Use trace_sched_process_free() instead of exit() for pid tracing
    bootconfig: Fix to find the initargs correctly
    Documentation: bootconfig: Add bootconfig override operator
    tools/bootconfig: Add testcases for value override operator
    lib/bootconfig: Add override operator support
    kprobes: Remove show_registers() function prototype
    tracing/uprobe: Remove dead code in trace_uprobe_register()
    kprobes: Fix NULL pointer dereference at kprobe_ftrace_handler
    ftrace: Fix ftrace_trace_task return value
    tracepoint: Use __used attribute definitions from compiler_attributes.h
    tracepoint: Mark __tracepoint_string's __used
    trace : Have tracing buffer info use kvzalloc instead of kzalloc
    tracing: Remove outdated comment in stack handling
    ftrace: Do not let direct or IPMODIFY ftrace_ops be added to module and set trampolines
    ftrace: Setup correct FTRACE_FL_REGS flags for module
    tracing/hwlat: Honor the tracing_cpumask
    tracing/hwlat: Drop the duplicate assignment in start_kthread()
    tracing: Save one trace_event->type by using __TRACE_LAST_TYPE
    ...

    Linus Torvalds
     
  • The merge resolution in commit 25d8d4eecace left ret no longer used,
    leading to:

    arch/powerpc/kernel/ptrace/ptrace-view.c: In function ‘pkey_get’:
    arch/powerpc/kernel/ptrace/ptrace-view.c:473:6: error: unused variable ‘ret’
    473 | int ret;

    Fix it by removing ret.

    Fixes: 25d8d4eecace ("Merge tag 'powerpc-5.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux")
    Signed-off-by: Michael Ellerman
    Signed-off-by: Linus Torvalds

    Michael Ellerman
     
  • Make sure we also put the dentry and vfsmnt in the illegal flags
    and !may_umount cases.

    Fixes: 41525f56e256 ("fs: refactor ksys_umount")
    Reported-by: Vikas Kumar
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • As trace_array_printk() used with not global instances will not add noise to
    the main buffer, they are OK to have in the kernel (unlike trace_printk()).
    This require the subsystem to create their own tracing instance, and the
    trace_array_printk() only writes into those instances.

    Add trace_array_init_printk() to initialize the trace_printk() buffers
    without printing out the WARNING message.

    Reported-by: Sean Paul
    Reviewed-by: Sean Paul
    Signed-off-by: Steven Rostedt (VMware)

    Steven Rostedt (VMware)
     
  • Pull clk updates from Stephen Boyd:
    "It looks like a smaller batch of clk updates this time around.

    In the core framework we just have some minor tweaks and a debugfs
    feature, so not much to see there. The driver updates are fairly well
    split between AT91 and Qualcomm clk support. Adding those two drivers
    together equals about 50% of the diffstat.

    Otherwise, the big amount of work this time was on supporting
    Broadcom's Raspberry Pi firmware clks.

    Highlights:

    Core:
    - Document clk_hw_round_rate() so it gets some more use
    - Remove unused __clk_get_flags()
    - Add a prepare/enable debugfs feature similar to rate setting

    New Drivers:
    - Add support for SAMA7G5 SoC clks
    - Enable CPU clks on Qualcomm IPQ6018 SoCs
    - Enable CPU clks on Qualcomm MSM8996 SoCs
    - GPU clk support for Qualcomm SM8150 and SM8250 SoCs
    - Audio clks on Qualcomm SC7180 SoCs
    - Microchip Sparx5 DPLL clk
    - Add support for the new Renesas RZ/G2H (R8A774E1) SoC

    Updates:
    - Make defines for bcm63xx-gate clks to use in DT
    - Support BCM2711 SoC firmware clks
    - Add HDMI clks for BCM2711 SoCs
    - Add RTC related clks on Ingenic SoCs
    - Support USB PHY clks on Ingenic SoCs
    - Support gate clks on BCM6318 SoCs
    - RMU and DMAC/GPIO clock support for Actions Semi S500 SoCs
    - Use poll_timeout functions in Rockchip clk driver
    - Support Rockchip rk3288w SoC variant
    - Mark mac_lbtest critical on Rockchip rk3188
    - Add CAAM clock support for i.MX vf610 driver
    - Add MU root clock support for i.MX imx8mp driver
    - Amlogic g12: add neural network accelerator clock sources
    - Amlogic meson8: remove critical flag for main PLL divider
    - Amlogic meson8: add video decoder clock gates
    - Convert one more Renesas DT binding to json-schema
    - Enhance critical clock handling on Renesas platforms to only
    consider clocks that were enabled at boot time"

    * tag 'clk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux: (79 commits)
    clk: qcom: gcc: Make disp gpll0 branch aon for sc7180/sdm845
    ipq806x: gcc: add support for child probe
    clk: qcom: msm8996: Make symbol 'cpu_msm8996_clks' static
    clk: qcom: ipq8074: Add correct index for PCIe clocks
    clk: : drop a duplicated word
    clk: renesas: cpg-mssr: Add r8a774e1 support
    dt-bindings: clock: renesas,cpg-mssr: Document r8a774e1
    clk: Drop duplicate selection in Kconfig
    clk: qcom: smd: Add support for MSM8992/4 rpm clocks
    clk: qcom: ipq8074: Add missing clocks for pcie
    dt-bindings: clock: qcom: ipq8074: Add missing bindings for PCIe
    Replace HTTP links with HTTPS ones: Common CLK framework
    clk: qcom: Add CPU clock driver for msm8996
    dt-bindings: clk: qcom: Add bindings for CPU clock for msm8996
    soc: qcom: Separate kryo l2 accessors from PMU driver
    clk: meson: meson8b: add the vclk2_en gate clock
    clk: meson: meson8b: add the vclk_en gate clock
    clk: qcom: Fix return value check in apss_ipq6018_probe()
    clk: bcm: dvp: Add missing module informations
    clk: meson: meson8b: Drop CLK_IS_CRITICAL from fclk_div2
    ...

    Linus Torvalds
     
  • Pull fdpick coredump update from Al Viro:
    "Switches fdpic coredumps away from original aout dumping primitives to
    the same kind of regset use as regular elf coredumps do"

    * 'work.fdpic' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    [elf-fdpic] switch coredump to regsets
    [elf-fdpic] use elf_dump_thread_status() for the dumper thread as well
    [elf-fdpic] move allocation of elf_thread_status into elf_dump_thread_status()
    [elf-fdpic] coredump: don't bother with cyclic list for per-thread objects
    kill elf_fpxregs_t
    take fdpic-related parts of elf_prstatus out
    unexport linux/elfcore.h

    Linus Torvalds
     
  • …ux/kernel/git/kees/linux

    Pull sysfs module section fix from Kees Cook:
    "Fix sysfs module section output overflow.

    About a month after my kallsyms_show_value() refactoring landed, 0day
    noticed that there was a path through the kernfs binattr read handlers
    that did not have PAGE_SIZEd buffers, and the module "sections" read
    handler made a bad assumption about this, resulting in it stomping on
    memory when reached through small-sized splice() calls.

    I've added a set of tests to find these kinds of regressions more
    quickly in the future as well"

    Sefltests-acked-by: Shuah Khan <skhan@linuxfoundation.org>

    * tag 'kallsyms_show_value-fix-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
    selftests: splice: Check behavior of full and short splices
    module: Correctly truncate sysfs sections output

    Linus Torvalds
     
  • Pull seccomp fix from Kees Cook:
    "This fixes my typo in the SCM_RIGHTS refactoring that broke compat
    handling.

    Thanks to Thadeu Lima de Souza Cascardo for tracking it down, and to
    Christian Zigotzky and Alex Xu for their reports"

    * tag 'seccomp-v5.9-rc1-fix1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
    net/scm: Fix typo in SCM_RIGHTS compat refactoring

    Linus Torvalds
     
  • Pull more power management updates from Rafael Wysocki:
    "These are mostly ARM cpufreq driver updates plus a cpufreq core
    cleanup, an ARM-wide change to make schedutil the default scaling
    governor, an intel_pstate driver fix and some runtime PM changes
    regarding kerneldoc comments.

    Specifics:

    - Add adaptive voltage scaling (AVS) support to the brcmstb cpufreq
    driver and clean it up (Florian Fainelli, Markus Mayer).

    - Add a new Tegra cpufreq driver and clean up the existing one (Jon
    Hunter, Sumit Gupta).

    - Add bandwidth level support to the Qcom cpufreq driver along with
    OPP changes (Sibi Sankar).

    - Clean up the sti, cpufreq-dt, ap806, CPPC cpufreq drivers (Viresh
    Kumar, Lee Jones, Ivan Kokshaysky, Sven Auhagen, Xin Hao).

    - Make schedutil the default governor for ARM (Valentin Schneider).

    - Fix dependency issues for the imx cpufreq driver (Walter Lozano).

    - Clean up cached_resolved_idx handlihng in the cpufreq core (Viresh
    Kumar).

    - Fix the intel_pstate driver to use the correct maximum frequency
    value when MSR_TURBO_RATIO_LIMIT is 0 (Srinivas Pandruvada).

    - Provide kenrneldoc comments for multiple runtime PM helpers and
    improve the pm_runtime_get_if_active() kerneldoc (Rafael Wysocki)"

    * tag 'pm-5.9-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (22 commits)
    cpufreq: intel_pstate: Fix cpuinfo_max_freq when MSR_TURBO_RATIO_LIMIT is 0
    PM: runtime: Improve kerneldoc of pm_runtime_get_if_active()
    PM: runtime: Add kerneldoc comments to multiple helpers
    cpufreq: make schedutil the default for arm and arm64
    cpufreq: cached_resolved_idx can not be negative
    cpufreq: Add Tegra194 cpufreq driver
    dt-bindings: arm: Add NVIDIA Tegra194 CPU Complex binding
    cpufreq: imx: Select NVMEM_IMX_OCOTP
    cpufreq: sti-cpufreq: Fix some formatting and misspelling issues
    cpufreq: tegra186: Simplify probe return path
    cpufreq: CPPC: Reuse caps variable in few routines
    cpufreq: ap806: fix cpufreq driver needs ap cpu clk
    cpufreq: cppc: Reorder code and remove apply_hisi_workaround variable
    cpufreq: dt: fix oops on armada37xx
    cpufreq: brcmstb-avs-cpufreq: send S2_ENTER / S2_EXIT commands to AVS
    cpufreq: brcmstb-avs-cpufreq: Support polling AVS firmware
    cpufreq: brcmstb-avs-cpufreq: more flexible interface for __issue_avs_command()
    cpufreq: qcom: Disable fast switch when scaling DDR/L3
    cpufreq: qcom: Update the bandwidth levels on frequency change
    OPP: Add and export helper to set bandwidth
    ...

    Linus Torvalds
     
  • …device-mapper/linux-dm

    Pull device mapper updates from Mike Snitzer:

    - DM multipath locking fixes around m->flags tests and improvements to
    bio-based code so that it follows patterns established by
    request-based code.

    - Request-based DM core improvement to eliminate unnecessary call to
    blk_mq_queue_stopped().

    - Add "panic_on_corruption" error handling mode to DM verity target.

    - DM bufio fix to to perform buffer cleanup from a workqueue rather
    than wait for IO in reclaim context from shrinker.

    - DM crypt improvement to optionally avoid async processing via
    workqueues for reads and/or writes -- via "no_read_workqueue" and
    "no_write_workqueue" features. This more direct IO processing
    improves latency and throughput with faster storage. Avoiding
    workqueue IO submission for writes (DM_CRYPT_NO_WRITE_WORKQUEUE) is a
    requirement for adding zoned block device support to DM crypt.

    - Add zoned block device support to DM crypt. Makes use of
    DM_CRYPT_NO_WRITE_WORKQUEUE and a new optional feature
    (DM_CRYPT_WRITE_INLINE) that allows write completion to wait for
    encryption to complete. This allows write ordering to be preserved,
    which is needed for zoned block devices.

    - Fix DM ebs target's check for REQ_OP_FLUSH.

    - Fix DM core's report zones support to not report more zones than were
    requested.

    - A few small compiler warning fixes.

    - DM dust improvements to return output directly to the user rather
    than require they scrape the system log for output.

    * tag 'for-5.9/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
    dm: don't call report zones for more than the user requested
    dm ebs: Fix incorrect checking for REQ_OP_FLUSH
    dm init: Set file local variable static
    dm ioctl: Fix compilation warning
    dm raid: Remove empty if statement
    dm verity: Fix compilation warning
    dm crypt: Enable zoned block device support
    dm crypt: add flags to optionally bypass kcryptd workqueues
    dm bufio: do buffer cleanup from a workqueue
    dm rq: don't call blk_mq_queue_stopped() in dm_stop_queue()
    dm dust: add interface to list all badblocks
    dm dust: report some message results directly back to user
    dm verity: add "panic_on_corruption" error handling mode
    dm mpath: use double checked locking in fast path
    dm mpath: rename current_pgpath to pgpath in multipath_prepare_ioctl
    dm mpath: rework __map_bio()
    dm mpath: factor out multipath_queue_bio
    dm mpath: push locking down to must_push_back_rq()
    dm mpath: take m->lock spinlock when testing QUEUE_IF_NO_PATH
    dm mpath: changes from initial m->flags locking audit

    Linus Torvalds
     
  • Pull media updates from Mauro Carvalho Chehab:

    - Legacy soc_camera driver was removed from staging

    - New I2C sensor related drivers: dw9768, ch7322, max9271, rdacm20

    - TI vpe driver code was re-organized and had new features added

    - Added Xilinx MIPI CSI-2 Rx Subsystem driver

    - Added support for Infrared Toy and IR Droid devices

    - Lots of random driver fixes, new features and cleanups

    * tag 'media/v5.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (318 commits)
    media: camss: fix memory leaks on error handling paths in probe
    media: davinci: vpif_capture: fix potential double free
    media: radio: remove redundant assignment to variable retval
    media: allegro: fix potential null dereference on header
    media: mtk-mdp: Fix a refcounting bug on error in init
    media: allegro: fix an error pointer vs NULL check
    media: meye: fix missing pm_mchip_mode field
    media: cafe-driver: use generic power management
    media: saa7164: use generic power management
    media: v4l2-dev/ioctl: Fix document for VIDIOC_QUERYCAP
    media: v4l2: Correct kernel-doc inconsistency
    media: v4l2: Correct kernel-doc inconsistency
    media: dvbdev.h: keep * together with the type
    media: v4l2-subdev.h: keep * together with the type
    media: videobuf2: Print videobuf2 buffer state by name
    media: colorspaces-details.rst: fix V4L2_COLORSPACE_JPEG description
    media: tw68: use generic power management
    media: meye: use generic power management
    media: cx88: use generic power management
    media: cx25821: use generic power management
    ...

    Linus Torvalds
     
  • Pull mailbox updates from Jassi Brar:
    "mediatek:
    - add support for mt6779 gce
    - shutdown cleanup and address shift support

    qcom:
    - add msm8994 apcs and sdm660 hmss compatibility

    imx:
    - mark PM funcs __maybe

    pcc:
    - put acpi table before bailout

    misc:
    - replace http with https links"

    * tag 'mailbox-v5.9' of git://git.linaro.org/landing-teams/working/fujitsu/integration:
    mailbox: mediatek: cmdq: clear task in channel before shutdown
    mailbox: cmdq: support mt6779 gce platform definition
    mailbox: cmdq: variablize address shift in platform
    dt-binding: gce: add gce header file for mt6779
    mailbox: qcom: Add msm8994 apcs compatible
    mailbox: qcom: Add sdm660 hmss compatible
    mailbox: imx: Mark PM functions as __maybe_unused
    mailbox: pcc: Put the PCCT table for error path
    mailbox: Replace HTTP links with HTTPS ones

    Linus Torvalds
     
  • When refactoring the SCM_RIGHTS code, I accidentally mis-merged my
    native/compat diffs, which entirely broke using SCM_RIGHTS in compat
    mode. Use the correct helper.

    Reported-by: Christian Zigotzky
    Link: https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-August/216156.html
    Reported-by: "Alex Xu (Hello71)"
    Link: https://lore.kernel.org/lkml/1596812929.lz7fuo8r2w.none@localhost/
    Suggested-by: Thadeu Lima de Souza Cascardo
    Fixes: c0029de50982 ("net/scm: Regularize compat handling of scm_detach_fds()")
    Tested-by: Alex Xu (Hello71)
    Acked-by: Thadeu Lima de Souza Cascardo
    Signed-off-by: Kees Cook

    Kees Cook
     
  • Pull dmaengine updates from Vinod Koul:
    "Core:
    - Support out of order dma completion
    - Support for repeating transaction

    New controllers:
    - Support for Actions S700 DMA engine
    - Renesas R8A774E1, r8a7742 controller binding
    - New driver for Xilinx DPDMA controller

    Other:
    - Support of out of order dma completion in idxd driver
    - W=1 warning cleanup of subsystem
    - Updates to ti-k3-dma, dw, idxd drivers"

    * tag 'dmaengine-5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine: (68 commits)
    dmaengine: dw: Don't include unneeded header to platform data header
    dmaengine: Actions: Add support for S700 DMA engine
    dmaengine: Actions: get rid of bit fields from dma descriptor
    dt-bindings: dmaengine: convert Actions Semi Owl SoCs bindings to yaml
    dmaengine: idxd: add missing invalid flags field to completion
    dmaengine: dw: Initialize max_sg_burst capability
    dmaengine: dw: Introduce max burst length hw config
    dmaengine: dw: Initialize min and max burst DMA device capability
    dmaengine: dw: Set DMA device max segment size parameter
    dmaengine: dw: Take HC_LLP flag into account for noLLP auto-config
    dmaengine: Introduce DMA-device device_caps callback
    dmaengine: Introduce max SG burst capability
    dmaengine: Introduce min burst length capability
    dt-bindings: dma: dw: Add max burst transaction length property
    dt-bindings: dma: dw: Convert DW DMAC to DT binding
    dmaengine: ti: k3-udma: Query throughput level information from hardware
    dmaengine: ti: k3-udma: Use defines for capabilities register parsing
    dmaengine: xilinx: dpdma: Fix kerneldoc warning
    dmaengine: xilinx: dpdma: add missing kernel doc
    dmaengine: xilinx: dpdma: remove comparison of unsigned expression
    ...

    Linus Torvalds
     
  • Merge misc updates from Andrew Morton:

    - a few MM hotfixes

    - kthread, tools, scripts, ntfs and ocfs2

    - some of MM

    Subsystems affected by this patch series: kthread, tools, scripts, ntfs,
    ocfs2 and mm (hofixes, pagealloc, slab-generic, slab, slub, kcsan,
    debug, pagecache, gup, swap, shmem, memcg, pagemap, mremap, mincore,
    sparsemem, vmalloc, kasan, pagealloc, hugetlb and vmscan).

    * emailed patches from Andrew Morton : (162 commits)
    mm: vmscan: consistent update to pgrefill
    mm/vmscan.c: fix typo
    khugepaged: khugepaged_test_exit() check mmget_still_valid()
    khugepaged: retract_page_tables() remember to test exit
    khugepaged: collapse_pte_mapped_thp() protect the pmd lock
    khugepaged: collapse_pte_mapped_thp() flush the right range
    mm/hugetlb: fix calculation of adjust_range_if_pmd_sharing_possible
    mm: thp: replace HTTP links with HTTPS ones
    mm/page_alloc: fix memalloc_nocma_{save/restore} APIs
    mm/page_alloc.c: skip setting nodemask when we are in interrupt
    mm/page_alloc: fallbacks at most has 3 elements
    mm/page_alloc: silence a KASAN false positive
    mm/page_alloc.c: remove unnecessary end_bitidx for [set|get]_pfnblock_flags_mask()
    mm/page_alloc.c: simplify pageblock bitmap access
    mm/page_alloc.c: extract the common part in pfn_to_bitidx()
    mm/page_alloc.c: replace the definition of NR_MIGRATETYPE_BITS with PB_migratetype_bits
    mm/shuffle: remove dynamic reconfiguration
    mm/memory_hotplug: document why shuffle_zone() is relevant
    mm/page_alloc: remove nr_free_pagecache_pages()
    mm: remove vm_total_pages
    ...

    Linus Torvalds
     
  • The vmstat pgrefill is useful together with pgscan and pgsteal stats to
    measure the reclaim efficiency. However vmstat's pgrefill is not updated
    consistently at system level. It gets updated for both global and memcg
    reclaim however pgscan and pgsteal are updated for only global reclaim.
    So, update pgrefill only for global reclaim. If someone is interested in
    the stats representing both system level as well as memcg level reclaim,
    then consult the root memcg's memory.stat instead of /proc/vmstat.

    Signed-off-by: Shakeel Butt
    Signed-off-by: Andrew Morton
    Acked-by: Yafang Shao
    Acked-by: Roman Gushchin
    Acked-by: Chris Down
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Link: http://lkml.kernel.org/r/20200711011459.1159929-1-shakeelb@google.com
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     
  • Change "optizimation" to "optimization".

    Signed-off-by: dylan-meiners
    Signed-off-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Link: http://lkml.kernel.org/r/20200609185144.10049-1-spacct.spacct@gmail.com
    Signed-off-by: Linus Torvalds

    dylan-meiners
     
  • Move collapse_huge_page()'s mmget_still_valid() check into
    khugepaged_test_exit() itself. collapse_huge_page() is used for anon THP
    only, and earned its mmget_still_valid() check because it inserts a huge
    pmd entry in place of the page table's pmd entry; whereas
    collapse_file()'s retract_page_tables() or collapse_pte_mapped_thp()
    merely clears the page table's pmd entry. But core dumping without mmap
    lock must have been as open to mistaking a racily cleared pmd entry for a
    page table at physical page 0, as exit_mmap() was. And we certainly have
    no interest in mapping as a THP once dumping core.

    Fixes: 59ea6d06cfa9 ("coredump: fix race condition between collapse_huge_page() and core dumping")
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Cc: Andrea Arcangeli
    Cc: Song Liu
    Cc: Mike Kravetz
    Cc: Kirill A. Shutemov
    Cc: [4.8+]
    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008021217020.27773@eggly.anvils
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Only once have I seen this scenario (and forgot even to notice what forced
    the eventual crash): a sequence of "BUG: Bad page map" alerts from
    vm_normal_page(), from zap_pte_range() servicing exit_mmap();
    pmd:00000000, pte values corresponding to data in physical page 0.

    The pte mappings being zapped in this case were supposed to be from a huge
    page of ext4 text (but could as well have been shmem): my belief is that
    it was racing with collapse_file()'s retract_page_tables(), found *pmd
    pointing to a page table, locked it, but *pmd had become 0 by the time
    start_pte was decided.

    In most cases, that possibility is excluded by holding mmap lock; but
    exit_mmap() proceeds without mmap lock. Most of what's run by khugepaged
    checks khugepaged_test_exit() after acquiring mmap lock:
    khugepaged_collapse_pte_mapped_thps() and hugepage_vma_revalidate() do so,
    for example. But retract_page_tables() did not: fix that.

    The fix is for retract_page_tables() to check khugepaged_test_exit(),
    after acquiring mmap lock, before doing anything to the page table.
    Getting the mmap lock serializes with __mmput(), which briefly takes and
    drops it in __khugepaged_exit(); then the khugepaged_test_exit() check on
    mm_users makes sure we don't touch the page table once exit_mmap() might
    reach it, since exit_mmap() will be proceeding without mmap lock, not
    expecting anyone to be racing with it.

    Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Acked-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Mike Kravetz
    Cc: Song Liu
    Cc: [4.8+]
    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008021215400.27773@eggly.anvils
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • When retract_page_tables() removes a page table to make way for a huge
    pmd, it holds huge page lock, i_mmap_lock_write, mmap_write_trylock and
    pmd lock; but when collapse_pte_mapped_thp() does the same (to handle the
    case when the original mmap_write_trylock had failed), only
    mmap_write_trylock and pmd lock are held.

    That's not enough. One machine has twice crashed under load, with "BUG:
    spinlock bad magic" and GPF on 6b6b6b6b6b6b6b6b. Examining the second
    crash, page_vma_mapped_walk_done()'s spin_unlock of pvmw->ptl (serving
    page_referenced() on a file THP, that had found a page table at *pmd)
    discovers that the page table page and its lock have already been freed by
    the time it comes to unlock.

    Follow the example of retract_page_tables(), but we only need one of huge
    page lock or i_mmap_lock_write to secure against this: because it's the
    narrower lock, and because it simplifies collapse_pte_mapped_thp() to know
    the hpage earlier, choose to rely on huge page lock here.

    Fixes: 27e1f8273113 ("khugepaged: enable collapse pmd for pte-mapped THP")
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Acked-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Mike Kravetz
    Cc: Song Liu
    Cc: [5.4+]
    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008021213070.27773@eggly.anvils
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • pmdp_collapse_flush() should be given the start address at which the huge
    page is mapped, haddr: it was given addr, which at that point has been
    used as a local variable, incremented to the end address of the extent.

    Found by source inspection while chasing a hugepage locking bug, which I
    then could not explain by this. At first I thought this was very bad;
    then saw that all of the page translations that were not flushed would
    actually still point to the right pages afterwards, so harmless; then
    realized that I know nothing of how different architectures and models
    cache intermediate paging structures, so maybe it matters after all -
    particularly since the page table concerned is immediately freed.

    Much easier to fix than to think about.

    Fixes: 27e1f8273113 ("khugepaged: enable collapse pmd for pte-mapped THP")
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Acked-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Mike Kravetz
    Cc: Song Liu
    Cc: [5.4+]
    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008021204390.27773@eggly.anvils
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • This is found by code observation only.

    Firstly, the worst case scenario should assume the whole range was covered
    by pmd sharing. The old algorithm might not work as expected for ranges
    like (1g-2m, 1g+2m), where the adjusted range should be (0, 1g+2m) but the
    expected range should be (0, 2g).

    Since at it, remove the loop since it should not be required. With that,
    the new code should be faster too when the invalidating range is huge.

    Mike said:

    : With range (1g-2m, 1g+2m) within a vma (0, 2g) the existing code will only
    : adjust to (0, 1g+2m) which is incorrect.
    :
    : We should cc stable. The original reason for adjusting the range was to
    : prevent data corruption (getting wrong page). Since the range is not
    : always adjusted correctly, the potential for corruption still exists.
    :
    : However, I am fairly confident that adjust_range_if_pmd_sharing_possible
    : is only gong to be called in two cases:
    :
    : 1) for a single page
    : 2) for range == entire vma
    :
    : In those cases, the current code should produce the correct results.
    :
    : To be safe, let's just cc stable.

    Fixes: 017b1660df89 ("mm: migration: fix migration of huge PMD shared pages")
    Signed-off-by: Peter Xu
    Signed-off-by: Andrew Morton
    Reviewed-by: Mike Kravetz
    Cc: Andrea Arcangeli
    Cc: Matthew Wilcox
    Cc:
    Link: http://lkml.kernel.org/r/20200730201636.74778-1-peterx@redhat.com
    Signed-off-by: Linus Torvalds

    Peter Xu
     
  • Rationale:
    Reduces attack surface on kernel devs opening the links for MITM
    as HTTPS traffic is much harder to manipulate.

    Deterministic algorithm:
    For each file:
    If not .svg:
    For each line:
    If doesn't contain `xmlns`:
    For each link, `http://[^# ]*(?:\w|/)`:
    If neither `gnu\.org/license`, nor `mozilla\.org/MPL`:
    If both the HTTP and HTTPS versions
    return 200 OK and serve the same content:
    Replace HTTP with HTTPS.

    [akpm@linux-foundation.org: fix amd.com URL, per Vlastimil]

    Signed-off-by: Alexander A. Klimov
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Cc: Vlastimil Babka
    Link: http://lkml.kernel.org/r/20200713164345.36088-1-grandmaster@al2klimov.de
    Signed-off-by: Linus Torvalds

    Alexander A. Klimov
     
  • Currently, memalloc_nocma_{save/restore} API that prevents CMA area
    in page allocation is implemented by using current_gfp_context(). However,
    there are two problems of this implementation.

    First, this doesn't work for allocation fastpath. In the fastpath,
    original gfp_mask is used since current_gfp_context() is introduced in
    order to control reclaim and it is on slowpath. So, CMA area can be
    allocated through the allocation fastpath even if
    memalloc_nocma_{save/restore} APIs are used. Currently, there is just
    one user for these APIs and it has a fallback method to prevent actual
    problem.
    Second, clearing __GFP_MOVABLE in current_gfp_context() has a side effect
    to exclude the memory on the ZONE_MOVABLE for allocation target.

    To fix these problems, this patch changes the implementation to exclude
    CMA area in page allocation. Main point of this change is using the
    alloc_flags. alloc_flags is mainly used to control allocation so it fits
    for excluding CMA area in allocation.

    Fixes: d7fefcc8de91 (mm/cma: add PF flag to force non cma alloc)
    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Cc: Christoph Hellwig
    Cc: Roman Gushchin
    Cc: Mike Kravetz
    Cc: Naoya Horiguchi
    Cc: Michal Hocko
    Cc: "Aneesh Kumar K . V"
    Link: http://lkml.kernel.org/r/1595468942-29687-1-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • When we are in the interrupt context, it is irrelevant to the current task
    context. If we use current task's mems_allowed, we can be fair to alloc
    pages in the fast path and fall back to slow path memory allocation when
    the current node(which is the current task mems_allowed) does not have
    enough memory to allocate. In this case, it slows down the memory
    allocation speed of interrupt context. So we can skip setting the
    nodemask to allow any node to allocate memory, so that fast path
    allocation can success.

    Signed-off-by: Muchun Song
    Signed-off-by: Andrew Morton
    Reviewed-by: Pekka Enberg
    Cc: David Hildenbrand
    Link: http://lkml.kernel.org/r/20200706025921.53683-1-songmuchun@bytedance.com
    Signed-off-by: Linus Torvalds

    Muchun Song
     
  • MIGRAGE_TYPES is used to be the mark of end and there are at most 3
    elements for the one dimension array.

    Reduce to 3 to save little memory.

    Signed-off-by: Wei Yang
    Signed-off-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Link: http://lkml.kernel.org/r/20200625231022.18784-1-richard.weiyang@linux.alibaba.com
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • kernel_init_free_pages() will use memset() on s390 to clear all pages from
    kmalloc_order() which will override KASAN redzones because a redzone was
    setup from the end of the allocation size to the end of the last page.
    Silence it by not reporting it there. An example of the report is,

    BUG: KASAN: slab-out-of-bounds in __free_pages_ok
    Write of size 4096 at addr 000000014beaa000
    Call Trace:
    show_stack+0x152/0x210
    dump_stack+0x1f8/0x248
    print_address_description.isra.13+0x5e/0x4d0
    kasan_report+0x130/0x178
    check_memory_region+0x190/0x218
    memset+0x34/0x60
    __free_pages_ok+0x894/0x12f0
    kfree+0x4f2/0x5e0
    unpack_to_rootfs+0x60e/0x650
    populate_rootfs+0x56/0x358
    do_one_initcall+0x1f4/0xa20
    kernel_init_freeable+0x758/0x7e8
    kernel_init+0x1c/0x170
    ret_from_fork+0x24/0x28
    Memory state around the buggy address:
    000000014bea9f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    000000014bea9f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    >000000014beaa000: 03 fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe
    ^
    000000014beaa080: fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe
    000000014beaa100: fe fe fe fe fe fe fe fe fe fe fe fe fe fe

    Fixes: 6471384af2a6 ("mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options")
    Signed-off-by: Qian Cai
    Signed-off-by: Andrew Morton
    Tested-by: Vasily Gorbik
    Acked-by: Vasily Gorbik
    Cc: Dmitry Vyukov
    Cc: Christian Borntraeger
    Cc: Alexander Potapenko
    Cc: Kees Cook
    Cc: Heiko Carstens
    Link: http://lkml.kernel.org/r/20200610052154.5180-1-cai@lca.pw
    Signed-off-by: Linus Torvalds

    Qian Cai
     
  • After previous cleanup, the end_bitidx is not necessary any more.

    Signed-off-by: Wei Yang
    Signed-off-by: Andrew Morton
    Cc: Mel Gorman
    Link: http://lkml.kernel.org/r/20200623124201.8199-4-richard.weiyang@linux.alibaba.com
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • Due to commit e58469bafd05 ("mm: page_alloc: use word-based accesses for
    get/set pageblock bitmaps"), pageblock bitmap is accessed with word-based
    access. This operation could be simplified a little.

    Intuitively, if we want to get a bit range [start_idx, end_idx] in a word,
    we can do like this:

    mask = (1 << (end_bitidx - start_bitidx + 1)) - 1;
    ret = (word >> start_idx) & mask;

    And also if we want to set a bit range [start_idx, end_idx] with flags, we
    can do the same by just shift start_bitidx.

    By doing so we reduce some instructions for these two helper functions:

    Before Patched
    set_pfnblock_flags_mask 209 198(-5%)
    get_pfnblock_flags_mask 101 87(-13%)

    Since the syntax is changed a little, we need to check the whole 4-bit
    migrate_type instead of part of it.

    Signed-off-by: Wei Yang
    Signed-off-by: Andrew Morton
    Cc: Mel Gorman
    Link: http://lkml.kernel.org/r/20200623124201.8199-3-richard.weiyang@linux.alibaba.com
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • The return value calculation is the same both for SPARSEMEM or not.

    Just take it out.

    Signed-off-by: Wei Yang
    Signed-off-by: Andrew Morton
    Cc: Mel Gorman
    Link: http://lkml.kernel.org/r/20200623124201.8199-2-richard.weiyang@linux.alibaba.com
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • We already have the definition of PB_migratetype_bits and current
    NR_MIGRATETYPE_BITS looks like a cyclic definition.

    Just use PB_migratetype_bits is enough.

    Signed-off-by: Wei Yang
    Signed-off-by: Andrew Morton
    Cc: Mel Gorman
    Link: http://lkml.kernel.org/r/20200623124201.8199-1-richard.weiyang@linux.alibaba.com
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • Commit e900a918b098 ("mm: shuffle initial free memory to improve
    memory-side-cache utilization") promised "autodetection of a
    memory-side-cache (to be added in a follow-on patch)" over a year ago.

    The original series included patches [1], however, they were dropped
    during review [2] to be followed-up later.

    Due to lack of platforms that publish an HMAT, autodetection is currently
    not implemented. However, manual activation is actively used [3]. Let's
    simplify for now and re-add when really (ever?) needed.

    [1] https://lkml.kernel.org/r/154510700291.1941238.817190985966612531.stgit@dwillia2-desk3.amr.corp.intel.com
    [2] https://lkml.kernel.org/r/154690326478.676627.103843791978176914.stgit@dwillia2-desk3.amr.corp.intel.com
    [3] https://lkml.kernel.org/r/CAPcyv4irwGUU2x+c6b4L=KbB1dnasNKaaZd6oSpYjL9kfsnROQ@mail.gmail.com

    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Reviewed-by: Wei Yang
    Acked-by: Dan Williams
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Huang Ying
    Cc: Wei Yang
    Cc: Mel Gorman
    Cc: Dan Williams
    Link: http://lkml.kernel.org/r/20200624094741.9918-4-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • It's not completely obvious why we have to shuffle the complete zone -
    introduced in commit e900a918b098 ("mm: shuffle initial free memory to
    improve memory-side-cache utilization") - because some sort of shuffling
    is already performed when onlining pages via __free_one_page(), placing
    MAX_ORDER-1 pages either to the head or the tail of the freelist. Let's
    document why we have to shuffle the complete zone when exposing larger,
    contiguous physical memory areas to the buddy.

    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Acked-by: Dan Williams
    Acked-by: Michal Hocko
    Cc: Alexander Duyck
    Cc: Dan Williams
    Cc: Michal Hocko
    Link: http://lkml.kernel.org/r/20200624094741.9918-3-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • nr_free_pagecache_pages() isn't used outside page_alloc.c anymore - and
    the name does not really help to understand what's going on. Let's
    open-code it instead and add a comment.

    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Reviewed-by: Wei Yang
    Reviewed-by: Pankaj Gupta
    Reviewed-by: Mike Rapoport
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Huang Ying
    Link: http://lkml.kernel.org/r/20200619132410.23859-3-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • The global variable "vm_total_pages" is a relic from older days. There is
    only a single user that reads the variable - build_all_zonelists() - and
    the first thing it does is update it.

    Use a local variable in build_all_zonelists() instead and remove the
    global variable.

    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Reviewed-by: Wei Yang
    Reviewed-by: Pankaj Gupta
    Reviewed-by: Mike Rapoport
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Huang Ying
    Cc: Minchan Kim
    Link: http://lkml.kernel.org/r/20200619132410.23859-2-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • When boosting is enabled, it is observed that rate of atomic order-0
    allocation failures are high due to the fact that free levels in the
    system are checked with ->watermark_boost offset. This is not a problem
    for sleepable allocations but for atomic allocations which looks like
    regression.

    This problem is seen frequently on system setup of Android kernel running
    on Snapdragon hardware with 4GB RAM size. When no extfrag event occurred
    in the system, ->watermark_boost factor is zero, thus the watermark
    configurations in the system are:

    _watermark = (
    [WMARK_MIN] = 1272, --> ~5MB
    [WMARK_LOW] = 9067, --> ~36MB
    [WMARK_HIGH] = 9385), --> ~38MB
    watermark_boost = 0

    After launching some memory hungry applications in Android which can cause
    extfrag events in the system to an extent that ->watermark_boost can be
    set to max i.e. default boost factor makes it to 150% of high watermark.

    _watermark = (
    [WMARK_MIN] = 1272, --> ~5MB
    [WMARK_LOW] = 9067, --> ~36MB
    [WMARK_HIGH] = 9385), --> ~38MB
    watermark_boost = 14077, -->~57MB

    With default system configuration, for an atomic order-0 allocation to
    succeed, having free memory of ~2MB will suffice. But boosting makes the
    min_wmark to ~61MB thus for an atomic order-0 allocation to be successful
    system should have minimum of ~23MB of free memory(from calculations of
    zone_watermark_ok(), min = 3/4(min/2)). But failures are observed despite
    system is having ~20MB of free memory. In the testing, this is
    reproducible as early as first 300secs since boot and with furtherlowram
    configurations(watermark_boost in
    watermark caluculations for atomic order-0 allocations.

    [akpm@linux-foundation.org: fix comment grammar, reflow comment]
    [charante@codeaurora.org: fix suggested by Mel Gorman]
    Link: http://lkml.kernel.org/r/31556793-57b1-1c21-1a9d-22674d9bd938@codeaurora.org

    Signed-off-by: Charan Teja Reddy
    Signed-off-by: Andrew Morton
    Acked-by: Vlastimil Babka
    Cc: Vinayak Menon
    Cc: Mel Gorman
    Link: http://lkml.kernel.org/r/1589882284-21010-1-git-send-email-charante@codeaurora.org
    Signed-off-by: Linus Torvalds

    Charan Teja Reddy
     
  • zone_watermark_fast was introduced by commit 48ee5f3696f6 ("mm,
    page_alloc: shortcut watermark checks for order-0 pages"). The commit
    simply checks if free pages is bigger than watermark without additional
    calculation such like reducing watermark.

    It considered free cma pages but it did not consider highatomic reserved.
    This may incur exhaustion of free pages except high order atomic free
    pages.

    Assume that reserved_highatomic pageblock is bigger than watermark min,
    and there are only few free pages except high order atomic free. Because
    zone_watermark_fast passes the allocation without considering high order
    atomic free, normal reclaimable allocation like GFP_HIGHUSER will consume
    all the free pages. Then finally order-0 atomic allocation may fail on
    allocation.

    This means watermark min is not protected against non-atomic allocation.
    The order-0 atomic allocation with ALLOC_HARDER unwantedly can be failed.
    Additionally the __GFP_MEMALLOC allocation with ALLOC_NO_WATERMARKS also
    can be failed.

    To avoid the problem, zone_watermark_fast should consider highatomic
    reserve. If the actual size of high atomic free is counted accurately
    like cma free, we may use it. On this patch just use
    nr_reserved_highatomic. Additionally introduce
    __zone_watermark_unusable_free to factor out common parts between
    zone_watermark_fast and __zone_watermark_ok.

    This is an example of ALLOC_HARDER allocation failure using v4.19 based
    kernel.

    Binder:9343_3: page allocation failure: order:0, mode:0x480020(GFP_ATOMIC), nodemask=(null)
    Call trace:
    [] dump_stack+0xb8/0xf0
    [] warn_alloc+0xd8/0x12c
    [] __alloc_pages_nodemask+0x120c/0x1250
    [] new_slab+0x128/0x604
    [] ___slab_alloc+0x508/0x670
    [] __kmalloc+0x2f8/0x310
    [] context_struct_to_string+0x104/0x1cc
    [] security_sid_to_context_core+0x74/0x144
    [] security_sid_to_context+0x10/0x18
    [] selinux_secid_to_secctx+0x20/0x28
    [] security_secid_to_secctx+0x3c/0x70
    [] binder_transaction+0xe68/0x454c
    Mem-Info:
    active_anon:102061 inactive_anon:81551 isolated_anon:0
    active_file:59102 inactive_file:68924 isolated_file:64
    unevictable:611 dirty:63 writeback:0 unstable:0
    slab_reclaimable:13324 slab_unreclaimable:44354
    mapped:83015 shmem:4858 pagetables:26316 bounce:0
    free:2727 free_pcp:1035 free_cma:178
    Node 0 active_anon:408244kB inactive_anon:326204kB active_file:236408kB inactive_file:275696kB unevictable:2444kB isolated(anon):0kB isolated(file):256kB mapped:332060kB dirty:252kB writeback:0kB shmem:19432kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
    Normal free:10908kB min:6192kB low:44388kB high:47060kB active_anon:409160kB inactive_anon:325924kB active_file:235820kB inactive_file:276628kB unevictable:2444kB writepending:252kB present:3076096kB managed:2673676kB mlocked:2444kB kernel_stack:62512kB pagetables:105264kB bounce:0kB free_pcp:4140kB local_pcp:40kB free_cma:712kB
    lowmem_reserve[]: 0 0
    Normal: 505*4kB (H) 357*8kB (H) 201*16kB (H) 65*32kB (H) 1*64kB (H) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 10236kB
    138826 total pagecache pages
    5460 pages in swap cache
    Swap cache stats: add 8273090, delete 8267506, find 1004381/4060142

    This is an example of ALLOC_NO_WATERMARKS allocation failure using v4.14
    based kernel.

    kswapd0: page allocation failure: order:0, mode:0x140000a(GFP_NOIO|__GFP_HIGHMEM|__GFP_MOVABLE), nodemask=(null)
    kswapd0 cpuset=/ mems_allowed=0
    CPU: 4 PID: 1221 Comm: kswapd0 Not tainted 4.14.113-18770262-userdebug #1
    Call trace:
    [] dump_backtrace+0x0/0x248
    [] show_stack+0x18/0x20
    [] __dump_stack+0x20/0x28
    [] dump_stack+0x68/0x90
    [] warn_alloc+0x104/0x198
    [] __alloc_pages_nodemask+0xdc0/0xdf0
    [] zs_malloc+0x148/0x3d0
    [] zram_bvec_rw+0x410/0x798
    [] zram_rw_page+0x88/0xdc
    [] bdev_write_page+0x70/0xbc
    [] __swap_writepage+0x58/0x37c
    [] swap_writepage+0x40/0x4c
    [] shrink_page_list+0xc30/0xf48
    [] shrink_inactive_list+0x2b0/0x61c
    [] shrink_node_memcg+0x23c/0x618
    [] shrink_node+0x1c8/0x304
    [] kswapd+0x680/0x7c4
    [] kthread+0x110/0x120
    [] ret_from_fork+0x10/0x18
    Mem-Info:
    active_anon:111826 inactive_anon:65557 isolated_anon:0\x0a active_file:44260 inactive_file:83422 isolated_file:0\x0a unevictable:4158 dirty:117 writeback:0 unstable:0\x0a slab_reclaimable:13943 slab_unreclaimable:43315\x0a mapped:102511 shmem:3299 pagetables:19566 bounce:0\x0a free:3510 free_pcp:553 free_cma:0
    Node 0 active_anon:447304kB inactive_anon:262228kB active_file:177040kB inactive_file:333688kB unevictable:16632kB isolated(anon):0kB isolated(file):0kB mapped:410044kB d irty:468kB writeback:0kB shmem:13196kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
    Normal free:14040kB min:7440kB low:94500kB high:98136kB reserved_highatomic:32768KB active_anon:447336kB inactive_anon:261668kB active_file:177572kB inactive_file:333768k B unevictable:16632kB writepending:480kB present:4081664kB managed:3637088kB mlocked:16632kB kernel_stack:47072kB pagetables:78264kB bounce:0kB free_pcp:2280kB local_pcp:720kB free_cma:0kB [ 4738.329607] lowmem_reserve[]: 0 0
    Normal: 860*4kB (H) 453*8kB (H) 180*16kB (H) 26*32kB (H) 34*64kB (H) 6*128kB (H) 2*256kB (H) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 14232kB

    This is trace log which shows GFP_HIGHUSER consumes free pages right
    before ALLOC_NO_WATERMARKS.

    -22275 [006] .... 889.213383: mm_page_alloc: page=00000000d2be5665 pfn=970744 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
    -22275 [006] .... 889.213385: mm_page_alloc: page=000000004b2335c2 pfn=970745 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
    -22275 [006] .... 889.213387: mm_page_alloc: page=00000000017272e1 pfn=970278 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
    -22275 [006] .... 889.213389: mm_page_alloc: page=00000000c4be79fb pfn=970279 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
    -22275 [006] .... 889.213391: mm_page_alloc: page=00000000f8a51d4f pfn=970260 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
    -22275 [006] .... 889.213393: mm_page_alloc: page=000000006ba8f5ac pfn=970261 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
    -22275 [006] .... 889.213395: mm_page_alloc: page=00000000819f1cd3 pfn=970196 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
    -22275 [006] .... 889.213396: mm_page_alloc: page=00000000f6b72a64 pfn=970197 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
    kswapd0-1207 [005] ...1 889.213398: mm_page_alloc: page= (null) pfn=0 order=0 migratetype=1 nr_free=3650 gfp_flags=GFP_NOWAIT|__GFP_HIGHMEM|__GFP_NOWARN|__GFP_MOVABLE

    [jaewon31.kim@samsung.com: remove redundant code for high-order]
    Link: http://lkml.kernel.org/r/20200623035242.27232-1-jaewon31.kim@samsung.com

    Reported-by: Yong-Taek Lee
    Suggested-by: Minchan Kim
    Signed-off-by: Jaewon Kim
    Signed-off-by: Andrew Morton
    Reviewed-by: Baoquan He
    Acked-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Yong-Taek Lee
    Cc: Michal Hocko
    Link: http://lkml.kernel.org/r/20200619235958.11283-1-jaewon31.kim@samsung.com
    Signed-off-by: Linus Torvalds

    Jaewon Kim
     
  • Hugh noted that task_capc() could use unlikely(), as most of the time
    there is no capture in progress and we are in page freeing hot path.
    Indeed adding unlikely() produces assembly that better matches the
    assumption and moves all the tests away from the hot path.

    I have also noticed that we don't need to test for cc->direct_compaction
    as the only place we set current->task_capture is compact_zone_order()
    which also always sets cc->direct_compaction true.

    Suggested-by: Hugh Dickins
    Signed-off-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Acked-by: Hugh Dickins
    Acked-by: Mel Gorman
    Cc: Alex Shi
    Cc: Li Wang
    Link: http://lkml.kernel.org/r/4a24f7af-3aa5-6e80-4ae6-8f253b562039@suse.cz
    Signed-off-by: Linus Torvalds

    Vlastimil Babka