29 Feb, 2020

1 commit

  • commit cba6437a1854fde5934098ec3bd0ee83af3129f5 upstream.

    Qian Cai reported that the WARN_ON() in the x86/msi affinity setting code,
    which catches cases where the affinity setting is not done on the CPU which
    is the current target of the interrupt, triggers during CPU hotplug stress
    testing.

    It turns out that the warning which was added with the commit addressing
    the MSI affinity race unearthed yet another long standing bug.

    If user space writes a bogus affinity mask, i.e. it contains no online CPUs,
    then it calls irq_select_affinity_usr(). This was introduced for ALPHA in

    eee45269b0f5 ("[PATCH] Alpha: convert to generic irq framework (generic part)")

    and subsequently made available for all architectures in

    18404756765c ("genirq: Expose default irq affinity mask (take 3)")

    which introduced the circumvention of the affinity setting restrictions for
    interrupt which cannot be moved in process context.

    The whole exercise is bogus in various aspects:

    1) If the interrupt is already started up then there is absolutely
    no point to honour a bogus interrupt affinity setting from user
    space. The interrupt is already assigned to an online CPU and it
    does not make any sense to reassign it to some other randomly
    chosen online CPU.

    2) If the interupt is not yet started up then there is no point
    either. A subsequent startup of the interrupt will invoke
    irq_setup_affinity() anyway which will chose a valid target CPU.

    So the only correct solution is to just return -EINVAL in case user space
    wrote an affinity mask which does not contain any online CPUs, except for
    ALPHA which has it's own magic sauce for this.

    Fixes: 18404756765c ("genirq: Expose default irq affinity mask (take 3)")
    Reported-by: Qian Cai
    Signed-off-by: Thomas Gleixner
    Tested-by: Qian Cai
    Link: https://lkml.kernel.org/r/878sl8xdbm.fsf@nanos.tec.linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

11 Feb, 2020

2 commits

  • commit 6f1a4891a5928a5969c87fa5a584844c983ec823 upstream.

    Evan tracked down a subtle race between the update of the MSI message and
    the device raising an interrupt internally on PCI devices which do not
    support MSI masking. The update of the MSI message is non-atomic and
    consists of either 2 or 3 sequential 32bit wide writes to the PCI config
    space.

    - Write address low 32bits
    - Write address high 32bits (If supported by device)
    - Write data

    When an interrupt is migrated then both address and data might change, so
    the kernel attempts to mask the MSI interrupt first. But for MSI masking is
    optional, so there exist devices which do not provide it. That means that
    if the device raises an interrupt internally between the writes then a MSI
    message is sent built from half updated state.

    On x86 this can lead to spurious interrupts on the wrong interrupt
    vector when the affinity setting changes both address and data. As a
    consequence the device interrupt can be lost causing the device to
    become stuck or malfunctioning.

    Evan tried to handle that by disabling MSI accross an MSI message
    update. That's not feasible because disabling MSI has issues on its own:

    If MSI is disabled the PCI device is routing an interrupt to the legacy
    INTx mechanism. The INTx delivery can be disabled, but the disablement is
    not working on all devices.

    Some devices lose interrupts when both MSI and INTx delivery are disabled.

    Another way to solve this would be to enforce the allocation of the same
    vector on all CPUs in the system for this kind of screwed devices. That
    could be done, but it would bring back the vector space exhaustion problems
    which got solved a few years ago.

    Fortunately the high address (if supported by the device) is only relevant
    when X2APIC is enabled which implies interrupt remapping. In the interrupt
    remapping case the affinity setting is happening at the interrupt remapping
    unit and the PCI MSI message is programmed only once when the PCI device is
    initialized.

    That makes it possible to solve it with a two step update:

    1) Target the MSI msg to the new vector on the current target CPU

    2) Target the MSI msg to the new vector on the new target CPU

    In both cases writing the MSI message is only changing a single 32bit word
    which prevents the issue of inconsistency.

    After writing the final destination it is necessary to check whether the
    device issued an interrupt while the intermediate state #1 (new vector,
    current CPU) was in effect.

    This is possible because the affinity change is always happening on the
    current target CPU. The code runs with interrupts disabled, so the
    interrupt can be detected by checking the IRR of the local APIC. If the
    vector is pending in the IRR then the interrupt is retriggered on the new
    target CPU by sending an IPI for the associated vector on the target CPU.

    This can cause spurious interrupts on both the local and the new target
    CPU.

    1) If the new vector is not in use on the local CPU and the device
    affected by the affinity change raised an interrupt during the
    transitional state (step #1 above) then interrupt entry code will
    ignore that spurious interrupt. The vector is marked so that the
    'No irq handler for vector' warning is supressed once.

    2) If the new vector is in use already on the local CPU then the IRR check
    might see an pending interrupt from the device which is using this
    vector. The IPI to the new target CPU will then invoke the handler of
    the device, which got the affinity change, even if that device did not
    issue an interrupt

    3) If the new vector is in use already on the local CPU and the device
    affected by the affinity change raised an interrupt during the
    transitional state (step #1 above) then the handler of the device which
    uses that vector on the local CPU will be invoked.

    expose issues in device driver interrupt handlers which are not prepared to
    handle a spurious interrupt correctly. This not a regression, it's just
    exposing something which was already broken as spurious interrupts can
    happen for a lot of reasons and all driver handlers need to be able to deal
    with them.

    Reported-by: Evan Green
    Debugged-by: Evan Green
    Signed-off-by: Thomas Gleixner
    Tested-by: Evan Green
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/r/87imkr4s7n.fsf@nanos.tec.linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • commit 0f394daef89b38d58c91118a2b08b8a1b316703b upstream.

    Fix a memory leak reported by kmemleak:
    unreferenced object 0xffff000bc6f50e80 (size 128):
    comm "kworker/23:2", pid 201, jiffies 4294894947 (age 942.132s)
    hex dump (first 32 bytes):
    00 00 00 00 41 00 00 00 86 c0 03 00 00 00 00 00 ....A...........
    00 a0 b2 c6 0b 00 ff ff 40 51 fd 10 00 80 ff ff ........@Q......
    backtrace:
    [] kmem_cache_alloc_trace+0x1a4/0x320
    [] irq_domain_push_irq+0x7c/0x188
    [] thunderx_gpio_probe+0x3ac/0x438
    [] pci_device_probe+0xe4/0x198
    [] really_probe+0xdc/0x320
    [] driver_probe_device+0x5c/0xf0
    [] __device_attach_driver+0x88/0xc0
    [] bus_for_each_drv+0x7c/0xc8
    [] __device_attach+0xe4/0x140
    [] device_initial_probe+0x18/0x20
    [] bus_probe_device+0x98/0xa0
    [] deferred_probe_work_func+0x74/0xa8
    [] process_one_work+0x1c8/0x470
    [] worker_thread+0x1f8/0x428
    [] kthread+0xfc/0x128
    [] ret_from_fork+0x10/0x18

    Fixes: 495c38d3001f ("irqdomain: Add irq_domain_{push,pop}_irq() functions")
    Signed-off-by: Kevin Hao
    Signed-off-by: Marc Zyngier
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/r/20200120043547.22271-1-haokexin@gmail.com
    Signed-off-by: Greg Kroah-Hartman

    Kevin Hao
     

05 Nov, 2019

1 commit


18 Sep, 2019

2 commits

  • Pull power management updates from Rafael Wysocki:
    "These include a rework of the main suspend-to-idle code flow (related
    to the handling of spurious wakeups), a switch over of several users
    of cpufreq notifiers to QoS-based limits, a new devfreq driver for
    Tegra20, a new cpuidle driver and governor for virtualized guests, an
    extension of the wakeup sources framework to expose wakeup sources as
    device objects in sysfs, and more.

    Specifics:

    - Rework the main suspend-to-idle control flow to avoid repeating
    "noirq" device resume and suspend operations in case of spurious
    wakeups from the ACPI EC and decouple the ACPI EC wakeups support
    from the LPS0 _DSM support (Rafael Wysocki).

    - Extend the wakeup sources framework to expose wakeup sources as
    device objects in sysfs (Tri Vo, Stephen Boyd).

    - Expose system suspend statistics in sysfs (Kalesh Singh).

    - Introduce a new haltpoll cpuidle driver and a new matching governor
    for virtualized guests wanting to do guest-side polling in the idle
    loop (Marcelo Tosatti, Joao Martins, Wanpeng Li, Stephen Rothwell).

    - Fix the menu and teo cpuidle governors to allow the scheduler tick
    to be stopped if PM QoS is used to limit the CPU idle state exit
    latency in some cases (Rafael Wysocki).

    - Increase the resolution of the play_idle() argument to microseconds
    for more fine-grained injection of CPU idle cycles (Daniel
    Lezcano).

    - Switch over some users of cpuidle notifiers to the new QoS-based
    frequency limits and drop the CPUFREQ_ADJUST and CPUFREQ_NOTIFY
    policy notifier events (Viresh Kumar).

    - Add new cpufreq driver based on nvmem for sun50i (Yangtao Li).

    - Add support for MT8183 and MT8516 to the mediatek cpufreq driver
    (Andrew-sh.Cheng, Fabien Parent).

    - Add i.MX8MN support to the imx-cpufreq-dt cpufreq driver (Anson
    Huang).

    - Add qcs404 to cpufreq-dt-platdev blacklist (Jorge Ramirez-Ortiz).

    - Update the qcom cpufreq driver (among other things, to make it
    easier to extend and to use kryo cpufreq for other nvmem-based
    SoCs) and add qcs404 support to it (Niklas Cassel, Douglas
    RAILLARD, Sibi Sankar, Sricharan R).

    - Fix assorted issues and make assorted minor improvements in the
    cpufreq code (Colin Ian King, Douglas RAILLARD, Florian Fainelli,
    Gustavo Silva, Hariprasad Kelam).

    - Add new devfreq driver for NVidia Tegra20 (Dmitry Osipenko, Arnd
    Bergmann).

    - Add new Exynos PPMU events to devfreq events and extend that
    mechanism (Lukasz Luba).

    - Fix and clean up the exynos-bus devfreq driver (Kamil Konieczny).

    - Improve devfreq documentation and governor code, fix spelling typos
    in devfreq (Ezequiel Garcia, Krzysztof Kozlowski, Leonard Crestez,
    MyungJoo Ham, Gaël PORTAY).

    - Add regulators enable and disable to the OPP (operating performance
    points) framework (Kamil Konieczny).

    - Update the OPP framework to support multiple opp-suspend properties
    (Anson Huang).

    - Fix assorted issues and make assorted minor improvements in the OPP
    code (Niklas Cassel, Viresh Kumar, Yue Hu).

    - Clean up the generic power domains (genpd) framework (Ulf Hansson).

    - Clean up assorted pieces of power management code and documentation
    (Akinobu Mita, Amit Kucheria, Chuhong Yuan).

    - Update the pm-graph tool to version 5.5 including multiple fixes
    and improvements (Todd Brandt).

    - Update the cpupower utility (Benjamin Weis, Geert Uytterhoeven,
    Sébastien Szymanski)"

    * tag 'pm-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (126 commits)
    cpuidle-haltpoll: Enable kvm guest polling when dedicated physical CPUs are available
    cpuidle-haltpoll: do not set an owner to allow modunload
    cpuidle-haltpoll: return -ENODEV on modinit failure
    cpuidle-haltpoll: set haltpoll as preferred governor
    cpuidle: allow governor switch on cpuidle_register_driver()
    PM: runtime: Documentation: add runtime_status ABI document
    pm-graph: make setVal unbuffered again for python2 and python3
    powercap: idle_inject: Use higher resolution for idle injection
    cpuidle: play_idle: Increase the resolution to usec
    cpuidle-haltpoll: vcpu hotplug support
    cpufreq: Add qcs404 to cpufreq-dt-platdev blacklist
    cpufreq: qcom: Add support for qcs404 on nvmem driver
    cpufreq: qcom: Refactor the driver to make it easier to extend
    cpufreq: qcom: Re-organise kryo cpufreq to use it for other nvmem based qcom socs
    dt-bindings: opp: Add qcom-opp bindings with properties needed for CPR
    dt-bindings: opp: qcom-nvmem: Support pstates provided by a power domain
    Documentation: cpufreq: Update policy notifier documentation
    cpufreq: Remove CPUFREQ_ADJUST and CPUFREQ_NOTIFY policy notifier events
    PM / Domains: Verify PM domain type in dev_pm_genpd_set_performance_state()
    PM / Domains: Simplify genpd_lookup_dev()
    ...

    Linus Torvalds
     
  • Pull core irq updates from Thomas Gleixner:
    "Updates from the irq departement:

    - Update the interrupt spreading code so it handles numa node with
    different CPU counts properly.

    - A large overhaul of the ARM GiCv3 driver to support new PPI and SPI
    ranges.

    - Conversion of all alloc_fwnode() users to use physical addresses
    instead of virtual addresses so the virtual addresses are not
    leaked. The physical address is sufficient to identify the
    associated interrupt chip.

    - Add support for Marvel MMP3, Amlogic Meson SM1 interrupt chips.

    - Enforce interrupt threading at compile time if RT is enabled.

    - Small updates and improvements all over the place"

    * 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (37 commits)
    irqchip/gic-v3-its: Fix LPI release for Multi-MSI devices
    irqchip/uniphier-aidet: Use devm_platform_ioremap_resource()
    irqdomain: Add the missing assignment of domain->fwnode for named fwnode
    irqchip/mmp: Coexist with GIC root IRQ controller
    irqchip/mmp: Mask off interrupts from other cores
    irqchip/mmp: Add missing chained_irq_{enter,exit}()
    irqchip/mmp: Do not use of_address_to_resource() to get mux regs
    irqchip/meson-gpio: Add support for meson sm1 SoCs
    dt-bindings: interrupt-controller: New binding for the meson sm1 SoCs
    genirq/affinity: Remove const qualifier from node_to_cpumask argument
    genirq/affinity: Spread vectors on node according to nr_cpu ratio
    genirq/affinity: Improve __irq_build_affinity_masks()
    irqchip: Remove dev_err() usage after platform_get_irq()
    irqchip: Add include guard to irq-partition-percpu.h
    irqchip/mmp: Do not call irq_set_default_host() on DT platforms
    irqchip/gic-v3-its: Remove the redundant set_bit for lpi_map
    irqchip/gic-v3: Add quirks for HIP06/07 invalid GICD_TYPER erratum 161010803
    irqchip/gic: Skip DT quirks when evaluating IIDR-based quirks
    irqchip/gic-v3: Warn about inconsistent implementations of extended ranges
    irqchip/gic-v3: Add EPPI range support
    ...

    Linus Torvalds
     

17 Sep, 2019

3 commits

  • * pm-sleep: (29 commits)
    ACPI: PM: s2idle: Always set up EC GPE for system wakeup
    ACPI: PM: s2idle: Avoid rearming SCI for wakeup unnecessarily
    PM / wakeup: Unexport wakeup_source_sysfs_{add,remove}()
    PM / wakeup: Register wakeup class kobj after device is added
    PM / wakeup: Fix sysfs registration error path
    PM / wakeup: Show wakeup sources stats in sysfs
    PM / wakeup: Use wakeup_source_register() in wakelock.c
    PM / wakeup: Drop wakeup_source_init(), wakeup_source_prepare()
    PM: sleep: Replace strncmp() with str_has_prefix()
    PM: suspend: Fix platform_suspend_prepare_noirq()
    intel-hid: Disable button array during suspend-to-idle
    intel-hid: intel-vbtn: Avoid leaking wakeup_mode set
    ACPI: PM: s2idle: Execute LPS0 _DSM functions with suspended devices
    ACPI: EC: PM: Make acpi_ec_dispatch_gpe() print debug message
    ACPI: EC: PM: Consolidate some code depending on PM_SLEEP
    ACPI: PM: s2idle: Eliminate acpi_sleep_no_ec_events()
    ACPI: PM: s2idle: Switch EC over to polling during "noirq" suspend
    ACPI: PM: s2idle: Add acpi.sleep_no_lps0 module parameter
    ACPI: PM: s2idle: Rearrange lps0_device_attach()
    PM/sleep: Expose suspend stats in sysfs
    ...

    Rafael J. Wysocki
     
  • Pull scheduler updates from Ingo Molnar:

    - MAINTAINERS: Add Mark Rutland as perf submaintainer, Juri Lelli and
    Vincent Guittot as scheduler submaintainers. Add Dietmar Eggemann,
    Steven Rostedt, Ben Segall and Mel Gorman as scheduler reviewers.

    As perf and the scheduler is getting bigger and more complex,
    document the status quo of current responsibilities and interests,
    and spread the review pain^H^H^H^H fun via an increase in the Cc:
    linecount generated by scripts/get_maintainer.pl. :-)

    - Add another series of patches that brings the -rt (PREEMPT_RT) tree
    closer to mainline: split the monolithic CONFIG_PREEMPT dependencies
    into a new CONFIG_PREEMPTION category that will allow the eventual
    introduction of CONFIG_PREEMPT_RT. Still a few more hundred patches
    to go though.

    - Extend the CPU cgroup controller with uclamp.min and uclamp.max to
    allow the finer shaping of CPU bandwidth usage.

    - Micro-optimize energy-aware wake-ups from O(CPUS^2) to O(CPUS).

    - Improve the behavior of high CPU count, high thread count
    applications running under cpu.cfs_quota_us constraints.

    - Improve balancing with SCHED_IDLE (SCHED_BATCH) tasks present.

    - Improve CPU isolation housekeeping CPU allocation NUMA locality.

    - Fix deadline scheduler bandwidth calculations and logic when cpusets
    rebuilds the topology, or when it gets deadline-throttled while it's
    being offlined.

    - Convert the cpuset_mutex to percpu_rwsem, to allow it to be used from
    setscheduler() system calls without creating global serialization.
    Add new synchronization between cpuset topology-changing events and
    the deadline acceptance tests in setscheduler(), which were broken
    before.

    - Rework the active_mm state machine to be less confusing and more
    optimal.

    - Rework (simplify) the pick_next_task() slowpath.

    - Improve load-balancing on AMD EPYC systems.

    - ... and misc cleanups, smaller fixes and improvements - please see
    the Git log for more details.

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (53 commits)
    sched/psi: Correct overly pessimistic size calculation
    sched/fair: Speed-up energy-aware wake-ups
    sched/uclamp: Always use 'enum uclamp_id' for clamp_id values
    sched/uclamp: Update CPU's refcount on TG's clamp changes
    sched/uclamp: Use TG's clamps to restrict TASK's clamps
    sched/uclamp: Propagate system defaults to the root group
    sched/uclamp: Propagate parent clamps
    sched/uclamp: Extend CPU's cgroup controller
    sched/topology: Improve load balancing on AMD EPYC systems
    arch, ia64: Make NUMA select SMP
    sched, perf: MAINTAINERS update, add submaintainers and reviewers
    sched/fair: Use rq_lock/unlock in online_fair_sched_group
    cpufreq: schedutil: fix equation in comment
    sched: Rework pick_next_task() slow-path
    sched: Allow put_prev_task() to drop rq->lock
    sched/fair: Expose newidle_balance()
    sched: Add task_struct pointer to sched_class::set_curr_task
    sched: Rework CPU hotplug task selection
    sched/{rt,deadline}: Fix set_next_task vs pick_next_task
    sched: Fix kerneldoc comment for ia64_set_curr_task
    ...

    Linus Torvalds
     
  • Pull ia64 updates from Tony Luck:
    "The big change here is removal of support for SGI Altix"

    * tag 'please-pull-ia64_for_5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux: (33 commits)
    genirq: remove the is_affinity_mask_valid hook
    ia64: remove CONFIG_SWIOTLB ifdefs
    ia64: remove support for machvecs
    ia64: move the screen_info setup to common code
    ia64: move the ROOT_DEV setup to common code
    ia64: rework iommu probing
    ia64: remove the unused sn_coherency_id symbol
    ia64: remove the SGI UV simulator support
    ia64: remove the zx1 swiotlb machvec
    ia64: remove CONFIG_ACPI ifdefs
    ia64: remove CONFIG_PCI ifdefs
    ia64: remove the hpsim platform
    ia64: remove now unused machvec indirections
    ia64: remove support for the SGI SN2 platform
    drivers: remove the SGI SN2 IOC4 base support
    drivers: remove the SGI SN2 IOC3 base support
    qla2xxx: remove SGI SN2 support
    qla1280: remove SGI SN2 support
    misc/sgi-xp: remove SGI SN2 support
    char/mspec: remove SGI SN2 support
    ...

    Linus Torvalds
     

06 Sep, 2019

2 commits

  • …-platforms into irq/core

    Pull irqchip updates for Linux 5.4 from Marc Zyngier:

    - Large GICv3 updates to support new PPI and SPI ranges
    - Conver all alloc_fwnode() users to use PAs instead of VAs
    - Add support for Marvell's MMP3 irqchip
    - Add support for Amlogic Meson SM1
    - Various cleanups and fixes

    Thomas Gleixner
     
  • The following crash was observed:

    Unable to handle kernel NULL pointer dereference at 0000000000000158
    Internal error: Oops: 96000004 [#1] SMP
    pc : resend_irqs+0x68/0xb0
    lr : resend_irqs+0x64/0xb0
    ...
    Call trace:
    resend_irqs+0x68/0xb0
    tasklet_action_common.isra.6+0x84/0x138
    tasklet_action+0x2c/0x38
    __do_softirq+0x120/0x324
    run_ksoftirqd+0x44/0x60
    smpboot_thread_fn+0x1ac/0x1e8
    kthread+0x134/0x138
    ret_from_fork+0x10/0x18

    The reason for this is that the interrupt resend mechanism happens in soft
    interrupt context, which is a asynchronous mechanism versus other
    operations on interrupts. free_irq() does not take resend handling into
    account. Thus, the irq descriptor might be already freed before the resend
    tasklet is executed. resend_irqs() does not check the return value of the
    interrupt descriptor lookup and derefences the return value
    unconditionally.

    1):
    __setup_irq
    irq_startup
    check_irq_resend // activate softirq to handle resend irq
    2):
    irq_domain_free_irqs
    irq_free_descs
    free_desc
    call_rcu(&desc->rcu, delayed_free_desc)
    3):
    __do_softirq
    tasklet_action
    resend_irqs
    desc = irq_to_desc(irq)
    desc->handle_irq(desc) // desc is NULL --> Ooops

    Fix this by adding a NULL pointer check in resend_irqs() before derefencing
    the irq descriptor.

    Fixes: a4633adcdbc1 ("[PATCH] genirq: add genirq sw IRQ-retrigger")
    Signed-off-by: Yunfeng Ye
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Zhiqiang Liu
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/1630ae13-5c8e-901e-de09-e740b6a426a7@huawei.com

    Yunfeng Ye
     

03 Sep, 2019

1 commit

  • Recently device pass-through stops working for Linux VM running on Hyper-V.

    git-bisect shows the regression is caused by the recent commit
    467a3bb97432 ("PCI: hv: Allocate a named fwnode ..."), but the root cause
    is that the commit d59f6617eef0 forgets to set the domain->fwnode for
    IRQCHIP_FWNODE_NAMED*, and as a result:

    1. The domain->fwnode remains to be NULL.

    2. irq_find_matching_fwspec() returns NULL since "h->fwnode == fwnode" is
    false, and pci_set_bus_msi_domain() sets the Hyper-V PCI root bus's
    msi_domain to NULL.

    3. When the device is added onto the root bus, the device's dev->msi_domain
    is set to NULL in pci_set_msi_domain().

    4. When a device driver tries to enable MSI-X, pci_msi_setup_msi_irqs()
    calls arch_setup_msi_irqs(), which uses the native MSI chip (i.e.
    arch/x86/kernel/apic/msi.c: pci_msi_controller) to set up the irqs, but
    actually pci_msi_setup_msi_irqs() is supposed to call
    msi_domain_alloc_irqs() with the hbus->irq_domain, which is created in
    hv_pcie_init_irq_domain() and is associated with the Hyper-V chip
    hv_msi_irq_chip. Consequently, the irq line is not properly set up, and
    the device driver can not receive any interrupt.

    Fixes: d59f6617eef0 ("genirq: Allow fwnode to carry name information only")
    Fixes: 467a3bb97432 ("PCI: hv: Allocate a named fwnode instead of an address-based one")
    Reported-by: Lili Deng
    Signed-off-by: Dexuan Cui
    Signed-off-by: Marc Zyngier
    Link: https://lore.kernel.org/r/PU1P153MB01694D9AF625AC335C600C5FBFBE0@PU1P153MB0169.APCP153.PROD.OUTLOOK.COM

    Dexuan Cui
     

28 Aug, 2019

1 commit

  • When CONFIG_CPUMASK_OFFSTACK isn't enabled, 'cpumask_var_t' is as

    'typedef struct cpumask cpumask_var_t[1]',

    so the argument 'node_to_cpumask' alloc_nodes_vectors() can't be declared
    as 'const cpumask_var_t *'

    Fixes the following warning:

    kernel/irq/affinity.c: In function '__irq_build_affinity_masks':
    alloc_nodes_vectors(numvecs, node_to_cpumask, cpu_mask,
    ^
    kernel/irq/affinity.c:128:13: note: expected 'const struct cpumask (*)[1]' but argument is of type 'struct cpumask (*)[1]'
    static void alloc_nodes_vectors(unsigned int numvecs,
    ^
    Fixes: b1a5a73e64e9 ("genirq/affinity: Spread vectors on node according to nr_cpu ratio")
    Reported-by: kbuild test robot
    Signed-off-by: Ming Lei
    Signed-off-by: Thomas Gleixner
    Link: https://lkml.kernel.org/r/20190828085815.19931-1-ming.lei@redhat.com

    Ming Lei
     

27 Aug, 2019

2 commits

  • Now __irq_build_affinity_masks() spreads vectors evenly per node, but there
    is a case that not all vectors have been spread when each numa node has a
    different number of CPUs which triggers the warning in the spreading code.

    Improve the spreading algorithm by

    - assigning vectors according to the ratio of the number of CPUs on a node
    to the number of remaining CPUs.

    - running the assignment from smaller nodes to bigger nodes to guarantee
    that every active node gets allocated at least one vector.

    This ensures that all vectors are spread out. Asided of that the spread
    becomes more fair if the nodes have different number of CPUs.

    For example, on the following machine:
    CPU(s): 16
    On-line CPU(s) list: 0-15
    Thread(s) per core: 1
    Core(s) per socket: 8
    Socket(s): 2
    NUMA node(s): 2
    ...
    NUMA node0 CPU(s): 0,1,3,5-9,11,13-15
    NUMA node1 CPU(s): 2,4,10,12

    When a driver requests to allocate 8 vectors, the following spread results:

    irq 31, cpu list 2,4
    irq 32, cpu list 10,12
    irq 33, cpu list 0-1
    irq 34, cpu list 3,5
    irq 35, cpu list 6-7
    irq 36, cpu list 8-9
    irq 37, cpu list 11,13
    irq 38, cpu list 14-15

    So Node 0 has now 6 and Node 1 has 2 vectors assigned. The original
    algorithm assigned 4 vectors on each node which was unfair versus Node 0.

    [ tglx: Massaged changelog ]

    Reported-by: Jon Derrick
    Signed-off-by: Ming Lei
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Keith Busch
    Reviewed-by: Jon Derrick
    Link: https://lkml.kernel.org/r/20190816022849.14075-3-ming.lei@redhat.com

    Ming Lei
     
  • One invariant of __irq_build_affinity_masks() is that all CPUs in the
    specified masks (cpu_mask AND node_to_cpumask for each node) should be
    covered during the spread. Even though all requested vectors have been
    reached, it's still required to spread vectors among remained CPUs. A
    similar policy has been taken in case of 'numvecs
    Signed-off-by: Thomas Gleixner
    Link: https://lkml.kernel.org/r/20190816022849.14075-2-ming.lei@redhat.com

    Ming Lei
     

20 Aug, 2019

1 commit

  • If alloc_descs() fails before irq_sysfs_init() has run, free_desc() in the
    cleanup path will call kobject_del() even though the kobject has not been
    added with kobject_add().

    Fix this by making the call to kobject_del() conditional on whether
    irq_sysfs_init() has run.

    This problem surfaced because commit aa30f47cf666 ("kobject: Add support
    for default attribute groups to kobj_type") makes kobject_del() stricter
    about pairing with kobject_add(). If the pairing is incorrrect, a WARNING
    and backtrace occur in sysfs_remove_group() because there is no parent.

    [ tglx: Add a comment to the code and make it work with CONFIG_SYSFS=n ]

    Fixes: ecb3f394c5db ("genirq: Expose interrupt information through sysfs")
    Signed-off-by: Michael Kelley
    Signed-off-by: Thomas Gleixner
    Acked-by: Greg Kroah-Hartman
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/1564703564-4116-1-git-send-email-mikelley@microsoft.com

    Michael Kelley
     

19 Aug, 2019

1 commit

  • Switch force_irqthreads from a boot time modifiable variable to a compile
    time constant when CONFIG_PREEMPT_RT is enabled.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Thomas Gleixner
    Link: https://lkml.kernel.org/r/20190816160923.12855-1-bigeasy@linutronix.de

    Thomas Gleixner
     

17 Aug, 2019

1 commit


08 Aug, 2019

1 commit

  • Since commit c66d4bd110a1f8 ("genirq/affinity: Add new callback for
    (re)calculating interrupt sets"), irq_create_affinity_masks() returns
    NULL in case of single vector. This change has caused regression on some
    drivers, such as lpfc.

    The problem is that single vector requests can happen in some generic cases:

    1) kdump kernel

    2) irq vectors resource is close to exhaustion.

    If in that situation the affinity mask for a single vector is not created,
    every caller has to handle the special case.

    There is no reason why the mask cannot be created, so remove the check for
    a single vector and create the mask.

    Fixes: c66d4bd110a1f8 ("genirq/affinity: Add new callback for (re)calculating interrupt sets")
    Signed-off-by: Ming Lei
    Signed-off-by: Thomas Gleixner
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190805011906.5020-1-ming.lei@redhat.com

    Ming Lei
     

07 Aug, 2019

1 commit

  • Booting a large arm64 server (HiSi D05) leads to the following
    shouting at boot time:

    [ 20.722132] debugfs: File 'irqchip@(____ptrval____)-3' in directory 'domains' already present!
    [ 20.730851] debugfs: File 'irqchip@(____ptrval____)-3' in directory 'domains' already present!
    [ 20.739560] debugfs: File 'irqchip@(____ptrval____)-3' in directory 'domains' already present!
    [ 20.748267] debugfs: File 'irqchip@(____ptrval____)-3' in directory 'domains' already present!
    [ 20.756975] debugfs: File 'irqchip@(____ptrval____)-3' in directory 'domains' already present!
    [ 20.765683] debugfs: File 'irqchip@(____ptrval____)-3' in directory 'domains' already present!
    [ 20.774391] debugfs: File 'irqchip@(____ptrval____)-3' in directory 'domains' already present!

    and many more... Evidently, we expect something a bit more informative
    than ____ptrval____, and certainly we want all of our domains, not just
    the first one.

    For that, turn the %p used to generate the fwnode name into something
    that won't be repainted (%pa). Given that we've now fixed all users to
    pass a pointer to a PA, it will actually do the right thing.

    Acked-by: Thomas Gleixner
    Signed-off-by: Marc Zyngier

    Marc Zyngier
     

25 Jul, 2019

1 commit


23 Jul, 2019

1 commit

  • Introduce a new function, rearm_wake_irq(), allowing a wakeup IRQ
    to be armed for systen wakeup detection again without running any
    action handlers associated with it after it has been armed for
    wakeup detection and triggered.

    That is useful for IRQs, like ACPI SCI, that may deliver wakeup
    as well as non-wakeup interrupts when armed for systen wakeup
    detection. In those cases, it may be possible to determine whether
    or not the delivered interrupt is a systen wakeup one without
    running the entire action handler (or handlers, if the IRQ is
    shared) for the IRQ, and if the interrupt turns out to be a
    non-wakeup one, the IRQ can be rearmed with the help of the
    new function.

    Signed-off-by: Rafael J. Wysocki
    Acked-by: Thomas Gleixner

    Rafael J. Wysocki
     

09 Jul, 2019

2 commits

  • Pull x96 apic updates from Thomas Gleixner:
    "Updates for the x86 APIC interrupt handling and APIC timer:

    - Fix a long standing issue with spurious interrupts which was caused
    by the big vector management rework a few years ago. Robert Hodaszi
    provided finally enough debug data and an excellent initial failure
    analysis which allowed to understand the underlying issues.

    This contains a change to the core interrupt management code which
    is required to handle this correctly for the APIC/IO_APIC. The core
    changes are NOOPs for most architectures except ARM64. ARM64 is not
    impacted by the change as confirmed by Marc Zyngier.

    - Newer systems allow to disable the PIT clock for power saving
    causing panic in the timer interrupt delivery check of the IO/APIC
    when the HPET timer is not enabled either. While the clock could be
    turned on this would cause an endless whack a mole game to chase
    the proper register in each affected chipset.

    These systems provide the relevant frequencies for TSC, CPU and the
    local APIC timer via CPUID and/or MSRs, which allows to avoid the
    PIT/HPET based calibration. As the calibration code is the only
    usage of the legacy timers on modern systems and is skipped anyway
    when the frequencies are known already, there is no point in
    setting up the PIT and actually checking for the interrupt delivery
    via IO/APIC.

    To achieve this on a wide variety of platforms, the CPUID/MSR based
    frequency readout has been made more robust, which also allowed to
    remove quite some workarounds which turned out to be not longer
    required. Thanks to Daniel Drake for analysis, patches and
    verification"

    * 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/irq: Seperate unused system vectors from spurious entry again
    x86/irq: Handle spurious interrupt after shutdown gracefully
    x86/ioapic: Implement irq_get_irqchip_state() callback
    genirq: Add optional hardware synchronization for shutdown
    genirq: Fix misleading synchronize_irq() documentation
    genirq: Delay deactivation in free_irq()
    x86/timer: Skip PIT initialization on modern chipsets
    x86/apic: Use non-atomic operations when possible
    x86/apic: Make apic_bsp_setup() static
    x86/tsc: Set LAPIC timer period to crystal clock frequency
    x86/apic: Rename 'lapic_timer_frequency' to 'lapic_timer_period'
    x86/tsc: Use CPUID.0x16 to calculate missing crystal frequency

    Linus Torvalds
     
  • Pull irq updates from Thomas Gleixner:
    "The irq departement provides the usual mixed bag:

    Core:

    - Further improvements to the irq timings code which aims to predict
    the next interrupt for power state selection to achieve better
    latency/power balance

    - Add interrupt statistics to the core NMI handlers

    - The usual small fixes and cleanups

    Drivers:

    - Support for Renesas RZ/A1, Annapurna Labs FIC, Meson-G12A SoC and
    Amazon Gravition AMR/GIC interrupt controllers.

    - Rework of the Renesas INTC controller driver

    - ACPI support for Socionext SoCs

    - Enhancements to the CSKY interrupt controller

    - The usual small fixes and cleanups"

    * 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (39 commits)
    irq/irqdomain: Fix comment typo
    genirq: Update irq stats from NMI handlers
    irqchip/gic-pm: Remove PM_CLK dependency
    irqchip/al-fic: Introduce Amazon's Annapurna Labs Fabric Interrupt Controller Driver
    dt-bindings: interrupt-controller: Add Amazon's Annapurna Labs FIC
    softirq: Use __this_cpu_write() in takeover_tasklets()
    irqchip/mbigen: Stop printing kernel addresses
    irqchip/gic: Add dependency for ARM_GIC_MAX_NR
    genirq/affinity: Remove unused argument from [__]irq_build_affinity_masks()
    genirq/timings: Add selftest for next event computation
    genirq/timings: Add selftest for irqs circular buffer
    genirq/timings: Add selftest for circular array
    genirq/timings: Encapsulate storing function
    genirq/timings: Encapsulate timings push
    genirq/timings: Optimize the period detection speed
    genirq/timings: Fix timings buffer inspection
    genirq/timings: Fix next event index function
    irqchip/qcom: Use struct_size() in devm_kzalloc()
    irqchip/irq-csky-mpintc: Remove unnecessary loop in interrupt handler
    dt-bindings: interrupt-controller: Update csky mpintc
    ...

    Linus Torvalds
     

06 Jul, 2019

2 commits

  • Fix typo in the comment on top of __irq_domain_add().

    Signed-off-by: Zenghui Yu
    Signed-off-by: Thomas Gleixner
    Link: https://lkml.kernel.org/r/1562388072-23492-1-git-send-email-yuzenghui@huawei.com

    Zenghui Yu
     
  • The NMI handlers handle_percpu_devid_fasteoi_nmi() and handle_fasteoi_nmi()
    do not update the interrupt counts. Due to that the NMI interrupt count
    does not show up correctly in /proc/interrupts.

    Add the statistics and treat the NMI handlers in the same way as per cpu
    interrupts and prevent them from updating irq_desc::tot_count as this might
    be corrupted due to concurrency.

    [ tglx: Massaged changelog ]

    Fixes: 2dcf1fbcad35 ("genirq: Provide NMI handlers")
    Signed-off-by: Shijith Thotton
    Signed-off-by: Thomas Gleixner
    Link: https://lkml.kernel.org/r/1562313336-11888-1-git-send-email-sthotton@marvell.com

    Shijith Thotton
     

03 Jul, 2019

3 commits

  • free_irq() ensures that no hardware interrupt handler is executing on a
    different CPU before actually releasing resources and deactivating the
    interrupt completely in a domain hierarchy.

    But that does not catch the case where the interrupt is on flight at the
    hardware level but not yet serviced by the target CPU. That creates an
    interesing race condition:

    CPU 0 CPU 1 IRQ CHIP

    interrupt is raised
    sent to CPU1
    Unable to handle
    immediately
    (interrupts off,
    deep idle delay)
    mask()
    ...
    free()
    shutdown()
    synchronize_irq()
    release_resources()
    do_IRQ()
    -> resources are not available

    That might be harmless and just trigger a spurious interrupt warning, but
    some interrupt chips might get into a wedged state.

    Utilize the existing irq_get_irqchip_state() callback for the
    synchronization in free_irq().

    synchronize_hardirq() is not using this mechanism as it might actually
    deadlock unter certain conditions, e.g. when called with interrupts
    disabled and the target CPU is the one on which the synchronization is
    invoked. synchronize_irq() uses it because that function cannot be called
    from non preemtible contexts as it might sleep.

    No functional change intended and according to Marc the existing GIC
    implementations where the driver supports the callback should be able
    to cope with that core change. Famous last words.

    Fixes: 464d12309e1b ("x86/vector: Switch IOAPIC to global reservation mode")
    Reported-by: Robert Hodaszi
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Marc Zyngier
    Tested-by: Marc Zyngier
    Link: https://lkml.kernel.org/r/20190628111440.279463375@linutronix.de

    Thomas Gleixner
     
  • The function might sleep, so it cannot be called from interrupt
    context. Not even with care.

    Signed-off-by: Thomas Gleixner
    Cc: Marc Zyngier
    Link: https://lkml.kernel.org/r/20190628111440.189241552@linutronix.de

    Thomas Gleixner
     
  • When interrupts are shutdown, they are immediately deactivated in the
    irqdomain hierarchy. While this looks obviously correct there is a subtle
    issue:

    There might be an interrupt in flight when free_irq() is invoking the
    shutdown. This is properly handled at the irq descriptor / primary handler
    level, but the deactivation might completely disable resources which are
    required to acknowledge the interrupt.

    Split the shutdown code and deactivate the interrupt after synchronization
    in free_irq(). Fixup all other usage sites where this is not an issue to
    invoke the combined shutdown_and_deactivate() function instead.

    This still might be an issue if the interrupt in flight servicing is
    delayed on a remote CPU beyond the invocation of synchronize_irq(), but
    that cannot be handled at that level and needs to be handled in the
    synchronize_irq() context.

    Fixes: f8264e34965a ("irqdomain: Introduce new interfaces to support hierarchy irqdomains")
    Reported-by: Robert Hodaszi
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Marc Zyngier
    Link: https://lkml.kernel.org/r/20190628111440.098196390@linutronix.de

    Thomas Gleixner
     

21 Jun, 2019

1 commit

  • In the presence of any form of instrumentation, nmi_enter() should be
    done before calling any traceable code and any instrumentation code.

    Currently, nmi_enter() is done in handle_domain_nmi(), which is much
    too late as instrumentation code might get called before. Move the
    nmi_enter/exit() calls to the arch IRQ vector handler.

    On arm64, it is not possible to know if the IRQ vector handler was
    called because of an NMI before acknowledging the interrupt. However, It
    is possible to know whether normal interrupts could be taken in the
    interrupted context (i.e. if taking an NMI in that context could
    introduce a potential race condition).

    When interrupting a context with IRQs disabled, call nmi_enter() as soon
    as possible. In contexts with IRQs enabled, defer this to the interrupt
    controller, which is in a better position to know if an interrupt taken
    is an NMI.

    Fixes: bc3c03ccb464 ("arm64: Enable the support of pseudo-NMIs")
    Cc: # 5.1.x-
    Cc: Will Deacon
    Cc: Thomas Gleixner
    Cc: Jason Cooper
    Cc: Mark Rutland
    Reviewed-by: Marc Zyngier
    Signed-off-by: Julien Thierry
    Signed-off-by: Catalin Marinas

    Julien Thierry
     

12 Jun, 2019

9 commits

  • The *affd argument is neither used in irq_build_affinity_masks() nor
    __irq_build_affinity_masks(). Remove it.

    Signed-off-by: Minwoo Im
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Ming Lei
    Cc: Minwoo Im
    Cc: linux-block@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190602112117.31839-1-minwoo.im.dev@gmail.com

    Minwoo Im
     
  • The circular buffers are now validated with selftests. The next interrupt
    index algorithm which is the hardest part to validate needs extra coverage.

    Add a selftest which uses the intervals stored in the arrays and insert all
    the values except the last one. The next event computation must return the
    same value as the last element which was not inserted.

    Signed-off-by: Daniel Lezcano
    Signed-off-by: Thomas Gleixner
    Cc: andriy.shevchenko@linux.intel.com
    Link: https://lkml.kernel.org/r/20190527205521.12091-9-daniel.lezcano@linaro.org

    Daniel Lezcano
     
  • After testing the per cpu interrupt circular event, make sure the per
    interrupt circular buffer usage is correct.

    Add tests to validate the interrupt circular buffer.

    Signed-off-by: Daniel Lezcano
    Signed-off-by: Thomas Gleixner
    Cc: andriy.shevchenko@linux.intel.com
    Link: https://lkml.kernel.org/r/20190527205521.12091-8-daniel.lezcano@linaro.org

    Daniel Lezcano
     
  • Due to the complexity of the code and the difficulty to debug it, add some
    selftests to the framework in order to spot issues or regression at boot
    time when the runtime testing is enabled for this subsystem.

    This tests the circular buffer at the limits and validates:
    - the encoding / decoding of the values
    - the macro to browse the irq timings circular buffer
    - the function to push data in the circular buffer

    Signed-off-by: Daniel Lezcano
    Signed-off-by: Thomas Gleixner
    Cc: andriy.shevchenko@linux.intel.com
    Link: https://lkml.kernel.org/r/20190527205521.12091-7-daniel.lezcano@linaro.org

    Daniel Lezcano
     
  • For the next patches providing the selftest, it is required to insert
    interval values directly in the buffer in order to check the correctness of
    the code. Encapsulate the code doing that in a always inline function in
    order to reuse it in the test code.

    No functional changes.

    Signed-off-by: Daniel Lezcano
    Signed-off-by: Thomas Gleixner
    Cc: andriy.shevchenko@linux.intel.com
    Link: https://lkml.kernel.org/r/20190527205521.12091-6-daniel.lezcano@linaro.org

    Daniel Lezcano
     
  • For the next patches providing the selftest, it is required to artificially
    insert timings value in the circular buffer in order to check the
    correctness of the code. Encapsulate the common code between the future
    test code and the current code with an always-inline tag.

    No functional change.

    Signed-off-by: Daniel Lezcano
    Signed-off-by: Thomas Gleixner
    Cc: andriy.shevchenko@linux.intel.com
    Link: https://lkml.kernel.org/r/20190527205521.12091-5-daniel.lezcano@linaro.org

    Daniel Lezcano
     
  • With a minimal period and if there is a period which is a multiple of it
    but lesser than the max period then it will be detected before and the
    minimal period will be never reached.

    1 2 1 2 1 2 1 2 1 2 1 2

    In that case, the minimum period is 2 and the maximum period is 5. That
    means all repeating pattern of 2 will be detected as repeating pattern of
    4, it is pointless to go up to 2 when searching for the period as it will
    always fail.

    Remove one loop iteration by increasing the minimal period to 3.

    Signed-off-by: Daniel Lezcano
    Signed-off-by: Thomas Gleixner
    Cc: andriy.shevchenko@linux.intel.com
    Link: https://lkml.kernel.org/r/20190527205521.12091-4-daniel.lezcano@linaro.org

    Daniel Lezcano
     
  • It appears the index beginning computation is not correct, the current
    code does:

    i = (irqts->count & IRQ_TIMINGS_MASK) - 1

    If irqts->count is equal to zero, we end up with an index equal to -1,
    but that does not happen because the function checks against zero
    before and returns in such case.

    However, if irqts->count is a multiple of IRQ_TIMINGS_SIZE, the
    resulting & bit op will be zero and leads also to a -1 index.

    Re-introduce the iteration loop belonging to the previous variance
    code which was correct.

    Fixes: bbba0e7c5cda "genirq/timings: Add array suffix computation code"
    Signed-off-by: Daniel Lezcano
    Signed-off-by: Thomas Gleixner
    Cc: andriy.shevchenko@linux.intel.com
    Link: https://lkml.kernel.org/r/20190527205521.12091-3-daniel.lezcano@linaro.org

    Daniel Lezcano
     
  • The current code is luckily working with most of the interval samples
    testing but actually it fails to correctly detect pattern repetition
    breaking at the end of the buffer.

    Narrowing down the bug has been a real pain because of the pointers,
    so the routine is rewrittne by using indexes instead.

    Fixes: bbba0e7c5cda "genirq/timings: Add array suffix computation code"
    Signed-off-by: Daniel Lezcano
    Signed-off-by: Thomas Gleixner
    Cc: andriy.shevchenko@linux.intel.com
    Link: https://lkml.kernel.org/r/20190527205521.12091-2-daniel.lezcano@linaro.org

    Daniel Lezcano
     

29 May, 2019

1 commit