18 Apr, 2019

1 commit

  • Interactive governor has lived in Android sources for a very long time
    and this commit is based on the code present in following branch:

    https://android.googlesource.com/kernel/common android-4.4

    The Interactive governor is designed for latency-sensitive workloads,
    such as interactive user interfaces like the mobile phones and tablets.
    The interactive governor aims to be significantly more responsive to
    ramp CPU quickly up when CPU-intensive activity begins.

    Existing governors sample CPU load at a particular rate, typically every
    X ms and then update the frequency from a work-handler. This can lead
    to under-powering UI threads for the period of time during which the
    user begins interacting with a previously-idle system until the next
    sample period happens.

    The 'interactive' governor uses a different approach.

    A real-time thread is used for scaling up, giving the remaining tasks
    the CPU performance benefit, unlike existing governors which are more
    likely to schedule ramp-up work to occur after your performance starved
    tasks have completed.

    The Android version of interactive governor also checks whether to scale
    the CPU frequency up soon after coming out of idle. When the CPU comes
    out of idle, the governor check if the CPU sampling is overdue or not.
    If yes, it immediately starts the sampling. Otherwise, the utilization
    hooks from the scheduler handle the sampling later. If the CPU is very
    busy from exiting idle to when the evaluation happens, then it assumes
    that the CPU is under-powered and ramps it to MAX speed.

    If the CPU was not sufficiently busy to immediately ramp to MAX speed,
    then the governor evaluates the CPU load since the last speed
    adjustment, choosing the highest value between that longer-term load or
    the short-term load since idle exit to determine the CPU speed to ramp
    to.

    Idle notifiers will be be handled later and are not included for now.

    The core of this code is written and maintained (in Android
    repositories) by Mike Chan and Todd Poyner over a long period of time.

    Vireshk has made changes to to the governor to align it with the current
    practices followed with mainline governors, like using utilization hooks
    from the scheduler and handling kobject (for governor's sysfs directory)
    in a race free manner. And of course this included general cleanup of
    the governor as well.

    Signed-off-by: Mike Chan
    Signed-off-by: Todd Poynor
    Signed-off-by: Viresh Kumar
    Signed-off-by: Vipul Kumar

    Viresh Kumar
     

10 Jan, 2019

1 commit

  • commit 5b5e4d623ec8a34689df98e42d038a3b594d2ff9 upstream.

    Swap storage is restricted to max_swapfile_size (~16TB on x86_64) whenever
    the system is deemed affected by L1TF vulnerability. Even though the limit
    is quite high for most deployments it seems to be too restrictive for
    deployments which are willing to live with the mitigation disabled.

    We have a customer to deploy 8x 6,4TB PCIe/NVMe SSD swap devices which is
    clearly out of the limit.

    Drop the swap restriction when l1tf=off is specified. It also doesn't make
    much sense to warn about too much memory for the l1tf mitigation when it is
    forcefully disabled by the administrator.

    [ tglx: Folded the documentation delta change ]

    Fixes: 377eeaa8e11f ("x86/speculation/l1tf: Limit swap file size to MAX_PA/2")
    Signed-off-by: Michal Hocko
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Pavel Tatashin
    Reviewed-by: Andi Kleen
    Acked-by: Jiri Kosina
    Cc: Linus Torvalds
    Cc: Dave Hansen
    Cc: Andi Kleen
    Cc: Borislav Petkov
    Cc:
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/20181113184910.26697-1-mhocko@kernel.org
    Signed-off-by: Greg Kroah-Hartman

    Michal Hocko
     

06 Dec, 2018

4 commits

  • commit 55a974021ec952ee460dc31ca08722158639de72 upstream

    Provide the possibility to enable IBPB always in combination with 'prctl'
    and 'seccomp'.

    Add the extra command line options and rework the IBPB selection to
    evaluate the command instead of the mode selected by the STIPB switch case.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Andy Lutomirski
    Cc: Linus Torvalds
    Cc: Jiri Kosina
    Cc: Tom Lendacky
    Cc: Josh Poimboeuf
    Cc: Andrea Arcangeli
    Cc: David Woodhouse
    Cc: Tim Chen
    Cc: Andi Kleen
    Cc: Dave Hansen
    Cc: Casey Schaufler
    Cc: Asit Mallick
    Cc: Arjan van de Ven
    Cc: Jon Masters
    Cc: Waiman Long
    Cc: Greg KH
    Cc: Dave Stewart
    Cc: Kees Cook
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/20181125185006.144047038@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • commit 6b3e64c237c072797a9ec918654a60e3a46488e2 upstream

    If 'prctl' mode of user space protection from spectre v2 is selected
    on the kernel command-line, STIBP and IBPB are applied on tasks which
    restrict their indirect branch speculation via prctl.

    SECCOMP enables the SSBD mitigation for sandboxed tasks already, so it
    makes sense to prevent spectre v2 user space to user space attacks as
    well.

    The Intel mitigation guide documents how STIPB works:

    Setting bit 1 (STIBP) of the IA32_SPEC_CTRL MSR on a logical processor
    prevents the predicted targets of indirect branches on any logical
    processor of that core from being controlled by software that executes
    (or executed previously) on another logical processor of the same core.

    Ergo setting STIBP protects the task itself from being attacked from a task
    running on a different hyper-thread and protects the tasks running on
    different hyper-threads from being attacked.

    While the document suggests that the branch predictors are shielded between
    the logical processors, the observed performance regressions suggest that
    STIBP simply disables the branch predictor more or less completely. Of
    course the document wording is vague, but the fact that there is also no
    requirement for issuing IBPB when STIBP is used points clearly in that
    direction. The kernel still issues IBPB even when STIBP is used until Intel
    clarifies the whole mechanism.

    IBPB is issued when the task switches out, so malicious sandbox code cannot
    mistrain the branch predictor for the next user space task on the same
    logical processor.

    Signed-off-by: Jiri Kosina
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Andy Lutomirski
    Cc: Linus Torvalds
    Cc: Tom Lendacky
    Cc: Josh Poimboeuf
    Cc: Andrea Arcangeli
    Cc: David Woodhouse
    Cc: Tim Chen
    Cc: Andi Kleen
    Cc: Dave Hansen
    Cc: Casey Schaufler
    Cc: Asit Mallick
    Cc: Arjan van de Ven
    Cc: Jon Masters
    Cc: Waiman Long
    Cc: Greg KH
    Cc: Dave Stewart
    Cc: Kees Cook
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/20181125185006.051663132@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • commit 7cc765a67d8e04ef7d772425ca5a2a1e2b894c15 upstream

    Now that all prerequisites are in place:

    - Add the prctl command line option

    - Default the 'auto' mode to 'prctl'

    - When SMT state changes, update the static key which controls the
    conditional STIBP evaluation on context switch.

    - At init update the static key which controls the conditional IBPB
    evaluation on context switch.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Andy Lutomirski
    Cc: Linus Torvalds
    Cc: Jiri Kosina
    Cc: Tom Lendacky
    Cc: Josh Poimboeuf
    Cc: Andrea Arcangeli
    Cc: David Woodhouse
    Cc: Tim Chen
    Cc: Andi Kleen
    Cc: Dave Hansen
    Cc: Casey Schaufler
    Cc: Asit Mallick
    Cc: Arjan van de Ven
    Cc: Jon Masters
    Cc: Waiman Long
    Cc: Greg KH
    Cc: Dave Stewart
    Cc: Kees Cook
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/20181125185005.958421388@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • commit fa1202ef224391b6f5b26cdd44cc50495e8fab54 upstream

    Add command line control for user space indirect branch speculation
    mitigations. The new option is: spectre_v2_user=

    The initial options are:

    - on: Unconditionally enabled
    - off: Unconditionally disabled
    -auto: Kernel selects mitigation (default off for now)

    When the spectre_v2= command line argument is either 'on' or 'off' this
    implies that the application to application control follows that state even
    if a contradicting spectre_v2_user= argument is supplied.

    Originally-by: Tim Chen
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Andy Lutomirski
    Cc: Linus Torvalds
    Cc: Jiri Kosina
    Cc: Tom Lendacky
    Cc: Josh Poimboeuf
    Cc: Andrea Arcangeli
    Cc: David Woodhouse
    Cc: Andi Kleen
    Cc: Dave Hansen
    Cc: Casey Schaufler
    Cc: Asit Mallick
    Cc: Arjan van de Ven
    Cc: Jon Masters
    Cc: Waiman Long
    Cc: Greg KH
    Cc: Dave Stewart
    Cc: Kees Cook
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/20181125185005.082720373@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

01 Dec, 2018

2 commits

  • commit 544b03da39e2d7b4961d3163976ed4bfb1fac509 upstream.

    At the request of the reporter, the Linux kernel security team offers to
    postpone the publishing of a fix for up to 5 business days from the date
    of a report.

    While it is generally undesirable to keep a fix private after it has
    been developed, this short window is intended to allow distributions to
    package the fix into their kernel builds and permits early inclusion of
    the security team in the case of a co-ordinated disclosure with other
    parties. Unfortunately, discussions with major Linux distributions and
    cloud providers has revealed that 5 business days is not sufficient to
    achieve either of these two goals.

    As an example, cloud providers need to roll out KVM security fixes to a
    global fleet of hosts with sufficient early ramp-up and monitoring. An
    end-to-end timeline of less than two weeks dramatically cuts into the
    amount of early validation and increases the chance of guest-visible
    regressions.

    The consequence of this timeline mismatch is that security issues are
    commonly fixed without the involvement of the Linux kernel security team
    and are instead analysed and addressed by an ad-hoc group of developers
    across companies contributing to Linux. In some cases, mainline (and
    therefore the official stable kernels) can be left to languish for
    extended periods of time. This undermines the Linux kernel security
    process and puts upstream developers in a difficult position should they
    find themselves involved with an undisclosed security problem that they
    are unable to report due to restrictions from their employer.

    To accommodate the needs of these users of the Linux kernel and
    encourage them to engage with the Linux security team when security
    issues are first uncovered, extend the maximum period for which fixes
    may be delayed to 7 calendar days, or 14 calendar days in exceptional
    cases, where the logistics of QA and large scale rollouts specifically
    need to be accommodated. This brings parity with the linux-distros@
    maximum embargo period of 14 calendar days.

    Cc: Paolo Bonzini
    Cc: David Woodhouse
    Cc: Amit Shah
    Cc: Laura Abbott
    Acked-by: Kees Cook
    Co-developed-by: Thomas Gleixner
    Co-developed-by: David Woodhouse
    Signed-off-by: Thomas Gleixner
    Signed-off-by: David Woodhouse
    Signed-off-by: Will Deacon
    Reviewed-by: Tyler Hicks
    Acked-by: Peter Zijlstra
    Signed-off-by: Greg Kroah-Hartman

    Will Deacon
     
  • commit 14fdc2c5318ae420e68496975f48dc1dbef52649 upstream.

    The Linux kernel security team has been accused of rejecting the idea of
    security embargoes. This is incorrect, and could dissuade people from
    reporting security issues to us under the false assumption that the
    issue would leak prematurely.

    Clarify the handling of embargoed information in our process
    documentation.

    Co-developed-by: Ingo Molnar
    Acked-by: Kees Cook
    Acked-by: Peter Zijlstra
    Acked-by: Laura Abbott
    Signed-off-by: Will Deacon
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Will Deacon
     

27 Nov, 2018

2 commits

  • commit 781f0766cc41a9dd2e5d118ef4b1d5d89430257b upstream.

    Devices connected under Terminus Technology Inc. Hub (1a40:0101) may
    fail to work after the system resumes from suspend:
    [ 206.063325] usb 3-2.4: reset full-speed USB device number 4 using xhci_hcd
    [ 206.143691] usb 3-2.4: device descriptor read/64, error -32
    [ 206.351671] usb 3-2.4: device descriptor read/64, error -32

    Info for this hub:
    T: Bus=03 Lev=01 Prnt=01 Port=01 Cnt=01 Dev#= 2 Spd=480 MxCh= 4
    D: Ver= 2.00 Cls=09(hub ) Sub=00 Prot=01 MxPS=64 #Cfgs= 1
    P: Vendor=1a40 ProdID=0101 Rev=01.11
    S: Product=USB 2.0 Hub
    C: #Ifs= 1 Cfg#= 1 Atr=e0 MxPwr=100mA
    I: If#= 0 Alt= 0 #EPs= 1 Cls=09(hub ) Sub=00 Prot=00 Driver=hub

    Some expirements indicate that the USB devices connected to the hub are
    innocent, it's the hub itself is to blame. The hub needs extra delay
    time after it resets its port.

    Hence wait for extra delay, if the device is connected to this quirky
    hub.

    Signed-off-by: Kai-Heng Feng
    Cc: stable
    Acked-by: Alan Stern
    Signed-off-by: Greg Kroah-Hartman

    Kai-Heng Feng
     
  • [ Upstream commit d2266bbfa9e3e32e3b642965088ca461bd24a94f ]

    The "pciserial" earlyprintk variant helps much on many modern x86
    platforms, but unfortunately there are still some platforms with PCI
    UART devices which have the wrong PCI class code. In that case, the
    current class code check does not allow for them to be used for logging.

    Add a sub-option "force" which overrides the class code check and thus
    the use of such device can be enforced.

    [ bp: massage formulations. ]

    Suggested-by: Borislav Petkov
    Signed-off-by: Feng Tang
    Signed-off-by: Borislav Petkov
    Cc: "H. Peter Anvin"
    Cc: "Stuart R . Anderson"
    Cc: Bjorn Helgaas
    Cc: David Rientjes
    Cc: Feng Tang
    Cc: Frederic Weisbecker
    Cc: Greg Kroah-Hartman
    Cc: H Peter Anvin
    Cc: Ingo Molnar
    Cc: Jiri Kosina
    Cc: Jonathan Corbet
    Cc: Kai-Heng Feng
    Cc: Kate Stewart
    Cc: Konrad Rzeszutek Wilk
    Cc: Peter Zijlstra
    Cc: Philippe Ombredanne
    Cc: Thomas Gleixner
    Cc: Thymo van Beers
    Cc: alan@linux.intel.com
    Cc: linux-doc@vger.kernel.org
    Link: http://lkml.kernel.org/r/20181002164921.25833-1-feng.tang@intel.com
    Signed-off-by: Sasha Levin

    Feng Tang
     

14 Sep, 2018

1 commit

  • Scrubbing pages on initial balloon down can take some time, especially
    in nested virtualization case (nested EPT is slow). When HVM/PVH guest is
    started with memory= significantly lower than maxmem=, all the extra
    pages will be scrubbed before returning to Xen. But since most of them
    weren't used at all at that point, Xen needs to populate them first
    (from populate-on-demand pool). In nested virt case (Xen inside KVM)
    this slows down the guest boot by 15-30s with just 1.5GB needed to be
    returned to Xen.

    Add runtime parameter to enable/disable it, to allow initially disabling
    scrubbing, then enable it back during boot (for example in initramfs).
    Such usage relies on assumption that a) most pages ballooned out during
    initial boot weren't used at all, and b) even if they were, very few
    secrets are in the guest at that time (before any serious userspace
    kicks in).
    Convert CONFIG_XEN_SCRUB_PAGES to CONFIG_XEN_SCRUB_PAGES_DEFAULT (also
    enabled by default), controlling default value for the new runtime
    switch.

    Signed-off-by: Marek Marczykowski-Górecki
    Reviewed-by: Juergen Gross
    Signed-off-by: Boris Ostrovsky

    Marek Marczykowski-Górecki
     

09 Sep, 2018

1 commit


02 Sep, 2018

1 commit

  • Instead of forcing a distro or other system builder to choose
    at build time whether the CPU is trusted for CRNG seeding via
    CONFIG_RANDOM_TRUST_CPU, provide a boot-time parameter for end users to
    control the choice. The CONFIG will set the default state instead.

    Signed-off-by: Kees Cook
    Signed-off-by: Theodore Ts'o

    Kees Cook
     

25 Aug, 2018

1 commit

  • Pull IOMMU updates from Joerg Roedel:

    - PASID table handling updates for the Intel VT-d driver. It implements
    a global PASID space now so that applications usings multiple devices
    will just have one PASID.

    - A new config option to make iommu passthroug mode the default.

    - New sysfs attribute for iommu groups to export the type of the
    default domain.

    - A debugfs interface (for debug only) usable by IOMMU drivers to
    export internals to user-space.

    - R-Car Gen3 SoCs support for the ipmmu-vmsa driver

    - The ARM-SMMU now aborts transactions from unknown devices and devices
    not attached to any domain.

    - Various cleanups and smaller fixes all over the place.

    * tag 'iommu-updates-v4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (42 commits)
    iommu/omap: Fix cache flushes on L2 table entries
    iommu: Remove the ->map_sg indirection
    iommu/arm-smmu-v3: Abort all transactions if SMMU is enabled in kdump kernel
    iommu/arm-smmu-v3: Prevent any devices access to memory without registration
    iommu/ipmmu-vmsa: Don't register as BUS IOMMU if machine doesn't have IPMMU-VMSA
    iommu/ipmmu-vmsa: Clarify supported platforms
    iommu/ipmmu-vmsa: Fix allocation in atomic context
    iommu: Add config option to set passthrough as default
    iommu: Add sysfs attribyte for domain type
    iommu/arm-smmu-v3: sync the OVACKFLG to PRIQ consumer register
    iommu/arm-smmu: Error out only if not enough context interrupts
    iommu/io-pgtable-arm-v7s: Abort allocation when table address overflows the PTE
    iommu/io-pgtable-arm: Fix pgtable allocation in selftest
    iommu/vt-d: Remove the obsolete per iommu pasid tables
    iommu/vt-d: Apply per pci device pasid table in SVA
    iommu/vt-d: Allocate and free pasid table
    iommu/vt-d: Per PCI device pasid table interfaces
    iommu/vt-d: Add for_each_device_domain() helper
    iommu/vt-d: Move device_domain_info to header
    iommu/vt-d: Apply global PASID in SVA
    ...

    Linus Torvalds
     

23 Aug, 2018

2 commits

  • For some workloads an intervention from the OOM killer can be painful.
    Killing a random task can bring the workload into an inconsistent state.

    Historically, there are two common solutions for this
    problem:
    1) enabling panic_on_oom,
    2) using a userspace daemon to monitor OOMs and kill
    all outstanding processes.

    Both approaches have their downsides: rebooting on each OOM is an obvious
    waste of capacity, and handling all in userspace is tricky and requires a
    userspace agent, which will monitor all cgroups for OOMs.

    In most cases an in-kernel after-OOM cleaning-up mechanism can eliminate
    the necessity of enabling panic_on_oom. Also, it can simplify the cgroup
    management for userspace applications.

    This commit introduces a new knob for cgroup v2 memory controller:
    memory.oom.group. The knob determines whether the cgroup should be
    treated as an indivisible workload by the OOM killer. If set, all tasks
    belonging to the cgroup or to its descendants (if the memory cgroup is not
    a leaf cgroup) are killed together or not at all.

    To determine which cgroup has to be killed, we do traverse the cgroup
    hierarchy from the victim task's cgroup up to the OOMing cgroup (or root)
    and looking for the highest-level cgroup with memory.oom.group set.

    Tasks with the OOM protection (oom_score_adj set to -1000) are treated as
    an exception and are never killed.

    This patch doesn't change the OOM victim selection algorithm.

    Link: http://lkml.kernel.org/r/20180802003201.817-4-guro@fb.com
    Signed-off-by: Roman Gushchin
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: David Rientjes
    Cc: Tetsuo Handa
    Cc: Tejun Heo
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • The Kconfig text for CONFIG_PAGE_POISONING doesn't mention that it has to
    be enabled explicitly. This updates the documentation for that and adds a
    note about CONFIG_PAGE_POISONING to the "page_poison" command line docs.
    While here, change description of CONFIG_PAGE_POISONING_ZERO too, as it's
    not "random" data, but rather the fixed debugging value that would be used
    when not zeroing. Additionally removes a stray "bool" in the Kconfig.

    Link: http://lkml.kernel.org/r/20180725223832.GA43733@beast
    Signed-off-by: Kees Cook
    Reviewed-by: Andrew Morton
    Cc: Jonathan Corbet
    Cc: Laura Abbott
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     

19 Aug, 2018

2 commits

  • Pull driver core updates from Greg KH:
    "Here are all of the driver core and related patches for 4.19-rc1.

    Nothing huge here, just a number of small cleanups and the ability to
    now stop the deferred probing after init happens.

    All of these have been in linux-next for a while with only a merge
    issue reported"

    * tag 'driver-core-4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (21 commits)
    base: core: Remove WARN_ON from link dependencies check
    drivers/base: stop new probing during shutdown
    drivers: core: Remove glue dirs from sysfs earlier
    driver core: remove unnecessary function extern declare
    sysfs.h: fix non-kernel-doc comment
    PM / Domains: Stop deferring probe at the end of initcall
    iommu: Remove IOMMU_OF_DECLARE
    iommu: Stop deferring probe at end of initcalls
    pinctrl: Support stopping deferred probe after initcalls
    dt-bindings: pinctrl: add a 'pinctrl-use-default' property
    driver core: allow stopping deferred probe after init
    driver core: add a debugfs entry to show deferred devices
    sysfs: Fix internal_create_group() for named group updates
    base: fix order of OF initialization
    linux/device.h: fix kernel-doc notation warning
    Documentation: update firmware loader fallback reference
    kobject: Replace strncpy with memcpy
    drivers: base: cacheinfo: use OF property_read_u32 instead of get_property,read_number
    kernfs: Replace strncpy with memcpy
    device: Add #define dev_fmt similar to #define pr_fmt
    ...

    Linus Torvalds
     
  • Pull tty/serial driver updates from Greg KH:
    "Here is the big tty and serial driver pull request for 4.19-rc1.

    It's not all that big, just a number of small serial driver updates
    and fixes, along with some better vt handling for unicode characters
    for those using braille terminals.

    All of these patches have been in linux-next for a long time with no
    reported issues"

    * tag 'tty-4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (73 commits)
    tty: serial: 8250: Revert NXP SC16C2552 workaround
    serial: 8250_exar: Read INT0 from slave device, too
    tty: rocket: Fix possible buffer overwrite on register_PCI
    serial: 8250_dw: Add ACPI support for uart on Broadcom SoC
    serial: 8250_dw: always set baud rate in dw8250_set_termios
    dt-bindings: serial: Add binding for uartlite
    tty: serial: uartlite: Add support for suspend and resume
    tty: serial: uartlite: Add clock adaptation
    tty: serial: uartlite: Add structure for private data
    serial: sh-sci: Improve support for separate TEI and DRI interrupts
    serial: sh-sci: Remove SCIx_RZ_SCIFA_REGTYPE
    serial: sh-sci: Allow for compressed SCIF address
    serial: sh-sci: Improve interrupts description
    serial: 8250: Use cached port name directly in messages
    serial: 8250_exar: Drop unused variable in pci_xr17v35x_setup()
    vt: drop unused struct vt_struct
    vt: avoid a VLA in the unicode screen scroll function
    vt: add /dev/vcsu* to devices.txt
    vt: coherence validation code for the unicode screen buffer
    vt: selection: take screen contents from uniscr if available
    ...

    Linus Torvalds
     

18 Aug, 2018

4 commits

  • Merge updates from Andrew Morton:

    - a few misc things

    - a few Y2038 fixes

    - ntfs fixes

    - arch/sh tweaks

    - ocfs2 updates

    - most of MM

    * emailed patches from Andrew Morton : (111 commits)
    mm/hmm.c: remove unused variables align_start and align_end
    fs/userfaultfd.c: remove redundant pointer uwq
    mm, vmacache: hash addresses based on pmd
    mm/list_lru: introduce list_lru_shrink_walk_irq()
    mm/list_lru.c: pass struct list_lru_node* as an argument to __list_lru_walk_one()
    mm/list_lru.c: move locking from __list_lru_walk_one() to its caller
    mm/list_lru.c: use list_lru_walk_one() in list_lru_walk_node()
    mm, swap: make CONFIG_THP_SWAP depend on CONFIG_SWAP
    mm/sparse: delete old sparse_init and enable new one
    mm/sparse: add new sparse_init_nid() and sparse_init()
    mm/sparse: move buffer init/fini to the common place
    mm/sparse: use the new sparse buffer functions in non-vmemmap
    mm/sparse: abstract sparse buffer allocations
    mm/hugetlb.c: don't zero 1GiB bootmem pages
    mm, page_alloc: double zone's batchsize
    mm/oom_kill.c: document oom_lock
    mm/hugetlb: remove gigantic page support for HIGHMEM
    mm, oom: remove sleep from under oom_lock
    kernel/dma: remove unsupported gfp_mask parameter from dma_alloc_from_contiguous()
    mm/cma: remove unsupported gfp_mask parameter from cma_alloc()
    ...

    Linus Torvalds
     
  • Add a flag which causes page-types to use the kernels's idle page
    tracking to mark pages idle. As the tool already prints the idle flag
    if set, subsequent runs will show which pages have been accessed since
    last run.

    [akpm@linux-foundation.org: simplify mark_page_idle()]
    [chansen3@cisco.com: reorganize mark_page_idle() logic, add docs]
    Link: http://lkml.kernel.org/r/20180706172237.21691-1-chansen3@cisco.com
    Link: http://lkml.kernel.org/r/20180612153223.13174-1-chansen3@cisco.com
    Signed-off-by: Christian Hansen
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christian Hansen
     
  • Add a new flag that will read kpagecount for each PFN and print out the
    number of times the page is mapped along with the flags in the listing
    view.

    This information is useful in understanding and optimizing memory usage.
    Identifying pages which are not shared allows us to focus on adjusting
    the memory layout or access patterns for the sole owning process.
    Knowing the number of processes that share a page tells us how many
    other times we must make the same adjustments or how many processes to
    potentially disable.

    Truncated sample output:

    voffset map-cnt offset len flags
    561a3591e 1 15fe8 1 ___U_lA____Ma_b___________________________
    561a3591f 1 2b103 1 ___U_lA____Ma_b___________________________
    561a36ca4 1 2cc78 1 ___U_lA____Ma_b___________________________
    7f588bb4e 14 2273c 1 __RU_lA____M______________________________

    [akpm@linux-foundation.org: coding-style fixes]
    [chansen3@cisco.com: add documentation, tweak whitespace]
    Link: http://lkml.kernel.org/r/20180705181204.5529-1-chansen3@cisco.com
    Link: http://lkml.kernel.org/r/20180612153205.12879-1-chansen3@cisco.com
    Signed-off-by: Christian Hansen
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christian Hansen
     
  • Pull powerpc updates from Michael Ellerman:
    "Notable changes:

    - A fix for a bug in our page table fragment allocator, where a page
    table page could be freed and reallocated for something else while
    still in use, leading to memory corruption etc. The fix reuses
    pt_mm in struct page (x86 only) for a powerpc only refcount.

    - Fixes to our pkey support. Several are user-visible changes, but
    bring us in to line with x86 behaviour and/or fix outright bugs.
    Thanks to Florian Weimer for reporting many of these.

    - A series to improve the hvc driver & related OPAL console code,
    which have been seen to cause hardlockups at times. The hvc driver
    changes in particular have been in linux-next for ~month.

    - Increase our MAX_PHYSMEM_BITS to 128TB when SPARSEMEM_VMEMMAP=y.

    - Remove Power8 DD1 and Power9 DD1 support, neither chip should be in
    use anywhere other than as a paper weight.

    - An optimised memcmp implementation using Power7-or-later VMX
    instructions

    - Support for barrier_nospec on some NXP CPUs.

    - Support for flushing the count cache on context switch on some IBM
    CPUs (controlled by firmware), as a Spectre v2 mitigation.

    - A series to enhance the information we print on unhandled signals
    to bring it into line with other arches, including showing the
    offending VMA and dumping the instructions around the fault.

    Thanks to: Aaro Koskinen, Akshay Adiga, Alastair D'Silva, Alexey
    Kardashevskiy, Alexey Spirkov, Alistair Popple, Andrew Donnellan,
    Aneesh Kumar K.V, Anju T Sudhakar, Arnd Bergmann, Bartosz Golaszewski,
    Benjamin Herrenschmidt, Bharat Bhushan, Bjoern Noetel, Boqun Feng,
    Breno Leitao, Bryant G. Ly, Camelia Groza, Christophe Leroy, Christoph
    Hellwig, Cyril Bur, Dan Carpenter, Daniel Klamt, Darren Stevens, Dave
    Young, David Gibson, Diana Craciun, Finn Thain, Florian Weimer,
    Frederic Barrat, Gautham R. Shenoy, Geert Uytterhoeven, Geoff Levand,
    Guenter Roeck, Gustavo Romero, Haren Myneni, Hari Bathini, Joel
    Stanley, Jonathan Neuschäfer, Kees Cook, Madhavan Srinivasan, Mahesh
    Salgaonkar, Markus Elfring, Mathieu Malaterre, Mauro S. M. Rodrigues,
    Michael Hanselmann, Michael Neuling, Michael Schmitz, Mukesh Ojha,
    Murilo Opsfelder Araujo, Nicholas Piggin, Parth Y Shah, Paul
    Mackerras, Paul Menzel, Ram Pai, Randy Dunlap, Rashmica Gupta, Reza
    Arbab, Rodrigo R. Galvao, Russell Currey, Sam Bobroff, Scott Wood,
    Shilpasri G Bhat, Simon Guo, Souptick Joarder, Stan Johnson, Thiago
    Jung Bauermann, Tyrel Datwyler, Vaibhav Jain, Vasant Hegde, Venkat
    Rao, zhong jiang"

    * tag 'powerpc-4.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (234 commits)
    powerpc/mm/book3s/radix: Add mapping statistics
    powerpc/uaccess: Enable get_user(u64, *p) on 32-bit
    powerpc/mm/hash: Remove unnecessary do { } while(0) loop
    powerpc/64s: move machine check SLB flushing to mm/slb.c
    powerpc/powernv/idle: Fix build error
    powerpc/mm/tlbflush: update the mmu_gather page size while iterating address range
    powerpc/mm: remove warning about ‘type’ being set
    powerpc/32: Include setup.h header file to fix warnings
    powerpc: Move `path` variable inside DEBUG_PROM
    powerpc/powermac: Make some functions static
    powerpc/powermac: Remove variable x that's never read
    cxl: remove a dead branch
    powerpc/powermac: Add missing include of header pmac.h
    powerpc/kexec: Use common error handling code in setup_new_fdt()
    powerpc/xmon: Add address lookup for percpu symbols
    powerpc/mm: remove huge_pte_offset_and_shift() prototype
    powerpc/lib: Use patch_site to patch copy_32 functions once cache is enabled
    powerpc/pseries: Fix endianness while restoring of r3 in MCE handler.
    powerpc/fadump: merge adjacent memory ranges to reduce PT_LOAD segements
    powerpc/fadump: handle crash memory ranges array index overflow
    ...

    Linus Torvalds
     

17 Aug, 2018

1 commit

  • Pull pci updates from Bjorn Helgaas:

    - Decode AER errors with names similar to "lspci" (Tyler Baicar)

    - Expose AER statistics in sysfs (Rajat Jain)

    - Clear AER status bits selectively based on the type of recovery (Oza
    Pawandeep)

    - Honor "pcie_ports=native" even if HEST sets FIRMWARE_FIRST (Alexandru
    Gagniuc)

    - Don't clear AER status bits if we're using the "Firmware-First"
    strategy where firmware owns the registers (Alexandru Gagniuc)

    - Use sysfs_match_string() to simplify ASPM sysfs parsing (Andy
    Shevchenko)

    - Remove unnecessary includes of (Bjorn Helgaas)

    - Defer DPC event handling to work queue (Keith Busch)

    - Use threaded IRQ for DPC bottom half (Keith Busch)

    - Print AER status while handling DPC events (Keith Busch)

    - Work around IDT switch ACS Source Validation erratum (James
    Puthukattukaran)

    - Emit diagnostics for all cases of PCIe Link downtraining (Links
    operating slower than they're capable of) (Alexandru Gagniuc)

    - Skip VFs when configuring Max Payload Size (Myron Stowe)

    - Reduce Root Port Max Payload Size if necessary when hot-adding a
    device below it (Myron Stowe)

    - Simplify SHPC existence/permission checks (Bjorn Helgaas)

    - Remove hotplug sample skeleton driver (Lukas Wunner)

    - Convert pciehp to threaded IRQ handling (Lukas Wunner)

    - Improve pciehp tolerance of missed events and initially unstable
    links (Lukas Wunner)

    - Clear spurious pciehp events on resume (Lukas Wunner)

    - Add pciehp runtime PM support, including for Thunderbolt controllers
    (Lukas Wunner)

    - Support interrupts from pciehp bridges in D3hot (Lukas Wunner)

    - Mark fall-through switch cases before enabling -Wimplicit-fallthrough
    (Gustavo A. R. Silva)

    - Move DMA-debug PCI init from arch code to PCI core (Christoph
    Hellwig)

    - Fix pci_request_irq() usage of IRQF_ONESHOT when no handler is
    supplied (Heiner Kallweit)

    - Unify PCI and DMA direction #defines (Shunyong Yang)

    - Add PCI_DEVICE_DATA() macro (Andy Shevchenko)

    - Check for VPD completion before checking for timeout (Bert Kenward)

    - Limit Netronome NFP5000 config space size to work around erratum
    (Jakub Kicinski)

    - Set IRQCHIP_ONESHOT_SAFE for PCI MSI irqchips (Heiner Kallweit)

    - Document ACPI description of PCI host bridges (Bjorn Helgaas)

    - Add "pci=disable_acs_redir=" parameter to disable ACS redirection for
    peer-to-peer DMA support (we don't have the peer-to-peer support yet;
    this is just one piece) (Logan Gunthorpe)

    - Clean up devm_of_pci_get_host_bridge_resources() resource allocation
    (Jan Kiszka)

    - Fixup resizable BARs after suspend/resume (Christian König)

    - Make "pci=earlydump" generic (Sinan Kaya)

    - Fix ROM BAR access routines to stay in bounds and check for signature
    correctly (Rex Zhu)

    - Add DMA alias quirk for Microsemi Switchtec NTB (Doug Meyer)

    - Expand documentation for pci_add_dma_alias() (Logan Gunthorpe)

    - To avoid bus errors, enable PASID only if entire path supports
    End-End TLP prefixes (Sinan Kaya)

    - Unify slot and bus reset functions and remove hotplug knowledge from
    callers (Sinan Kaya)

    - Add Function-Level Reset quirks for Intel and Samsung NVMe devices to
    fix guest reboot issues (Alex Williamson)

    - Add function 1 DMA alias quirk for Marvell 88SS9183 PCIe SSD
    Controller (Bjorn Helgaas)

    - Remove Xilinx AXI-PCIe host bridge arch dependency (Palmer Dabbelt)

    - Remove Aardvark outbound window configuration (Evan Wang)

    - Fix Aardvark bridge window sizing issue (Zachary Zhang)

    - Convert Aardvark to use pci_host_probe() to reduce code duplication
    (Thomas Petazzoni)

    - Correct the Cadence cdns_pcie_writel() signature (Alan Douglas)

    - Add Cadence support for optional generic PHYs (Alan Douglas)

    - Add Cadence power management ops (Alan Douglas)

    - Remove redundant variable from Cadence driver (Colin Ian King)

    - Add Kirin MSI support (Xiaowei Song)

    - Drop unnecessary root_bus_nr setting from exynos, imx6, keystone,
    armada8k, artpec6, designware-plat, histb, qcom, spear13xx (Shawn
    Guo)

    - Move link notification settings from DesignWare core to individual
    drivers (Gustavo Pimentel)

    - Add endpoint library MSI-X interfaces (Gustavo Pimentel)

    - Correct signature of endpoint library IRQ interfaces (Gustavo
    Pimentel)

    - Add DesignWare endpoint library MSI-X callbacks (Gustavo Pimentel)

    - Add endpoint library MSI-X test support (Gustavo Pimentel)

    - Remove unnecessary GFP_ATOMIC from Hyper-V "new child" allocation
    (Jia-Ju Bai)

    - Add more devices to Broadcom PAXC quirk (Ray Jui)

    - Work around corrupted Broadcom PAXC config space to enable SMMU and
    GICv3 ITS (Ray Jui)

    - Disable MSI parsing to work around broken Broadcom PAXC logic in some
    devices (Ray Jui)

    - Hide unconfigured functions to work around a Broadcom PAXC defect
    (Ray Jui)

    - Lower iproc log level to reduce console output during boot (Ray Jui)

    - Fix mobiveil iomem/phys_addr_t type usage (Lorenzo Pieralisi)

    - Fix mobiveil missing include file (Lorenzo Pieralisi)

    - Add mobiveil Kconfig/Makefile support (Lorenzo Pieralisi)

    - Fix mvebu I/O space remapping issues (Thomas Petazzoni)

    - Use generic pci_host_bridge in mvebu instead of ARM-specific API
    (Thomas Petazzoni)

    - Whitelist VMD devices with fast interrupt handlers to avoid sharing
    vectors with slow handlers (Keith Busch)

    * tag 'pci-v4.19-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (153 commits)
    PCI/AER: Don't clear AER bits if error handling is Firmware-First
    PCI: Limit config space size for Netronome NFP5000
    PCI/MSI: Set IRQCHIP_ONESHOT_SAFE for PCI-MSI irqchips
    PCI/VPD: Check for VPD access completion before checking for timeout
    PCI: Add PCI_DEVICE_DATA() macro to fully describe device ID entry
    PCI: Match Root Port's MPS to endpoint's MPSS as necessary
    PCI: Skip MPS logic for Virtual Functions (VFs)
    PCI: Add function 1 DMA alias quirk for Marvell 88SS9183
    PCI: Check for PCIe Link downtraining
    PCI: Add ACS Redirect disable quirk for Intel Sunrise Point
    PCI: Add device-specific ACS Redirect disable infrastructure
    PCI: Convert device-specific ACS quirks from NULL termination to ARRAY_SIZE
    PCI: Add "pci=disable_acs_redir=" parameter for peer-to-peer support
    PCI: Allow specifying devices using a base bus and path of devfns
    PCI: Make specifying PCI devices in kernel parameters reusable
    PCI: Hide ACS quirk declarations inside PCI core
    PCI: Delay after FLR of Intel DC P3700 NVMe
    PCI: Disable Samsung SM961/PM961 NVMe before FLR
    PCI: Export pcie_has_flr()
    PCI: mvebu: Drop bogus comment above mvebu_pcie_map_registers()
    ...

    Linus Torvalds
     

16 Aug, 2018

2 commits

  • Pull random updates from Ted Ts'o:
    "Some changes to trust cpu-based hwrng (such as RDRAND) for
    initializing hashed pointers and (optionally, controlled by a config
    option) to initialize the CRNG to avoid boot hangs"

    * tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random:
    random: Make crng state queryable
    random: remove preempt disabled region
    random: add a config option to trust the CPU's hwrng
    vsprintf: Add command line option debug_boot_weak_hash
    vsprintf: Use hw RNG for ptr_key
    random: Return nbytes filled from hw RNG
    random: Fix whitespace pre random-bytes work

    Linus Torvalds
     
  • - Clean up devm_of_pci_get_host_bridge_resources() resource allocation
    (Jan Kiszka)

    - Fixup resizable BARs after suspend/resume (Christian König)

    - Make "pci=earlydump" generic (Sinan Kaya)

    - Fix ROM BAR access routines to stay in bounds and check for signature
    correctly (Rex Zhu)

    * pci/resource:
    PCI: Make pci_get_rom_size() static
    PCI: Add check code for last image indicator not set
    PCI: Avoid accessing memory outside the ROM BAR
    PCI: Make early dump functionality generic
    PCI: Cleanup PCI_REBAR_CTRL_BAR_SHIFT handling
    PCI: Restore resized BAR state on resume
    PCI: Clean up resource allocation in devm_of_pci_get_host_bridge_resources()

    # Conflicts:
    # Documentation/admin-guide/kernel-parameters.txt

    Bjorn Helgaas
     

15 Aug, 2018

4 commits

  • Pull hardened usercopy updates from Kees Cook:
    "This cleans up a minor Kconfig issue and adds a kernel boot option for
    disabling hardened usercopy for distro users that may have corner-case
    performance issues (e.g. high bandwidth small-packet UDP traffic).

    Summary:

    - drop unneeded Kconfig "select BUG" (Kamal Mostafa)

    - add "hardened_usercopy=off" rare performance needs (Chris von
    Recklinghausen)"

    * tag 'hardened-usercopy-v4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
    usercopy: Allow boot cmdline disabling of hardening
    usercopy: Do not select BUG with HARDENED_USERCOPY

    Linus Torvalds
     
  • Pull documentation update from Jonathan Corbet:
    "This was a moderately busy cycle for docs, with the usual collection
    of small fixes and updates.

    We also have new ktime_get_*() docs from Arnd, some kernel-doc fixes,
    a new set of Italian translations (non so se vale la pena, ma non fa
    male - speriamo bene), and some extensive early memory-management
    documentation improvements from Mike Rapoport"

    * tag 'docs-4.19' of git://git.lwn.net/linux: (52 commits)
    Documentation: corrections to console/console.txt
    Documentation: add ioctl number entry for v4l2-subdev.h
    Remove gendered language from management style documentation
    scripts/kernel-doc: Escape all literal braces in regexes
    docs/mm: add description of boot time memory management
    docs/mm: memblock: add overview documentation
    docs/mm: memblock: add kernel-doc description for memblock types
    docs/mm: memblock: add kernel-doc comments for memblock_add[_node]
    docs/mm: memblock: update kernel-doc comments
    mm/memblock: add a name for memblock flags enumeration
    docs/mm: bootmem: add overview documentation
    docs/mm: bootmem: add kernel-doc description of 'struct bootmem_data'
    docs/mm: bootmem: fix kernel-doc warnings
    docs/mm: nobootmem: fixup kernel-doc comments
    mm/bootmem: drop duplicated kernel-doc comments
    Documentation: vm.txt: Adding 'nr_hugepages_mempolicy' parameter description.
    doc:it_IT: translation for kernel-hacking
    docs: Fix the reference labels in Locking.rst
    doc: tracing: Fix a typo of trace_stat
    mm: Introduce new type vm_fault_t
    ...

    Linus Torvalds
     
  • Pull block updates from Jens Axboe:
    "First pull request for this merge window, there will also be a
    followup request with some stragglers.

    This pull request contains:

    - Fix for a thundering heard issue in the wbt block code (Anchal
    Agarwal)

    - A few NVMe pull requests:
    * Improved tracepoints (Keith)
    * Larger inline data support for RDMA (Steve Wise)
    * RDMA setup/teardown fixes (Sagi)
    * Effects log suppor for NVMe target (Chaitanya Kulkarni)
    * Buffered IO suppor for NVMe target (Chaitanya Kulkarni)
    * TP4004 (ANA) support (Christoph)
    * Various NVMe fixes

    - Block io-latency controller support. Much needed support for
    properly containing block devices. (Josef)

    - Series improving how we handle sense information on the stack
    (Kees)

    - Lightnvm fixes and updates/improvements (Mathias/Javier et al)

    - Zoned device support for null_blk (Matias)

    - AIX partition fixes (Mauricio Faria de Oliveira)

    - DIF checksum code made generic (Max Gurtovoy)

    - Add support for discard in iostats (Michael Callahan / Tejun)

    - Set of updates for BFQ (Paolo)

    - Removal of async write support for bsg (Christoph)

    - Bio page dirtying and clone fixups (Christoph)

    - Set of bcache fix/changes (via Coly)

    - Series improving blk-mq queue setup/teardown speed (Ming)

    - Series improving merging performance on blk-mq (Ming)

    - Lots of other fixes and cleanups from a slew of folks"

    * tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block: (190 commits)
    blkcg: Make blkg_root_lookup() work for queues in bypass mode
    bcache: fix error setting writeback_rate through sysfs interface
    null_blk: add lock drop/acquire annotation
    Blk-throttle: reduce tail io latency when iops limit is enforced
    block: paride: pd: mark expected switch fall-throughs
    block: Ensure that a request queue is dissociated from the cgroup controller
    block: Introduce blk_exit_queue()
    blkcg: Introduce blkg_root_lookup()
    block: Remove two superfluous #include directives
    blk-mq: count the hctx as active before allocating tag
    block: bvec_nr_vecs() returns value for wrong slab
    bcache: trivial - remove tailing backslash in macro BTREE_FLAG
    bcache: make the pr_err statement used for ENOENT only in sysfs_attatch section
    bcache: set max writeback rate when I/O request is idle
    bcache: add code comments for bset.c
    bcache: fix mistaken comments in request.c
    bcache: fix mistaken code comments in bcache.h
    bcache: add a comment in super.c
    bcache: avoid unncessary cache prefetch bch_btree_node_get()
    bcache: display rate debug parameters to 0 when writeback is not running
    ...

    Linus Torvalds
     
  • Merge L1 Terminal Fault fixes from Thomas Gleixner:
    "L1TF, aka L1 Terminal Fault, is yet another speculative hardware
    engineering trainwreck. It's a hardware vulnerability which allows
    unprivileged speculative access to data which is available in the
    Level 1 Data Cache when the page table entry controlling the virtual
    address, which is used for the access, has the Present bit cleared or
    other reserved bits set.

    If an instruction accesses a virtual address for which the relevant
    page table entry (PTE) has the Present bit cleared or other reserved
    bits set, then speculative execution ignores the invalid PTE and loads
    the referenced data if it is present in the Level 1 Data Cache, as if
    the page referenced by the address bits in the PTE was still present
    and accessible.

    While this is a purely speculative mechanism and the instruction will
    raise a page fault when it is retired eventually, the pure act of
    loading the data and making it available to other speculative
    instructions opens up the opportunity for side channel attacks to
    unprivileged malicious code, similar to the Meltdown attack.

    While Meltdown breaks the user space to kernel space protection, L1TF
    allows to attack any physical memory address in the system and the
    attack works across all protection domains. It allows an attack of SGX
    and also works from inside virtual machines because the speculation
    bypasses the extended page table (EPT) protection mechanism.

    The assoicated CVEs are: CVE-2018-3615, CVE-2018-3620, CVE-2018-3646

    The mitigations provided by this pull request include:

    - Host side protection by inverting the upper address bits of a non
    present page table entry so the entry points to uncacheable memory.

    - Hypervisor protection by flushing L1 Data Cache on VMENTER.

    - SMT (HyperThreading) control knobs, which allow to 'turn off' SMT
    by offlining the sibling CPU threads. The knobs are available on
    the kernel command line and at runtime via sysfs

    - Control knobs for the hypervisor mitigation, related to L1D flush
    and SMT control. The knobs are available on the kernel command line
    and at runtime via sysfs

    - Extensive documentation about L1TF including various degrees of
    mitigations.

    Thanks to all people who have contributed to this in various ways -
    patches, review, testing, backporting - and the fruitful, sometimes
    heated, but at the end constructive discussions.

    There is work in progress to provide other forms of mitigations, which
    might be less horrible performance wise for a particular kind of
    workloads, but this is not yet ready for consumption due to their
    complexity and limitations"

    * 'l1tf-final' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (75 commits)
    x86/microcode: Allow late microcode loading with SMT disabled
    tools headers: Synchronise x86 cpufeatures.h for L1TF additions
    x86/mm/kmmio: Make the tracer robust against L1TF
    x86/mm/pat: Make set_memory_np() L1TF safe
    x86/speculation/l1tf: Make pmd/pud_mknotpresent() invert
    x86/speculation/l1tf: Invert all not present mappings
    cpu/hotplug: Fix SMT supported evaluation
    KVM: VMX: Tell the nested hypervisor to skip L1D flush on vmentry
    x86/speculation: Use ARCH_CAPABILITIES to skip L1D flush on vmentry
    x86/speculation: Simplify sysfs report of VMX L1TF vulnerability
    Documentation/l1tf: Remove Yonah processors from not vulnerable list
    x86/KVM/VMX: Don't set l1tf_flush_l1d from vmx_handle_external_intr()
    x86/irq: Let interrupt handlers set kvm_cpu_l1tf_flush_l1d
    x86: Don't include linux/irq.h from asm/hardirq.h
    x86/KVM/VMX: Introduce per-host-cpu analogue of l1tf_flush_l1d
    x86/irq: Demote irq_cpustat_t::__softirq_pending to u16
    x86/KVM/VMX: Move the l1tf_flush_l1d test to vmx_l1d_flush()
    x86/KVM/VMX: Replace 'vmx_l1d_flush_always' with 'vmx_l1d_flush_cond'
    x86/KVM/VMX: Don't set l1tf_flush_l1d to true from vmx_l1d_flush()
    cpu/hotplug: detect SMT disabled by BIOS
    ...

    Linus Torvalds
     

14 Aug, 2018

1 commit

  • Pull x86 timer updates from Thomas Gleixner:
    "Early TSC based time stamping to allow better boot time analysis.

    This comes with a general cleanup of the TSC calibration code which
    grew warts and duct taping over the years and removes 250 lines of
    code. Initiated and mostly implemented by Pavel with help from various
    folks"

    * 'x86-timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (37 commits)
    x86/kvmclock: Mark kvm_get_preset_lpj() as __init
    x86/tsc: Consolidate init code
    sched/clock: Disable interrupts when calling generic_sched_clock_init()
    timekeeping: Prevent false warning when persistent clock is not available
    sched/clock: Close a hole in sched_clock_init()
    x86/tsc: Make use of tsc_calibrate_cpu_early()
    x86/tsc: Split native_calibrate_cpu() into early and late parts
    sched/clock: Use static key for sched_clock_running
    sched/clock: Enable sched clock early
    sched/clock: Move sched clock initialization and merge with generic clock
    x86/tsc: Use TSC as sched clock early
    x86/tsc: Initialize cyc2ns when tsc frequency is determined
    x86/tsc: Calibrate tsc only once
    ARM/time: Remove read_boot_clock64()
    s390/time: Remove read_boot_clock64()
    timekeeping: Default boot time offset to local_clock()
    timekeeping: Replace read_boot_clock64() with read_persistent_wall_and_boot_offset()
    s390/time: Add read_persistent_wall_and_boot_offset()
    x86/xen/time: Output xen sched_clock time from 0
    x86/xen/time: Initialize pv xen time in init_hypervisor_platform()
    ...

    Linus Torvalds
     

10 Aug, 2018

3 commits

  • To support peer-to-peer traffic on a segment of the PCI hierarchy, we must
    disable the ACS redirect bits for select PCI bridges. The bridges must be
    selected before the devices are discovered by the kernel and the IOMMU
    groups created. Therefore, add a kernel command line parameter to specify
    devices which must have their ACS bits disabled.

    The new parameter takes a list of devices separated by a semicolon. Each
    device specified will have its ACS redirect bits disabled. This is
    similar to the existing 'resource_alignment' parameter.

    The ACS Request P2P Request Redirect, P2P Completion Redirect and P2P
    Egress Control bits are disabled, which is sufficient to always allow
    passing P2P traffic uninterrupted. The bits are set after the kernel
    (optionally) enables the ACS bits itself. It is also done regardless of
    whether the kernel or platform firmware sets the bits.

    If the user tries to disable the ACS redirect for a device without the ACS
    capability, print a warning to dmesg.

    Signed-off-by: Logan Gunthorpe
    [bhelgaas: reorder to add the generic code first and move the
    device-specific quirk to subsequent patches]
    Signed-off-by: Bjorn Helgaas
    Reviewed-by: Stephen Bates
    Reviewed-by: Alex Williamson
    Acked-by: Christian König

    Logan Gunthorpe
     
  • When specifying PCI devices on the kernel command line using a
    bus/device/function address, bus numbers can change when adding or
    replacing a device, changing motherboard firmware, or applying kernel
    parameters like "pci=assign-buses". When bus numbers change, it's likely
    the command line tweak will be applied to the wrong device.

    Therefore, it is useful to be able to specify devices with a base bus
    number and the path of devfns needed to get to it, similar to the "device
    scope" structure in the Intel VT-d spec, Section 8.3.1.

    Thus, we add an option to specify devices in the following format:

    [:]:.[/.]*

    The path can be any segment within the PCI hierarchy of any length and
    determined through the use of 'lspci -t'. When specified this way, it is
    less likely that a renumbered bus will result in a valid device
    specification and the tweak won't be applied to the wrong device.

    Signed-off-by: Logan Gunthorpe
    [bhelgaas: use "device" instead of "slot" in documentation since that's the
    usual language in the PCI specs]
    Signed-off-by: Bjorn Helgaas
    Reviewed-by: Stephen Bates
    Reviewed-by: Alex Williamson
    Acked-by: Christian König

    Logan Gunthorpe
     
  • Separate out the code to match a PCI device with a string (typically
    originating from a kernel parameter) from the
    pci_specified_resource_alignment() function into its own helper function.

    While we are at it, this change fixes the kernel style of the function
    (fixing a number of long lines and extra parentheses).

    Additionally, make the analogous change to the kernel parameter
    documentation: Separate the description of how to specify a PCI device
    into its own section at the head of the "pci=" parameter.

    This patch should have no functional alterations.

    Signed-off-by: Logan Gunthorpe
    [bhelgaas: use "device" instead of "slot" in documentation since that's the
    usual language in the PCI specs]
    Signed-off-by: Bjorn Helgaas
    Reviewed-by: Stephen Bates
    Reviewed-by: Alex Williamson
    Acked-by: Christian König

    Logan Gunthorpe
     

08 Aug, 2018

1 commit


07 Aug, 2018

1 commit


06 Aug, 2018

1 commit


05 Aug, 2018

3 commits


02 Aug, 2018

1 commit

  • Currently, avg_lat is calculated by accumulating the mean of every
    window in a long running cumulative average. As time goes on, the metric
    becomes less and less useful due to the accumulated history.

    This patch reuses the same calculation done in load averages to make the
    avg_lat metric more lively. Unlike load averages, the avg only advances
    when a window elapses (due to an io). Idle periods extend the most
    recent window. Bucketing is used to limit the history of avg_lat by
    binding it to the window size. So, the window range for 1/exp (decay
    rate) is [1 min, 2.5 min) when windows elapse immediately.

    The current sample window size is exposed in the debug info to enable
    calculation of the window range.

    Signed-off-by: Dennis Zhou
    Acked-by: Tejun Heo
    Acked-by: Johannes Weiner
    Acked-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Dennis Zhou (Facebook)