29 Oct, 2018

1 commit

  • Interactive governor has lived in Android sources for a very long time
    and this commit is based on the code present in following branch:

    https://android.googlesource.com/kernel/common android-4.4

    The Interactive governor is designed for latency-sensitive workloads,
    such as interactive user interfaces like the mobile phones and tablets.
    The interactive governor aims to be significantly more responsive to
    ramp CPU quickly up when CPU-intensive activity begins.

    Existing governors sample CPU load at a particular rate, typically every
    X ms and then update the frequency from a work-handler. This can lead
    to under-powering UI threads for the period of time during which the
    user begins interacting with a previously-idle system until the next
    sample period happens.

    The 'interactive' governor uses a different approach.

    A real-time thread is used for scaling up, giving the remaining tasks
    the CPU performance benefit, unlike existing governors which are more
    likely to schedule ramp-up work to occur after your performance starved
    tasks have completed.

    The Android version of interactive governor also checks whether to scale
    the CPU frequency up soon after coming out of idle. When the CPU comes
    out of idle, the governor check if the CPU sampling is overdue or not.
    If yes, it immediately starts the sampling. Otherwise, the utilization
    hooks from the scheduler handle the sampling later. If the CPU is very
    busy from exiting idle to when the evaluation happens, then it assumes
    that the CPU is under-powered and ramps it to MAX speed.

    If the CPU was not sufficiently busy to immediately ramp to MAX speed,
    then the governor evaluates the CPU load since the last speed
    adjustment, choosing the highest value between that longer-term load or
    the short-term load since idle exit to determine the CPU speed to ramp
    to.

    Idle notifiers will be be handled later and are not included for now.

    The core of this code is written and maintained (in Android
    repositories) by Mike Chan and Todd Poyner over a long period of time.

    Vireshk has made changes to to the governor to align it with the current
    practices followed with mainline governors, like using utilization hooks
    from the scheduler and handling kobject (for governor's sysfs directory)
    in a race free manner. And of course this included general cleanup of
    the governor as well.

    Signed-off-by: Mike Chan
    Signed-off-by: Todd Poynor
    Signed-off-by: Viresh Kumar

    Viresh Kumar
     

16 Aug, 2018

9 commits

  • commit 5b76a3cff011df2dcb6186c965a2e4d809a05ad4 upstream

    When nested virtualization is in use, VMENTER operations from the nested
    hypervisor into the nested guest will always be processed by the bare metal
    hypervisor, and KVM's "conditional cache flushes" mode in particular does a
    flush on nested vmentry. Therefore, include the "skip L1D flush on
    vmentry" bit in KVM's suggested ARCH_CAPABILITIES setting.

    Add the relevant Documentation.

    Signed-off-by: Paolo Bonzini
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Paolo Bonzini
     
  • commit 58331136136935c631c2b5f06daf4c3006416e91 upstream

    Dave reported, that it's not confirmed that Yonah processors are
    unaffected. Remove them from the list.

    Reported-by: ave Hansen
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • commit 1949f9f49792d65dba2090edddbe36a5f02e3ba3 upstream

    Fix spelling and other typos

    Signed-off-by: Tony Luck
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Tony Luck
     
  • commit 3ec8ce5d866ec6a08a9cfab82b62acf4a830b35f upstream

    Add documentation for the L1TF vulnerability and the mitigation mechanisms:

    - Explain the problem and risks
    - Document the mitigation mechanisms
    - Document the command line controls
    - Document the sysfs files

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Greg Kroah-Hartman
    Reviewed-by: Josh Poimboeuf
    Acked-by: Linus Torvalds
    Link: https://lkml.kernel.org/r/20180713142323.287429944@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • commit d90a7a0ec83fb86622cd7dae23255d3c50a99ec8 upstream

    Introduce the 'l1tf=' kernel command line option to allow for boot-time
    switching of mitigation that is used on processors affected by L1TF.

    The possible values are:

    full
    Provides all available mitigations for the L1TF vulnerability. Disables
    SMT and enables all mitigations in the hypervisors. SMT control via
    /sys/devices/system/cpu/smt/control is still possible after boot.
    Hypervisors will issue a warning when the first VM is started in
    a potentially insecure configuration, i.e. SMT enabled or L1D flush
    disabled.

    full,force
    Same as 'full', but disables SMT control. Implies the 'nosmt=force'
    command line option. sysfs control of SMT and the hypervisor flush
    control is disabled.

    flush
    Leaves SMT enabled and enables the conditional hypervisor mitigation.
    Hypervisors will issue a warning when the first VM is started in a
    potentially insecure configuration, i.e. SMT enabled or L1D flush
    disabled.

    flush,nosmt
    Disables SMT and enables the conditional hypervisor mitigation. SMT
    control via /sys/devices/system/cpu/smt/control is still possible
    after boot. If SMT is reenabled or flushing disabled at runtime
    hypervisors will issue a warning.

    flush,nowarn
    Same as 'flush', but hypervisors will not warn when
    a VM is started in a potentially insecure configuration.

    off
    Disables hypervisor mitigations and doesn't emit any warnings.

    Default is 'flush'.

    Let KVM adhere to these semantics, which means:

    - 'lt1f=full,force' : Performe L1D flushes. No runtime control
    possible.

    - 'l1tf=full'
    - 'l1tf-flush'
    - 'l1tf=flush,nosmt' : Perform L1D flushes and warn on VM start if
    SMT has been runtime enabled or L1D flushing
    has been run-time enabled

    - 'l1tf=flush,nowarn' : Perform L1D flushes and no warnings are emitted.

    - 'l1tf=off' : L1D flushes are not performed and no warnings
    are emitted.

    KVM can always override the L1D flushing behavior using its 'vmentry_l1d_flush'
    module parameter except when lt1f=full,force is set.

    This makes KVM's private 'nosmt' option redundant, and as it is a bit
    non-systematic anyway (this is something to control globally, not on
    hypervisor level), remove that option.

    Add the missing Documentation entry for the l1tf vulnerability sysfs file
    while at it.

    Signed-off-by: Jiri Kosina
    Signed-off-by: Thomas Gleixner
    Tested-by: Jiri Kosina
    Reviewed-by: Greg Kroah-Hartman
    Reviewed-by: Josh Poimboeuf
    Link: https://lkml.kernel.org/r/20180713142323.202758176@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Jiri Kosina
     
  • commit a399477e52c17e148746d3ce9a483f681c2aa9a0 upstream

    Add a mitigation mode parameter "vmentry_l1d_flush" for CVE-2018-3620, aka
    L1 terminal fault. The valid arguments are:

    - "always" L1D cache flush on every VMENTER.
    - "cond" Conditional L1D cache flush, explained below
    - "never" Disable the L1D cache flush mitigation

    "cond" is trying to avoid L1D cache flushes on VMENTER if the code executed
    between VMEXIT and VMENTER is considered safe, i.e. is not bringing any
    interesting information into L1D which might exploited.

    [ tglx: Split out from a larger patch ]

    Signed-off-by: Konrad Rzeszutek Wilk
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Konrad Rzeszutek Wilk
     
  • commit 26acfb666a473d960f0fd971fe68f3e3ad16c70b upstream

    If the L1TF CPU bug is present we allow the KVM module to be loaded as the
    major of users that use Linux and KVM have trusted guests and do not want a
    broken setup.

    Cloud vendors are the ones that are uncomfortable with CVE 2018-3620 and as
    such they are the ones that should set nosmt to one.

    Setting 'nosmt' means that the system administrator also needs to disable
    SMT (Hyper-threading) in the BIOS, or via the 'nosmt' command line
    parameter, or via the /sys/devices/system/cpu/smt/control. See commit
    05736e4ac13c ("cpu/hotplug: Provide knobs to control SMT").

    Other mitigations are to use task affinity, cpu sets, interrupt binding,
    etc - anything to make sure that _only_ the same guests vCPUs are running
    on sibling threads.

    Signed-off-by: Konrad Rzeszutek Wilk
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Konrad Rzeszutek Wilk
     
  • commit 506a66f374891ff08e064a058c446b336c5ac760 upstream

    Dave Hansen reported, that it's outright dangerous to keep SMT siblings
    disabled completely so they are stuck in the BIOS and wait for SIPI.

    The reason is that Machine Check Exceptions are broadcasted to siblings and
    the soft disabled sibling has CR4.MCE = 0. If a MCE is delivered to a
    logical core with CR4.MCE = 0, it asserts IERR#, which shuts down or
    reboots the machine. The MCE chapter in the SDM contains the following
    blurb:

    Because the logical processors within a physical package are tightly
    coupled with respect to shared hardware resources, both logical
    processors are notified of machine check errors that occur within a
    given physical processor. If machine-check exceptions are enabled when
    a fatal error is reported, all the logical processors within a physical
    package are dispatched to the machine-check exception handler. If
    machine-check exceptions are disabled, the logical processors enter the
    shutdown state and assert the IERR# signal. When enabling machine-check
    exceptions, the MCE flag in control register CR4 should be set for each
    logical processor.

    Reverting the commit which ignores siblings at enumeration time solves only
    half of the problem. The core cpuhotplug logic needs to be adjusted as
    well.

    This thoughtful engineered mechanism also turns the boot process on all
    Intel HT enabled systems into a MCE lottery. MCE is enabled on the boot CPU
    before the secondary CPUs are brought up. Depending on the number of
    physical cores the window in which this situation can happen is smaller or
    larger. On a HSW-EX it's about 750ms:

    MCE is enabled on the boot CPU:

    [ 0.244017] mce: CPU supports 22 MCE banks

    The corresponding sibling #72 boots:

    [ 1.008005] .... node #0, CPUs: #72

    That means if an MCE hits on physical core 0 (logical CPUs 0 and 72)
    between these two points the machine is going to shutdown. At least it's a
    known safe state.

    It's obvious that the early boot can be hit by an MCE as well and then runs
    into the same situation because MCEs are not yet enabled on the boot CPU.
    But after enabling them on the boot CPU, it does not make any sense to
    prevent the kernel from recovering.

    Adjust the nosmt kernel parameter documentation as well.

    Reverts: 2207def700f9 ("x86/apic: Ignore secondary threads if nosmt=force")
    Reported-by: Dave Hansen
    Signed-off-by: Thomas Gleixner
    Tested-by: Tony Luck
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • commit 05736e4ac13c08a4a9b1ef2de26dd31a32cbee57 upstream

    Provide a command line and a sysfs knob to control SMT.

    The command line options are:

    'nosmt': Enumerate secondary threads, but do not online them

    'nosmt=force': Ignore secondary threads completely during enumeration
    via MP table and ACPI/MADT.

    The sysfs control file has the following states (read/write):

    'on': SMT is enabled. Secondary threads can be freely onlined
    'off': SMT is disabled. Secondary threads, even if enumerated
    cannot be onlined
    'forceoff': SMT is permanentely disabled. Writes to the control
    file are rejected.
    'notsupported': SMT is not supported by the CPU

    The command line option 'nosmt' sets the sysfs control to 'off'. This
    can be changed to 'on' to reenable SMT during runtime.

    The command line option 'nosmt=force' sets the sysfs control to
    'forceoff'. This cannot be changed during runtime.

    When SMT is 'on' and the control file is changed to 'off' then all online
    secondary threads are offlined and attempts to online a secondary thread
    later on are rejected.

    When SMT is 'off' and the control file is changed to 'on' then secondary
    threads can be onlined again. The 'off' -> 'on' transition does not
    automatically online the secondary threads.

    When the control file is set to 'forceoff', the behaviour is the same as
    setting it to 'off', but the operation is irreversible and later writes to
    the control file are rejected.

    When the control status is 'notsupported' then writes to the control file
    are rejected.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Konrad Rzeszutek Wilk
    Acked-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

22 Jul, 2018

1 commit

  • commit a43ae4dfe56a01f5b98ba0cb2f784b6a43bafcc6 upstream.

    On a system where the firmware implements ARCH_WORKAROUND_2,
    it may be useful to either permanently enable or disable the
    workaround for cases where the user decides that they'd rather
    not get a trap overhead, and keep the mitigation permanently
    on or off instead of switching it on exception entry/exit.

    In any case, default to the mitigation being enabled.

    Reviewed-by: Julien Grall
    Reviewed-by: Mark Rutland
    Acked-by: Will Deacon
    Signed-off-by: Marc Zyngier
    Signed-off-by: Catalin Marinas
    Signed-off-by: Marc Zyngier
    Signed-off-by: Greg Kroah-Hartman

    Marc Zyngier
     

23 May, 2018

3 commits

  • commit f21b53b20c754021935ea43364dbf53778eeba32 upstream

    Unless explicitly opted out of, anything running under seccomp will have
    SSB mitigations enabled. Choosing the "prctl" mode will disable this.

    [ tglx: Adjusted it to the new arch_seccomp_spec_mitigate() mechanism ]

    Signed-off-by: Kees Cook
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Kees Cook
     
  • commit a73ec77ee17ec556fe7f165d00314cb7c047b1ac upstream

    Add prctl based control for Speculative Store Bypass mitigation and make it
    the default mitigation for Intel and AMD.

    Andi Kleen provided the following rationale (slightly redacted):

    There are multiple levels of impact of Speculative Store Bypass:

    1) JITed sandbox.
    It cannot invoke system calls, but can do PRIME+PROBE and may have call
    interfaces to other code

    2) Native code process.
    No protection inside the process at this level.

    3) Kernel.

    4) Between processes.

    The prctl tries to protect against case (1) doing attacks.

    If the untrusted code can do random system calls then control is already
    lost in a much worse way. So there needs to be system call protection in
    some way (using a JIT not allowing them or seccomp). Or rather if the
    process can subvert its environment somehow to do the prctl it can already
    execute arbitrary code, which is much worse than SSB.

    To put it differently, the point of the prctl is to not allow JITed code
    to read data it shouldn't read from its JITed sandbox. If it already has
    escaped its sandbox then it can already read everything it wants in its
    address space, and do much worse.

    The ability to control Speculative Store Bypass allows to enable the
    protection selectively without affecting overall system performance.

    Based on an initial patch from Tim Chen. Completely rewritten.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Konrad Rzeszutek Wilk
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • commit 24f7fc83b9204d20f878c57cb77d261ae825e033 upstream

    Contemporary high performance processors use a common industry-wide
    optimization known as "Speculative Store Bypass" in which loads from
    addresses to which a recent store has occurred may (speculatively) see an
    older value. Intel refers to this feature as "Memory Disambiguation" which
    is part of their "Smart Memory Access" capability.

    Memory Disambiguation can expose a cache side-channel attack against such
    speculatively read values. An attacker can create exploit code that allows
    them to read memory outside of a sandbox environment (for example,
    malicious JavaScript in a web page), or to perform more complex attacks
    against code running within the same privilege level, e.g. via the stack.

    As a first step to mitigate against such attacks, provide two boot command
    line control knobs:

    nospec_store_bypass_disable
    spec_store_bypass_disable=[off,auto,on]

    By default affected x86 processors will power on with Speculative
    Store Bypass enabled. Hence the provided kernel parameters are written
    from the point of view of whether to enable a mitigation or not.
    The parameters are as follows:

    - auto - Kernel detects whether your CPU model contains an implementation
    of Speculative Store Bypass and picks the most appropriate
    mitigation.

    - on - disable Speculative Store Bypass
    - off - enable Speculative Store Bypass

    [ tglx: Reordered the checks so that the whole evaluation is not done
    when the CPU does not support RDS ]

    Signed-off-by: Konrad Rzeszutek Wilk
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Borislav Petkov
    Reviewed-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Konrad Rzeszutek Wilk
     

29 Apr, 2018

1 commit

  • [ Upstream commit 686140a1a9c41d85a4212a1c26d671139b76404b ]

    Implement CPU alternatives, which allows to optionally patch newer
    instructions at runtime, based on CPU facilities availability.

    A new kernel boot parameter "noaltinstr" disables patching.

    Current implementation is derived from x86 alternatives. Although
    ideal instructions padding (when altinstr is longer then oldinstr)
    is added at compile time, and no oldinstr nops optimization has to be
    done at runtime. Also couple of compile time sanity checks are done:
    1. oldinstr and altinstr must be
    Signed-off-by: Martin Schwidefsky
    Signed-off-by: Greg Kroah-Hartman

    Vasily Gorbik
     

22 Feb, 2018

1 commit

  • commit 4675ff05de2d76d167336b368bd07f3fef6ed5a6 upstream.

    Fix up makefiles, remove references, and git rm kmemcheck.

    Link: http://lkml.kernel.org/r/20171007030159.22241-4-alexander.levin@verizon.com
    Signed-off-by: Sasha Levin
    Cc: Steven Rostedt
    Cc: Vegard Nossum
    Cc: Pekka Enberg
    Cc: Michal Hocko
    Cc: Eric W. Biederman
    Cc: Alexander Potapenko
    Cc: Tim Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Levin, Alexander (Sasha Levin)
     

08 Feb, 2018

1 commit

  • commit 12c69f1e94c89d40696e83804dd2f0965b5250cd

    The 'noreplace-paravirt' option disables paravirt patching, leaving the
    original pv indirect calls in place.

    That's highly incompatible with retpolines, unless we want to uglify
    paravirt even further and convert the paravirt calls to retpolines.

    As far as I can tell, the option doesn't seem to be useful for much
    other than introducing surprising corner cases and making the kernel
    vulnerable to Spectre v2. It was probably a debug option from the early
    paravirt days. So just remove it.

    Signed-off-by: Josh Poimboeuf
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Juergen Gross
    Cc: Andrea Arcangeli
    Cc: Peter Zijlstra
    Cc: Andi Kleen
    Cc: Ashok Raj
    Cc: Greg KH
    Cc: Jun Nakajima
    Cc: Tim Chen
    Cc: Rusty Russell
    Cc: Dave Hansen
    Cc: Asit Mallick
    Cc: Andy Lutomirski
    Cc: Linus Torvalds
    Cc: Jason Baron
    Cc: Paolo Bonzini
    Cc: Alok Kataria
    Cc: Arjan Van De Ven
    Cc: David Woodhouse
    Cc: Dan Williams
    Link: https://lkml.kernel.org/r/20180131041333.2x6blhxirc2kclrq@treble
    Signed-off-by: Greg Kroah-Hartman

    Josh Poimboeuf
     

17 Jan, 2018

2 commits

  • commit da285121560e769cc31797bba6422eea71d473e0 upstream.

    Add a spectre_v2= option to select the mitigation used for the indirect
    branch speculation vulnerability.

    Currently, the only option available is retpoline, in its various forms.
    This will be expanded to cover the new IBRS/IBPB microcode features.

    The RETPOLINE_AMD feature relies on a serializing LFENCE for speculation
    control. For AMD hardware, only set RETPOLINE_AMD if LFENCE is a
    serializing instruction, which is indicated by the LFENCE_RDTSC feature.

    [ tglx: Folded back the LFENCE/AMD fixes and reworked it so IBRS
    integration becomes simple ]

    Signed-off-by: David Woodhouse
    Signed-off-by: Thomas Gleixner
    Cc: gnomes@lxorguk.ukuu.org.uk
    Cc: Rik van Riel
    Cc: Andi Kleen
    Cc: Josh Poimboeuf
    Cc: thomas.lendacky@amd.com
    Cc: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Jiri Kosina
    Cc: Andy Lutomirski
    Cc: Dave Hansen
    Cc: Kees Cook
    Cc: Tim Chen
    Cc: Greg Kroah-Hartman
    Cc: Paul Turner
    Link: https://lkml.kernel.org/r/1515707194-20531-5-git-send-email-dwmw@amazon.co.uk
    Signed-off-by: Greg Kroah-Hartman

    David Woodhouse
     
  • commit 01c9b17bf673b05bb401b76ec763e9730ccf1376 upstream.

    Add some details about how PTI works, what some of the downsides
    are, and how to debug it when things go wrong.

    Also document the kernel parameter: 'pti/nopti'.

    Signed-off-by: Dave Hansen
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Randy Dunlap
    Reviewed-by: Kees Cook
    Cc: Moritz Lipp
    Cc: Daniel Gruss
    Cc: Michael Schwarz
    Cc: Richard Fellner
    Cc: Andy Lutomirski
    Cc: Linus Torvalds
    Cc: Hugh Dickins
    Cc: Andi Lutomirsky
    Link: https://lkml.kernel.org/r/20180105174436.1BC6FA2B@viggo.jf.intel.com
    Signed-off-by: Greg Kroah-Hartman

    Dave Hansen
     

03 Jan, 2018

2 commits

  • commit 41f4c20b57a4890ea7f56ff8717cc83fefb8d537 upstream.

    Keep the "nopti" optional for traditional reasons.

    [ tglx: Don't allow force on when running on XEN PV and made 'on'
    printout conditional ]

    Requested-by: Linus Torvalds
    Signed-off-by: Borislav Petkov
    Signed-off-by: Thomas Gleixner
    Cc: Andy Lutomirski
    Cc: Andy Lutomirsky
    Cc: Boris Ostrovsky
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Dave Hansen
    Cc: Dave Hansen
    Cc: David Laight
    Cc: Denys Vlasenko
    Cc: Eduardo Valentin
    Cc: Greg KH
    Cc: H. Peter Anvin
    Cc: Josh Poimboeuf
    Cc: Juergen Gross
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Will Deacon
    Cc: aliguori@amazon.com
    Cc: daniel.gruss@iaik.tugraz.at
    Cc: hughd@google.com
    Cc: keescook@google.com
    Link: https://lkml.kernel.org/r/20171212133952.10177-1-bp@alien8.de
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Borislav Petkov
     
  • commit aa8c6248f8c75acfd610fe15d8cae23cf70d9d09 upstream.

    Add the initial files for kernel page table isolation, with a minimal init
    function and the boot time detection for this misfeature.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Borislav Petkov
    Cc: Andy Lutomirski
    Cc: Boris Ostrovsky
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Dave Hansen
    Cc: David Laight
    Cc: Denys Vlasenko
    Cc: Eduardo Valentin
    Cc: Greg KH
    Cc: H. Peter Anvin
    Cc: Josh Poimboeuf
    Cc: Juergen Gross
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Will Deacon
    Cc: aliguori@amazon.com
    Cc: daniel.gruss@iaik.tugraz.at
    Cc: hughd@google.com
    Cc: keescook@google.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

13 Sep, 2017

1 commit

  • Pull selinux updates from Paul Moore:
    "A relatively quiet period for SELinux, 11 patches with only two/three
    having any substantive changes.

    These noteworthy changes include another tweak to the NNP/nosuid
    handling, per-file labeling for cgroups, and an object class fix for
    AF_UNIX/SOCK_RAW sockets; the rest of the changes are minor tweaks or
    administrative updates (Stephen's email update explains the file
    explosion in the diffstat).

    Everything passes the selinux-testsuite"

    [ Also a couple of small patches from the security tree from Tetsuo
    Handa for Tomoyo and LSM cleanup. The separation of security policy
    updates wasn't all that clean - Linus ]

    * tag 'selinux-pr-20170831' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
    selinux: constify nf_hook_ops
    selinux: allow per-file labeling for cgroupfs
    lsm_audit: update my email address
    selinux: update my email address
    MAINTAINERS: update the NetLabel and Labeled Networking information
    selinux: use GFP_NOWAIT in the AVC kmem_caches
    selinux: Generalize support for NNP/nosuid SELinux domain transitions
    selinux: genheaders should fail if too many permissions are defined
    selinux: update the selinux info in MAINTAINERS
    credits: update Paul Moore's info
    selinux: Assign proper class to PF_UNIX/SOCK_RAW sockets
    tomoyo: Update URLs in Documentation/admin-guide/LSM/tomoyo.rst
    LSM: Remove security_task_create() hook.

    Linus Torvalds
     

09 Sep, 2017

1 commit

  • Pull ARC updates from Vineet Gupta:

    - Support for HSDK board hosting a Quad core HS38x4 based SoC running
    @1GHz (and some prerrquisite changes such as ability to scoot the
    kernel code/data from start of memory map etc)

    - Quite a few updates for EZChip (Mellanox) platform

    - Fixes to fault/exception printing

    * tag 'arc-4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc: (26 commits)
    ARC: Re-enable MMU upon Machine Check exception
    ARC: Show fault information passed to show_kernel_fault_diag()
    ARC: [plat-hsdk] initial port for HSDK board
    ARC: mm: Decouple RAM base address from kernel link address
    ARCv2: IOC: Tighten up the contraints (specifically base / size alignment)
    ARC: [plat-axs103] refactor the DT fudging code
    ARC: [plat-axs103] use clk driver #2: Add core pll node to DT to manage cpu clk
    ARC: [plat-axs103] use clk driver #1: Get rid of platform specific cpu clk setting
    ARCv2: SLC: provide a line based flush routine for debugging
    ARC: Hardcode ARCH_DMA_MINALIGN to max line length we may have
    ARC: [plat-eznps] handle extra aux regs #2: kernel/entry exit
    ARC: [plat-eznps] handle extra aux regs #1: save/restore on context switch
    ARC: [plat-eznps] avoid toggling of DPC register
    ARC: [plat-eznps] Update the init sequence of aux regs per cpu.
    ARC: [plat-eznps] new command line argument for HW scheduler at MTM
    ARC: set boot print log level to PR_INFO
    ARC: [plat-eznps] Handle user memory error same in simulation and silicon
    ARC: [plat-eznps] use schd.wft instruction instead of sleep at idle task
    ARC: create cpu specific version of arch_cpu_idle()
    ARC: [plat-eznps] spinlock aware for MTM
    ...

    Linus Torvalds
     

07 Sep, 2017

1 commit

  • Patch series "cleanup zonelists initialization", v1.

    This is aimed at cleaning up the zonelists initialization code we have
    but the primary motivation was bug report [2] which got resolved but the
    usage of stop_machine is just too ugly to live. Most patches are
    straightforward but 3 of them need a special consideration.

    Patch 1 removes zone ordered zonelists completely. I am CCing linux-api
    because this is a user visible change. As I argue in the patch
    description I do not think we have a strong usecase for it these days.
    I have kept sysctl in place and warn into the log if somebody tries to
    configure zone lists ordering. If somebody has a real usecase for it we
    can revert this patch but I do not expect anybody will actually notice
    runtime differences. This patch is not strictly needed for the rest but
    it made patch 6 easier to implement.

    Patch 7 removes stop_machine from build_all_zonelists without adding any
    special synchronization between iterators and updater which I _believe_
    is acceptable as explained in the changelog. I hope I am not missing
    anything.

    Patch 8 then removes zonelists_mutex which is kind of ugly as well and
    not really needed AFAICS but a care should be taken when double checking
    my thinking.

    This patch (of 9):

    Supporting zone ordered zonelists costs us just a lot of code while the
    usefulness is arguable if existent at all. Mel has already made node
    ordering default on 64b systems. 32b systems are still using
    ZONELIST_ORDER_ZONE because it is considered better to fallback to a
    different NUMA node rather than consume precious lowmem zones.

    This argument is, however, weaken by the fact that the memory reclaim
    has been reworked to be node rather than zone oriented. This means that
    lowmem requests have to skip over all highmem pages on LRUs already and
    so zone ordering doesn't save the reclaim time much. So the only
    advantage of the zone ordering is under a light memory pressure when
    highmem requests do not ever hit into lowmem zones and the lowmem
    pressure doesn't need to reclaim.

    Considering that 32b NUMA systems are rather suboptimal already and it
    is generally advisable to use 64b kernel on such a HW I believe we
    should rather care about the code maintainability and just get rid of
    ZONELIST_ORDER_ZONE altogether. Keep systcl in place and warn if
    somebody tries to set zone ordering either from kernel command line or
    the sysctl.

    [mhocko@suse.com: reading vm.numa_zonelist_order will never terminate]
    Link: http://lkml.kernel.org/r/20170721143915.14161-2-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Shaohua Li
    Cc: Toshi Kani
    Cc: Abdul Haleem
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

06 Sep, 2017

3 commits

  • Pull power management updates from Rafael Wysocki:
    "This time (again) cpufreq gets the majority of changes which mostly
    are driver updates (including a major consolidation of intel_pstate),
    some schedutil governor modifications and core cleanups.

    There also are some changes in the system suspend area, mostly related
    to diagnostics and debug messages plus some renames of things related
    to suspend-to-idle. One major change here is that suspend-to-idle is
    now going to be preferred over S3 on systems where the ACPI tables
    indicate to do so and provide requsite support (the Low Power Idle S0
    _DSM in particular). The system sleep documentation and the tools
    related to it are updated too.

    The rest is a few cpuidle changes (nothing major), devfreq updates,
    generic power domains (genpd) framework updates and a few assorted
    modifications elsewhere.

    Specifics:

    - Drop the P-state selection algorithm based on a PID controller from
    intel_pstate and make it use the same P-state selection method
    (based on the CPU load) for all types of systems in the active mode
    (Rafael Wysocki, Srinivas Pandruvada).

    - Rework the cpufreq core and governors to make it possible to take
    cross-CPU utilization updates into account and modify the schedutil
    governor to actually do so (Viresh Kumar).

    - Clean up the handling of transition latency information in the
    cpufreq core and untangle it from the information on which drivers
    cannot do dynamic frequency switching (Viresh Kumar).

    - Add support for new SoCs (MT2701/MT7623 and MT7622) to the mediatek
    cpufreq driver and update its DT bindings (Sean Wang).

    - Modify the cpufreq dt-platdev driver to autimatically create
    cpufreq devices for the new (v2) Operating Performance Points (OPP)
    DT bindings and update its whitelist of supported systems (Viresh
    Kumar, Shubhrajyoti Datta, Marc Gonzalez, Khiem Nguyen, Finley
    Xiao).

    - Add support for Ux500 to the cpufreq-dt driver and drop the
    obsolete dbx500 cpufreq driver (Linus Walleij, Arnd Bergmann).

    - Add new SoC (R8A7795) support to the cpufreq rcar driver (Khiem
    Nguyen).

    - Fix and clean up assorted issues in the cpufreq drivers and core
    (Arvind Yadav, Christophe Jaillet, Colin Ian King, Gustavo Silva,
    Julia Lawall, Leonard Crestez, Rob Herring, Sudeep Holla).

    - Update the IO-wait boost handling in the schedutil governor to make
    it less aggressive (Joel Fernandes).

    - Rework system suspend diagnostics to make it print fewer messages
    to the kernel log by default, add a sysfs knob to allow more
    suspend-related messages to be printed and add Low Power S0 Idle
    constraints checks to the ACPI suspend-to-idle code (Rafael
    Wysocki, Srinivas Pandruvada).

    - Prefer suspend-to-idle over S3 on ACPI-based systems with the
    ACPI_FADT_LOW_POWER_S0 flag set and the Low Power Idle S0 _DSM
    interface present in the ACPI tables (Rafael Wysocki).

    - Update documentation related to system sleep and rename a number of
    items in the code to make it cleare that they are related to
    suspend-to-idle (Rafael Wysocki).

    - Export a variable allowing device drivers to check the target
    system sleep state from the core system suspend code (Florian
    Fainelli).

    - Clean up the cpuidle subsystem to handle the polling state on x86
    in a more straightforward way and to use %pOF instead of full_name
    (Rafael Wysocki, Rob Herring).

    - Update the devfreq framework to fix and clean up a few minor issues
    (Chanwoo Choi, Rob Herring).

    - Extend diagnostics in the generic power domains (genpd) framework
    and clean it up slightly (Thara Gopinath, Rob Herring).

    - Fix and clean up a couple of issues in the operating performance
    points (OPP) framework (Viresh Kumar, Waldemar Rymarkiewicz).

    - Add support for RV1108 to the rockchip-io Adaptive Voltage Scaling
    (AVS) driver (David Wu).

    - Fix the usage of notifiers in CPU power management on some
    platforms (Alex Shi).

    - Update the pm-graph system suspend/hibernation and boot profiling
    utility (Todd Brandt).

    - Make it possible to run the cpupower utility without CPU0 (Prarit
    Bhargava)"

    * tag 'pm-4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (87 commits)
    cpuidle: Make drivers initialize polling state
    cpuidle: Move polling state initialization code to separate file
    cpuidle: Eliminate the CPUIDLE_DRIVER_STATE_START symbol
    cpufreq: imx6q: Fix imx6sx low frequency support
    cpufreq: speedstep-lib: make several arrays static, makes code smaller
    PM: docs: Delete the obsolete states.txt document
    PM: docs: Describe high-level PM strategies and sleep states
    PM / devfreq: Fix memory leak when fail to register device
    PM / devfreq: Add dependency on PM_OPP
    PM / devfreq: Move private devfreq_update_stats() into devfreq
    PM / devfreq: Convert to using %pOF instead of full_name
    PM / AVS: rockchip-io: add io selectors and supplies for RV1108
    cpufreq: ti: Fix 'of_node_put' being called twice in error handling path
    cpufreq: dt-platdev: Drop few entries from whitelist
    cpufreq: dt-platdev: Automatically create cpufreq device with OPP v2
    ARM: ux500: don't select CPUFREQ_DT
    cpuidle: Convert to using %pOF instead of full_name
    cpufreq: Convert to using %pOF instead of full_name
    PM / Domains: Convert to using %pOF instead of full_name
    cpufreq: Cap the default transition delay value to 10 ms
    ...

    Linus Torvalds
     
  • Pull char/misc driver updates from Greg KH:
    "Here is the big char/misc driver update for 4.14-rc1.

    Lots of different stuff in here, it's been an active development cycle
    for some reason. Highlights are:

    - updated binder driver, this brings binder up to date with what
    shipped in the Android O release, plus some more changes that
    happened since then that are in the Android development trees.

    - coresight updates and fixes

    - mux driver file renames to be a bit "nicer"

    - intel_th driver updates

    - normal set of hyper-v updates and changes

    - small fpga subsystem and driver updates

    - lots of const code changes all over the driver trees

    - extcon driver updates

    - fmc driver subsystem upadates

    - w1 subsystem minor reworks and new features and drivers added

    - spmi driver updates

    Plus a smattering of other minor driver updates and fixes.

    All of these have been in linux-next with no reported issues for a
    while"

    * tag 'char-misc-4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (244 commits)
    ANDROID: binder: don't queue async transactions to thread.
    ANDROID: binder: don't enqueue death notifications to thread todo.
    ANDROID: binder: Don't BUG_ON(!spin_is_locked()).
    ANDROID: binder: Add BINDER_GET_NODE_DEBUG_INFO ioctl
    ANDROID: binder: push new transactions to waiting threads.
    ANDROID: binder: remove proc waitqueue
    android: binder: Add page usage in binder stats
    android: binder: fixup crash introduced by moving buffer hdr
    drivers: w1: add hwmon temp support for w1_therm
    drivers: w1: refactor w1_slave_show to make the temp reading functionality separate
    drivers: w1: add hwmon support structures
    eeprom: idt_89hpesx: Support both ACPI and OF probing
    mcb: Fix an error handling path in 'chameleon_parse_cells()'
    MCB: add support for SC31 to mcb-lpc
    mux: make device_type const
    char: virtio: constify attribute_group structures.
    Documentation/ABI: document the nvmem sysfs files
    lkdtm: fix spelling mistake: "incremeted" -> "incremented"
    perf: cs-etm: Fix ETMv4 CONFIGR entry in perf.data file
    nvmem: include linux/err.h from header
    ...

    Linus Torvalds
     
  • Pull s390 updates from Martin Schwidefsky:
    "The first part of the s390 updates for 4.14:

    - Add machine type 0x3906 for IBM z14

    - Add IBM z14 TLB flushing improvements for KVM guests

    - Exploit the TOD clock epoch extension to provide a continuous TOD
    clock afer 2042/09/17

    - Add NIAI spinlock hints for IBM z14

    - Rework the vmcp driver and use CMA for the respone buffer of z/VM
    CP commands

    - Drop some s390 specific asm headers and use the generic version

    - Add block discard for DASD-FBA devices under z/VM

    - Add average request times to DASD statistics

    - A few of those constify patches which seem to be in vogue right now

    - Cleanup and bug fixes"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (50 commits)
    s390/mm: avoid empty zero pages for KVM guests to avoid postcopy hangs
    s390/dasd: Add discard support for FBA devices
    s390/zcrypt: make CPRBX const
    s390/uaccess: avoid mvcos jump label
    s390/mm: use generic mm_hooks
    s390/facilities: fix typo
    s390/vmcp: simplify vmcp_response_free()
    s390/topology: Remove the unused parent_node() macro
    s390/dasd: Change unsigned long long to unsigned long
    s390/smp: convert cpuhp_setup_state() return code to zero on success
    s390: fix 'novx' early parameter handling
    s390/dasd: add average request times to dasd statistics
    s390/scm: use common completion path
    s390/pci: log changes to uid checking
    s390/vmcp: simplify vmcp_ioctl()
    s390/vmcp: return -ENOTTY for unknown ioctl commands
    s390/vmcp: split vmcp header file and move to uapi
    s390/vmcp: make use of contiguous memory allocator
    s390/cpcmd,vmcp: avoid GFP_DMA allocations
    s390/vmcp: fix uaccess check and avoid undefined behavior
    ...

    Linus Torvalds
     

05 Sep, 2017

2 commits

  • Pull x86 cache quality monitoring update from Thomas Gleixner:
    "This update provides a complete rewrite of the Cache Quality
    Monitoring (CQM) facility.

    The existing CQM support was duct taped into perf with a lot of issues
    and the attempts to fix those turned out to be incomplete and
    horrible.

    After lengthy discussions it was decided to integrate the CQM support
    into the Resource Director Technology (RDT) facility, which is the
    obvious choise as in hardware CQM is part of RDT. This allowed to add
    Memory Bandwidth Monitoring support on top.

    As a result the mechanisms for allocating cache/memory bandwidth and
    the corresponding monitoring mechanisms are integrated into a single
    management facility with a consistent user interface"

    * 'x86-cache-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (37 commits)
    x86/intel_rdt: Turn off most RDT features on Skylake
    x86/intel_rdt: Add command line options for resource director technology
    x86/intel_rdt: Move special case code for Haswell to a quirk function
    x86/intel_rdt: Remove redundant ternary operator on return
    x86/intel_rdt/cqm: Improve limbo list processing
    x86/intel_rdt/mbm: Fix MBM overflow handler during CPU hotplug
    x86/intel_rdt: Modify the intel_pqr_state for better performance
    x86/intel_rdt/cqm: Clear the default RMID during hotcpu
    x86/intel_rdt: Show bitmask of shareable resource with other executing units
    x86/intel_rdt/mbm: Handle counter overflow
    x86/intel_rdt/mbm: Add mbm counter initialization
    x86/intel_rdt/mbm: Basic counting of MBM events (total and local)
    x86/intel_rdt/cqm: Add CPU hotplug support
    x86/intel_rdt/cqm: Add sched_in support
    x86/intel_rdt: Introduce rdt_enable_key for scheduling
    x86/intel_rdt/cqm: Add mount,umount support
    x86/intel_rdt/cqm: Add rmdir support
    x86/intel_rdt: Separate the ctrl bits from rmdir
    x86/intel_rdt/cqm: Add mon_data
    x86/intel_rdt: Prepare for RDT monitor data support
    ...

    Linus Torvalds
     
  • Pull x86 mm changes from Ingo Molnar:
    "PCID support, 5-level paging support, Secure Memory Encryption support

    The main changes in this cycle are support for three new, complex
    hardware features of x86 CPUs:

    - Add 5-level paging support, which is a new hardware feature on
    upcoming Intel CPUs allowing up to 128 PB of virtual address space
    and 4 PB of physical RAM space - a 512-fold increase over the old
    limits. (Supercomputers of the future forecasting hurricanes on an
    ever warming planet can certainly make good use of more RAM.)

    Many of the necessary changes went upstream in previous cycles,
    v4.14 is the first kernel that can enable 5-level paging.

    This feature is activated via CONFIG_X86_5LEVEL=y - disabled by
    default.

    (By Kirill A. Shutemov)

    - Add 'encrypted memory' support, which is a new hardware feature on
    upcoming AMD CPUs ('Secure Memory Encryption', SME) allowing system
    RAM to be encrypted and decrypted (mostly) transparently by the
    CPU, with a little help from the kernel to transition to/from
    encrypted RAM. Such RAM should be more secure against various
    attacks like RAM access via the memory bus and should make the
    radio signature of memory bus traffic harder to intercept (and
    decrypt) as well.

    This feature is activated via CONFIG_AMD_MEM_ENCRYPT=y - disabled
    by default.

    (By Tom Lendacky)

    - Enable PCID optimized TLB flushing on newer Intel CPUs: PCID is a
    hardware feature that attaches an address space tag to TLB entries
    and thus allows to skip TLB flushing in many cases, even if we
    switch mm's.

    (By Andy Lutomirski)

    All three of these features were in the works for a long time, and
    it's coincidence of the three independent development paths that they
    are all enabled in v4.14 at once"

    * 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (65 commits)
    x86/mm: Enable RCU based page table freeing (CONFIG_HAVE_RCU_TABLE_FREE=y)
    x86/mm: Use pr_cont() in dump_pagetable()
    x86/mm: Fix SME encryption stack ptr handling
    kvm/x86: Avoid clearing the C-bit in rsvd_bits()
    x86/CPU: Align CR3 defines
    x86/mm, mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages
    acpi, x86/mm: Remove encryption mask from ACPI page protection type
    x86/mm, kexec: Fix memory corruption with SME on successive kexecs
    x86/mm/pkeys: Fix typo in Documentation/x86/protection-keys.txt
    x86/mm/dump_pagetables: Speed up page tables dump for CONFIG_KASAN=y
    x86/mm: Implement PCID based optimization: try to preserve old TLB entries using PCID
    x86: Enable 5-level paging support via CONFIG_X86_5LEVEL=y
    x86/mm: Allow userspace have mappings above 47-bit
    x86/mm: Prepare to expose larger address space to userspace
    x86/mpx: Do not allow MPX if we have mappings above 47-bit
    x86/mm: Rename tasksize_32bit/64bit to task_size_32bit/64bit()
    x86/xen: Redefine XEN_ELFNOTE_INIT_P2M using PUD_SIZE * PTRS_PER_PUD
    x86/mm/dump_pagetables: Fix printout of p4d level
    x86/mm/dump_pagetables: Generalize address normalization
    x86/boot: Fix memremap() related build failure
    ...

    Linus Torvalds
     

04 Sep, 2017

3 commits

  • * pm-docs:
    PM: docs: Delete the obsolete states.txt document
    PM: docs: Describe high-level PM strategies and sleep states

    Rafael J. Wysocki
     
  • * intel_pstate:
    cpufreq: intel_pstate: Shorten a couple of long names
    cpufreq: intel_pstate: Simplify intel_pstate_adjust_pstate()
    cpufreq: intel_pstate: Improve IO performance with per-core P-states
    cpufreq: intel_pstate: Drop INTEL_PSTATE_HWP_SAMPLING_INTERVAL
    cpufreq: intel_pstate: Drop ->update_util from pstate_funcs
    cpufreq: intel_pstate: Do not use PID-based P-state selection

    Rafael J. Wysocki
     
  • * pm-cpufreq: (33 commits)
    cpufreq: imx6q: Fix imx6sx low frequency support
    cpufreq: speedstep-lib: make several arrays static, makes code smaller
    cpufreq: ti: Fix 'of_node_put' being called twice in error handling path
    cpufreq: dt-platdev: Drop few entries from whitelist
    cpufreq: dt-platdev: Automatically create cpufreq device with OPP v2
    ARM: ux500: don't select CPUFREQ_DT
    cpufreq: Convert to using %pOF instead of full_name
    cpufreq: Cap the default transition delay value to 10 ms
    cpufreq: dbx500: Delete obsolete driver
    mfd: db8500-prcmu: Get rid of cpufreq dependency
    cpufreq: enable the DT cpufreq driver on the Ux500
    cpufreq: Loongson2: constify platform_device_id
    cpufreq: dt: Add r8a7796 support to to use generic cpufreq driver
    cpufreq: remove setting of policy->cpu in policy->cpus during init
    cpufreq: mediatek: add support of cpufreq to MT7622 SoC
    cpufreq: mediatek: add cleanups with the more generic naming
    cpufreq: rcar: Add support for R8A7795 SoC
    cpufreq: dt: Add rk3328 compatible to use generic cpufreq driver
    cpufreq: s5pv210: add missing of_node_put()
    cpufreq: Allow dynamic switching with CPUFREQ_ETERNAL latency
    ...

    Rafael J. Wysocki
     

29 Aug, 2017

2 commits

  • We add ability for all cores at NPS SoC to control the number of cycles
    HW thread can execute before it is replace with another eligible
    HW thread within the same core. The replacement is done by the
    HW scheduler.

    Signed-off-by: Noam Camus
    Signed-off-by: Vineet Gupta
    [vgupta: simplified handlign of out of range argument value]

    Noam Camus
     
  • Reorganize the power management part of admin-guide by adding a
    description of major power management strategies supported by the
    kernel (system-wide and working-state power management) to it and
    dividing the rest of the material into the system-wide PM and
    working-state PM chapters.

    On top of that, add a description of system sleep states to the
    system-wide PM chapter.

    Signed-off-by: Rafael J. Wysocki
    Reviewed-by: Lukas Wunner

    Rafael J. Wysocki
     

26 Aug, 2017

2 commits


21 Aug, 2017

2 commits


15 Aug, 2017

1 commit


09 Aug, 2017

1 commit

  • If memory is fragmented it is unlikely that large order memory
    allocations succeed. This has been an issue with the vmcp device
    driver since a long time, since it requires large physical contiguous
    memory ares for large responses.

    To hopefully resolve this issue make use of the contiguous memory
    allocator (cma). This patch adds a vmcp specific vmcp cma area with a
    default size of 4MB. The size can be changed either via the
    VMCP_CMA_SIZE config option at compile time or with the "vmcp_cma"
    kernel parameter (e.g. "vmcp_cma=16m").

    For any vmcp response buffers larger than 16k memory from the cma area
    will be allocated. If such an allocation fails, there is a fallback to
    the buddy allocator.

    Signed-off-by: Heiko Carstens
    Signed-off-by: Martin Schwidefsky

    Heiko Carstens