25 Dec, 2016

1 commit


14 Dec, 2016

3 commits

  • Pull workqueue updates from Tejun Heo:
    "Mostly patches to initialize workqueue subsystem earlier and get rid
    of keventd_up().

    The patches were headed for the last merge cycle but got delayed due
    to a bug found late minute, which is fixed now.

    Also, to help debugging, destroy_workqueue() is more chatty now on a
    sanity check failure."

    * 'for-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
    workqueue: move wq_numa_init() to workqueue_init()
    workqueue: remove keventd_up()
    debugobj, workqueue: remove keventd_up() usage
    slab, workqueue: remove keventd_up() usage
    power, workqueue: remove keventd_up() usage
    tty, workqueue: remove keventd_up() usage
    mce, workqueue: remove keventd_up() usage
    workqueue: make workqueue available early during boot
    workqueue: dump workqueue state on sanity check failures in destroy_workqueue()

    Linus Torvalds
     
  • Pull power management updates from Rafael Wysocki:
    "Again, cpufreq gets more changes than the other parts this time (one
    new driver, one old driver less, a bunch of enhancements of the
    existing code, new CPU IDs, fixes, cleanups)

    There also are some changes in cpuidle (idle injection rework, a
    couple of new CPU IDs, online/offline rework in intel_idle, fixes and
    cleanups), in the generic power domains framework (mostly related to
    supporting power domains containing CPUs), and in the Operating
    Performance Points (OPP) library (mostly related to supporting devices
    with multiple voltage regulators)

    In addition to that, the system sleep state selection interface is
    modified to make it easier for distributions with unchanged user space
    to support suspend-to-idle as the default system suspend method, some
    issues are fixed in the PM core, the latency tolerance PM QoS
    framework is improved a bit, the Intel RAPL power capping driver is
    cleaned up and there are some fixes and cleanups in the devfreq
    subsystem

    Specifics:

    - New cpufreq driver for Broadcom STB SoCs and a Device Tree binding
    for it (Markus Mayer)

    - Support for ARM Integrator/AP and Integrator/CP in the generic DT
    cpufreq driver and elimination of the old Integrator cpufreq driver
    (Linus Walleij)

    - Support for the zx296718, r8a7743 and r8a7745, Socionext UniPhier,
    and PXA SoCs in the the generic DT cpufreq driver (Baoyou Xie,
    Geert Uytterhoeven, Masahiro Yamada, Robert Jarzmik)

    - cpufreq core fix to eliminate races that may lead to using inactive
    policy objects and related cleanups (Rafael Wysocki)

    - cpufreq schedutil governor update to make it use SCHED_FIFO kernel
    threads (instead of regular workqueues) for doing delayed work (to
    reduce the response latency in some cases) and related cleanups
    (Viresh Kumar)

    - New cpufreq sysfs attribute for resetting statistics (Markus Mayer)

    - cpufreq governors fixes and cleanups (Chen Yu, Stratos Karafotis,
    Viresh Kumar)

    - Support for using generic cpufreq governors in the intel_pstate
    driver (Rafael Wysocki)

    - Support for per-logical-CPU P-state limits and the EPP/EPB (Energy
    Performance Preference/Energy Performance Bias) knobs in the
    intel_pstate driver (Srinivas Pandruvada)

    - New CPU ID for Knights Mill in intel_pstate (Piotr Luc)

    - intel_pstate driver modification to use the P-state selection
    algorithm based on CPU load on platforms with the system profile in
    the ACPI tables set to "mobile" (Srinivas Pandruvada)

    - intel_pstate driver cleanups (Arnd Bergmann, Rafael Wysocki,
    Srinivas Pandruvada)

    - cpufreq powernv driver updates including fast switching support
    (for the schedutil governor), fixes and cleanus (Akshay Adiga,
    Andrew Donnellan, Denis Kirjanov)

    - acpi-cpufreq driver rework to switch it over to the new CPU
    offline/online state machine (Sebastian Andrzej Siewior)

    - Assorted cleanups in cpufreq drivers (Wei Yongjun, Prashanth
    Prakash)

    - Idle injection rework (to make it use the regular idle path instead
    of a home-grown custom one) and related powerclamp thermal driver
    updates (Peter Zijlstra, Jacob Pan, Petr Mladek, Sebastian Andrzej
    Siewior)

    - New CPU IDs for Atom Z34xx and Knights Mill in intel_idle (Andy
    Shevchenko, Piotr Luc)

    - intel_idle driver cleanups and switch over to using the new CPU
    offline/online state machine (Anna-Maria Gleixner, Sebastian
    Andrzej Siewior)

    - cpuidle DT driver update to support suspend-to-idle properly
    (Sudeep Holla)

    - cpuidle core cleanups and misc updates (Daniel Lezcano, Pan Bian,
    Rafael Wysocki)

    - Preliminary support for power domains including CPUs in the generic
    power domains (genpd) framework and related DT bindings (Lina Iyer)

    - Assorted fixes and cleanups in the generic power domains (genpd)
    framework (Colin Ian King, Dan Carpenter, Geert Uytterhoeven)

    - Preliminary support for devices with multiple voltage regulators
    and related fixes and cleanups in the Operating Performance Points
    (OPP) library (Viresh Kumar, Masahiro Yamada, Stephen Boyd)

    - System sleep state selection interface rework to make it easier to
    support suspend-to-idle as the default system suspend method
    (Rafael Wysocki)

    - PM core fixes and cleanups, mostly related to the interactions
    between the system suspend and runtime PM frameworks (Ulf Hansson,
    Sahitya Tummala, Tony Lindgren)

    - Latency tolerance PM QoS framework imorovements (Andrew Lutomirski)

    - New Knights Mill CPU ID for the Intel RAPL power capping driver
    (Piotr Luc)

    - Intel RAPL power capping driver fixes, cleanups and switch over to
    using the new CPU offline/online state machine (Jacob Pan, Thomas
    Gleixner, Sebastian Andrzej Siewior)

    - Fixes and cleanups in the exynos-ppmu, exynos-nocp, rk3399_dmc,
    rockchip-dfi devfreq drivers and the devfreq core (Axel Lin,
    Chanwoo Choi, Javier Martinez Canillas, MyungJoo Ham, Viresh Kumar)

    - Fix for false-positive KASAN warnings during resume from ACPI S3
    (suspend-to-RAM) on x86 (Josh Poimboeuf)

    - Memory map verification during resume from hibernation on x86 to
    ensure a consistent address space layout (Chen Yu)

    - Wakeup sources debugging enhancement (Xing Wei)

    - rockchip-io AVS driver cleanup (Shawn Lin)"

    * tag 'pm-4.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (127 commits)
    devfreq: rk3399_dmc: Don't use OPP structures outside of RCU locks
    devfreq: rk3399_dmc: Remove dangling rcu_read_unlock()
    devfreq: exynos: Don't use OPP structures outside of RCU locks
    Documentation: intel_pstate: Document HWP energy/performance hints
    cpufreq: intel_pstate: Support for energy performance hints with HWP
    cpufreq: intel_pstate: Add locking around HWP requests
    PM / sleep: Print active wakeup sources when blocking on wakeup_count reads
    PM / core: Fix bug in the error handling of async suspend
    PM / wakeirq: Fix dedicated wakeirq for drivers not using autosuspend
    PM / Domains: Fix compatible for domain idle state
    PM / OPP: Don't WARN on multiple calls to dev_pm_opp_set_regulators()
    PM / OPP: Allow platform specific custom set_opp() callbacks
    PM / OPP: Separate out _generic_set_opp()
    PM / OPP: Add infrastructure to manage multiple regulators
    PM / OPP: Pass struct dev_pm_opp_supply to _set_opp_voltage()
    PM / OPP: Manage supply's voltage/current in a separate structure
    PM / OPP: Don't use OPP structure outside of rcu protected section
    PM / OPP: Reword binding supporting multiple regulators per device
    PM / OPP: Fix incorrect cpu-supply property in binding
    cpuidle: Add a kerneldoc comment to cpuidle_use_deepest_state()
    ..

    Linus Torvalds
     
  • Pull block layer updates from Jens Axboe:
    "This is the main block pull request this series. Contrary to previous
    release, I've kept the core and driver changes in the same branch. We
    always ended up having dependencies between the two for obvious
    reasons, so makes more sense to keep them together. That said, I'll
    probably try and keep more topical branches going forward, especially
    for cycles that end up being as busy as this one.

    The major parts of this pull request is:

    - Improved support for O_DIRECT on block devices, with a small
    private implementation instead of using the pig that is
    fs/direct-io.c. From Christoph.

    - Request completion tracking in a scalable fashion. This is utilized
    by two components in this pull, the new hybrid polling and the
    writeback queue throttling code.

    - Improved support for polling with O_DIRECT, adding a hybrid mode
    that combines pure polling with an initial sleep. From me.

    - Support for automatic throttling of writeback queues on the block
    side. This uses feedback from the device completion latencies to
    scale the queue on the block side up or down. From me.

    - Support from SMR drives in the block layer and for SD. From Hannes
    and Shaun.

    - Multi-connection support for nbd. From Josef.

    - Cleanup of request and bio flags, so we have a clear split between
    which are bio (or rq) private, and which ones are shared. From
    Christoph.

    - A set of patches from Bart, that improve how we handle queue
    stopping and starting in blk-mq.

    - Support for WRITE_ZEROES from Chaitanya.

    - Lightnvm updates from Javier/Matias.

    - Supoort for FC for the nvme-over-fabrics code. From James Smart.

    - A bunch of fixes from a whole slew of people, too many to name
    here"

    * 'for-4.10/block' of git://git.kernel.dk/linux-block: (182 commits)
    blk-stat: fix a few cases of missing batch flushing
    blk-flush: run the queue when inserting blk-mq flush
    elevator: make the rqhash helpers exported
    blk-mq: abstract out blk_mq_dispatch_rq_list() helper
    blk-mq: add blk_mq_start_stopped_hw_queue()
    block: improve handling of the magic discard payload
    blk-wbt: don't throttle discard or write zeroes
    nbd: use dev_err_ratelimited in io path
    nbd: reset the setup task for NBD_CLEAR_SOCK
    nvme-fabrics: Add FC LLDD loopback driver to test FC-NVME
    nvme-fabrics: Add target support for FC transport
    nvme-fabrics: Add host support for FC transport
    nvme-fabrics: Add FC transport LLDD api definitions
    nvme-fabrics: Add FC transport FC-NVME definitions
    nvme-fabrics: Add FC transport error codes to nvme.h
    Add type 0x28 NVME type code to scsi fc headers
    nvme-fabrics: patch target code in prep for FC transport support
    nvme-fabrics: set sqe.command_id in core not transports
    parser: add u64 number parser
    nvme-rdma: align to generic ib_event logging helper
    ...

    Linus Torvalds
     

13 Dec, 2016

1 commit

  • * pm-sleep:
    PM / sleep: Print active wakeup sources when blocking on wakeup_count reads
    x86/suspend: fix false positive KASAN warning on suspend/resume
    PM / sleep / ACPI: Use the ACPI_FADT_LOW_POWER_S0 flag
    PM / sleep: System sleep state selection interface rework
    PM / hibernate: Verify the consistent of e820 memory map by md5 digest

    * powercap:
    powercap / RAPL: Add Knights Mill CPUID
    powercap/intel_rapl: fix and tidy up error handling
    powercap/intel_rapl: Track active CPUs internally
    powercap/intel_rapl: Cleanup duplicated init code
    powercap/intel rapl: Convert to hotplug state machine
    powercap/intel_rapl: Propagate error code when registration fails
    powercap/intel_rapl: Add missing domain data update on hotplug

    Rafael J. Wysocki
     

22 Nov, 2016

2 commits

  • Modify the ACPI system sleep support setup code to select
    suspend-to-idle as the default system sleep state if the
    ACPI_FADT_LOW_POWER_S0 flag is set in the FADT and the
    default sleep state was not selected from the kernel command
    line.

    Signed-off-by: Rafael J. Wysocki
    Tested-by: Mario Limonciello

    Rafael J. Wysocki
     
  • There are systems in which the platform doesn't support any special
    sleep states, so suspend-to-idle (PM_SUSPEND_FREEZE) is the only
    available system sleep state. However, some user space frameworks
    only use the "mem" and (sometimes) "standby" sleep state labels, so
    the users of those systems need to modify user space in order to be
    able to use system suspend at all and that may be a pain in practice.

    Commit 0399d4db3edf (PM / sleep: Introduce command line argument for
    sleep state enumeration) attempted to address this problem by adding
    a command line argument to change the meaning of the "mem" string in
    /sys/power/state to make it trigger suspend-to-idle (instead of
    suspend-to-RAM).

    However, there also are systems in which the platform does support
    special sleep states, but suspend-to-idle is the preferred one anyway
    (it even may save more energy than the platform-provided sleep states
    in some cases) and the above commit doesn't help in those cases.

    For this reason, rework the system sleep state selection interface
    again (but preserve backwards compatibiliby). Namely, add a new
    sysfs file, /sys/power/mem_sleep, that will control the system
    suspend mode triggered by writing "mem" to /sys/power/state (in
    analogy with what /sys/power/disk does for hibernation). Make it
    select suspend-to-RAM ("deep" sleep) by default (if supported) and
    fall back to suspend-to-idle ("s2idle") otherwise and add a new
    command line argument, mem_sleep_default, allowing that default to
    be overridden if need be.

    At the same time, drop the relative_sleep_states command line
    argument that doesn't make sense any more.

    Signed-off-by: Rafael J. Wysocki
    Tested-by: Mario Limonciello

    Rafael J. Wysocki
     

02 Nov, 2016

1 commit


01 Nov, 2016

1 commit


24 Oct, 2016

1 commit

  • Commit 4bcc595ccd80 (printk: reinstate KERN_CONT for printing
    continuation lines) exposed a missing KERN_CONT from one of the
    messages shown on entering suspend. With v4.9-rc1, the 'done.' shown
    after syncing the filesystems no longer appears as a continuation but
    a new message with its own timestamp.

    [ 9.259566] PM: Syncing filesystems ... [ 9.264119] done.

    Fix this by adding the KERN_CONT log level for the 'done.' part of the
    message seen after syncing filesystems. While we are at it, convert
    these suspend printks to pr_info and pr_cont, respectively.

    Signed-off-by: Jon Hunter
    Signed-off-by: Rafael J. Wysocki

    Jon Hunter
     

20 Oct, 2016

1 commit


08 Oct, 2016

1 commit

  • Commit 74070542099c ("oom, suspend: fix oom_reaper vs.
    oom_killer_disable race") has workaround an existing race between
    oom_killer_disable and oom_reaper by adding another round of
    try_to_freeze_tasks after the oom killer was disabled. This was the
    easiest thing to do for a late 4.7 fix. Let's fix it properly now.

    After "oom: keep mm of the killed task available" we no longer have to
    call exit_oom_victim from the oom reaper because we have stable mm
    available and hide the oom_reaped mm by MMF_OOM_SKIP flag. So let's
    remove exit_oom_victim and the race described in the above commit
    doesn't exist anymore if.

    Unfortunately this alone is not sufficient for the oom_killer_disable
    usecase because now we do not have any reliable way to reach
    exit_oom_victim (the victim might get stuck on a way to exit for an
    unbounded amount of time). OOM killer can cope with that by checking mm
    flags and move on to another victim but we cannot do the same for
    oom_killer_disable as we would lose the guarantee of no further
    interference of the victim with the rest of the system. What we can do
    instead is to cap the maximum time the oom_killer_disable waits for
    victims. The only current user of this function (pm suspend) already
    has a concept of timeout for back off so we can reuse the same value
    there.

    Let's drop set_freezable for the oom_reaper kthread because it is no
    longer needed as the reaper doesn't wake or thaw any processes.

    Link: http://lkml.kernel.org/r/1472119394-11342-7-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

18 Sep, 2016

1 commit


13 Sep, 2016

3 commits

  • PAGE_POISONING_ZERO disables zeroing new pages on alloc, they are
    poisoned (zeroed) as they become available.
    In the hibernate use case, free pages will appear in the system without
    being cleared, left there by the loading kernel.

    This patch will make sure free pages are cleared on resume when
    PAGE_POISONING_ZERO is enabled. We free the pages just after resume
    because we can't do it later: going through any device resume code might
    allocate some memory and invalidate the free pages bitmap.

    Thus we don't need to disable hibernation when PAGE_POISONING_ZERO is
    enabled.

    Signed-off-by: Anisse Astier
    Reviewed-by: Kees Cook
    Acked-by: Pavel Machek
    Signed-off-by: Rafael J. Wysocki

    Anisse Astier
     
  • Suspend-to-idle (aka the "freeze" sleep state) is a system sleep state
    in which all of the processors enter deepest possible idle state and
    wait for interrupts right after suspending all the devices.

    There is no hard requirement for a platform to support and register
    platform specific suspend_ops to enter suspend-to-idle/freeze state.
    Only deeper system sleep states like PM_SUSPEND_STANDBY and
    PM_SUSPEND_MEM rely on such low level support/implementation.

    suspend-to-idle can be entered as along as all the devices can be
    suspended. This patch enables the support for suspend-to-idle even on
    systems that don't have any low level support for deeper system sleep
    states and/or don't register any platform specific suspend_ops.

    Signed-off-by: Sudeep Holla
    Tested-by: Andy Gross
    Signed-off-by: Rafael J. Wysocki

    Sudeep Holla
     
  • Recently we have a new report that, the harddisk can not
    resume on time due to firmware issues, and got a kernel
    panic because of DPM watchdog timeout. So adjust the
    default timeout from 60 to 120 to survive on this platform,
    and make DPM_WATCHDOG depending on EXPERT.

    Link: https://bugzilla.kernel.org/show_bug.cgi?id=117971
    Suggested-by: Pavel Machek
    Suggested-by: Rafael J. Wysocki
    Reported-by: Higuita
    Signed-off-by: Chen Yu
    Acked-by: Pavel Machek
    Signed-off-by: Rafael J. Wysocki

    Chen Yu
     

05 Sep, 2016

1 commit

  • of_clk_init() ends up calling into pm_qos_update_request() very early
    during boot where irq is expected to stay disabled.
    pm_qos_update_request() uses cancel_delayed_work_sync() which
    correctly assumes that irq is enabled on invocation and
    unconditionally disables and re-enables it.

    Gate cancel_delayed_work_sync() invocation with kevented_up() to avoid
    enabling irq unexpectedly during early boot.

    Signed-off-by: Tejun Heo
    Reported-and-tested-by: Qiao Zhou
    Link: http://lkml.kernel.org/r/d2501c4c-8e7b-bea3-1b01-000b36b5dfe9@asrmicro.com
    Signed-off-by: Rafael J. Wysocki

    Tejun Heo
     

18 Aug, 2016

1 commit


16 Aug, 2016

1 commit

  • rtree_next_node() walks the linked list of leaf nodes to find the next
    block of pages in the struct memory_bitmap. If it walks off the end of
    the list of nodes, it walks the list of memory zones to find the next
    region of memory. If it walks off the end of the list of zones, it
    returns false.

    This leaves the struct bm_position's node and zone pointers pointing
    at their respective struct list_heads in struct mem_zone_bm_rtree.

    memory_bm_find_bit() uses struct bm_position's node and zone pointers
    to avoid walking lists and trees if the next bit appears in the same
    node/zone. It handles these values being stale.

    Swap rtree_next_node()s 'step then test' to 'test-next then step',
    this means if we reach the end of memory we return false and leave
    the node and zone pointers as they were.

    This fixes a panic on resume using AMD Seattle with 64K pages:
    [ 6.868732] Freezing user space processes ... (elapsed 0.000 seconds) done.
    [ 6.875753] Double checking all user space processes after OOM killer disable... (elapsed 0.000 seconds)
    [ 6.896453] PM: Using 3 thread(s) for decompression.
    [ 6.896453] PM: Loading and decompressing image data (5339 pages)...
    [ 7.318890] PM: Image loading progress: 0%
    [ 7.323395] Unable to handle kernel paging request at virtual address 00800040
    [ 7.330611] pgd = ffff000008df0000
    [ 7.334003] [00800040] *pgd=00000083fffe0003, *pud=00000083fffe0003, *pmd=00000083fffd0003, *pte=0000000000000000
    [ 7.344266] Internal error: Oops: 96000005 [#1] PREEMPT SMP
    [ 7.349825] Modules linked in:
    [ 7.352871] CPU: 2 PID: 1 Comm: swapper/0 Tainted: G W I 4.8.0-rc1 #4737
    [ 7.360512] Hardware name: AMD Overdrive/Supercharger/Default string, BIOS ROD1002C 04/08/2016
    [ 7.369109] task: ffff8003c0220000 task.stack: ffff8003c0280000
    [ 7.375020] PC is at set_bit+0x18/0x30
    [ 7.378758] LR is at memory_bm_set_bit+0x24/0x30
    [ 7.383362] pc : [] lr : [] pstate: 60000045
    [ 7.390743] sp : ffff8003c0283b00
    [ 7.473551]
    [ 7.475031] Process swapper/0 (pid: 1, stack limit = 0xffff8003c0280020)
    [ 7.481718] Stack: (0xffff8003c0283b00 to 0xffff8003c0284000)
    [ 7.800075] Call trace:
    [ 7.887097] [] set_bit+0x18/0x30
    [ 7.891876] [] duplicate_memory_bitmap.constprop.38+0x54/0x70
    [ 7.899172] [] snapshot_write_next+0x22c/0x47c
    [ 7.905166] [] load_image_lzo+0x754/0xa88
    [ 7.910725] [] swsusp_read+0x144/0x230
    [ 7.916025] [] load_image_and_restore+0x58/0x90
    [ 7.922105] [] software_resume+0x2f0/0x338
    [ 7.927752] [] do_one_initcall+0x38/0x11c
    [ 7.933314] [] kernel_init_freeable+0x14c/0x1ec
    [ 7.939395] [] kernel_init+0x10/0xfc
    [ 7.944520] [] ret_from_fork+0x10/0x40
    [ 7.949820] Code: d2800022 8b400c21 f9800031 9ac32043 (c85f7c22)
    [ 7.955909] ---[ end trace 0024a5986e6ff323 ]---
    [ 7.960529] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b

    Here struct mem_zone_bm_rtree's start_pfn has been returned instead of
    struct rtree_node's addr as the node/zone pointers are corrupt after
    we walked off the end of the lists during mark_unsafe_pages().

    This behaviour was exposed by commit 6dbecfd345a6 ("PM / hibernate:
    Simplify mark_unsafe_pages()"), which caused mark_unsafe_pages() to call
    duplicate_memory_bitmap(), which uses memory_bm_find_bit() after walking
    off the end of the memory bitmap.

    Fixes: 3a20cb177961 (PM / Hibernate: Implement position keeping in radix tree)
    Signed-off-by: James Morse
    [ rjw: Subject ]
    Signed-off-by: Rafael J. Wysocki

    James Morse
     

13 Aug, 2016

2 commits

  • * pm-sleep:
    PM / hibernate: Restore processor state before using per-CPU variables
    x86/power/64: Always create temporary identity mapping correctly

    * pm-cpufreq:
    cpufreq: powernv: Fix crash in gpstate_timer_handler()

    Rafael J. Wysocki
     
  • Restore the processor state before calling any other functions to
    ensure per-CPU variables can be used with KASLR memory randomization.

    Tracing functions use per-CPU variables (GS based on x86) and one was
    called just before restoring the processor state fully. It resulted
    in a double fault when both the tracing & the exception handler
    functions tried to use a per-CPU variable.

    Fixes: bb3632c6101b (PM / sleep: trace events for suspend/resume)
    Reported-and-tested-by: Borislav Petkov
    Reported-by: Jiri Kosina
    Tested-by: Rafael J. Wysocki
    Tested-by: Jiri Kosina
    Signed-off-by: Thomas Garnier
    Acked-by: Pavel Machek
    Signed-off-by: Rafael J. Wysocki

    Thomas Garnier
     

29 Jul, 2016

1 commit

  • This moves the LRU lists from the zone to the node and related data such
    as counters, tracing, congestion tracking and writeback tracking.

    Unfortunately, due to reclaim and compaction retry logic, it is
    necessary to account for the number of LRU pages on both zone and node
    logic. Most reclaim logic is based on the node counters but the retry
    logic uses the zone counters which do not distinguish inactive and
    active sizes. It would be possible to leave the LRU counters on a
    per-zone basis but it's a heavier calculation across multiple cache
    lines that is much more frequent than the retry checks.

    Other than the LRU counters, this is mostly a mechanical patch but note
    that it introduces a number of anomalies. For example, the scans are
    per-zone but using per-node counters. We also mark a node as congested
    when a zone is congested. This causes weird problems that are fixed
    later but is easier to review.

    In the event that there is excessive overhead on 32-bit systems due to
    the nodes being on LRU then there are two potential solutions

    1. Long-term isolation of highmem pages when reclaim is lowmem

    When pages are skipped, they are immediately added back onto the LRU
    list. If lowmem reclaim persisted for long periods of time, the same
    highmem pages get continually scanned. The idea would be that lowmem
    keeps those pages on a separate list until a reclaim for highmem pages
    arrives that splices the highmem pages back onto the LRU. It potentially
    could be implemented similar to the UNEVICTABLE list.

    That would reduce the skip rate with the potential corner case is that
    highmem pages have to be scanned and reclaimed to free lowmem slab pages.

    2. Linear scan lowmem pages if the initial LRU shrink fails

    This will break LRU ordering but may be preferable and faster during
    memory pressure than skipping LRU pages.

    Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

27 Jul, 2016

2 commits

  • Pull power management updates from Rafael Wysocki:
    "Again, the majority of changes go into the cpufreq subsystem, but
    there are no big features this time. The cpufreq changes that stand
    out somewhat are the governor interface rework and improvements
    related to the handling of frequency tables. Apart from those, there
    are fixes and new device/CPU IDs in drivers, cleanups and an
    improvement of the new schedutil governor.

    Next, there are some changes in the hibernation core, including a fix
    for a nasty problem related to the MONITOR/MWAIT usage by CPU offline
    during resume from hibernation, a few core improvements related to
    memory management during resume, a couple of additional debug features
    and cleanups.

    Finally, we have some fixes and cleanups in the devfreq subsystem,
    generic power domains framework improvements related to system
    suspend/resume, support for some new chips in intel_idle and in the
    power capping RAPL driver, a new version of the AnalyzeSuspend utility
    and some assorted fixes and cleanups.

    Specifics:

    - Rework the cpufreq governor interface to make it more
    straightforward and modify the conservative governor to avoid using
    transition notifications (Rafael Wysocki).

    - Rework the handling of frequency tables by the cpufreq core to make
    it more efficient (Viresh Kumar).

    - Modify the schedutil governor to reduce the number of wakeups it
    causes to occur in cases when the CPU frequency doesn't need to be
    changed (Steve Muckle, Viresh Kumar).

    - Fix some minor issues and clean up code in the cpufreq core and
    governors (Rafael Wysocki, Viresh Kumar).

    - Add Intel Broxton support to the intel_pstate driver (Srinivas
    Pandruvada).

    - Fix problems related to the config TDP feature and to the validity
    of the MSR_HWP_INTERRUPT register in intel_pstate (Jan Kiszka,
    Srinivas Pandruvada).

    - Make intel_pstate update the cpu_frequency tracepoint even if the
    frequency doesn't change to avoid confusing powertop (Rafael
    Wysocki).

    - Clean up the usage of __init/__initdata in intel_pstate, mark some
    of its internal variables as __read_mostly and drop an unused
    structure element from it (Jisheng Zhang, Carsten Emde).

    - Clean up the usage of some duplicate MSR symbols in intel_pstate
    and turbostat (Srinivas Pandruvada).

    - Update/fix the powernv, s3c24xx and mvebu cpufreq drivers (Akshay
    Adiga, Viresh Kumar, Ben Dooks).

    - Fix a regression (introduced during the 4.5 cycle) in the
    pcc-cpufreq driver by reverting the problematic commit (Andreas
    Herrmann).

    - Add support for Intel Denverton to intel_idle, clean up Broxton
    support in it and make it explicitly non-modular (Jacob Pan, Jan
    Beulich, Paul Gortmaker).

    - Add support for Denverton and Ivy Bridge server to the Intel RAPL
    power capping driver and make it more careful about the handing of
    MSRs that may not be present (Jacob Pan, Xiaolong Wang).

    - Fix resume from hibernation on x86-64 by making the CPU offline
    during resume avoid using MONITOR/MWAIT in the "play dead" loop
    which may lead to an inadvertent "revival" of a "dead" CPU and a
    page fault leading to a kernel crash from it (Rafael Wysocki).

    - Make memory management during resume from hibernation more
    straightforward (Rafael Wysocki).

    - Add debug features that should help to detect problems related to
    hibernation and resume from it (Rafael Wysocki, Chen Yu).

    - Clean up hibernation core somewhat (Rafael Wysocki).

    - Prevent KASAN from instrumenting the hibernation core which leads
    to large numbers of false-positives from it (James Morse).

    - Prevent PM (hibernate and suspend) notifiers from being called
    during the cleanup phase if they have not been called during the
    corresponding preparation phase which is possible if one of the
    other notifiers returns an error at that time (Lianwei Wang).

    - Improve suspend-related debug printout in the tasks freezer and
    clean up suspend-related console handling (Roger Lu, Borislav
    Petkov).

    - Update the AnalyzeSuspend script in the kernel sources to version
    4.2 (Todd Brandt).

    - Modify the generic power domains framework to make it handle system
    suspend/resume better (Ulf Hansson).

    - Make the runtime PM framework avoid resuming devices synchronously
    when user space changes the runtime PM settings for them and
    improve its error reporting (Rafael Wysocki, Linus Walleij).

    - Fix error paths in devfreq drivers (exynos, exynos-ppmu,
    exynos-bus) and in the core, make some devfreq code explicitly
    non-modular and change some of it into tristate (Bartlomiej
    Zolnierkiewicz, Peter Chen, Paul Gortmaker).

    - Add DT support to the generic PM clocks management code and make it
    export some more symbols (Jon Hunter, Paul Gortmaker).

    - Make the PCI PM core code slightly more robust against possible
    driver errors (Andy Shevchenko).

    - Make it possible to change DESTDIR and PREFIX in turbostat (Andy
    Shevchenko)"

    * tag 'pm-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (89 commits)
    Revert "cpufreq: pcc-cpufreq: update default value of cpuinfo_transition_latency"
    PM / hibernate: Introduce test_resume mode for hibernation
    cpufreq: export cpufreq_driver_resolve_freq()
    cpufreq: Disallow ->resolve_freq() for drivers providing ->target_index()
    PCI / PM: check all fields in pci_set_platform_pm()
    cpufreq: acpi-cpufreq: use cached frequency mapping when possible
    cpufreq: schedutil: map raw required frequency to driver frequency
    cpufreq: add cpufreq_driver_resolve_freq()
    cpufreq: intel_pstate: Check cpuid for MSR_HWP_INTERRUPT
    intel_pstate: Update cpu_frequency tracepoint every time
    cpufreq: intel_pstate: clean remnant struct element
    PM / tools: scripts: AnalyzeSuspend v4.2
    x86 / hibernate: Use hlt_play_dead() when resuming from hibernation
    cpufreq: powernv: Replacing pstate_id with frequency table index
    intel_pstate: Fix MSR_CONFIG_TDP_x addressing in core_get_max_pstate()
    PM / hibernate: Image data protection during restoration
    PM / hibernate: Add missing braces in __register_nosave_region()
    PM / hibernate: Clean up comments in snapshot.c
    PM / hibernate: Clean up function headers in snapshot.c
    PM / hibernate: Add missing braces in hibernate_setup()
    ...

    Linus Torvalds
     
  • Pull core block updates from Jens Axboe:

    - the big change is the cleanup from Mike Christie, cleaning up our
    uses of command types and modified flags. This is what will throw
    some merge conflicts

    - regression fix for the above for btrfs, from Vincent

    - following up to the above, better packing of struct request from
    Christoph

    - a 2038 fix for blktrace from Arnd

    - a few trivial/spelling fixes from Bart Van Assche

    - a front merge check fix from Damien, which could cause issues on
    SMR drives

    - Atari partition fix from Gabriel

    - convert cfq to highres timers, since jiffies isn't granular enough
    for some devices these days. From Jan and Jeff

    - CFQ priority boost fix idle classes, from me

    - cleanup series from Ming, improving our bio/bvec iteration

    - a direct issue fix for blk-mq from Omar

    - fix for plug merging not involving the IO scheduler, like we do for
    other types of merges. From Tahsin

    - expose DAX type internally and through sysfs. From Toshi and Yigal

    * 'for-4.8/core' of git://git.kernel.dk/linux-block: (76 commits)
    block: Fix front merge check
    block: do not merge requests without consulting with io scheduler
    block: Fix spelling in a source code comment
    block: expose QUEUE_FLAG_DAX in sysfs
    block: add QUEUE_FLAG_DAX for devices to advertise their DAX support
    Btrfs: fix comparison in __btrfs_map_block()
    block: atari: Return early for unsupported sector size
    Doc: block: Fix a typo in queue-sysfs.txt
    cfq-iosched: Charge at least 1 jiffie instead of 1 ns
    cfq-iosched: Fix regression in bonnie++ rewrite performance
    cfq-iosched: Convert slice_resid from u64 to s64
    block: Convert fifo_time from ulong to u64
    blktrace: avoid using timespec
    block/blk-cgroup.c: Declare local symbols static
    block/bio-integrity.c: Add #include "blk.h"
    block/partition-generic.c: Remove a set-but-not-used variable
    block: bio: kill BIO_MAX_SIZE
    cfq-iosched: temporarily boost queue priority for idle classes
    block: drbd: avoid to use BIO_MAX_SIZE
    block: bio: remove BIO_MAX_SECTORS
    ...

    Linus Torvalds
     

22 Jul, 2016

1 commit

  • test_resume mode is to verify if the snapshot data
    written to swap device can be successfully restored
    to memory. It is useful to ease the debugging process
    on hibernation, since this mode can not only bypass
    the BIOSes/bootloader, but also the system re-initialization.

    To avoid the risk to break the filesystm on persistent storage,
    this patch resumes the image with tasks frozen.

    For example:
    echo test_resume > /sys/power/disk
    echo disk > /sys/power/state

    [ 187.306470] PM: Image saving progress: 70%
    [ 187.395298] PM: Image saving progress: 80%
    [ 187.476697] PM: Image saving progress: 90%
    [ 187.554641] PM: Image saving done.
    [ 187.558896] PM: Wrote 594600 kbytes in 0.90 seconds (660.66 MB/s)
    [ 187.566000] PM: S|
    [ 187.589742] PM: Basic memory bitmaps freed
    [ 187.594694] PM: Checking hibernation image
    [ 187.599865] PM: Image signature found, resuming
    [ 187.605209] PM: Loading hibernation image.
    [ 187.665753] PM: Basic memory bitmaps created
    [ 187.691397] PM: Using 3 thread(s) for decompression.
    [ 187.691397] PM: Loading and decompressing image data (148650 pages)...
    [ 187.889719] PM: Image loading progress: 0%
    [ 188.100452] PM: Image loading progress: 10%
    [ 188.244781] PM: Image loading progress: 20%
    [ 189.057305] PM: Image loading done.
    [ 189.068793] PM: Image successfully loaded

    Suggested-by: Rafael J. Wysocki
    Signed-off-by: Chen Yu
    Signed-off-by: Rafael J. Wysocki

    Chen Yu
     

16 Jul, 2016

1 commit

  • On Intel hardware, native_play_dead() uses mwait_play_dead() by
    default and only falls back to the other methods if that fails.
    That also happens during resume from hibernation, when the restore
    (boot) kernel runs disable_nonboot_cpus() to take all of the CPUs
    except for the boot one offline.

    However, that is problematic, because the address passed to
    __monitor() in mwait_play_dead() is likely to be written to in the
    last phase of hibernate image restoration and that causes the "dead"
    CPU to start executing instructions again. Unfortunately, the page
    containing the address in that CPU's instruction pointer may not be
    valid any more at that point.

    First, that page may have been overwritten with image kernel memory
    contents already, so the instructions the CPU attempts to execute may
    simply be invalid. Second, the page tables previously used by that
    CPU may have been overwritten by image kernel memory contents, so the
    address in its instruction pointer is impossible to resolve then.

    A report from Varun Koyyalagunta and investigation carried out by
    Chen Yu show that the latter sometimes happens in practice.

    To prevent it from happening, temporarily change the smp_ops.play_dead
    pointer during resume from hibernation so that it points to a special
    "play dead" routine which uses hlt_play_dead() and avoids the
    inadvertent "revivals" of "dead" CPUs this way.

    A slightly unpleasant consequence of this change is that if the
    system is hibernated with one or more CPUs offline, it will generally
    draw more power after resume than it did before hibernation, because
    the physical state entered by CPUs via hlt_play_dead() is higher-power
    than the mwait_play_dead() one in the majority of cases. It is
    possible to work around this, but it is unclear how much of a problem
    that's going to be in practice, so the workaround will be implemented
    later if it turns out to be necessary.

    Link: https://bugzilla.kernel.org/show_bug.cgi?id=106371
    Reported-by: Varun Koyyalagunta
    Original-by: Chen Yu
    Tested-by: Chen Yu
    Signed-off-by: Rafael J. Wysocki
    Acked-by: Ingo Molnar

    Rafael J. Wysocki
     

10 Jul, 2016

5 commits


09 Jul, 2016

1 commit


08 Jul, 2016

1 commit


02 Jul, 2016

4 commits

  • One of the memory bitmaps used by the hibernation image restoration
    code is freed after the image has been loaded.

    That is not quite efficient, though, because the memory pages used
    for building that bitmap are known to be safe (ie. they were not
    used by the image kernel before hibernation) and the arch-specific
    code finalizing the image restoration may need them. In that case
    it needs to allocate those pages again via the memory management
    subsystem, check if they are really safe again by consulting the
    other bitmaps and so on.

    To avoid that, recycle those pages by putting them into the global
    list of known safe pages so that they can be given to the arch code
    right away when necessary.

    Signed-off-by: Rafael J. Wysocki

    Rafael J. Wysocki
     
  • Rework mark_unsafe_pages() to use a simpler method of clearing
    all bits in free_pages_map and to set the bits for the "unsafe"
    pages (ie. pages that were used by the image kernel before
    hibernation) with the help of duplicate_memory_bitmap().

    For this purpose, move the pfn_valid() check from mark_unsafe_pages()
    to unpack_orig_pfns() where the "unsafe" pages are discovered.

    Signed-off-by: Rafael J. Wysocki

    Rafael J. Wysocki
     
  • The core image restoration code preallocates some safe pages
    (ie. pages that weren't used by the image kernel before hibernation)
    for future use before allocating the bulk of memory for loading the
    image data. Those safe pages are then freed so they can be allocated
    again (with the memory management subsystem's help). That's done to
    ensure that there will be enough safe pages for temporary data
    structures needed during image restoration.

    However, it is not really necessary to free those pages after they
    have been allocated. They can be added to the (global) list of
    safe pages right away and then picked up from there when needed
    without freeing.

    That reduces the overhead related to using safe pages, especially
    in the arch-specific code, so modify the code accordingly.

    Signed-off-by: Rafael J. Wysocki

    Rafael J. Wysocki
     
  • If freezable workqueue aborts suspend flow, show
    workqueue state for debug purpose.

    Signed-off-by: Roger Lu
    Acked-by: Tejun Heo
    Signed-off-by: Rafael J. Wysocki

    Roger Lu
     

28 Jun, 2016

1 commit


26 Jun, 2016

1 commit

  • With the following fix:

    70595b479ce1 ("x86/power/64: Fix crash whan the hibernation code passes control to the image kernel")

    ... there is no longer a problem with hibernation resuming a
    KASLR-booted kernel image, so remove the restriction.

    Signed-off-by: Kees Cook
    Cc: Andy Lutomirski
    Cc: Baoquan He
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Jonathan Corbet
    Cc: Len Brown
    Cc: Linus Torvalds
    Cc: Linux PM list
    Cc: Logan Gunthorpe
    Cc: Pavel Machek
    Cc: Peter Zijlstra
    Cc: Stephen Smalley
    Cc: Thomas Gleixner
    Cc: Yinghai Lu
    Cc: linux-doc@vger.kernel.org
    Link: http://lkml.kernel.org/r/20160613221002.GA29719@www.outflux.net
    Signed-off-by: Ingo Molnar

    Kees Cook
     

25 Jun, 2016

1 commit

  • Tetsuo has reported the following potential oom_killer_disable vs.
    oom_reaper race:

    (1) freeze_processes() starts freezing user space threads.
    (2) Somebody (maybe a kenrel thread) calls out_of_memory().
    (3) The OOM killer calls mark_oom_victim() on a user space thread
    P1 which is already in __refrigerator().
    (4) oom_killer_disable() sets oom_killer_disabled = true.
    (5) P1 leaves __refrigerator() and enters do_exit().
    (6) The OOM reaper calls exit_oom_victim(P1) before P1 can call
    exit_oom_victim(P1).
    (7) oom_killer_disable() returns while P1 not yet finished
    (8) P1 perform IO/interfere with the freezer.

    This situation is unfortunate. We cannot move oom_killer_disable after
    all the freezable kernel threads are frozen because the oom victim might
    depend on some of those kthreads to make a forward progress to exit so
    we could deadlock. It is also far from trivial to teach the oom_reaper
    to not call exit_oom_victim() because then we would lose a guarantee of
    the OOM killer and oom_killer_disable forward progress because
    exit_mm->mmput might block and never call exit_oom_victim.

    It seems the easiest way forward is to workaround this race by calling
    try_to_freeze_tasks again after oom_killer_disable. This will make sure
    that all the tasks are frozen or it bails out.

    Fixes: 449d777d7ad6 ("mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for oom_reaper")
    Link: http://lkml.kernel.org/r/1466597634-16199-1-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Tetsuo Handa
    Cc: "Rafael J. Wysocki"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko