02 Nov, 2016

1 commit


24 Oct, 2016

1 commit

  • Commit 4bcc595ccd80 (printk: reinstate KERN_CONT for printing
    continuation lines) exposed a missing KERN_CONT from one of the
    messages shown on entering suspend. With v4.9-rc1, the 'done.' shown
    after syncing the filesystems no longer appears as a continuation but
    a new message with its own timestamp.

    [ 9.259566] PM: Syncing filesystems ... [ 9.264119] done.

    Fix this by adding the KERN_CONT log level for the 'done.' part of the
    message seen after syncing filesystems. While we are at it, convert
    these suspend printks to pr_info and pr_cont, respectively.

    Signed-off-by: Jon Hunter
    Signed-off-by: Rafael J. Wysocki

    Jon Hunter
     

08 Oct, 2016

1 commit

  • Commit 74070542099c ("oom, suspend: fix oom_reaper vs.
    oom_killer_disable race") has workaround an existing race between
    oom_killer_disable and oom_reaper by adding another round of
    try_to_freeze_tasks after the oom killer was disabled. This was the
    easiest thing to do for a late 4.7 fix. Let's fix it properly now.

    After "oom: keep mm of the killed task available" we no longer have to
    call exit_oom_victim from the oom reaper because we have stable mm
    available and hide the oom_reaped mm by MMF_OOM_SKIP flag. So let's
    remove exit_oom_victim and the race described in the above commit
    doesn't exist anymore if.

    Unfortunately this alone is not sufficient for the oom_killer_disable
    usecase because now we do not have any reliable way to reach
    exit_oom_victim (the victim might get stuck on a way to exit for an
    unbounded amount of time). OOM killer can cope with that by checking mm
    flags and move on to another victim but we cannot do the same for
    oom_killer_disable as we would lose the guarantee of no further
    interference of the victim with the rest of the system. What we can do
    instead is to cap the maximum time the oom_killer_disable waits for
    victims. The only current user of this function (pm suspend) already
    has a concept of timeout for back off so we can reuse the same value
    there.

    Let's drop set_freezable for the oom_reaper kthread because it is no
    longer needed as the reaper doesn't wake or thaw any processes.

    Link: http://lkml.kernel.org/r/1472119394-11342-7-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

13 Sep, 2016

3 commits

  • PAGE_POISONING_ZERO disables zeroing new pages on alloc, they are
    poisoned (zeroed) as they become available.
    In the hibernate use case, free pages will appear in the system without
    being cleared, left there by the loading kernel.

    This patch will make sure free pages are cleared on resume when
    PAGE_POISONING_ZERO is enabled. We free the pages just after resume
    because we can't do it later: going through any device resume code might
    allocate some memory and invalidate the free pages bitmap.

    Thus we don't need to disable hibernation when PAGE_POISONING_ZERO is
    enabled.

    Signed-off-by: Anisse Astier
    Reviewed-by: Kees Cook
    Acked-by: Pavel Machek
    Signed-off-by: Rafael J. Wysocki

    Anisse Astier
     
  • Suspend-to-idle (aka the "freeze" sleep state) is a system sleep state
    in which all of the processors enter deepest possible idle state and
    wait for interrupts right after suspending all the devices.

    There is no hard requirement for a platform to support and register
    platform specific suspend_ops to enter suspend-to-idle/freeze state.
    Only deeper system sleep states like PM_SUSPEND_STANDBY and
    PM_SUSPEND_MEM rely on such low level support/implementation.

    suspend-to-idle can be entered as along as all the devices can be
    suspended. This patch enables the support for suspend-to-idle even on
    systems that don't have any low level support for deeper system sleep
    states and/or don't register any platform specific suspend_ops.

    Signed-off-by: Sudeep Holla
    Tested-by: Andy Gross
    Signed-off-by: Rafael J. Wysocki

    Sudeep Holla
     
  • Recently we have a new report that, the harddisk can not
    resume on time due to firmware issues, and got a kernel
    panic because of DPM watchdog timeout. So adjust the
    default timeout from 60 to 120 to survive on this platform,
    and make DPM_WATCHDOG depending on EXPERT.

    Link: https://bugzilla.kernel.org/show_bug.cgi?id=117971
    Suggested-by: Pavel Machek
    Suggested-by: Rafael J. Wysocki
    Reported-by: Higuita
    Signed-off-by: Chen Yu
    Acked-by: Pavel Machek
    Signed-off-by: Rafael J. Wysocki

    Chen Yu
     

05 Sep, 2016

1 commit

  • of_clk_init() ends up calling into pm_qos_update_request() very early
    during boot where irq is expected to stay disabled.
    pm_qos_update_request() uses cancel_delayed_work_sync() which
    correctly assumes that irq is enabled on invocation and
    unconditionally disables and re-enables it.

    Gate cancel_delayed_work_sync() invocation with kevented_up() to avoid
    enabling irq unexpectedly during early boot.

    Signed-off-by: Tejun Heo
    Reported-and-tested-by: Qiao Zhou
    Link: http://lkml.kernel.org/r/d2501c4c-8e7b-bea3-1b01-000b36b5dfe9@asrmicro.com
    Signed-off-by: Rafael J. Wysocki

    Tejun Heo
     

18 Aug, 2016

1 commit


16 Aug, 2016

1 commit

  • rtree_next_node() walks the linked list of leaf nodes to find the next
    block of pages in the struct memory_bitmap. If it walks off the end of
    the list of nodes, it walks the list of memory zones to find the next
    region of memory. If it walks off the end of the list of zones, it
    returns false.

    This leaves the struct bm_position's node and zone pointers pointing
    at their respective struct list_heads in struct mem_zone_bm_rtree.

    memory_bm_find_bit() uses struct bm_position's node and zone pointers
    to avoid walking lists and trees if the next bit appears in the same
    node/zone. It handles these values being stale.

    Swap rtree_next_node()s 'step then test' to 'test-next then step',
    this means if we reach the end of memory we return false and leave
    the node and zone pointers as they were.

    This fixes a panic on resume using AMD Seattle with 64K pages:
    [ 6.868732] Freezing user space processes ... (elapsed 0.000 seconds) done.
    [ 6.875753] Double checking all user space processes after OOM killer disable... (elapsed 0.000 seconds)
    [ 6.896453] PM: Using 3 thread(s) for decompression.
    [ 6.896453] PM: Loading and decompressing image data (5339 pages)...
    [ 7.318890] PM: Image loading progress: 0%
    [ 7.323395] Unable to handle kernel paging request at virtual address 00800040
    [ 7.330611] pgd = ffff000008df0000
    [ 7.334003] [00800040] *pgd=00000083fffe0003, *pud=00000083fffe0003, *pmd=00000083fffd0003, *pte=0000000000000000
    [ 7.344266] Internal error: Oops: 96000005 [#1] PREEMPT SMP
    [ 7.349825] Modules linked in:
    [ 7.352871] CPU: 2 PID: 1 Comm: swapper/0 Tainted: G W I 4.8.0-rc1 #4737
    [ 7.360512] Hardware name: AMD Overdrive/Supercharger/Default string, BIOS ROD1002C 04/08/2016
    [ 7.369109] task: ffff8003c0220000 task.stack: ffff8003c0280000
    [ 7.375020] PC is at set_bit+0x18/0x30
    [ 7.378758] LR is at memory_bm_set_bit+0x24/0x30
    [ 7.383362] pc : [] lr : [] pstate: 60000045
    [ 7.390743] sp : ffff8003c0283b00
    [ 7.473551]
    [ 7.475031] Process swapper/0 (pid: 1, stack limit = 0xffff8003c0280020)
    [ 7.481718] Stack: (0xffff8003c0283b00 to 0xffff8003c0284000)
    [ 7.800075] Call trace:
    [ 7.887097] [] set_bit+0x18/0x30
    [ 7.891876] [] duplicate_memory_bitmap.constprop.38+0x54/0x70
    [ 7.899172] [] snapshot_write_next+0x22c/0x47c
    [ 7.905166] [] load_image_lzo+0x754/0xa88
    [ 7.910725] [] swsusp_read+0x144/0x230
    [ 7.916025] [] load_image_and_restore+0x58/0x90
    [ 7.922105] [] software_resume+0x2f0/0x338
    [ 7.927752] [] do_one_initcall+0x38/0x11c
    [ 7.933314] [] kernel_init_freeable+0x14c/0x1ec
    [ 7.939395] [] kernel_init+0x10/0xfc
    [ 7.944520] [] ret_from_fork+0x10/0x40
    [ 7.949820] Code: d2800022 8b400c21 f9800031 9ac32043 (c85f7c22)
    [ 7.955909] ---[ end trace 0024a5986e6ff323 ]---
    [ 7.960529] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b

    Here struct mem_zone_bm_rtree's start_pfn has been returned instead of
    struct rtree_node's addr as the node/zone pointers are corrupt after
    we walked off the end of the lists during mark_unsafe_pages().

    This behaviour was exposed by commit 6dbecfd345a6 ("PM / hibernate:
    Simplify mark_unsafe_pages()"), which caused mark_unsafe_pages() to call
    duplicate_memory_bitmap(), which uses memory_bm_find_bit() after walking
    off the end of the memory bitmap.

    Fixes: 3a20cb177961 (PM / Hibernate: Implement position keeping in radix tree)
    Signed-off-by: James Morse
    [ rjw: Subject ]
    Signed-off-by: Rafael J. Wysocki

    James Morse
     

13 Aug, 2016

2 commits

  • * pm-sleep:
    PM / hibernate: Restore processor state before using per-CPU variables
    x86/power/64: Always create temporary identity mapping correctly

    * pm-cpufreq:
    cpufreq: powernv: Fix crash in gpstate_timer_handler()

    Rafael J. Wysocki
     
  • Restore the processor state before calling any other functions to
    ensure per-CPU variables can be used with KASLR memory randomization.

    Tracing functions use per-CPU variables (GS based on x86) and one was
    called just before restoring the processor state fully. It resulted
    in a double fault when both the tracing & the exception handler
    functions tried to use a per-CPU variable.

    Fixes: bb3632c6101b (PM / sleep: trace events for suspend/resume)
    Reported-and-tested-by: Borislav Petkov
    Reported-by: Jiri Kosina
    Tested-by: Rafael J. Wysocki
    Tested-by: Jiri Kosina
    Signed-off-by: Thomas Garnier
    Acked-by: Pavel Machek
    Signed-off-by: Rafael J. Wysocki

    Thomas Garnier
     

29 Jul, 2016

1 commit

  • This moves the LRU lists from the zone to the node and related data such
    as counters, tracing, congestion tracking and writeback tracking.

    Unfortunately, due to reclaim and compaction retry logic, it is
    necessary to account for the number of LRU pages on both zone and node
    logic. Most reclaim logic is based on the node counters but the retry
    logic uses the zone counters which do not distinguish inactive and
    active sizes. It would be possible to leave the LRU counters on a
    per-zone basis but it's a heavier calculation across multiple cache
    lines that is much more frequent than the retry checks.

    Other than the LRU counters, this is mostly a mechanical patch but note
    that it introduces a number of anomalies. For example, the scans are
    per-zone but using per-node counters. We also mark a node as congested
    when a zone is congested. This causes weird problems that are fixed
    later but is easier to review.

    In the event that there is excessive overhead on 32-bit systems due to
    the nodes being on LRU then there are two potential solutions

    1. Long-term isolation of highmem pages when reclaim is lowmem

    When pages are skipped, they are immediately added back onto the LRU
    list. If lowmem reclaim persisted for long periods of time, the same
    highmem pages get continually scanned. The idea would be that lowmem
    keeps those pages on a separate list until a reclaim for highmem pages
    arrives that splices the highmem pages back onto the LRU. It potentially
    could be implemented similar to the UNEVICTABLE list.

    That would reduce the skip rate with the potential corner case is that
    highmem pages have to be scanned and reclaimed to free lowmem slab pages.

    2. Linear scan lowmem pages if the initial LRU shrink fails

    This will break LRU ordering but may be preferable and faster during
    memory pressure than skipping LRU pages.

    Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

27 Jul, 2016

2 commits

  • Pull power management updates from Rafael Wysocki:
    "Again, the majority of changes go into the cpufreq subsystem, but
    there are no big features this time. The cpufreq changes that stand
    out somewhat are the governor interface rework and improvements
    related to the handling of frequency tables. Apart from those, there
    are fixes and new device/CPU IDs in drivers, cleanups and an
    improvement of the new schedutil governor.

    Next, there are some changes in the hibernation core, including a fix
    for a nasty problem related to the MONITOR/MWAIT usage by CPU offline
    during resume from hibernation, a few core improvements related to
    memory management during resume, a couple of additional debug features
    and cleanups.

    Finally, we have some fixes and cleanups in the devfreq subsystem,
    generic power domains framework improvements related to system
    suspend/resume, support for some new chips in intel_idle and in the
    power capping RAPL driver, a new version of the AnalyzeSuspend utility
    and some assorted fixes and cleanups.

    Specifics:

    - Rework the cpufreq governor interface to make it more
    straightforward and modify the conservative governor to avoid using
    transition notifications (Rafael Wysocki).

    - Rework the handling of frequency tables by the cpufreq core to make
    it more efficient (Viresh Kumar).

    - Modify the schedutil governor to reduce the number of wakeups it
    causes to occur in cases when the CPU frequency doesn't need to be
    changed (Steve Muckle, Viresh Kumar).

    - Fix some minor issues and clean up code in the cpufreq core and
    governors (Rafael Wysocki, Viresh Kumar).

    - Add Intel Broxton support to the intel_pstate driver (Srinivas
    Pandruvada).

    - Fix problems related to the config TDP feature and to the validity
    of the MSR_HWP_INTERRUPT register in intel_pstate (Jan Kiszka,
    Srinivas Pandruvada).

    - Make intel_pstate update the cpu_frequency tracepoint even if the
    frequency doesn't change to avoid confusing powertop (Rafael
    Wysocki).

    - Clean up the usage of __init/__initdata in intel_pstate, mark some
    of its internal variables as __read_mostly and drop an unused
    structure element from it (Jisheng Zhang, Carsten Emde).

    - Clean up the usage of some duplicate MSR symbols in intel_pstate
    and turbostat (Srinivas Pandruvada).

    - Update/fix the powernv, s3c24xx and mvebu cpufreq drivers (Akshay
    Adiga, Viresh Kumar, Ben Dooks).

    - Fix a regression (introduced during the 4.5 cycle) in the
    pcc-cpufreq driver by reverting the problematic commit (Andreas
    Herrmann).

    - Add support for Intel Denverton to intel_idle, clean up Broxton
    support in it and make it explicitly non-modular (Jacob Pan, Jan
    Beulich, Paul Gortmaker).

    - Add support for Denverton and Ivy Bridge server to the Intel RAPL
    power capping driver and make it more careful about the handing of
    MSRs that may not be present (Jacob Pan, Xiaolong Wang).

    - Fix resume from hibernation on x86-64 by making the CPU offline
    during resume avoid using MONITOR/MWAIT in the "play dead" loop
    which may lead to an inadvertent "revival" of a "dead" CPU and a
    page fault leading to a kernel crash from it (Rafael Wysocki).

    - Make memory management during resume from hibernation more
    straightforward (Rafael Wysocki).

    - Add debug features that should help to detect problems related to
    hibernation and resume from it (Rafael Wysocki, Chen Yu).

    - Clean up hibernation core somewhat (Rafael Wysocki).

    - Prevent KASAN from instrumenting the hibernation core which leads
    to large numbers of false-positives from it (James Morse).

    - Prevent PM (hibernate and suspend) notifiers from being called
    during the cleanup phase if they have not been called during the
    corresponding preparation phase which is possible if one of the
    other notifiers returns an error at that time (Lianwei Wang).

    - Improve suspend-related debug printout in the tasks freezer and
    clean up suspend-related console handling (Roger Lu, Borislav
    Petkov).

    - Update the AnalyzeSuspend script in the kernel sources to version
    4.2 (Todd Brandt).

    - Modify the generic power domains framework to make it handle system
    suspend/resume better (Ulf Hansson).

    - Make the runtime PM framework avoid resuming devices synchronously
    when user space changes the runtime PM settings for them and
    improve its error reporting (Rafael Wysocki, Linus Walleij).

    - Fix error paths in devfreq drivers (exynos, exynos-ppmu,
    exynos-bus) and in the core, make some devfreq code explicitly
    non-modular and change some of it into tristate (Bartlomiej
    Zolnierkiewicz, Peter Chen, Paul Gortmaker).

    - Add DT support to the generic PM clocks management code and make it
    export some more symbols (Jon Hunter, Paul Gortmaker).

    - Make the PCI PM core code slightly more robust against possible
    driver errors (Andy Shevchenko).

    - Make it possible to change DESTDIR and PREFIX in turbostat (Andy
    Shevchenko)"

    * tag 'pm-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (89 commits)
    Revert "cpufreq: pcc-cpufreq: update default value of cpuinfo_transition_latency"
    PM / hibernate: Introduce test_resume mode for hibernation
    cpufreq: export cpufreq_driver_resolve_freq()
    cpufreq: Disallow ->resolve_freq() for drivers providing ->target_index()
    PCI / PM: check all fields in pci_set_platform_pm()
    cpufreq: acpi-cpufreq: use cached frequency mapping when possible
    cpufreq: schedutil: map raw required frequency to driver frequency
    cpufreq: add cpufreq_driver_resolve_freq()
    cpufreq: intel_pstate: Check cpuid for MSR_HWP_INTERRUPT
    intel_pstate: Update cpu_frequency tracepoint every time
    cpufreq: intel_pstate: clean remnant struct element
    PM / tools: scripts: AnalyzeSuspend v4.2
    x86 / hibernate: Use hlt_play_dead() when resuming from hibernation
    cpufreq: powernv: Replacing pstate_id with frequency table index
    intel_pstate: Fix MSR_CONFIG_TDP_x addressing in core_get_max_pstate()
    PM / hibernate: Image data protection during restoration
    PM / hibernate: Add missing braces in __register_nosave_region()
    PM / hibernate: Clean up comments in snapshot.c
    PM / hibernate: Clean up function headers in snapshot.c
    PM / hibernate: Add missing braces in hibernate_setup()
    ...

    Linus Torvalds
     
  • Pull core block updates from Jens Axboe:

    - the big change is the cleanup from Mike Christie, cleaning up our
    uses of command types and modified flags. This is what will throw
    some merge conflicts

    - regression fix for the above for btrfs, from Vincent

    - following up to the above, better packing of struct request from
    Christoph

    - a 2038 fix for blktrace from Arnd

    - a few trivial/spelling fixes from Bart Van Assche

    - a front merge check fix from Damien, which could cause issues on
    SMR drives

    - Atari partition fix from Gabriel

    - convert cfq to highres timers, since jiffies isn't granular enough
    for some devices these days. From Jan and Jeff

    - CFQ priority boost fix idle classes, from me

    - cleanup series from Ming, improving our bio/bvec iteration

    - a direct issue fix for blk-mq from Omar

    - fix for plug merging not involving the IO scheduler, like we do for
    other types of merges. From Tahsin

    - expose DAX type internally and through sysfs. From Toshi and Yigal

    * 'for-4.8/core' of git://git.kernel.dk/linux-block: (76 commits)
    block: Fix front merge check
    block: do not merge requests without consulting with io scheduler
    block: Fix spelling in a source code comment
    block: expose QUEUE_FLAG_DAX in sysfs
    block: add QUEUE_FLAG_DAX for devices to advertise their DAX support
    Btrfs: fix comparison in __btrfs_map_block()
    block: atari: Return early for unsupported sector size
    Doc: block: Fix a typo in queue-sysfs.txt
    cfq-iosched: Charge at least 1 jiffie instead of 1 ns
    cfq-iosched: Fix regression in bonnie++ rewrite performance
    cfq-iosched: Convert slice_resid from u64 to s64
    block: Convert fifo_time from ulong to u64
    blktrace: avoid using timespec
    block/blk-cgroup.c: Declare local symbols static
    block/bio-integrity.c: Add #include "blk.h"
    block/partition-generic.c: Remove a set-but-not-used variable
    block: bio: kill BIO_MAX_SIZE
    cfq-iosched: temporarily boost queue priority for idle classes
    block: drbd: avoid to use BIO_MAX_SIZE
    block: bio: remove BIO_MAX_SECTORS
    ...

    Linus Torvalds
     

22 Jul, 2016

1 commit

  • test_resume mode is to verify if the snapshot data
    written to swap device can be successfully restored
    to memory. It is useful to ease the debugging process
    on hibernation, since this mode can not only bypass
    the BIOSes/bootloader, but also the system re-initialization.

    To avoid the risk to break the filesystm on persistent storage,
    this patch resumes the image with tasks frozen.

    For example:
    echo test_resume > /sys/power/disk
    echo disk > /sys/power/state

    [ 187.306470] PM: Image saving progress: 70%
    [ 187.395298] PM: Image saving progress: 80%
    [ 187.476697] PM: Image saving progress: 90%
    [ 187.554641] PM: Image saving done.
    [ 187.558896] PM: Wrote 594600 kbytes in 0.90 seconds (660.66 MB/s)
    [ 187.566000] PM: S|
    [ 187.589742] PM: Basic memory bitmaps freed
    [ 187.594694] PM: Checking hibernation image
    [ 187.599865] PM: Image signature found, resuming
    [ 187.605209] PM: Loading hibernation image.
    [ 187.665753] PM: Basic memory bitmaps created
    [ 187.691397] PM: Using 3 thread(s) for decompression.
    [ 187.691397] PM: Loading and decompressing image data (148650 pages)...
    [ 187.889719] PM: Image loading progress: 0%
    [ 188.100452] PM: Image loading progress: 10%
    [ 188.244781] PM: Image loading progress: 20%
    [ 189.057305] PM: Image loading done.
    [ 189.068793] PM: Image successfully loaded

    Suggested-by: Rafael J. Wysocki
    Signed-off-by: Chen Yu
    Signed-off-by: Rafael J. Wysocki

    Chen Yu
     

16 Jul, 2016

1 commit

  • On Intel hardware, native_play_dead() uses mwait_play_dead() by
    default and only falls back to the other methods if that fails.
    That also happens during resume from hibernation, when the restore
    (boot) kernel runs disable_nonboot_cpus() to take all of the CPUs
    except for the boot one offline.

    However, that is problematic, because the address passed to
    __monitor() in mwait_play_dead() is likely to be written to in the
    last phase of hibernate image restoration and that causes the "dead"
    CPU to start executing instructions again. Unfortunately, the page
    containing the address in that CPU's instruction pointer may not be
    valid any more at that point.

    First, that page may have been overwritten with image kernel memory
    contents already, so the instructions the CPU attempts to execute may
    simply be invalid. Second, the page tables previously used by that
    CPU may have been overwritten by image kernel memory contents, so the
    address in its instruction pointer is impossible to resolve then.

    A report from Varun Koyyalagunta and investigation carried out by
    Chen Yu show that the latter sometimes happens in practice.

    To prevent it from happening, temporarily change the smp_ops.play_dead
    pointer during resume from hibernation so that it points to a special
    "play dead" routine which uses hlt_play_dead() and avoids the
    inadvertent "revivals" of "dead" CPUs this way.

    A slightly unpleasant consequence of this change is that if the
    system is hibernated with one or more CPUs offline, it will generally
    draw more power after resume than it did before hibernation, because
    the physical state entered by CPUs via hlt_play_dead() is higher-power
    than the mwait_play_dead() one in the majority of cases. It is
    possible to work around this, but it is unclear how much of a problem
    that's going to be in practice, so the workaround will be implemented
    later if it turns out to be necessary.

    Link: https://bugzilla.kernel.org/show_bug.cgi?id=106371
    Reported-by: Varun Koyyalagunta
    Original-by: Chen Yu
    Tested-by: Chen Yu
    Signed-off-by: Rafael J. Wysocki
    Acked-by: Ingo Molnar

    Rafael J. Wysocki
     

10 Jul, 2016

5 commits


09 Jul, 2016

1 commit


08 Jul, 2016

1 commit


02 Jul, 2016

4 commits

  • One of the memory bitmaps used by the hibernation image restoration
    code is freed after the image has been loaded.

    That is not quite efficient, though, because the memory pages used
    for building that bitmap are known to be safe (ie. they were not
    used by the image kernel before hibernation) and the arch-specific
    code finalizing the image restoration may need them. In that case
    it needs to allocate those pages again via the memory management
    subsystem, check if they are really safe again by consulting the
    other bitmaps and so on.

    To avoid that, recycle those pages by putting them into the global
    list of known safe pages so that they can be given to the arch code
    right away when necessary.

    Signed-off-by: Rafael J. Wysocki

    Rafael J. Wysocki
     
  • Rework mark_unsafe_pages() to use a simpler method of clearing
    all bits in free_pages_map and to set the bits for the "unsafe"
    pages (ie. pages that were used by the image kernel before
    hibernation) with the help of duplicate_memory_bitmap().

    For this purpose, move the pfn_valid() check from mark_unsafe_pages()
    to unpack_orig_pfns() where the "unsafe" pages are discovered.

    Signed-off-by: Rafael J. Wysocki

    Rafael J. Wysocki
     
  • The core image restoration code preallocates some safe pages
    (ie. pages that weren't used by the image kernel before hibernation)
    for future use before allocating the bulk of memory for loading the
    image data. Those safe pages are then freed so they can be allocated
    again (with the memory management subsystem's help). That's done to
    ensure that there will be enough safe pages for temporary data
    structures needed during image restoration.

    However, it is not really necessary to free those pages after they
    have been allocated. They can be added to the (global) list of
    safe pages right away and then picked up from there when needed
    without freeing.

    That reduces the overhead related to using safe pages, especially
    in the arch-specific code, so modify the code accordingly.

    Signed-off-by: Rafael J. Wysocki

    Rafael J. Wysocki
     
  • If freezable workqueue aborts suspend flow, show
    workqueue state for debug purpose.

    Signed-off-by: Roger Lu
    Acked-by: Tejun Heo
    Signed-off-by: Rafael J. Wysocki

    Roger Lu
     

28 Jun, 2016

1 commit


26 Jun, 2016

1 commit

  • With the following fix:

    70595b479ce1 ("x86/power/64: Fix crash whan the hibernation code passes control to the image kernel")

    ... there is no longer a problem with hibernation resuming a
    KASLR-booted kernel image, so remove the restriction.

    Signed-off-by: Kees Cook
    Cc: Andy Lutomirski
    Cc: Baoquan He
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Jonathan Corbet
    Cc: Len Brown
    Cc: Linus Torvalds
    Cc: Linux PM list
    Cc: Logan Gunthorpe
    Cc: Pavel Machek
    Cc: Peter Zijlstra
    Cc: Stephen Smalley
    Cc: Thomas Gleixner
    Cc: Yinghai Lu
    Cc: linux-doc@vger.kernel.org
    Link: http://lkml.kernel.org/r/20160613221002.GA29719@www.outflux.net
    Signed-off-by: Ingo Molnar

    Kees Cook
     

25 Jun, 2016

1 commit

  • Tetsuo has reported the following potential oom_killer_disable vs.
    oom_reaper race:

    (1) freeze_processes() starts freezing user space threads.
    (2) Somebody (maybe a kenrel thread) calls out_of_memory().
    (3) The OOM killer calls mark_oom_victim() on a user space thread
    P1 which is already in __refrigerator().
    (4) oom_killer_disable() sets oom_killer_disabled = true.
    (5) P1 leaves __refrigerator() and enters do_exit().
    (6) The OOM reaper calls exit_oom_victim(P1) before P1 can call
    exit_oom_victim(P1).
    (7) oom_killer_disable() returns while P1 not yet finished
    (8) P1 perform IO/interfere with the freezer.

    This situation is unfortunate. We cannot move oom_killer_disable after
    all the freezable kernel threads are frozen because the oom victim might
    depend on some of those kthreads to make a forward progress to exit so
    we could deadlock. It is also far from trivial to teach the oom_reaper
    to not call exit_oom_victim() because then we would lose a guarantee of
    the OOM killer and oom_killer_disable forward progress because
    exit_mm->mmput might block and never call exit_oom_victim.

    It seems the easiest way forward is to workaround this race by calling
    try_to_freeze_tasks again after oom_killer_disable. This will make sure
    that all the tasks are frozen or it bails out.

    Fixes: 449d777d7ad6 ("mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for oom_reaper")
    Link: http://lkml.kernel.org/r/1466597634-16199-1-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Tetsuo Handa
    Cc: "Rafael J. Wysocki"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

15 Jun, 2016

1 commit


14 Jun, 2016

1 commit

  • Kasan causes the compiler to instrument C code and is used at runtime to
    detect accesses to memory that has been freed, or not yet allocated.

    The code in snapshot.c saves and restores memory when hibernating. This will
    access whole pages in the slab cache that have both free and allocated
    areas, resulting in a large number of false positives from Kasan.

    Disable instrumentation of this file.

    Signed-off-by: James Morse
    Acked-by: Catalin Marinas
    Signed-off-by: Rafael J. Wysocki

    James Morse
     

08 Jun, 2016

2 commits

  • Separate the op from the rq_flag_bits and have the pm code
    set/get the bio using bio_set_op_attrs/bio_op.

    Signed-off-by: Mike Christie
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Mike Christie
     
  • This has callers of submit_bio/submit_bio_wait set the bio->bi_rw
    instead of passing it in. This makes that use the same as
    generic_make_request and how we set the other bio fields.

    Signed-off-by: Mike Christie

    Fixed up fs/ext4/crypto.c

    Signed-off-by: Jens Axboe

    Mike Christie
     

28 Apr, 2016

1 commit

  • Some architectures require code written to memory as if it were data to be
    'cleaned' from any data caches before the processor can fetch them as new
    instructions.

    During resume from hibernate, the snapshot code copies some pages directly,
    meaning these architectures do not get a chance to perform their cache
    maintenance. Modify the read and decompress code to call
    flush_icache_range() on all pages that are restored, so that the restored
    in-place pages are guaranteed to be executable on these architectures.

    Signed-off-by: James Morse
    Acked-by: Pavel Machek
    Acked-by: Rafael J. Wysocki
    Acked-by: Catalin Marinas
    [will: make clean_pages_on_* static and remove initialisers]
    Signed-off-by: Will Deacon

    James Morse
     

25 Mar, 2016

2 commits

  • Pull more power management and ACPI updates from Rafael Wysocki:
    "The second batch of power management and ACPI updates for v4.6.

    Included are fixups on top of the previous PM/ACPI pull request and
    other material that didn't make into it but still should go into 4.6.

    Among other things, there's a fix for an intel_pstate driver issue
    uncovered by recent cpufreq changes, a workaround for a boot hang on
    Skylake-H related to the handling of deep C-states by the platform and
    a PCI/ACPI fix for the handling of IO port resources on non-x86
    architectures plus some new device IDs and similar.

    Specifics:

    - Fix for an intel_pstate driver issue related to the handling of MSR
    updates uncovered by the recent cpufreq rework (Rafael Wysocki).

    - cpufreq core cleanups related to starting governors and frequency
    synchronization during resume from system suspend and a locking fix
    for cpufreq_quick_get() (Rafael Wysocki, Richard Cochran).

    - acpi-cpufreq and powernv cpufreq driver updates (Jisheng Zhang,
    Michael Neuling, Richard Cochran, Shilpasri Bhat).

    - intel_idle driver update preventing some Skylake-H systems from
    hanging during initialization by disabling deep C-states mishandled
    by the platform in the problematic configurations (Len Brown).

    - Intel Xeon Phi Processor x200 support for intel_idle
    (Dasaratharaman Chandramouli).

    - cpuidle menu governor updates to make it always honor PM QoS
    latency constraints (and prevent C1 from being used as the fallback
    C-state on x86 when they are set below its exit latency) and to
    restore the previous behavior to fall back to C1 if the next timer
    event is set far enough in the future that was changed in 4.4 which
    led to an energy consumption regression (Rik van Riel, Rafael
    Wysocki).

    - New device ID for a future AMD UART controller in the ACPI driver
    for AMD SoCs (Wang Hongcheng).

    - Rockchip rk3399 support for the rockchip-io-domain adaptive voltage
    scaling (AVS) driver (David Wu).

    - ACPI PCI resources management fix for the handling of IO space
    resources on architectures where the IO space is memory mapped
    (IA64 and ARM64) broken by the introduction of common ACPI
    resources parsing for PCI host bridges in 4.4 (Lorenzo Pieralisi).

    - Fix for the ACPI backend of the generic device properties API to
    make it parse non-device (data node only) children of an ACPI
    device correctly (Irina Tirdea).

    - Fixes for the handling of global suspend flags (introduced in 4.4)
    during hibernation and resume from it (Lukas Wunner).

    - Support for obtaining configuration information from Device Trees
    in the PM clocks framework (Jon Hunter).

    - ACPI _DSM helper code and devfreq framework cleanups (Colin Ian
    King, Geert Uytterhoeven)"

    * tag 'pm+acpi-4.6-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (23 commits)
    PM / AVS: rockchip-io: add io selectors and supplies for rk3399
    intel_idle: Support for Intel Xeon Phi Processor x200 Product Family
    intel_idle: prevent SKL-H boot failure when C8+C9+C10 enabled
    ACPI / PM: Runtime resume devices when waking from hibernate
    PM / sleep: Clear pm_suspend_global_flags upon hibernate
    cpufreq: governor: Always schedule work on the CPU running update
    cpufreq: Always update current frequency before startig governor
    cpufreq: Introduce cpufreq_update_current_freq()
    cpufreq: Introduce cpufreq_start_governor()
    cpufreq: powernv: Add sysfs attributes to show throttle stats
    cpufreq: acpi-cpufreq: make Intel/AMD MSR access, io port access static
    PCI: ACPI: IA64: fix IO port generic range check
    ACPI / util: cast data to u64 before shifting to fix sign extension
    cpufreq: powernv: Define per_cpu chip pointer to optimize hot-path
    cpuidle: menu: Fall back to polling if next timer event is near
    cpufreq: acpi-cpufreq: Clean up hot plug notifier callback
    intel_pstate: Do not call wrmsrl_on_cpu() with disabled interrupts
    cpufreq: Make cpufreq_quick_get() safe to call
    ACPI / property: fix data node parsing in acpi_get_next_subnode()
    ACPI / APD: Add device HID for future AMD UART controller
    ...

    Linus Torvalds
     
  • * pm-avs:
    PM / AVS: rockchip-io: add io selectors and supplies for rk3399

    * pm-clk:
    PM / clk: Add support for obtaining clocks from device-tree

    * pm-devfreq:
    PM / devfreq: Spelling s/frequnecy/frequency/

    * pm-sleep:
    ACPI / PM: Runtime resume devices when waking from hibernate
    PM / sleep: Clear pm_suspend_global_flags upon hibernate

    Rafael J. Wysocki
     

23 Mar, 2016

2 commits

  • When suspending to RAM, waking up and later suspending to disk,
    we gratuitously runtime resume devices after the thaw phase.
    This does not occur if we always suspend to RAM or always to disk.

    pm_complete_with_resume_check(), which gets called from
    pci_pm_complete() among others, schedules a runtime resume
    if PM_SUSPEND_FLAG_FW_RESUME is set. The flag is set during
    a suspend-to-RAM cycle. It is cleared at the beginning of
    the suspend-to-RAM cycle but not afterwards and it is not
    cleared during a suspend-to-disk cycle at all. Fix it.

    Fixes: ef25ba047601 (PM / sleep: Add flags to indicate platform firmware involvement)
    Signed-off-by: Lukas Wunner
    Cc: 4.4+ # 4.4+
    Signed-off-by: Rafael J. Wysocki

    Lukas Wunner
     
  • Use the more common logging method with the eventual goal of removing
    pr_warning altogether.

    Miscellanea:

    - Realign arguments
    - Coalesce formats
    - Add missing space between a few coalesced formats

    Signed-off-by: Joe Perches
    Acked-by: Rafael J. Wysocki [kernel/power/suspend.c]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     

17 Mar, 2016

1 commit

  • Pull power management and ACPI updates from Rafael Wysocki:
    "This time the majority of changes go into cpufreq and they are
    significant.

    First off, the way CPU frequency updates are triggered is different
    now. Instead of having to set up and manage a deferrable timer for
    each CPU in the system to evaluate and possibly change its frequency
    periodically, cpufreq governors set up callbacks to be invoked by the
    scheduler on a regular basis (basically on utilization updates). The
    "old" governors, "ondemand" and "conservative", still do all of their
    work in process context (although that is triggered by the scheduler
    now), but intel_pstate does it all in the callback invoked by the
    scheduler with no need for any additional asynchronous processing.

    Of course, this eliminates the overhead related to the management of
    all those timers, but also it allows the cpufreq governor code to be
    simplified quite a bit. On top of that, the common code and data
    structures used by the "ondemand" and "conservative" governors are
    cleaned up and made more straightforward and some long-standing and
    quite annoying problems are addressed. In particular, the handling of
    governor sysfs attributes is modified and the related locking becomes
    more fine grained which allows some concurrency problems to be avoided
    (particularly deadlocks with the core cpufreq code).

    In principle, the new mechanism for triggering frequency updates
    allows utilization information to be passed from the scheduler to
    cpufreq. Although the current code doesn't make use of it, in the
    works is a new cpufreq governor that will make decisions based on the
    scheduler's utilization data. That should allow the scheduler and
    cpufreq to work more closely together in the long run.

    In addition to the core and governor changes, cpufreq drivers are
    updated too. Fixes and optimizations go into intel_pstate, the
    cpufreq-dt driver is updated on top of some modification in the
    Operating Performance Points (OPP) framework and there are fixes and
    other updates in the powernv cpufreq driver.

    Apart from the cpufreq updates there is some new ACPICA material,
    including a fix for a problem introduced by previous ACPICA updates,
    and some less significant changes in the ACPI code, like CPPC code
    optimizations, ACPI processor driver cleanups and support for loading
    ACPI tables from initrd.

    Also updated are the generic power domains framework, the Intel RAPL
    power capping driver and the turbostat utility and we have a bunch of
    traditional assorted fixes and cleanups.

    Specifics:

    - Redesign of cpufreq governors and the intel_pstate driver to make
    them use callbacks invoked by the scheduler to trigger CPU
    frequency evaluation instead of using per-CPU deferrable timers for
    that purpose (Rafael Wysocki).

    - Reorganization and cleanup of cpufreq governor code to make it more
    straightforward and fix some concurrency problems in it (Rafael
    Wysocki, Viresh Kumar).

    - Cleanup and improvements of locking in the cpufreq core (Viresh
    Kumar).

    - Assorted cleanups in the cpufreq core (Rafael Wysocki, Viresh
    Kumar, Eric Biggers).

    - intel_pstate driver updates including fixes, optimizations and a
    modification to make it enable enable hardware-coordinated P-state
    selection (HWP) by default if supported by the processor (Philippe
    Longepe, Srinivas Pandruvada, Rafael Wysocki, Viresh Kumar, Felipe
    Franciosi).

    - Operating Performance Points (OPP) framework updates to improve its
    handling of voltage regulators and device clocks and updates of the
    cpufreq-dt driver on top of that (Viresh Kumar, Jon Hunter).

    - Updates of the powernv cpufreq driver to fix initialization and
    cleanup problems in it and correct its worker thread handling with
    respect to CPU offline, new powernv_throttle tracepoint (Shilpasri
    Bhat).

    - ACPI cpufreq driver optimization and cleanup (Rafael Wysocki).

    - ACPICA updates including one fix for a regression introduced by
    previos changes in the ACPICA code (Bob Moore, Lv Zheng, David Box,
    Colin Ian King).

    - Support for installing ACPI tables from initrd (Lv Zheng).

    - Optimizations of the ACPI CPPC code (Prashanth Prakash, Ashwin
    Chaugule).

    - Support for _HID(ACPI0010) devices (ACPI processor containers) and
    ACPI processor driver cleanups (Sudeep Holla).

    - Support for ACPI-based enumeration of the AMBA bus (Graeme Gregory,
    Aleksey Makarov).

    - Modification of the ACPI PCI IRQ management code to make it treat
    255 in the Interrupt Line register as "not connected" on x86 (as
    per the specification) and avoid attempts to use that value as a
    valid interrupt vector (Chen Fan).

    - ACPI APEI fixes related to resource leaks (Josh Hunt).

    - Removal of modularity from a few ACPI drivers (BGRT, GHES,
    intel_pmic_crc) that cannot be built as modules in practice (Paul
    Gortmaker).

    - PNP framework update to make it treat ACPI_RESOURCE_TYPE_SERIAL_BUS
    as a valid resource type (Harb Abdulhamid).

    - New device ID (future AMD I2C controller) in the ACPI driver for
    AMD SoCs (APD) and in the designware I2C driver (Xiangliang Yu).

    - Assorted ACPI cleanups (Colin Ian King, Kaiyen Chang, Oleg Drokin).

    - cpuidle menu governor optimization to avoid a square root
    computation in it (Rasmus Villemoes).

    - Fix for potential use-after-free in the generic device properties
    framework (Heikki Krogerus).

    - Updates of the generic power domains (genpd) framework including
    support for multiple power states of a domain, fixes and debugfs
    output improvements (Axel Haslam, Jon Hunter, Laurent Pinchart,
    Geert Uytterhoeven).

    - Intel RAPL power capping driver updates to reduce IPI overhead in
    it (Jacob Pan).

    - System suspend/hibernation code cleanups (Eric Biggers, Saurabh
    Sengar).

    - Year 2038 fix for the process freezer (Abhilash Jindal).

    - turbostat utility updates including new features (decoding of more
    registers and CPUID fields, sub-second intervals support, GFX MHz
    and RC6 printout, --out command line option), fixes (syscall jitter
    detection and workaround, reductioin of the number of syscalls
    made, fixes related to Xeon x200 processors, compiler warning
    fixes) and cleanups (Len Brown, Hubert Chrzaniuk, Chen Yu)"

    * tag 'pm+acpi-4.6-rc1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (182 commits)
    tools/power turbostat: bugfix: TDP MSRs print bits fixing
    tools/power turbostat: correct output for MSR_NHM_SNB_PKG_CST_CFG_CTL dump
    tools/power turbostat: call __cpuid() instead of __get_cpuid()
    tools/power turbostat: indicate SMX and SGX support
    tools/power turbostat: detect and work around syscall jitter
    tools/power turbostat: show GFX%rc6
    tools/power turbostat: show GFXMHz
    tools/power turbostat: show IRQs per CPU
    tools/power turbostat: make fewer systems calls
    tools/power turbostat: fix compiler warnings
    tools/power turbostat: add --out option for saving output in a file
    tools/power turbostat: re-name "%Busy" field to "Busy%"
    tools/power turbostat: Intel Xeon x200: fix turbo-ratio decoding
    tools/power turbostat: Intel Xeon x200: fix erroneous bclk value
    tools/power turbostat: allow sub-sec intervals
    ACPI / APEI: ERST: Fixed leaked resources in erst_init
    ACPI / APEI: Fix leaked resources
    intel_pstate: Do not skip samples partially
    intel_pstate: Remove freq calculation from intel_pstate_calc_busy()
    intel_pstate: Move intel_pstate_calc_busy() into get_target_pstate_use_performance()
    ...

    Linus Torvalds