Eric Lee / smarc-fsl-linux-kernel

25 Dec, 2016

1 commit

7c0f6ba68 Replace <asm/uaccess.h> with <linux/uaccess.h> globally ... Browse Code »

This was entirely automated, using the script by Al:

PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*'
sed -i -e "s!$PATT!#include !" \
$(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h)

to do the replacement at the end of the merge window.

Requested-by: Al Viro
Signed-off-by: Linus Torvalds

Linus Torvalds
2016-12-25 03:46:01 +0800

14 Dec, 2016

3 commits

c11a6cfb0 Merge branch 'for-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq ... Browse Code »

Pull workqueue updates from Tejun Heo:
"Mostly patches to initialize workqueue subsystem earlier and get rid
of keventd_up().

The patches were headed for the last merge cycle but got delayed due
to a bug found late minute, which is fixed now.

Also, to help debugging, destroy_workqueue() is more chatty now on a
sanity check failure."

* 'for-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: move wq_numa_init() to workqueue_init()
workqueue: remove keventd_up()
debugobj, workqueue: remove keventd_up() usage
slab, workqueue: remove keventd_up() usage
power, workqueue: remove keventd_up() usage
tty, workqueue: remove keventd_up() usage
mce, workqueue: remove keventd_up() usage
workqueue: make workqueue available early during boot
workqueue: dump workqueue state on sanity check failures in destroy_workqueue()

Linus Torvalds
2016-12-14 04:59:57 +0800
7b9dc3f75 Merge tag 'pm-4.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm ... Browse Code »

Pull power management updates from Rafael Wysocki:
"Again, cpufreq gets more changes than the other parts this time (one
new driver, one old driver less, a bunch of enhancements of the
existing code, new CPU IDs, fixes, cleanups)

There also are some changes in cpuidle (idle injection rework, a
couple of new CPU IDs, online/offline rework in intel_idle, fixes and
cleanups), in the generic power domains framework (mostly related to
supporting power domains containing CPUs), and in the Operating
Performance Points (OPP) library (mostly related to supporting devices
with multiple voltage regulators)

In addition to that, the system sleep state selection interface is
modified to make it easier for distributions with unchanged user space
to support suspend-to-idle as the default system suspend method, some
issues are fixed in the PM core, the latency tolerance PM QoS
framework is improved a bit, the Intel RAPL power capping driver is
cleaned up and there are some fixes and cleanups in the devfreq
subsystem

Specifics:

- New cpufreq driver for Broadcom STB SoCs and a Device Tree binding
for it (Markus Mayer)

- Support for ARM Integrator/AP and Integrator/CP in the generic DT
cpufreq driver and elimination of the old Integrator cpufreq driver
(Linus Walleij)

- Support for the zx296718, r8a7743 and r8a7745, Socionext UniPhier,
and PXA SoCs in the the generic DT cpufreq driver (Baoyou Xie,
Geert Uytterhoeven, Masahiro Yamada, Robert Jarzmik)

- cpufreq core fix to eliminate races that may lead to using inactive
policy objects and related cleanups (Rafael Wysocki)

- cpufreq schedutil governor update to make it use SCHED_FIFO kernel
threads (instead of regular workqueues) for doing delayed work (to
reduce the response latency in some cases) and related cleanups
(Viresh Kumar)

- New cpufreq sysfs attribute for resetting statistics (Markus Mayer)

- cpufreq governors fixes and cleanups (Chen Yu, Stratos Karafotis,
Viresh Kumar)

- Support for using generic cpufreq governors in the intel_pstate
driver (Rafael Wysocki)

- Support for per-logical-CPU P-state limits and the EPP/EPB (Energy
Performance Preference/Energy Performance Bias) knobs in the
intel_pstate driver (Srinivas Pandruvada)

- New CPU ID for Knights Mill in intel_pstate (Piotr Luc)

- intel_pstate driver modification to use the P-state selection
algorithm based on CPU load on platforms with the system profile in
the ACPI tables set to "mobile" (Srinivas Pandruvada)

- intel_pstate driver cleanups (Arnd Bergmann, Rafael Wysocki,
Srinivas Pandruvada)

- cpufreq powernv driver updates including fast switching support
(for the schedutil governor), fixes and cleanus (Akshay Adiga,
Andrew Donnellan, Denis Kirjanov)

- acpi-cpufreq driver rework to switch it over to the new CPU
offline/online state machine (Sebastian Andrzej Siewior)

- Assorted cleanups in cpufreq drivers (Wei Yongjun, Prashanth
Prakash)

- Idle injection rework (to make it use the regular idle path instead
of a home-grown custom one) and related powerclamp thermal driver
updates (Peter Zijlstra, Jacob Pan, Petr Mladek, Sebastian Andrzej
Siewior)

- New CPU IDs for Atom Z34xx and Knights Mill in intel_idle (Andy
Shevchenko, Piotr Luc)

- intel_idle driver cleanups and switch over to using the new CPU
offline/online state machine (Anna-Maria Gleixner, Sebastian
Andrzej Siewior)

- cpuidle DT driver update to support suspend-to-idle properly
(Sudeep Holla)

- cpuidle core cleanups and misc updates (Daniel Lezcano, Pan Bian,
Rafael Wysocki)

- Preliminary support for power domains including CPUs in the generic
power domains (genpd) framework and related DT bindings (Lina Iyer)

- Assorted fixes and cleanups in the generic power domains (genpd)
framework (Colin Ian King, Dan Carpenter, Geert Uytterhoeven)

- Preliminary support for devices with multiple voltage regulators
and related fixes and cleanups in the Operating Performance Points
(OPP) library (Viresh Kumar, Masahiro Yamada, Stephen Boyd)

- System sleep state selection interface rework to make it easier to
support suspend-to-idle as the default system suspend method
(Rafael Wysocki)

- PM core fixes and cleanups, mostly related to the interactions
between the system suspend and runtime PM frameworks (Ulf Hansson,
Sahitya Tummala, Tony Lindgren)

- Latency tolerance PM QoS framework imorovements (Andrew Lutomirski)

- New Knights Mill CPU ID for the Intel RAPL power capping driver
(Piotr Luc)

- Intel RAPL power capping driver fixes, cleanups and switch over to
using the new CPU offline/online state machine (Jacob Pan, Thomas
Gleixner, Sebastian Andrzej Siewior)

- Fixes and cleanups in the exynos-ppmu, exynos-nocp, rk3399_dmc,
rockchip-dfi devfreq drivers and the devfreq core (Axel Lin,
Chanwoo Choi, Javier Martinez Canillas, MyungJoo Ham, Viresh Kumar)

- Fix for false-positive KASAN warnings during resume from ACPI S3
(suspend-to-RAM) on x86 (Josh Poimboeuf)

- Memory map verification during resume from hibernation on x86 to
ensure a consistent address space layout (Chen Yu)

- Wakeup sources debugging enhancement (Xing Wei)

- rockchip-io AVS driver cleanup (Shawn Lin)"

* tag 'pm-4.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (127 commits)
devfreq: rk3399_dmc: Don't use OPP structures outside of RCU locks
devfreq: rk3399_dmc: Remove dangling rcu_read_unlock()
devfreq: exynos: Don't use OPP structures outside of RCU locks
Documentation: intel_pstate: Document HWP energy/performance hints
cpufreq: intel_pstate: Support for energy performance hints with HWP
cpufreq: intel_pstate: Add locking around HWP requests
PM / sleep: Print active wakeup sources when blocking on wakeup_count reads
PM / core: Fix bug in the error handling of async suspend
PM / wakeirq: Fix dedicated wakeirq for drivers not using autosuspend
PM / Domains: Fix compatible for domain idle state
PM / OPP: Don't WARN on multiple calls to dev_pm_opp_set_regulators()
PM / OPP: Allow platform specific custom set_opp() callbacks
PM / OPP: Separate out _generic_set_opp()
PM / OPP: Add infrastructure to manage multiple regulators
PM / OPP: Pass struct dev_pm_opp_supply to _set_opp_voltage()
PM / OPP: Manage supply's voltage/current in a separate structure
PM / OPP: Don't use OPP structure outside of rcu protected section
PM / OPP: Reword binding supporting multiple regulators per device
PM / OPP: Fix incorrect cpu-supply property in binding
cpuidle: Add a kerneldoc comment to cpuidle_use_deepest_state()
..

Linus Torvalds
2016-12-14 02:41:53 +0800
36869cb93 Merge branch 'for-4.10/block' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block layer updates from Jens Axboe:
"This is the main block pull request this series. Contrary to previous
release, I've kept the core and driver changes in the same branch. We
always ended up having dependencies between the two for obvious
reasons, so makes more sense to keep them together. That said, I'll
probably try and keep more topical branches going forward, especially
for cycles that end up being as busy as this one.

The major parts of this pull request is:

- Improved support for O_DIRECT on block devices, with a small
private implementation instead of using the pig that is
fs/direct-io.c. From Christoph.

- Request completion tracking in a scalable fashion. This is utilized
by two components in this pull, the new hybrid polling and the
writeback queue throttling code.

- Improved support for polling with O_DIRECT, adding a hybrid mode
that combines pure polling with an initial sleep. From me.

- Support for automatic throttling of writeback queues on the block
side. This uses feedback from the device completion latencies to
scale the queue on the block side up or down. From me.

- Support from SMR drives in the block layer and for SD. From Hannes
and Shaun.

- Multi-connection support for nbd. From Josef.

- Cleanup of request and bio flags, so we have a clear split between
which are bio (or rq) private, and which ones are shared. From
Christoph.

- A set of patches from Bart, that improve how we handle queue
stopping and starting in blk-mq.

- Support for WRITE_ZEROES from Chaitanya.

- Lightnvm updates from Javier/Matias.

- Supoort for FC for the nvme-over-fabrics code. From James Smart.

- A bunch of fixes from a whole slew of people, too many to name
here"

* 'for-4.10/block' of git://git.kernel.dk/linux-block: (182 commits)
blk-stat: fix a few cases of missing batch flushing
blk-flush: run the queue when inserting blk-mq flush
elevator: make the rqhash helpers exported
blk-mq: abstract out blk_mq_dispatch_rq_list() helper
blk-mq: add blk_mq_start_stopped_hw_queue()
block: improve handling of the magic discard payload
blk-wbt: don't throttle discard or write zeroes
nbd: use dev_err_ratelimited in io path
nbd: reset the setup task for NBD_CLEAR_SOCK
nvme-fabrics: Add FC LLDD loopback driver to test FC-NVME
nvme-fabrics: Add target support for FC transport
nvme-fabrics: Add host support for FC transport
nvme-fabrics: Add FC transport LLDD api definitions
nvme-fabrics: Add FC transport FC-NVME definitions
nvme-fabrics: Add FC transport error codes to nvme.h
Add type 0x28 NVME type code to scsi fc headers
nvme-fabrics: patch target code in prep for FC transport support
nvme-fabrics: set sqe.command_id in core not transports
parser: add u64 number parser
nvme-rdma: align to generic ib_event logging helper
...

Linus Torvalds
2016-12-14 02:19:16 +0800

13 Dec, 2016

1 commit

631ddaba5 Merge branches 'pm-sleep' and 'powercap' ... Browse Code »

* pm-sleep:
PM / sleep: Print active wakeup sources when blocking on wakeup_count reads
x86/suspend: fix false positive KASAN warning on suspend/resume
PM / sleep / ACPI: Use the ACPI_FADT_LOW_POWER_S0 flag
PM / sleep: System sleep state selection interface rework
PM / hibernate: Verify the consistent of e820 memory map by md5 digest

* powercap:
powercap / RAPL: Add Knights Mill CPUID
powercap/intel_rapl: fix and tidy up error handling
powercap/intel_rapl: Track active CPUs internally
powercap/intel_rapl: Cleanup duplicated init code
powercap/intel rapl: Convert to hotplug state machine
powercap/intel_rapl: Propagate error code when registration fails
powercap/intel_rapl: Add missing domain data update on hotplug

Rafael J. Wysocki
2016-12-13 03:46:35 +0800

22 Nov, 2016

2 commits

08b98d329 PM / sleep / ACPI: Use the ACPI_FADT_LOW_POWER_S0 flag ... Browse Code »

Modify the ACPI system sleep support setup code to select
suspend-to-idle as the default system sleep state if the
ACPI_FADT_LOW_POWER_S0 flag is set in the FADT and the
default sleep state was not selected from the kernel command
line.

Signed-off-by: Rafael J. Wysocki
Tested-by: Mario Limonciello

Rafael J. Wysocki
2016-11-22 05:48:10 +0800
406e79385 PM / sleep: System sleep state selection interface rework ... Browse Code »

There are systems in which the platform doesn't support any special
sleep states, so suspend-to-idle (PM_SUSPEND_FREEZE) is the only
available system sleep state. However, some user space frameworks
only use the "mem" and (sometimes) "standby" sleep state labels, so
the users of those systems need to modify user space in order to be
able to use system suspend at all and that may be a pain in practice.

Commit 0399d4db3edf (PM / sleep: Introduce command line argument for
sleep state enumeration) attempted to address this problem by adding
a command line argument to change the meaning of the "mem" string in
/sys/power/state to make it trigger suspend-to-idle (instead of
suspend-to-RAM).

However, there also are systems in which the platform does support
special sleep states, but suspend-to-idle is the preferred one anyway
(it even may save more energy than the platform-provided sleep states
in some cases) and the above commit doesn't help in those cases.

For this reason, rework the system sleep state selection interface
again (but preserve backwards compatibiliby). Namely, add a new
sysfs file, /sys/power/mem_sleep, that will control the system
suspend mode triggered by writing "mem" to /sys/power/state (in
analogy with what /sys/power/disk does for hibernation). Make it
select suspend-to-RAM ("deep" sleep) by default (if supported) and
fall back to suspend-to-idle ("s2idle") otherwise and add a new
command line argument, mem_sleep_default, allowing that default to
be overridden if need be.

At the same time, drop the relative_sleep_states command line
argument that doesn't make sense any more.

Signed-off-by: Rafael J. Wysocki
Tested-by: Mario Limonciello

Rafael J. Wysocki
2016-11-22 05:45:40 +0800

02 Nov, 2016

1 commit

ceb75787b PM / sleep: fix device reference leak in test_suspend ... Browse Code »

Make sure to drop the reference taken by class_find_device() after
opening the RTC device.

Fixes: 77437fd4e61f (pm: boot time suspend selftest)
Signed-off-by: Johan Hovold
Signed-off-by: Rafael J. Wysocki

Johan Hovold
2016-11-02 12:10:04 +0800

01 Nov, 2016

1 commit

70fd76140 block,fs: use REQ_* flags directly ... Browse Code »

Remove the WRITE_* and READ_SYNC wrappers, and just use the flags
directly. Where applicable this also drops usage of the
bio_set_op_attrs wrapper.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2016-11-01 23:43:26 +0800

24 Oct, 2016

1 commit

1adb469b9 PM / suspend: Fix missing KERN_CONT for suspend message ... Browse Code »

Commit 4bcc595ccd80 (printk: reinstate KERN_CONT for printing
continuation lines) exposed a missing KERN_CONT from one of the
messages shown on entering suspend. With v4.9-rc1, the 'done.' shown
after syncing the filesystems no longer appears as a continuation but
a new message with its own timestamp.

[ 9.259566] PM: Syncing filesystems ... [ 9.264119] done.

Fix this by adding the KERN_CONT log level for the 'done.' part of the
message seen after syncing filesystems. While we are at it, convert
these suspend printks to pr_info and pr_cont, respectively.

Signed-off-by: Jon Hunter
Signed-off-by: Rafael J. Wysocki

Jon Hunter
2016-10-24 20:38:02 +0800

20 Oct, 2016

1 commit

8bc4a0445 Merge branch 'for-4.9' into for-4.10 Browse Code »

Tejun Heo
2016-10-20 00:12:40 +0800

08 Oct, 2016

1 commit

7d2e7a22c oom, suspend: fix oom_killer_disable vs. pm suspend properly ... Browse Code »

Commit 74070542099c ("oom, suspend: fix oom_reaper vs.
oom_killer_disable race") has workaround an existing race between
oom_killer_disable and oom_reaper by adding another round of
try_to_freeze_tasks after the oom killer was disabled. This was the
easiest thing to do for a late 4.7 fix. Let's fix it properly now.

After "oom: keep mm of the killed task available" we no longer have to
call exit_oom_victim from the oom reaper because we have stable mm
available and hide the oom_reaped mm by MMF_OOM_SKIP flag. So let's
remove exit_oom_victim and the race described in the above commit
doesn't exist anymore if.

Unfortunately this alone is not sufficient for the oom_killer_disable
usecase because now we do not have any reliable way to reach
exit_oom_victim (the victim might get stuck on a way to exit for an
unbounded amount of time). OOM killer can cope with that by checking mm
flags and move on to another victim but we cannot do the same for
oom_killer_disable as we would lose the guarantee of no further
interference of the victim with the rest of the system. What we can do
instead is to cap the maximum time the oom_killer_disable waits for
victims. The only current user of this function (pm suspend) already
has a concept of timeout for back off so we can reuse the same value
there.

Let's drop set_freezable for the oom_reaper kthread because it is no
longer needed as the reaper doesn't wake or thaw any processes.

Link: http://lkml.kernel.org/r/1472119394-11342-7-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko
Cc: Tetsuo Handa
Cc: Oleg Nesterov
Cc: David Rientjes
Cc: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2016-10-08 09:46:27 +0800

18 Sep, 2016

1 commit

a81f80f3e power, workqueue: remove keventd_up() usage ... Browse Code »

Now that workqueue can handle work item queueing/cancelling from very
early during boot, there is no need to gate cancel_delayed_work_sync()
while !keventd_up(). Remove it.

Signed-off-by: Tejun Heo
Acked-by: Rafael J. Wysocki
Cc: Qiao Zhou

Tejun Heo
2016-09-18 01:18:21 +0800

13 Sep, 2016

3 commits

1ad1410f6 PM / Hibernate: allow hibernation with PAGE_POISONING_ZERO ... Browse Code »

PAGE_POISONING_ZERO disables zeroing new pages on alloc, they are
poisoned (zeroed) as they become available.
In the hibernate use case, free pages will appear in the system without
being cleared, left there by the loading kernel.

This patch will make sure free pages are cleared on resume when
PAGE_POISONING_ZERO is enabled. We free the pages just after resume
because we can't do it later: going through any device resume code might
allocate some memory and invalidate the free pages bitmap.

Thus we don't need to disable hibernation when PAGE_POISONING_ZERO is
enabled.

Signed-off-by: Anisse Astier
Reviewed-by: Kees Cook
Acked-by: Pavel Machek
Signed-off-by: Rafael J. Wysocki

Anisse Astier
2016-09-13 08:35:27 +0800
fa7fd6fa3 PM / sleep: enable suspend-to-idle even without registered suspend_ops ... Browse Code »

Suspend-to-idle (aka the "freeze" sleep state) is a system sleep state
in which all of the processors enter deepest possible idle state and
wait for interrupts right after suspending all the devices.

There is no hard requirement for a platform to support and register
platform specific suspend_ops to enter suspend-to-idle/freeze state.
Only deeper system sleep states like PM_SUSPEND_STANDBY and
PM_SUSPEND_MEM rely on such low level support/implementation.

suspend-to-idle can be entered as along as all the devices can be
suspended. This patch enables the support for suspend-to-idle even on
systems that don't have any low level support for deeper system sleep
states and/or don't register any platform specific suspend_ops.

Signed-off-by: Sudeep Holla
Tested-by: Andy Gross
Signed-off-by: Rafael J. Wysocki

Sudeep Holla
2016-09-13 08:17:19 +0800
5b3f249c9 PM / sleep: Increase default DPM watchdog timeout to 120 ... Browse Code »

Recently we have a new report that, the harddisk can not
resume on time due to firmware issues, and got a kernel
panic because of DPM watchdog timeout. So adjust the
default timeout from 60 to 120 to survive on this platform,
and make DPM_WATCHDOG depending on EXPERT.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=117971
Suggested-by: Pavel Machek
Suggested-by: Rafael J. Wysocki
Reported-by: Higuita
Signed-off-by: Chen Yu
Acked-by: Pavel Machek
Signed-off-by: Rafael J. Wysocki

Chen Yu
2016-09-13 08:15:58 +0800

05 Sep, 2016

1 commit

c86d06ba2 PM / QoS: avoid calling cancel_delayed_work_sync() during early boot ... Browse Code »

of_clk_init() ends up calling into pm_qos_update_request() very early
during boot where irq is expected to stay disabled.
pm_qos_update_request() uses cancel_delayed_work_sync() which
correctly assumes that irq is enabled on invocation and
unconditionally disables and re-enables it.

Gate cancel_delayed_work_sync() invocation with kevented_up() to avoid
enabling irq unexpectedly during early boot.

Signed-off-by: Tejun Heo
Reported-and-tested-by: Qiao Zhou
Link: http://lkml.kernel.org/r/d2501c4c-8e7b-bea3-1b01-000b36b5dfe9@asrmicro.com
Signed-off-by: Rafael J. Wysocki

Tejun Heo
2016-09-05 21:07:53 +0800

18 Aug, 2016

1 commit

6c16f42a4 Merge branch 'pm-sleep' ... Browse Code »

* pm-sleep:
PM / hibernate: Fix rtree_next_node() to avoid walking off list ends
x86/power/64: Use __pa() for physical address computation
PM / sleep: Update some system sleep documentation

Rafael J. Wysocki
2016-08-18 09:27:08 +0800

16 Aug, 2016

1 commit

924d86967 PM / hibernate: Fix rtree_next_node() to avoid walking off list ends ... Browse Code »

rtree_next_node() walks the linked list of leaf nodes to find the next
block of pages in the struct memory_bitmap. If it walks off the end of
the list of nodes, it walks the list of memory zones to find the next
region of memory. If it walks off the end of the list of zones, it
returns false.

This leaves the struct bm_position's node and zone pointers pointing
at their respective struct list_heads in struct mem_zone_bm_rtree.

memory_bm_find_bit() uses struct bm_position's node and zone pointers
to avoid walking lists and trees if the next bit appears in the same
node/zone. It handles these values being stale.

Swap rtree_next_node()s 'step then test' to 'test-next then step',
this means if we reach the end of memory we return false and leave
the node and zone pointers as they were.

This fixes a panic on resume using AMD Seattle with 64K pages:
[ 6.868732] Freezing user space processes ... (elapsed 0.000 seconds) done.
[ 6.875753] Double checking all user space processes after OOM killer disable... (elapsed 0.000 seconds)
[ 6.896453] PM: Using 3 thread(s) for decompression.
[ 6.896453] PM: Loading and decompressing image data (5339 pages)...
[ 7.318890] PM: Image loading progress: 0%
[ 7.323395] Unable to handle kernel paging request at virtual address 00800040
[ 7.330611] pgd = ffff000008df0000
[ 7.334003] [00800040] *pgd=00000083fffe0003, *pud=00000083fffe0003, *pmd=00000083fffd0003, *pte=0000000000000000
[ 7.344266] Internal error: Oops: 96000005 [#1] PREEMPT SMP
[ 7.349825] Modules linked in:
[ 7.352871] CPU: 2 PID: 1 Comm: swapper/0 Tainted: G W I 4.8.0-rc1 #4737
[ 7.360512] Hardware name: AMD Overdrive/Supercharger/Default string, BIOS ROD1002C 04/08/2016
[ 7.369109] task: ffff8003c0220000 task.stack: ffff8003c0280000
[ 7.375020] PC is at set_bit+0x18/0x30
[ 7.378758] LR is at memory_bm_set_bit+0x24/0x30
[ 7.383362] pc : [] lr : [] pstate: 60000045
[ 7.390743] sp : ffff8003c0283b00
[ 7.473551]
[ 7.475031] Process swapper/0 (pid: 1, stack limit = 0xffff8003c0280020)
[ 7.481718] Stack: (0xffff8003c0283b00 to 0xffff8003c0284000)
[ 7.800075] Call trace:
[ 7.887097] [] set_bit+0x18/0x30
[ 7.891876] [] duplicate_memory_bitmap.constprop.38+0x54/0x70
[ 7.899172] [] snapshot_write_next+0x22c/0x47c
[ 7.905166] [] load_image_lzo+0x754/0xa88
[ 7.910725] [] swsusp_read+0x144/0x230
[ 7.916025] [] load_image_and_restore+0x58/0x90
[ 7.922105] [] software_resume+0x2f0/0x338
[ 7.927752] [] do_one_initcall+0x38/0x11c
[ 7.933314] [] kernel_init_freeable+0x14c/0x1ec
[ 7.939395] [] kernel_init+0x10/0xfc
[ 7.944520] [] ret_from_fork+0x10/0x40
[ 7.949820] Code: d2800022 8b400c21 f9800031 9ac32043 (c85f7c22)
[ 7.955909] ---[ end trace 0024a5986e6ff323 ]---
[ 7.960529] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b

Here struct mem_zone_bm_rtree's start_pfn has been returned instead of
struct rtree_node's addr as the node/zone pointers are corrupt after
we walked off the end of the lists during mark_unsafe_pages().

This behaviour was exposed by commit 6dbecfd345a6 ("PM / hibernate:
Simplify mark_unsafe_pages()"), which caused mark_unsafe_pages() to call
duplicate_memory_bitmap(), which uses memory_bm_find_bit() after walking
off the end of the memory bitmap.

Fixes: 3a20cb177961 (PM / Hibernate: Implement position keeping in radix tree)
Signed-off-by: James Morse
[ rjw: Subject ]
Signed-off-by: Rafael J. Wysocki

James Morse
2016-08-16 19:16:36 +0800

13 Aug, 2016

2 commits

0aeeb3e73 Merge branches 'pm-sleep' and 'pm-cpufreq' ... Browse Code »

* pm-sleep:
PM / hibernate: Restore processor state before using per-CPU variables
x86/power/64: Always create temporary identity mapping correctly

* pm-cpufreq:
cpufreq: powernv: Fix crash in gpstate_timer_handler()

Rafael J. Wysocki
2016-08-13 04:53:58 +0800
62822e2ec PM / hibernate: Restore processor state before using per-CPU variables ... Browse Code »

Restore the processor state before calling any other functions to
ensure per-CPU variables can be used with KASLR memory randomization.

Tracing functions use per-CPU variables (GS based on x86) and one was
called just before restoring the processor state fully. It resulted
in a double fault when both the tracing & the exception handler
functions tried to use a per-CPU variable.

Fixes: bb3632c6101b (PM / sleep: trace events for suspend/resume)
Reported-and-tested-by: Borislav Petkov
Reported-by: Jiri Kosina
Tested-by: Rafael J. Wysocki
Tested-by: Jiri Kosina
Signed-off-by: Thomas Garnier
Acked-by: Pavel Machek
Signed-off-by: Rafael J. Wysocki

Thomas Garnier
2016-08-13 04:50:42 +0800

29 Jul, 2016

1 commit

599d0c954 mm, vmscan: move LRU lists to node ... Browse Code »

This moves the LRU lists from the zone to the node and related data such
as counters, tracing, congestion tracking and writeback tracking.

Unfortunately, due to reclaim and compaction retry logic, it is
necessary to account for the number of LRU pages on both zone and node
logic. Most reclaim logic is based on the node counters but the retry
logic uses the zone counters which do not distinguish inactive and
active sizes. It would be possible to leave the LRU counters on a
per-zone basis but it's a heavier calculation across multiple cache
lines that is much more frequent than the retry checks.

Other than the LRU counters, this is mostly a mechanical patch but note
that it introduces a number of anomalies. For example, the scans are
per-zone but using per-node counters. We also mark a node as congested
when a zone is congested. This causes weird problems that are fixed
later but is easier to review.

In the event that there is excessive overhead on 32-bit systems due to
the nodes being on LRU then there are two potential solutions

1. Long-term isolation of highmem pages when reclaim is lowmem

When pages are skipped, they are immediately added back onto the LRU
list. If lowmem reclaim persisted for long periods of time, the same
highmem pages get continually scanned. The idea would be that lowmem
keeps those pages on a separate list until a reclaim for highmem pages
arrives that splices the highmem pages back onto the LRU. It potentially
could be implemented similar to the UNEVICTABLE list.

That would reduce the skip rate with the potential corner case is that
highmem pages have to be scanned and reclaimed to free lowmem slab pages.

2. Linear scan lowmem pages if the initial LRU shrink fails

This will break LRU ordering but may be preferable and faster during
memory pressure than skipping LRU pages.

Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman
Acked-by: Johannes Weiner
Acked-by: Vlastimil Babka
Cc: Hillf Danton
Cc: Joonsoo Kim
Cc: Michal Hocko
Cc: Minchan Kim
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2016-07-29 07:07:41 +0800

27 Jul, 2016

2 commits

6453dbdda Merge tag 'pm-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm ... Browse Code »

Pull power management updates from Rafael Wysocki:
"Again, the majority of changes go into the cpufreq subsystem, but
there are no big features this time. The cpufreq changes that stand
out somewhat are the governor interface rework and improvements
related to the handling of frequency tables. Apart from those, there
are fixes and new device/CPU IDs in drivers, cleanups and an
improvement of the new schedutil governor.

Next, there are some changes in the hibernation core, including a fix
for a nasty problem related to the MONITOR/MWAIT usage by CPU offline
during resume from hibernation, a few core improvements related to
memory management during resume, a couple of additional debug features
and cleanups.

Finally, we have some fixes and cleanups in the devfreq subsystem,
generic power domains framework improvements related to system
suspend/resume, support for some new chips in intel_idle and in the
power capping RAPL driver, a new version of the AnalyzeSuspend utility
and some assorted fixes and cleanups.

Specifics:

- Rework the cpufreq governor interface to make it more
straightforward and modify the conservative governor to avoid using
transition notifications (Rafael Wysocki).

- Rework the handling of frequency tables by the cpufreq core to make
it more efficient (Viresh Kumar).

- Modify the schedutil governor to reduce the number of wakeups it
causes to occur in cases when the CPU frequency doesn't need to be
changed (Steve Muckle, Viresh Kumar).

- Fix some minor issues and clean up code in the cpufreq core and
governors (Rafael Wysocki, Viresh Kumar).

- Add Intel Broxton support to the intel_pstate driver (Srinivas
Pandruvada).

- Fix problems related to the config TDP feature and to the validity
of the MSR_HWP_INTERRUPT register in intel_pstate (Jan Kiszka,
Srinivas Pandruvada).

- Make intel_pstate update the cpu_frequency tracepoint even if the
frequency doesn't change to avoid confusing powertop (Rafael
Wysocki).

- Clean up the usage of __init/__initdata in intel_pstate, mark some
of its internal variables as __read_mostly and drop an unused
structure element from it (Jisheng Zhang, Carsten Emde).

- Clean up the usage of some duplicate MSR symbols in intel_pstate
and turbostat (Srinivas Pandruvada).

- Update/fix the powernv, s3c24xx and mvebu cpufreq drivers (Akshay
Adiga, Viresh Kumar, Ben Dooks).

- Fix a regression (introduced during the 4.5 cycle) in the
pcc-cpufreq driver by reverting the problematic commit (Andreas
Herrmann).

- Add support for Intel Denverton to intel_idle, clean up Broxton
support in it and make it explicitly non-modular (Jacob Pan, Jan
Beulich, Paul Gortmaker).

- Add support for Denverton and Ivy Bridge server to the Intel RAPL
power capping driver and make it more careful about the handing of
MSRs that may not be present (Jacob Pan, Xiaolong Wang).

- Fix resume from hibernation on x86-64 by making the CPU offline
during resume avoid using MONITOR/MWAIT in the "play dead" loop
which may lead to an inadvertent "revival" of a "dead" CPU and a
page fault leading to a kernel crash from it (Rafael Wysocki).

- Make memory management during resume from hibernation more
straightforward (Rafael Wysocki).

- Add debug features that should help to detect problems related to
hibernation and resume from it (Rafael Wysocki, Chen Yu).

- Clean up hibernation core somewhat (Rafael Wysocki).

- Prevent KASAN from instrumenting the hibernation core which leads
to large numbers of false-positives from it (James Morse).

- Prevent PM (hibernate and suspend) notifiers from being called
during the cleanup phase if they have not been called during the
corresponding preparation phase which is possible if one of the
other notifiers returns an error at that time (Lianwei Wang).

- Improve suspend-related debug printout in the tasks freezer and
clean up suspend-related console handling (Roger Lu, Borislav
Petkov).

- Update the AnalyzeSuspend script in the kernel sources to version
4.2 (Todd Brandt).

- Modify the generic power domains framework to make it handle system
suspend/resume better (Ulf Hansson).

- Make the runtime PM framework avoid resuming devices synchronously
when user space changes the runtime PM settings for them and
improve its error reporting (Rafael Wysocki, Linus Walleij).

- Fix error paths in devfreq drivers (exynos, exynos-ppmu,
exynos-bus) and in the core, make some devfreq code explicitly
non-modular and change some of it into tristate (Bartlomiej
Zolnierkiewicz, Peter Chen, Paul Gortmaker).

- Add DT support to the generic PM clocks management code and make it
export some more symbols (Jon Hunter, Paul Gortmaker).

- Make the PCI PM core code slightly more robust against possible
driver errors (Andy Shevchenko).

- Make it possible to change DESTDIR and PREFIX in turbostat (Andy
Shevchenko)"

* tag 'pm-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (89 commits)
Revert "cpufreq: pcc-cpufreq: update default value of cpuinfo_transition_latency"
PM / hibernate: Introduce test_resume mode for hibernation
cpufreq: export cpufreq_driver_resolve_freq()
cpufreq: Disallow ->resolve_freq() for drivers providing ->target_index()
PCI / PM: check all fields in pci_set_platform_pm()
cpufreq: acpi-cpufreq: use cached frequency mapping when possible
cpufreq: schedutil: map raw required frequency to driver frequency
cpufreq: add cpufreq_driver_resolve_freq()
cpufreq: intel_pstate: Check cpuid for MSR_HWP_INTERRUPT
intel_pstate: Update cpu_frequency tracepoint every time
cpufreq: intel_pstate: clean remnant struct element
PM / tools: scripts: AnalyzeSuspend v4.2
x86 / hibernate: Use hlt_play_dead() when resuming from hibernation
cpufreq: powernv: Replacing pstate_id with frequency table index
intel_pstate: Fix MSR_CONFIG_TDP_x addressing in core_get_max_pstate()
PM / hibernate: Image data protection during restoration
PM / hibernate: Add missing braces in __register_nosave_region()
PM / hibernate: Clean up comments in snapshot.c
PM / hibernate: Clean up function headers in snapshot.c
PM / hibernate: Add missing braces in hibernate_setup()
...

Linus Torvalds
2016-07-27 08:29:07 +0800
d05d7f407 Merge branch 'for-4.8/core' of git://git.kernel.dk/linux-block ... Browse Code »

Pull core block updates from Jens Axboe:

- the big change is the cleanup from Mike Christie, cleaning up our
uses of command types and modified flags. This is what will throw
some merge conflicts

- regression fix for the above for btrfs, from Vincent

- following up to the above, better packing of struct request from
Christoph

- a 2038 fix for blktrace from Arnd

- a few trivial/spelling fixes from Bart Van Assche

- a front merge check fix from Damien, which could cause issues on
SMR drives

- Atari partition fix from Gabriel

- convert cfq to highres timers, since jiffies isn't granular enough
for some devices these days. From Jan and Jeff

- CFQ priority boost fix idle classes, from me

- cleanup series from Ming, improving our bio/bvec iteration

- a direct issue fix for blk-mq from Omar

- fix for plug merging not involving the IO scheduler, like we do for
other types of merges. From Tahsin

- expose DAX type internally and through sysfs. From Toshi and Yigal

* 'for-4.8/core' of git://git.kernel.dk/linux-block: (76 commits)
block: Fix front merge check
block: do not merge requests without consulting with io scheduler
block: Fix spelling in a source code comment
block: expose QUEUE_FLAG_DAX in sysfs
block: add QUEUE_FLAG_DAX for devices to advertise their DAX support
Btrfs: fix comparison in __btrfs_map_block()
block: atari: Return early for unsupported sector size
Doc: block: Fix a typo in queue-sysfs.txt
cfq-iosched: Charge at least 1 jiffie instead of 1 ns
cfq-iosched: Fix regression in bonnie++ rewrite performance
cfq-iosched: Convert slice_resid from u64 to s64
block: Convert fifo_time from ulong to u64
blktrace: avoid using timespec
block/blk-cgroup.c: Declare local symbols static
block/bio-integrity.c: Add #include "blk.h"
block/partition-generic.c: Remove a set-but-not-used variable
block: bio: kill BIO_MAX_SIZE
cfq-iosched: temporarily boost queue priority for idle classes
block: drbd: avoid to use BIO_MAX_SIZE
block: bio: remove BIO_MAX_SECTORS
...

Linus Torvalds
2016-07-27 06:03:07 +0800

22 Jul, 2016

1 commit

fe12c00d2 PM / hibernate: Introduce test_resume mode for hibernation ... Browse Code »

test_resume mode is to verify if the snapshot data
written to swap device can be successfully restored
to memory. It is useful to ease the debugging process
on hibernation, since this mode can not only bypass
the BIOSes/bootloader, but also the system re-initialization.

To avoid the risk to break the filesystm on persistent storage,
this patch resumes the image with tasks frozen.

For example:
echo test_resume > /sys/power/disk
echo disk > /sys/power/state

[ 187.306470] PM: Image saving progress: 70%
[ 187.395298] PM: Image saving progress: 80%
[ 187.476697] PM: Image saving progress: 90%
[ 187.554641] PM: Image saving done.
[ 187.558896] PM: Wrote 594600 kbytes in 0.90 seconds (660.66 MB/s)
[ 187.566000] PM: S|
[ 187.589742] PM: Basic memory bitmaps freed
[ 187.594694] PM: Checking hibernation image
[ 187.599865] PM: Image signature found, resuming
[ 187.605209] PM: Loading hibernation image.
[ 187.665753] PM: Basic memory bitmaps created
[ 187.691397] PM: Using 3 thread(s) for decompression.
[ 187.691397] PM: Loading and decompressing image data (148650 pages)...
[ 187.889719] PM: Image loading progress: 0%
[ 188.100452] PM: Image loading progress: 10%
[ 188.244781] PM: Image loading progress: 20%
[ 189.057305] PM: Image loading done.
[ 189.068793] PM: Image successfully loaded

Suggested-by: Rafael J. Wysocki
Signed-off-by: Chen Yu
Signed-off-by: Rafael J. Wysocki

Chen Yu
2016-07-22 19:57:23 +0800

16 Jul, 2016

1 commit

406f992e4 x86 / hibernate: Use hlt_play_dead() when resuming from hibernation ... Browse Code »

On Intel hardware, native_play_dead() uses mwait_play_dead() by
default and only falls back to the other methods if that fails.
That also happens during resume from hibernation, when the restore
(boot) kernel runs disable_nonboot_cpus() to take all of the CPUs
except for the boot one offline.

However, that is problematic, because the address passed to
__monitor() in mwait_play_dead() is likely to be written to in the
last phase of hibernate image restoration and that causes the "dead"
CPU to start executing instructions again. Unfortunately, the page
containing the address in that CPU's instruction pointer may not be
valid any more at that point.

First, that page may have been overwritten with image kernel memory
contents already, so the instructions the CPU attempts to execute may
simply be invalid. Second, the page tables previously used by that
CPU may have been overwritten by image kernel memory contents, so the
address in its instruction pointer is impossible to resolve then.

A report from Varun Koyyalagunta and investigation carried out by
Chen Yu show that the latter sometimes happens in practice.

To prevent it from happening, temporarily change the smp_ops.play_dead
pointer during resume from hibernation so that it points to a special
"play dead" routine which uses hlt_play_dead() and avoids the
inadvertent "revivals" of "dead" CPUs this way.

A slightly unpleasant consequence of this change is that if the
system is hibernated with one or more CPUs offline, it will generally
draw more power after resume than it did before hibernation, because
the physical state entered by CPUs via hlt_play_dead() is higher-power
than the mwait_play_dead() one in the majority of cases. It is
possible to work around this, but it is unclear how much of a problem
that's going to be in practice, so the workaround will be implemented
later if it turns out to be necessary.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=106371
Reported-by: Varun Koyyalagunta
Original-by: Chen Yu
Tested-by: Chen Yu
Signed-off-by: Rafael J. Wysocki
Acked-by: Ingo Molnar

Rafael J. Wysocki
2016-07-16 04:42:48 +0800

10 Jul, 2016

5 commits

4c0b6c10f PM / hibernate: Image data protection during restoration ... Browse Code »

Make it possible to protect all pages holding image data during
hibernate image restoration by setting them read-only (so as to
catch attempts to write to those pages after image data have been
stored in them).

This adds overhead to image restoration code (it may cause large
page mappings to be split as a result of page flags changes) and
the errors it protects against should never happen in theory, so
the feature is only active after passing hibernate=protect_image
to the command line of the restore kernel.

Also it only is built if CONFIG_DEBUG_RODATA is set.

Signed-off-by: Rafael J. Wysocki

Rafael J. Wysocki
2016-07-10 08:12:10 +0800
d5f32af31 PM / hibernate: Add missing braces in __register_nosave_region() ... Browse Code »

One branch of an if/else statement in __register_nosave_region() is
formatted against the kernel coding style which causes the code to
look slightly odd. To fix that, add missing braces to it.

No functional changes.

Signed-off-by: Rafael J. Wysocki

Rafael J. Wysocki
2016-07-10 07:37:35 +0800
ef96f639e PM / hibernate: Clean up comments in snapshot.c ... Browse Code »

Many comments in kernel/power/snapshot.c do not follow the general
comment formatting rules. They look odd, some of them are outdated
too, some are hard to parse and generally difficult to understand.

Clean them up to make them easier to comprehend.

No functional changes.

Signed-off-by: Rafael J. Wysocki

Rafael J. Wysocki
2016-07-10 07:37:26 +0800
efd5a8524 PM / hibernate: Clean up function headers in snapshot.c ... Browse Code »

The formatting of some function headers in kernel/power/snapshot.c
is not consistent with the general kernel coding style and with the
formatting of some other function headers in the same file.

Make all of them follow the same formatting convention.

No functional changes.

Signed-off-by: Rafael J. Wysocki

Rafael J. Wysocki
2016-07-10 07:37:20 +0800
2f88e41a2 PM / hibernate: Add missing braces in hibernate_setup() ... Browse Code »

Make hibernate_setup() follow the coding style more closely by adding
some missing braces to the if () statement in it.

Signed-off-by: Rafael J. Wysocki

Rafael J. Wysocki
2016-07-10 07:37:13 +0800

09 Jul, 2016

1 commit

63f9ccb89 Merge back earlier suspend/hibernation changes for v4.8. Browse Code »

Rafael J. Wysocki
2016-07-09 05:14:17 +0800

08 Jul, 2016

1 commit

9e7f7f542 Merge branch 'x86/mm' into x86/boot, to pick up dependencies ... Browse Code »

Signed-off-by: Ingo Molnar

Ingo Molnar
2016-07-08 23:27:47 +0800

02 Jul, 2016

4 commits

307c5971c PM / hibernate: Recycle safe pages after image restoration ... Browse Code »

One of the memory bitmaps used by the hibernation image restoration
code is freed after the image has been loaded.

That is not quite efficient, though, because the memory pages used
for building that bitmap are known to be safe (ie. they were not
used by the image kernel before hibernation) and the arch-specific
code finalizing the image restoration may need them. In that case
it needs to allocate those pages again via the memory management
subsystem, check if they are really safe again by consulting the
other bitmaps and so on.

To avoid that, recycle those pages by putting them into the global
list of known safe pages so that they can be given to the arch code
right away when necessary.

Signed-off-by: Rafael J. Wysocki

Rafael J. Wysocki
2016-07-02 07:52:10 +0800
6dbecfd34 PM / hibernate: Simplify mark_unsafe_pages() ... Browse Code »

Rework mark_unsafe_pages() to use a simpler method of clearing
all bits in free_pages_map and to set the bits for the "unsafe"
pages (ie. pages that were used by the image kernel before
hibernation) with the help of duplicate_memory_bitmap().

For this purpose, move the pfn_valid() check from mark_unsafe_pages()
to unpack_orig_pfns() where the "unsafe" pages are discovered.

Signed-off-by: Rafael J. Wysocki

Rafael J. Wysocki
2016-07-02 07:52:09 +0800
9c744481c PM / hibernate: Do not free preallocated safe pages during image restore ... Browse Code »

The core image restoration code preallocates some safe pages
(ie. pages that weren't used by the image kernel before hibernation)
for future use before allocating the bulk of memory for loading the
image data. Those safe pages are then freed so they can be allocated
again (with the memory management subsystem's help). That's done to
ensure that there will be enough safe pages for temporary data
structures needed during image restoration.

However, it is not really necessary to free those pages after they
have been allocated. They can be added to the (global) list of
safe pages right away and then picked up from there when needed
without freeing.

That reduces the overhead related to using safe pages, especially
in the arch-specific code, so modify the code accordingly.

Signed-off-by: Rafael J. Wysocki

Rafael J. Wysocki
2016-07-02 07:52:09 +0800
7b776af66 PM / suspend: show workqueue state in suspend flow ... Browse Code »

If freezable workqueue aborts suspend flow, show
workqueue state for debug purpose.

Signed-off-by: Roger Lu
Acked-by: Tejun Heo
Signed-off-by: Rafael J. Wysocki

Roger Lu
2016-07-02 07:42:48 +0800

28 Jun, 2016

1 commit

ea00f4f4f PM / sleep: make PM notifiers called symmetrically ... Browse Code »

This makes pm notifier PREPARE/POST symmetrical: if PREPARE
fails, we will only undo what ever happened on PREPARE.

It fixes the unbalanced CPU hotplug enable in CPU PM notifier.

Signed-off-by: Lianwei Wang
Signed-off-by: Rafael J. Wysocki

Lianwei Wang
2016-06-28 06:38:55 +0800

26 Jun, 2016

1 commit

65fe935dd x86/KASLR, x86/power: Remove x86 hibernation restrictions ... Browse Code »

With the following fix:

70595b479ce1 ("x86/power/64: Fix crash whan the hibernation code passes control to the image kernel")

... there is no longer a problem with hibernation resuming a
KASLR-booted kernel image, so remove the restriction.

Signed-off-by: Kees Cook
Cc: Andy Lutomirski
Cc: Baoquan He
Cc: Borislav Petkov
Cc: Brian Gerst
Cc: Denys Vlasenko
Cc: H. Peter Anvin
Cc: Jonathan Corbet
Cc: Len Brown
Cc: Linus Torvalds
Cc: Linux PM list
Cc: Logan Gunthorpe
Cc: Pavel Machek
Cc: Peter Zijlstra
Cc: Stephen Smalley
Cc: Thomas Gleixner
Cc: Yinghai Lu
Cc: linux-doc@vger.kernel.org
Link: http://lkml.kernel.org/r/20160613221002.GA29719@www.outflux.net
Signed-off-by: Ingo Molnar

Kees Cook
2016-06-26 18:32:03 +0800

25 Jun, 2016

1 commit

740705420 oom, suspend: fix oom_reaper vs. oom_killer_disable race ... Browse Code »

Tetsuo has reported the following potential oom_killer_disable vs.
oom_reaper race:

(1) freeze_processes() starts freezing user space threads.
(2) Somebody (maybe a kenrel thread) calls out_of_memory().
(3) The OOM killer calls mark_oom_victim() on a user space thread
P1 which is already in __refrigerator().
(4) oom_killer_disable() sets oom_killer_disabled = true.
(5) P1 leaves __refrigerator() and enters do_exit().
(6) The OOM reaper calls exit_oom_victim(P1) before P1 can call
exit_oom_victim(P1).
(7) oom_killer_disable() returns while P1 not yet finished
(8) P1 perform IO/interfere with the freezer.

This situation is unfortunate. We cannot move oom_killer_disable after
all the freezable kernel threads are frozen because the oom victim might
depend on some of those kthreads to make a forward progress to exit so
we could deadlock. It is also far from trivial to teach the oom_reaper
to not call exit_oom_victim() because then we would lose a guarantee of
the OOM killer and oom_killer_disable forward progress because
exit_mm->mmput might block and never call exit_oom_victim.

It seems the easiest way forward is to workaround this race by calling
try_to_freeze_tasks again after oom_killer_disable. This will make sure
that all the tasks are frozen or it bails out.

Fixes: 449d777d7ad6 ("mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for oom_reaper")
Link: http://lkml.kernel.org/r/1466597634-16199-1-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko
Reported-by: Tetsuo Handa
Cc: "Rafael J. Wysocki"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2016-06-25 08:23:52 +0800