Eric Lee / smarc-fsl-linux-kernel

06 Oct, 2016

2 commits

3ddf40e8c mm: filemap: fix mapping->nrpages double accounting in fuse ... Browse Code »

Commit 22f2ac51b6d6 ("mm: workingset: fix crash in shadow node shrinker
caused by replace_page_cache_page()") switched replace_page_cache() from
raw radix tree operations to page_cache_tree_insert() but didn't take
into account that the latter function, unlike the raw radix tree op,
handles mapping->nrpages. As a result, that counter is bumped for each
page replacement rather than balanced out even.

The mapping->nrpages counter is used to skip needless radix tree walks
when invalidating, truncating, syncing inodes without pages, as well as
statistics for userspace. Since the error is positive, we'll do more
page cache tree walks than necessary; we won't miss a necessary one.
And we'll report more buffer pages to userspace than there are. The
error is limited to fuse inodes.

Fixes: 22f2ac51b6d6 ("mm: workingset: fix crash in shadow node shrinker caused by replace_page_cache_page()")
Signed-off-by: Johannes Weiner
Cc: Andrew Morton
Cc: Miklos Szeredi
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-10-06 00:17:56 +0800
d3798ae8c mm: filemap: don't plant shadow entries without radix tree node ... Browse Code »

When the underflow checks were added to workingset_node_shadow_dec(),
they triggered immediately:

kernel BUG at ./include/linux/swap.h:276!
invalid opcode: 0000 [#1] SMP
Modules linked in: isofs usb_storage fuse xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_REJECT nf_reject_ipv6
soundcore wmi acpi_als pinctrl_sunrisepoint kfifo_buf tpm_tis industrialio acpi_pad pinctrl_intel tpm_tis_core tpm nfsd auth_rpcgss nfs_acl lockd grace sunrpc dm_crypt
CPU: 0 PID: 20929 Comm: blkid Not tainted 4.8.0-rc8-00087-gbe67d60ba944 #1
Hardware name: System manufacturer System Product Name/Z170-K, BIOS 1803 05/06/2016
task: ffff8faa93ecd940 task.stack: ffff8faa7f478000
RIP: page_cache_tree_insert+0xf1/0x100
Call Trace:
__add_to_page_cache_locked+0x12e/0x270
add_to_page_cache_lru+0x4e/0xe0
mpage_readpages+0x112/0x1d0
blkdev_readpages+0x1d/0x20
__do_page_cache_readahead+0x1ad/0x290
force_page_cache_readahead+0xaa/0x100
page_cache_sync_readahead+0x3f/0x50
generic_file_read_iter+0x5af/0x740
blkdev_read_iter+0x35/0x40
__vfs_read+0xe1/0x130
vfs_read+0x96/0x130
SyS_read+0x55/0xc0
entry_SYSCALL_64_fastpath+0x13/0x8f
Code: 03 00 48 8b 5d d8 65 48 33 1c 25 28 00 00 00 44 89 e8 75 19 48 83 c4 18 5b 41 5c 41 5d 41 5e 5d c3 0f 0b 41 bd ef ff ff ff eb d7 0b e8 88 68 ef ff 0f 1f 84 00
RIP page_cache_tree_insert+0xf1/0x100

This is a long-standing bug in the way shadow entries are accounted in
the radix tree nodes. The shrinker needs to know when radix tree nodes
contain only shadow entries, no pages, so node->count is split in half
to count shadows in the upper bits and pages in the lower bits.

Unfortunately, the radix tree implementation doesn't know of this and
assumes all entries are in node->count. When there is a shadow entry
directly in root->rnode and the tree is later extended, the radix tree
implementation will copy that entry into the new node and and bump its
node->count, i.e. increases the page count bits. Once the shadow gets
removed and we subtract from the upper counter, node->count underflows
and triggers the warning. Afterwards, without node->count reaching 0
again, the radix tree node is leaked.

Limit shadow entries to when we have actual radix tree nodes and can
count them properly. That means we lose the ability to detect refaults
from files that had only the first page faulted in at eviction time.

Fixes: 449dd6984d0e ("mm: keep page cache radix tree nodes in check")
Signed-off-by: Johannes Weiner
Reported-and-tested-by: Linus Torvalds
Reviewed-by: Jan Kara
Cc: Andrew Morton
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-10-06 00:17:56 +0800

04 Oct, 2016

4 commits

597f03f9d Merge branch 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull CPU hotplug updates from Thomas Gleixner:
"Yet another batch of cpu hotplug core updates and conversions:

- Provide core infrastructure for multi instance drivers so the
drivers do not have to keep custom lists.

- Convert custom lists to the new infrastructure. The block-mq custom
list conversion comes through the block tree and makes the diffstat
tip over to more lines removed than added.

- Handle unbalanced hotplug enable/disable calls more gracefully.

- Remove the obsolete CPU_STARTING/DYING notifier support.

- Convert another batch of notifier users.

The relayfs changes which conflicted with the conversion have been
shipped to me by Andrew.

The remaining lot is targeted for 4.10 so that we finally can remove
the rest of the notifiers"

* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits)
cpufreq: Fix up conversion to hotplug state machine
blk/mq: Reserve hotplug states for block multiqueue
x86/apic/uv: Convert to hotplug state machine
s390/mm/pfault: Convert to hotplug state machine
mips/loongson/smp: Convert to hotplug state machine
mips/octeon/smp: Convert to hotplug state machine
fault-injection/cpu: Convert to hotplug state machine
padata: Convert to hotplug state machine
cpufreq: Convert to hotplug state machine
ACPI/processor: Convert to hotplug state machine
virtio scsi: Convert to hotplug state machine
oprofile/timer: Convert to hotplug state machine
block/softirq: Convert to hotplug state machine
lib/irq_poll: Convert to hotplug state machine
x86/microcode: Convert to hotplug state machine
sh/SH-X3 SMP: Convert to hotplug state machine
ia64/mca: Convert to hotplug state machine
ARM/OMAP/wakeupgen: Convert to hotplug state machine
ARM/shmobile: Convert to hotplug state machine
arm64/FP/SIMD: Convert to hotplug state machine
...

Linus Torvalds
2016-10-04 10:43:08 +0800
8e4ef6386 Merge branch 'x86-vdso-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull x86 vdso updates from Ingo Molnar:
"The main changes in this cycle centered around adding support for
32-bit compatible C/R of the vDSO on 64-bit kernels, by Dmitry
Safonov"

* 'x86-vdso-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/vdso: Use CONFIG_X86_X32_ABI to enable vdso prctl
x86/vdso: Only define map_vdso_randomized() if CONFIG_X86_64
x86/vdso: Only define prctl_map_vdso() if CONFIG_CHECKPOINT_RESTORE
x86/signal: Add SA_{X32,IA32}_ABI sa_flags
x86/ptrace: Down with test_thread_flag(TIF_IA32)
x86/coredump: Use pr_reg size, rather that TIF_IA32 flag
x86/arch_prctl/vdso: Add ARCH_MAP_VDSO_*
x86/vdso: Replace calculate_addr in map_vdso() with addr
x86/vdso: Unmap vdso blob on vvar mapping failure

Linus Torvalds
2016-10-04 08:29:01 +0800
af79ad2b1 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler changes from Ingo Molnar:
"The main changes are:

- irqtime accounting cleanups and enhancements. (Frederic Weisbecker)

- schedstat debugging enhancements, make it more broadly runtime
available. (Josh Poimboeuf)

- More work on asymmetric topology/capacity scheduling. (Morten
Rasmussen)

- sched/wait fixes and cleanups. (Oleg Nesterov)

- PELT (per entity load tracking) improvements. (Peter Zijlstra)

- Rewrite and enhance select_idle_siblings(). (Peter Zijlstra)

- sched/numa enhancements/fixes (Rik van Riel)

- sched/cputime scalability improvements (Stanislaw Gruszka)

- Load calculation arithmetics fixes. (Dietmar Eggemann)

- sched/deadline enhancements (Tommaso Cucinotta)

- Fix utilization accounting when switching to the SCHED_NORMAL
policy. (Vincent Guittot)

- ... plus misc cleanups and enhancements"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (64 commits)
sched/irqtime: Consolidate irqtime flushing code
sched/irqtime: Consolidate accounting synchronization with u64_stats API
u64_stats: Introduce IRQs disabled helpers
sched/irqtime: Remove needless IRQs disablement on kcpustat update
sched/irqtime: No need for preempt-safe accessors
sched/fair: Fix min_vruntime tracking
sched/debug: Add SCHED_WARN_ON()
sched/core: Fix set_user_nice()
sched/fair: Introduce set_curr_task() helper
sched/core, ia64: Rename set_curr_task()
sched/core: Fix incorrect utilization accounting when switching to fair class
sched/core: Optimize SCHED_SMT
sched/core: Rewrite and improve select_idle_siblings()
sched/core: Replace sd_busy/nr_busy_cpus with sched_domain_shared
sched/core: Introduce 'struct sched_domain_shared'
sched/core: Restructure destroy_sched_domain()
sched/core: Remove unused @cpu argument from destroy_sched_domain*()
sched/wait: Introduce init_wait_entry()
sched/wait: Avoid abort_exclusive_wait() in __wait_on_bit_lock()
sched/wait: Avoid abort_exclusive_wait() in ___wait_event()
...

Linus Torvalds
2016-10-04 04:39:00 +0800
72ec94560 Merge tag 'pm-4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm ... Browse Code »

Pull power management updates from Rafael Wysocki:
"Traditionally, cpufreq is the area with the greatest number of
changes, but there are fewer of them than last time. There also is
some activity in the generic power domains and the devfreq frameworks,
a couple of system suspend and hibernation fixes and some assorted
changes in other places.

One new feature is the cpufreq change to allow the scheduler to pass
hints to the governors' utilization update callbacks and some code
rework based on that. Another one is the support for domain removal in
the generic power domains framework. Also it is now possible to use
hibernation with PAGE_POISONING_ZERO enabled and devfreq supports the
RockChip DFI controller and the rk3399 DMC.

The rest of the changes is mostly fixes and cleanups in a number of
places.

Specifics:

- Add a mechanism for passing hints from the scheduler to cpufreq
governors via their utilization update callbacks and use it to
introduce "IOwait boosting" into the schedutil governor and
intel_pstate that will make them boost performance if the enqueued
task was previously waiting on I/O (Rafael Wysocki).

- Fix a schedutil governor problem that causes it to overestimate
utilization if SMT is in use (Steve Muckle).

- Update defconfigs trying to use the schedutil governor as a module
which is not possible any more (Javier Martinez Canillas).

- Update the intel_pstate's pstate_sample tracepoint to take "IOwait
boosting" into account (Srinivas Pandruvada).

- Fix a problem in the cpufreq core causing it to mishandle the
initialization of CPUs registered after the cpufreq driver (Viresh
Kumar, Rafael Wysocki).

- Make the cpufreq-dt driver support per-policy governor tunables,
clean it up and update its Kconfig description (Viresh Kumar).

- Add support for more ARM platforms to the cpufreq-dt driver
(Chanwoo Choi, Dave Gerlach, Geert Uytterhoeven).

- Make the cpufreq CPPC driver report frequencies in KHz to avoid
user space compatiblility issues (Al Stone, Hoan Tran).

- Clean up a few cpufreq drivers (st, kirkwood, SCPI) a bit (Colin
Ian King, Markus Elfring).

- Constify some local structures in the intel_pstate driver (Julia
Lawall).

- Add a Documentation/cpu-freq/ entry to MAINTAINERS (Jean Delvare).

- Add support for PM domain removal to the generic power domains
(genpd) framework, add new DT helper functions to it and make it
always enable debugfs support if available (Jon Hunter, Tomeu
Vizoso).

- Clean up the generic power domains (genpd) framework and make it
avoid measuring power-on and power-off latencies during system-wide
PM transitions (Ulf Hansson).

- Add support for the RockChip DFI controller and the rk3399 DMC to
the devfreq framework (Lin Huang, Axel Lin, Arnd Bergmann).

- Add COMPILE_TEST to the devfreq framework (Krzysztof Kozlowski,
Stephen Rothwell).

- Fix a minor issue in the exynos-ppmu devfreq driver and fix up
devfreq Kconfig indentation style (Wei Yongjun, Jisheng Zhang).

- Fix the system suspend interface to make suspend-to-idle work if
platform suspend operations have not been registered (Sudeep
Holla).

- Make it possible to use hibernation with PAGE_POISONING_ZERO
enabled (Anisse Astier).

- Increas the default timeout of the system suspend/resume watchdog
and make it depend on EXPERT (Chen Yu).

- Make the operating performance points (OPP) framework avoid using
OPPs that aren't supported by the platform and fix a build warning
in it (Dave Gerlach, Arnd Bergmann).

- Fix the ARM cpuidle driver's return value (Christophe Jaillet).

- Make the SmartReflex AVS (Adaptive Voltage Scaling) driver use more
common logging style (Joe Perches)"

* tag 'pm-4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (58 commits)
PM / OPP: Don't support OPP if it provides supported-hw but platform does not
cpufreq: st: add missing \n to end of dev_err message
cpufreq: kirkwood: add missing \n to end of dev_err messages
PM / Domains: Rename pm_genpd_sync_poweron|poweroff()
PM / Domains: Don't measure latency of ->power_on|off() during system PM
PM / Domains: Remove redundant system PM callbacks
PM / Domains: Simplify detaching a device from its genpd
PM / devfreq: rk3399_dmc: Remove explictly regulator_put call in .remove
PM / devfreq: rockchip: add PM_DEVFREQ_EVENT dependency
PM / OPP: avoid maybe-uninitialized warning
PM / Domains: Allow holes in genpd_data.domains array
cpufreq: CPPC: Avoid overflow when calculating desired_perf
cpufreq: ti: Use generic platdev driver
cpufreq: intel_pstate: Add io_boost trace
partial revert of "PM / devfreq: Add COMPILE_TEST for build coverage"
cpufreq: intel_pstate: Use IOWAIT flag in Atom algorithm
cpufreq: schedutil: Add iowait boosting
cpufreq / sched: SCHED_CPUFREQ_IOWAIT flag to indicate iowait condition
PM / Domains: Add support for removing nested PM domains by provider
PM / Domains: Add support for removing PM domains
...

Linus Torvalds
2016-10-04 00:33:40 +0800

03 Oct, 2016

1 commit

7af8a0f80 Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux ... Browse Code »

Pull arm64 updates from Will Deacon:
"It's a bit all over the place this time with no "killer feature" to
speak of. Support for mismatched cache line sizes should help people
seeing whacky JIT failures on some SoCs, and the big.LITTLE perf
updates have been a long time coming, but a lot of the changes here
are cleanups.

We stray outside arch/arm64 in a few areas: the arch/arm/ arch_timer
workaround is acked by Russell, the DT/OF bits are acked by Rob, the
arch_timer clocksource changes acked by Marc, CPU hotplug by tglx and
jump_label by Peter (all CC'd).

Summary:

- Support for execute-only page permissions
- Support for hibernate and DEBUG_PAGEALLOC
- Support for heterogeneous systems with mismatches cache line sizes
- Errata workarounds (A53 843419 update and QorIQ A-008585 timer bug)
- arm64 PMU perf updates, including cpumasks for heterogeneous systems
- Set UTS_MACHINE for building rpm packages
- Yet another head.S tidy-up
- Some cleanups and refactoring, particularly in the NUMA code
- Lots of random, non-critical fixes across the board"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (100 commits)
arm64: tlbflush.h: add __tlbi() macro
arm64: Kconfig: remove SMP dependence for NUMA
arm64: Kconfig: select OF/ACPI_NUMA under NUMA config
arm64: fix dump_backtrace/unwind_frame with NULL tsk
arm/arm64: arch_timer: Use archdata to indicate vdso suitability
arm64: arch_timer: Work around QorIQ Erratum A-008585
arm64: arch_timer: Add device tree binding for A-008585 erratum
arm64: Correctly bounds check virt_addr_valid
arm64: migrate exception table users off module.h and onto extable.h
arm64: pmu: Hoist pmu platform device name
arm64: pmu: Probe default hw/cache counters
arm64: pmu: add fallback probe table
MAINTAINERS: Update ARM PMU PROFILING AND DEBUGGING entry
arm64: Improve kprobes test for atomic sequence
arm64/kvm: use alternative auto-nop
arm64: use alternative auto-nop
arm64: alternative: add auto-nop infrastructure
arm64: lse: convert lse alternatives NOP padding to use __nops
arm64: barriers: introduce nops and __nops macros for NOP sequences
arm64: sysreg: replace open-coded mrs_s/msr_s with {read,write}_sysreg_s
...

Linus Torvalds
2016-10-03 23:58:35 +0800

02 Oct, 2016

1 commit

993eb0aea Merge branches 'pm-devfreq' and 'pm-sleep' ... Browse Code »

* pm-devfreq:
PM / devfreq: rk3399_dmc: Remove explictly regulator_put call in .remove
PM / devfreq: rockchip: add PM_DEVFREQ_EVENT dependency
partial revert of "PM / devfreq: Add COMPILE_TEST for build coverage"
PM / devfreq: rockchip: add devfreq driver for rk3399 dmc
Documentation: bindings: add dt documentation for rk3399 dmc
PM / devfreq: event: support rockchip dfi controller
Documentation: bindings: add dt documentation for dfi controller
PM / devfreq: event: remove duplicate devfreq_event_get_drvdata()
PM / devfreq: fix Kconfig indent style
PM / devfreq: Add COMPILE_TEST for build coverage
PM / devfreq: exynos-ppmu: remove unneeded of_node_put()

* pm-sleep:
PM / Hibernate: allow hibernation with PAGE_POISONING_ZERO
PM / sleep: enable suspend-to-idle even without registered suspend_ops
PM / sleep: Increase default DPM watchdog timeout to 120

Rafael J. Wysocki
2016-10-02 07:43:45 +0800

01 Oct, 2016

1 commit

22f2ac51b mm: workingset: fix crash in shadow node shrinker caused by replace_page_cache_page() ... Browse Code »

Antonio reports the following crash when using fuse under memory pressure:

kernel BUG at /build/linux-a2WvEb/linux-4.4.0/mm/workingset.c:346!
invalid opcode: 0000 [#1] SMP
Modules linked in: all of them
CPU: 2 PID: 63 Comm: kswapd0 Not tainted 4.4.0-36-generic #55-Ubuntu
Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013
task: ffff88040cae6040 ti: ffff880407488000 task.ti: ffff880407488000
RIP: shadow_lru_isolate+0x181/0x190
Call Trace:
__list_lru_walk_one.isra.3+0x8f/0x130
list_lru_walk_one+0x23/0x30
scan_shadow_nodes+0x34/0x50
shrink_slab.part.40+0x1ed/0x3d0
shrink_zone+0x2ca/0x2e0
kswapd+0x51e/0x990
kthread+0xd8/0xf0
ret_from_fork+0x3f/0x70

which corresponds to the following sanity check in the shadow node
tracking:

BUG_ON(node->count & RADIX_TREE_COUNT_MASK);

The workingset code tracks radix tree nodes that exclusively contain
shadow entries of evicted pages in them, and this (somewhat obscure)
line checks whether there are real pages left that would interfere with
reclaim of the radix tree node under memory pressure.

While discussing ways how fuse might sneak pages into the radix tree
past the workingset code, Miklos pointed to replace_page_cache_page(),
and indeed there is a problem there: it properly accounts for the old
page being removed - __delete_from_page_cache() does that - but then
does a raw raw radix_tree_insert(), not accounting for the replacement
page. Eventually the page count bits in node->count underflow while
leaving the node incorrectly linked to the shadow node LRU.

To address this, make sure replace_page_cache_page() uses the tracked
page insertion code, page_cache_tree_insert(). This fixes the page
accounting and makes sure page-containing nodes are properly unlinked
from the shadow node LRU again.

Also, make the sanity checks a bit less obscure by using the helpers for
checking the number of pages and shadows in a radix tree node.

Fixes: 449dd6984d0e ("mm: keep page cache radix tree nodes in check")
Link: http://lkml.kernel.org/r/20160919155822.29498-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner
Reported-by: Antonio SJ Musumeci
Debugged-by: Miklos Szeredi
Cc: [3.15+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-10-01 06:26:52 +0800

30 Sep, 2016

1 commit

536e0e81e Merge branch 'linus' into sched/core, to pick up fixes ... Browse Code »

Signed-off-by: Ingo Molnar

Ingo Molnar
2016-09-30 16:44:27 +0800

29 Sep, 2016

2 commits

231e97e2b mem-hotplug: use nodes that contain memory as mask in new_node_page() ... Browse Code »

9bb627be47a5 ("mem-hotplug: don't clear the only node in new_node_page()")
prevents allocating from an empty nodemask, but as David points out, it is
still wrong. As node_online_map may include memoryless nodes, only
allocating from these nodes is meaningless.

This patch uses node_states[N_MEMORY] mask to prevent the above case.

Fixes: 9bb627be47a5 ("mem-hotplug: don't clear the only node in new_node_page()")
Fixes: 394e31d2ceb4 ("mem-hotplug: alloc new page from a nearest neighbor node when mem-offline")
Link: http://lkml.kernel.org/r/1474447117.28370.6.camel@TP420
Signed-off-by: Li Zhong
Suggested-by: David Rientjes
Acked-by: Vlastimil Babka
Cc: Michal Hocko
Cc: John Allen
Cc: Xishi Qiu
Cc: Joonsoo Kim
Cc: Naoya Horiguchi
Cc: Tetsuo Handa
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Li Zhong
2016-09-29 07:19:02 +0800
5b398e416 mm,ksm: fix endless looping in allocating memory when ksm enable ... Browse Code »

I hit the following hung task when runing a OOM LTP test case with 4.1
kernel.

Call trace:
[] __switch_to+0x74/0x8c
[] __schedule+0x23c/0x7bc
[] schedule+0x3c/0x94
[] rwsem_down_write_failed+0x214/0x350
[] down_write+0x64/0x80
[] __ksm_exit+0x90/0x19c
[] mmput+0x118/0x11c
[] do_exit+0x2dc/0xa74
[] do_group_exit+0x4c/0xe4
[] get_signal+0x444/0x5e0
[] do_signal+0x1d8/0x450
[] do_notify_resume+0x70/0x78

The oom victim cannot terminate because it needs to take mmap_sem for
write while the lock is held by ksmd for read which loops in the page
allocator

ksm_do_scan
scan_get_next_rmap_item
down_read
get_next_rmap_item
alloc_rmap_item #ksmd will loop permanently.

There is no way forward because the oom victim cannot release any memory
in 4.1 based kernel. Since 4.6 we have the oom reaper which would solve
this problem because it would release the memory asynchronously.
Nevertheless we can relax alloc_rmap_item requirements and use
__GFP_NORETRY because the allocation failure is acceptable as ksm_do_scan
would just retry later after the lock got dropped.

Such a patch would be also easy to backport to older stable kernels which
do not have oom_reaper.

While we are at it add GFP_NOWARN so the admin doesn't have to be alarmed
by the allocation failure.

Link: http://lkml.kernel.org/r/1474165570-44398-1-git-send-email-zhongjiang@huawei.com
Signed-off-by: zhong jiang
Suggested-by: Hugh Dickins
Suggested-by: Michal Hocko
Acked-by: Michal Hocko
Acked-by: Hugh Dickins
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

zhong jiang
2016-09-29 07:19:01 +0800

26 Sep, 2016

1 commit

38e088546 mm: check VMA flags to avoid invalid PROT_NONE NUMA balancing ... Browse Code »

The NUMA balancing logic uses an arch-specific PROT_NONE page table flag
defined by pte_protnone() or pmd_protnone() to mark PTEs or huge page
PMDs respectively as requiring balancing upon a subsequent page fault.
User-defined PROT_NONE memory regions which also have this flag set will
not normally invoke the NUMA balancing code as do_page_fault() will send
a segfault to the process before handle_mm_fault() is even called.

However if access_remote_vm() is invoked to access a PROT_NONE region of
memory, handle_mm_fault() is called via faultin_page() and
__get_user_pages() without any access checks being performed, meaning
the NUMA balancing logic is incorrectly invoked on a non-NUMA memory
region.

A simple means of triggering this problem is to access PROT_NONE mmap'd
memory using /proc/self/mem which reliably results in the NUMA handling
functions being invoked when CONFIG_NUMA_BALANCING is set.

This issue was reported in bugzilla (issue 99101) which includes some
simple repro code.

There are BUG_ON() checks in do_numa_page() and do_huge_pmd_numa_page()
added at commit c0e7cad to avoid accidentally provoking strange
behaviour by attempting to apply NUMA balancing to pages that are in
fact PROT_NONE. The BUG_ON()'s are consistently triggered by the repro.

This patch moves the PROT_NONE check into mm/memory.c rather than
invoking BUG_ON() as faulting in these pages via faultin_page() is a
valid reason for reaching the NUMA check with the PROT_NONE page table
flag set and is therefore not always a bug.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=99101
Reported-by: Trevor Saunders
Signed-off-by: Lorenzo Stoakes
Acked-by: Rik van Riel
Cc: Andrew Morton
Cc: Mel Gorman
Signed-off-by: Linus Torvalds

Lorenzo Stoakes
2016-09-26 06:43:42 +0800

25 Sep, 2016

4 commits

0f2657417 Merge branch 'hughd-fixes' (patches from Hugh Dickins) ... Browse Code »

Merge VM fixes from High Dickins:
"I get the impression that Andrew is away or busy at the moment, so I'm
going to send you three independent uncontroversial little mm fixes
directly - though none is strictly a 4.8 regression fix.

- shmem: fix tmpfs to handle the huge= option properly from Toshi
Kani is a one-liner to fix a major embarrassment in 4.8's hugepages
on tmpfs feature: although Hillf pointed it out in June, somehow
both Kirill and I repeatedly dropped the ball on this one. You
might wonder if the feature got tested at all with that bug in:
yes, it did, but for wider testing coverage, Kirill and I had each
relied too much on an override which bypasses that condition.

- huge tmpfs: fix Committed_AS leak just a run-of-the-mill accounting
fix in the same feature.

- mm: delete unnecessary and unsafe init_tlb_ubc() is an unrelated
fix to 4.3's TLB flush batching in reclaim: the bug would be rare,
and none of us will be shamed if this one misses 4.8; but it got
such a quick ack from Mel today that I'm inclined to offer it along
with the first two"

* emailed patches from Hugh Dickins :
mm: delete unnecessary and unsafe init_tlb_ubc()
huge tmpfs: fix Committed_AS leak
shmem: fix tmpfs to handle the huge= option properly

Linus Torvalds
2016-09-25 02:31:45 +0800
b385d21f2 mm: delete unnecessary and unsafe init_tlb_ubc() ... Browse Code »

init_tlb_ubc() looked unnecessary to me: tlb_ubc is statically
initialized with zeroes in the init_task, and copied from parent to
child while it is quiescent in arch_dup_task_struct(); so I went to
delete it.

But inserted temporary debug WARN_ONs in place of init_tlb_ubc() to
check that it was always empty at that point, and found them firing:
because memcg reclaim can recurse into global reclaim (when allocating
biosets for swapout in my case), and arrive back at the init_tlb_ubc()
in shrink_node_memcg().

Resetting tlb_ubc.flush_required at that point is wrong: if the upper
level needs a deferred TLB flush, but the lower level turns out not to,
we miss a TLB flush. But fortunately, that's the only part of the
protocol that does not nest: with the initialization removed, cpumask
collects bits from upper and lower levels, and flushes TLB when needed.

Fixes: 72b252aed506 ("mm: send one IPI per CPU to TLB flush all entries after unmapping pages")
Signed-off-by: Hugh Dickins
Acked-by: Mel Gorman
Cc: stable@vger.kernel.org # 4.3+
Signed-off-by: Linus Torvalds

Hugh Dickins
2016-09-25 02:20:01 +0800
71664665c huge tmpfs: fix Committed_AS leak ... Browse Code »

Under swapping load on huge tmpfs, /proc/meminfo's Committed_AS grows
bigger and bigger: just a cosmetic issue for most users, but disabling
for those who run without overcommit (/proc/sys/vm/overcommit_memory 2).

shmem_uncharge() was forgetting to unaccount __vm_enough_memory's
charge, and shmem_charge() was forgetting it on the filesystem-full
error path.

Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
Signed-off-by: Hugh Dickins
Acked-by: Kirill A. Shutemov
Signed-off-by: Linus Torvalds

Hugh Dickins
2016-09-25 02:20:01 +0800
3089bf614 shmem: fix tmpfs to handle the huge= option properly ... Browse Code »

shmem_get_unmapped_area() checks SHMEM_SB(sb)->huge incorrectly, which
leads to a reversed effect of "huge=" mount option.

Fix the check in shmem_get_unmapped_area().

Note, the default value of SHMEM_SB(sb)->huge remains as
SHMEM_HUGE_NEVER. User will need to specify "huge=" option to enable
huge page mappings.

Reported-by: Hillf Danton
Signed-off-by: Toshi Kani
Acked-by: Kirill A. Shutemov
Reviewed-by: Aneesh Kumar K.V
Signed-off-by: Hugh Dickins
Signed-off-by: Linus Torvalds

Toshi Kani
2016-09-25 02:20:01 +0800

22 Sep, 2016

1 commit

50797851b Merge branch 'linus' into sched/core, to pick up fixes ... Browse Code »

Signed-off-by: Ingo Molnar

Ingo Molnar
2016-09-22 20:49:40 +0800

21 Sep, 2016

1 commit

aa4f06011 mm: usercopy: Check for module addresses ... Browse Code »

While running a compile on arm64, I hit a memory exposure

usercopy: kernel memory exposure attempt detected from fffffc0000f3b1a8 (buffer_head) (1 bytes)
------------[ cut here ]------------
kernel BUG at mm/usercopy.c:75!
Internal error: Oops - BUG: 0 [#1] SMP
Modules linked in: ip6t_rpfilter ip6t_REJECT
nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_broute bridge stp
llc ebtable_nat ip6table_security ip6table_raw ip6table_nat
nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle
iptable_security iptable_raw iptable_nat nf_conntrack_ipv4
nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle
ebtable_filter ebtables ip6table_filter ip6_tables vfat fat xgene_edac
xgene_enet edac_core i2c_xgene_slimpro i2c_core at803x realtek xgene_dma
mdio_xgene gpio_dwapb gpio_xgene_sb xgene_rng mailbox_xgene_slimpro nfsd
auth_rpcgss nfs_acl lockd grace sunrpc xfs libcrc32c sdhci_of_arasan
sdhci_pltfm sdhci mmc_core xhci_plat_hcd gpio_keys
CPU: 0 PID: 19744 Comm: updatedb Tainted: G W 4.8.0-rc3-threadinfo+ #1
Hardware name: AppliedMicro X-Gene Mustang Board/X-Gene Mustang Board, BIOS 3.06.12 Aug 12 2016
task: fffffe03df944c00 task.stack: fffffe00d128c000
PC is at __check_object_size+0x70/0x3f0
LR is at __check_object_size+0x70/0x3f0
...
[] __check_object_size+0x70/0x3f0
[] filldir64+0x158/0x1a0
[] __fat_readdir+0x4a0/0x558 [fat]
[] fat_readdir+0x34/0x40 [fat]
[] iterate_dir+0x190/0x1e0
[] SyS_getdents64+0x88/0x120
[] el0_svc_naked+0x24/0x28

fffffc0000f3b1a8 is a module address. Modules may have compiled in
strings which could get copied to userspace. In this instance, it
looks like "." which matches with a size of 1 byte. Extend the
is_vmalloc_addr check to be is_vmalloc_or_module_addr to cover
all possible cases.

Signed-off-by: Laura Abbott
Signed-off-by: Kees Cook

Laura Abbott
2016-09-21 07:07:39 +0800

20 Sep, 2016

6 commits

db2ba40c2 mm: memcontrol: make per-cpu charge cache IRQ-safe for socket accounting ... Browse Code »

During cgroup2 rollout into production, we started encountering css
refcount underflows and css access crashes in the memory controller.
Splitting the heavily shared css reference counter into logical users
narrowed the imbalance down to the cgroup2 socket memory accounting.

The problem turns out to be the per-cpu charge cache. Cgroup1 had a
separate socket counter, but the new cgroup2 socket accounting goes
through the common charge path that uses a shared per-cpu cache for all
memory that is being tracked. Those caches are safe against scheduling
preemption, but not against interrupts - such as the newly added packet
receive path. When cache draining is interrupted by network RX taking
pages out of the cache, the resuming drain operation will put references
of in-use pages, thus causing the imbalance.

Disable IRQs during all per-cpu charge cache operations.

Fixes: f7e1cb6ec51b ("mm: memcontrol: account socket memory in unified hierarchy memory controller")
Link: http://lkml.kernel.org/r/20160914194846.11153-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner
Acked-by: Tejun Heo
Cc: "David S. Miller"
Cc: Michal Hocko
Cc: Vladimir Davydov
Cc: [4.5+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-09-20 06:36:17 +0800
c8de641b1 mm: fix the page_swap_info() BUG_ON check ... Browse Code »

Commit 62c230bc1790 ("mm: add support for a filesystem to activate
swap files and use direct_IO for writing swap pages") replaced the
swap_aops dirty hook from __set_page_dirty_no_writeback() with
swap_set_page_dirty().

For normal cases without these special SWP flags code path falls back to
__set_page_dirty_no_writeback() so the behaviour is expected to be the
same as before.

But swap_set_page_dirty() makes use of the page_swap_info() helper to
get the swap_info_struct to check for the flags like SWP_FILE,
SWP_BLKDEV etc as desired for those features. This helper has
BUG_ON(!PageSwapCache(page)) which is racy and safe only for the
set_page_dirty_lock() path.

For the set_page_dirty() path which is often needed for cases to be
called from irq context, kswapd() can toggle the flag behind the back
while the call is getting executed when system is low on memory and
heavy swapping is ongoing.

This ends up with undesired kernel panic.

This patch just moves the check outside the helper to its users
appropriately to fix kernel panic for the described path. Couple of
users of helpers already take care of SwapCache condition so I skipped
them.

Link: http://lkml.kernel.org/r/1473460718-31013-1-git-send-email-santosh.shilimkar@oracle.com
Signed-off-by: Santosh Shilimkar
Cc: Mel Gorman
Cc: Joe Perches
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: David S. Miller
Cc: Jens Axboe
Cc: Michal Hocko
Cc: Hugh Dickins
Cc: Al Viro
Cc: [4.7.x]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Santosh Shilimkar
2016-09-20 06:36:17 +0800
4d35427ad mm: avoid endless recursion in dump_page() ... Browse Code »

dump_page() uses page_mapcount() to get mapcount of the page.
page_mapcount() has VM_BUG_ON_PAGE(PageSlab(page)) as mapcount doesn't
make sense for slab pages and the field in struct page used for other
information.

It leads to recursion if dump_page() called for slub page and DEBUG_VM
is enabled:

dump_page() -> page_mapcount() -> VM_BUG_ON_PAGE() -> dump_page -> ...

Let's avoid calling page_mapcount() for slab pages in dump_page().

Link: http://lkml.kernel.org/r/20160908082137.131076-1-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-09-20 06:36:16 +0800
982785c6b mm, thp: fix leaking mapped pte in __collapse_huge_page_swapin() ... Browse Code »

Currently, khugepaged does not permit swapin if there are enough young
pages in a THP. The problem is when a THP does not have enough young
pages, khugepaged leaks mapped ptes.

This patch prohibits leaking mapped ptes.

Link: http://lkml.kernel.org/r/1472820276-7831-1-git-send-email-ebru.akagunduz@gmail.com
Signed-off-by: Ebru Akagunduz
Suggested-by: Andrea Arcangeli
Reviewed-by: Andrea Arcangeli
Reviewed-by: Rik van Riel
Cc: Vlastimil Babka
Cc: Mel Gorman
Cc: Kirill A. Shutemov
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ebru Akagunduz
2016-09-20 06:36:16 +0800
c131f751a khugepaged: fix use-after-free in collapse_huge_page() ... Browse Code »

hugepage_vma_revalidate() tries to re-check if we still should try to
collapse small pages into huge one after the re-acquiring mmap_sem.

The problem Dmitry Vyukov reported[1] is that the vma found by
hugepage_vma_revalidate() can be suitable for huge pages, but not the
same vma we had before dropping mmap_sem. And dereferencing original
vma can lead to fun results..

Let's use vma hugepage_vma_revalidate() found instead of assuming it's the
same as what we had before the lock was dropped.

[1] http://lkml.kernel.org/r/CACT4Y+Z3gigBvhca9kRJFcjX0G70V_nRhbwKBU+yGoESBDKi9Q@mail.gmail.com

Link: http://lkml.kernel.org/r/20160907122559.GA6542@black.fi.intel.com
Signed-off-by: Kirill A. Shutemov
Reported-by: Dmitry Vyukov
Reviewed-by: Andrea Arcangeli
Cc: Ebru Akagunduz
Cc: Vlastimil Babka
Cc: Mel Gorman
Cc: Johannes Weiner
Cc: Vegard Nossum
Cc: Sasha Levin
Cc: Konstantin Khlebnikov
Cc: Andrey Ryabinin
Cc: Greg Thelen
Cc: Suleiman Souhlal
Cc: Hugh Dickins
Cc: David Rientjes
Cc: syzkaller
Cc: Kostya Serebryany
Cc: Alexander Potapenko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-09-20 06:36:16 +0800
9bb627be4 mem-hotplug: don't clear the only node in new_node_page() ... Browse Code »

Commit 394e31d2ceb4 ("mem-hotplug: alloc new page from a nearest
neighbor node when mem-offline") introduced new_node_page() for memory
hotplug.

In new_node_page(), the nid is cleared before calling
__alloc_pages_nodemask(). But if it is the only node of the system, and
the first round allocation fails, it will not be able to get memory from
an empty nodemask, and will trigger oom.

The patch checks whether it is the last node on the system, and if it
is, then don't clear the nid in the nodemask.

Fixes: 394e31d2ceb4 ("mem-hotplug: alloc new page from a nearest neighbor node when mem-offline")
Link: http://lkml.kernel.org/r/1473044391.4250.19.camel@TP420
Signed-off-by: Li Zhong
Reported-by: John Allen
Acked-by: Vlastimil Babka
Cc: Xishi Qiu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Li Zhong
2016-09-20 06:36:16 +0800

15 Sep, 2016

1 commit

2eefd8789 x86/arch_prctl/vdso: Add ARCH_MAP_VDSO_* ... Browse Code »

Add API to change vdso blob type with arch_prctl.
As this is usefull only by needs of CRIU, expose
this interface under CONFIG_CHECKPOINT_RESTORE.

Signed-off-by: Dmitry Safonov
Acked-by: Andy Lutomirski
Cc: 0x7f454c46@gmail.com
Cc: oleg@redhat.com
Cc: linux-mm@kvack.org
Cc: gorcunov@openvz.org
Cc: xemul@virtuozzo.com
Link: http://lkml.kernel.org/r/20160905133308.28234-4-dsafonov@virtuozzo.com
Signed-off-by: Thomas Gleixner

Dmitry Safonov
2016-09-15 03:28:09 +0800

14 Sep, 2016

1 commit

d59dc7bcf sched/numa, mm: Revert to checking pmd/pte_write instead of VMA flags ... Browse Code »

Commit:

4d9424669946 ("mm: convert p[te|md]_mknonnuma and remaining page table manipulations")

changed NUMA balancing from _PAGE_NUMA to using PROT_NONE, and was quickly
found to introduce a regression with NUMA grouping.

It was followed up by these commits:

53da3bc2ba9e ("mm: fix up numa read-only thread grouping logic")
bea66fbd11af ("mm: numa: group related processes based on VMA flags instead of page table flags")
b191f9b106ea ("mm: numa: preserve PTE write permissions across a NUMA hinting fault")

The first of those two commits try alternate approaches to NUMA
grouping, which apparently do not work as well as looking at the PTE
write permissions.

The latter patch preserves the PTE write permissions across a NUMA
protection fault. However, it forgets to revert the condition for
whether or not to group tasks together back to what it was before
v3.19, even though the information is now preserved in the page tables
once again.

This patch brings the NUMA grouping heuristic back to what it was
before commit 4d9424669946, which the changelogs of subsequent
commits suggest worked best.

We have all the information again. We should probably use it.

Signed-off-by: Rik van Riel
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: aarcange@redhat.com
Cc: linux-mm@kvack.org
Cc: mgorman@suse.de
Link: http://lkml.kernel.org/r/20160908213053.07c992a9@annuminas.surriel.com
Signed-off-by: Ingo Molnar

Rik van Riel
2016-09-14 02:31:33 +0800

13 Sep, 2016

1 commit

1ad1410f6 PM / Hibernate: allow hibernation with PAGE_POISONING_ZERO ... Browse Code »

PAGE_POISONING_ZERO disables zeroing new pages on alloc, they are
poisoned (zeroed) as they become available.
In the hibernate use case, free pages will appear in the system without
being cleared, left there by the loading kernel.

This patch will make sure free pages are cleared on resume when
PAGE_POISONING_ZERO is enabled. We free the pages just after resume
because we can't do it later: going through any device resume code might
allocate some memory and invalidate the free pages bitmap.

Thus we don't need to disable hibernation when PAGE_POISONING_ZERO is
enabled.

Signed-off-by: Anisse Astier
Reviewed-by: Kees Cook
Acked-by: Pavel Machek
Signed-off-by: Rafael J. Wysocki

Anisse Astier
2016-09-13 08:35:27 +0800

11 Sep, 2016

1 commit

98ac9a608 Merge branch 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm ... Browse Code »

Pull libnvdimm fixes from Dan Williams:
"nvdimm fixes for v4.8, two of them are tagged for -stable:

- Fix devm_memremap_pages() to use track_pfn_insert(). Otherwise,
DAX pmd mappings end up with an uncached pgprot, and unusable
performance for the device-dax interface. The device-dax interface
appeared in 4.7 so this is tagged for -stable.

- Fix a couple VM_BUG_ON() checks in the show_smaps() path to
understand DAX pmd entries. This fix is tagged for -stable.

- Fix a mis-merge of the nfit machine-check handler to flip the
polarity of an if() to match the final version of the patch that
Vishal sent for 4.8-rc1. Without this the nfit machine check
handler never detects / inserts new 'badblocks' entries which
applications use to identify lost portions of files.

- For test purposes, fix the nvdimm_clear_poison() path to operate on
legacy / simulated nvdimm memory ranges. Without this fix a test
can set badblocks, but never clear them on these ranges.

- Fix the range checking done by dax_dev_pmd_fault(). This is not
tagged for -stable since this problem is mitigated by specifying
aligned resources at device-dax setup time.

These patches have appeared in a next release over the past week. The
recent rebase you can see in the timestamps was to drop an invalid fix
as identified by the updated device-dax unit tests [1]. The -mm
touches have an ack from Andrew"

[1]: "[ndctl PATCH 0/3] device-dax test for recent kernel bugs"
https://lists.01.org/pipermail/linux-nvdimm/2016-September/006855.html

* 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
libnvdimm: allow legacy (e820) pmem region to clear bad blocks
nfit, mce: Fix SPA matching logic in MCE handler
mm: fix cache mode of dax pmd mappings
mm: fix show_smap() for zone_device-pmd ranges
dax: fix mapping size check

Linus Torvalds
2016-09-11 00:58:52 +0800

10 Sep, 2016

1 commit

ca120cf68 mm: fix show_smap() for zone_device-pmd ranges ... Browse Code »

Attempting to dump /proc//smaps for a process with pmd dax mappings
currently results in the following VM_BUG_ONs:

kernel BUG at mm/huge_memory.c:1105!
task: ffff88045f16b140 task.stack: ffff88045be14000
RIP: 0010:[] [] follow_trans_huge_pmd+0x2cb/0x340
[..]
Call Trace:
[] smaps_pte_range+0xa0/0x4b0
[] ? vsnprintf+0x255/0x4c0
[] __walk_page_range+0x1fe/0x4d0
[] walk_page_vma+0x62/0x80
[] show_smap+0xa6/0x2b0

kernel BUG at fs/proc/task_mmu.c:585!
RIP: 0010:[] [] smaps_pte_range+0x499/0x4b0
Call Trace:
[] ? vsnprintf+0x255/0x4c0
[] __walk_page_range+0x1fe/0x4d0
[] walk_page_vma+0x62/0x80
[] show_smap+0xa6/0x2b0

These locations are sanity checking page flags that must be set for an
anonymous transparent huge page, but are not set for the zone_device
pages associated with dax mappings.

Cc: Ross Zwisler
Cc: Kirill A. Shutemov
Acked-by: Andrew Morton
Signed-off-by: Dan Williams

Dan Williams
2016-09-10 08:34:45 +0800

08 Sep, 2016

1 commit

8e1f74ea0 usercopy: remove page-spanning test for now ... Browse Code »

A custom allocator without __GFP_COMP that copies to userspace has been
found in vmw_execbuf_process[1], so this disables the page-span checker
by placing it behind a CONFIG for future work where such things can be
tracked down later.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1373326

Reported-by: Vinson Lee
Fixes: f5509cc18daa ("mm: Hardened usercopy")
Signed-off-by: Kees Cook

Kees Cook
2016-09-08 02:33:26 +0800

07 Sep, 2016

3 commits

1d7ac6aec mm/writeback: Convert to hotplug state machine ... Browse Code »

Install the callbacks via the state machine and let the core invoke
the callbacks on the already online CPUs.

Signed-off-by: Sebastian Andrzej Siewior
Cc: Peter Zijlstra
Cc: Jens Axboe
Cc: linux-mm@kvack.org
Cc: rt@linutronix.de
Cc: Tejun Heo
Link: http://lkml.kernel.org/r/20160818125731.27256-6-bigeasy@linutronix.de
Signed-off-by: Thomas Gleixner

Sebastian Andrzej Siewior
2016-09-07 00:30:20 +0800
a96a87bf9 slub: Convert to hotplug state machine ... Browse Code »

Install the callbacks via the state machine.

Signed-off-by: Sebastian Andrzej Siewior
Cc: Andrew Morton
Cc: Peter Zijlstra
Cc: Pekka Enberg
Cc: linux-mm@kvack.org
Cc: rt@linutronix.de
Cc: David Rientjes
Cc: Christoph Lameter
Cc: Joonsoo Kim
Link: http://lkml.kernel.org/r/20160818125731.27256-5-bigeasy@linutronix.de
Signed-off-by: Thomas Gleixner

Sebastian Andrzej Siewior
2016-09-07 00:30:20 +0800
6731d4f12 slab: Convert to hotplug state machine ... Browse Code »

Install the callbacks via the state machine.

Signed-off-by: Richard Weinberger
Signed-off-by: Thomas Gleixner
Signed-off-by: Sebastian Andrzej Siewior
Reviewed-by: Sebastian Andrzej Siewior
Cc: Peter Zijlstra
Cc: Pekka Enberg
Cc: linux-mm@kvack.org
Cc: rt@linutronix.de
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Andrew Morton
Cc: Christoph Lameter
Link: http://lkml.kernel.org/r/20160823125319.abeapfjapf2kfezp@linutronix.de
Signed-off-by: Thomas Gleixner

Sebastian Andrzej Siewior
2016-09-07 00:30:20 +0800

02 Sep, 2016

3 commits

c11600e4f mm, mempolicy: task->mempolicy must be NULL before dropping final reference ... Browse Code »

KASAN allocates memory from the page allocator as part of
kmem_cache_free(), and that can reference current->mempolicy through any
number of allocation functions. It needs to be NULL'd out before the
final reference is dropped to prevent a use-after-free bug:

BUG: KASAN: use-after-free in alloc_pages_current+0x363/0x370 at addr ffff88010b48102c
CPU: 0 PID: 15425 Comm: trinity-c2 Not tainted 4.8.0-rc2+ #140
...
Call Trace:
dump_stack
kasan_object_err
kasan_report_error
__asan_report_load2_noabort
alloc_pages_current mempolicy to NULL before dropping the final
reference.

Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1608301442180.63329@chino.kir.corp.google.com
Fixes: cd11016e5f52 ("mm, kasan: stackdepot implementation. Enable stackdepot for SLAB")
Signed-off-by: David Rientjes
Reported-by: Vegard Nossum
Acked-by: Andrey Ryabinin
Cc: Alexander Potapenko
Cc: Dmitry Vyukov
Cc: [4.6+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2016-09-02 08:52:01 +0800
6aa303def mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator ... Browse Code »

Firmware Assisted Dump (FA_DUMP) on ppc64 reserves substantial amounts
of memory when booting a secondary kernel. Srikar Dronamraju reported
that multiple nodes may have no memory managed by the buddy allocator
but still return true for populated_zone().

Commit 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of
nodes") was reported to cause kswapd to spin at 100% CPU usage when
fadump was enabled. The old code happened to deal with the situation of
a populated node with zero free pages by co-incidence but the current
code tries to reclaim populated zones without realising that is
impossible.

We cannot just convert populated_zone() as many existing users really
need to check for present_pages. This patch introduces a managed_zone()
helper and uses it in the few cases where it is critical that the check
is made for managed pages -- zonelist construction and page reclaim.

Link: http://lkml.kernel.org/r/20160831195104.GB8119@techsingularity.net
Signed-off-by: Mel Gorman
Reported-by: Srikar Dronamraju
Tested-by: Srikar Dronamraju
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2016-09-02 08:52:01 +0800
6b4e3181d mm, oom: prevent premature OOM killer invocation for high order request ... Browse Code »

There have been several reports about pre-mature OOM killer invocation
in 4.7 kernel when order-2 allocation request (for the kernel stack)
invoked OOM killer even during basic workloads (light IO or even kernel
compile on some filesystems). In all reported cases the memory is
fragmented and there are no order-2+ pages available. There is usually
a large amount of slab memory (usually dentries/inodes) and further
debugging has shown that there are way too many unmovable blocks which
are skipped during the compaction. Multiple reporters have confirmed
that the current linux-next which includes [1] and [2] helped and OOMs
are not reproducible anymore.

A simpler fix for the late rc and stable is to simply ignore the
compaction feedback and retry as long as there is a reclaim progress and
we are not getting OOM for order-0 pages. We already do that for
CONFING_COMPACTION=n so let's reuse the same code when compaction is
enabled as well.

[1] http://lkml.kernel.org/r/20160810091226.6709-1-vbabka@suse.cz
[2] http://lkml.kernel.org/r/f7a9ea9d-bb88-bfd6-e340-3a933559305a@suse.cz

Fixes: 0a0337e0d1d1 ("mm, oom: rework oom detection")
Link: http://lkml.kernel.org/r/20160823074339.GB23577@dhcp22.suse.cz
Signed-off-by: Michal Hocko
Tested-by: Olaf Hering
Tested-by: Ralf-Peter Rohbeck
Cc: Markus Trippelsdorf
Cc: Arkadiusz Miskiewicz
Cc: Ralf-Peter Rohbeck
Cc: Jiri Slaby
Cc: Vlastimil Babka
Cc: Joonsoo Kim
Cc: Tetsuo Handa
Cc: David Rientjes
Cc: [4.7.x]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2016-09-02 08:52:01 +0800

27 Aug, 2016

3 commits

11bd969fd mm: silently skip readahead for DAX inodes ... Browse Code »

For DAX inodes we need to be careful to never have page cache pages in
the mapping->page_tree. This radix tree should be composed only of DAX
exceptional entries and zero pages.

ltp's readahead02 test was triggering a warning because we were trying
to insert a DAX exceptional entry but found that a page cache page had
already been inserted into the tree. This page was being inserted into
the radix tree in response to a readahead(2) call.

Readahead doesn't make sense for DAX inodes, but we don't want it to
report a failure either. Instead, we just return success and don't do
any work.

Link: http://lkml.kernel.org/r/20160824221429.21158-1-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler
Reported-by: Jeff Moyer
Cc: Dan Williams
Cc: Dave Chinner
Cc: Dave Hansen
Cc: Jan Kara
Cc: [4.5+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ross Zwisler
2016-08-27 08:39:35 +0800
358c07fcc mm: memcontrol: avoid unused function warning ... Browse Code »

A bugfix in v4.8-rc2 introduced a harmless warning when
CONFIG_MEMCG_SWAP is disabled but CONFIG_MEMCG is enabled:

mm/memcontrol.c:4085:27: error: 'mem_cgroup_id_get_online' defined but not used [-Werror=unused-function]
static struct mem_cgroup *mem_cgroup_id_get_online(struct mem_cgroup *memcg)

This moves the function inside of the #ifdef block that hides the
calling function, to avoid the warning.

Fixes: 1f47b61fb407 ("mm: memcontrol: fix swap counter leak on swapout from offline cgroup")
Link: http://lkml.kernel.org/r/20160824113733.2776701-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann
Acked-by: Michal Hocko
Acked-by: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Arnd Bergmann
2016-08-27 08:39:35 +0800
b32eaf71d mm: clarify COMPACTION Kconfig text ... Browse Code »

The current wording of the COMPACTION Kconfig help text doesn't
emphasise that disabling COMPACTION might cripple the page allocator
which relies on the compaction quite heavily for high order requests and
an unexpected OOM can happen with the lack of compaction. Make sure we
are vocal about that.

Link: http://lkml.kernel.org/r/20160823091726.GK23577@dhcp22.suse.cz
Signed-off-by: Michal Hocko
Cc: Markus Trippelsdorf
Cc: Mel Gorman
Cc: Joonsoo Kim
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2016-08-27 08:39:35 +0800