06 Oct, 2016

2 commits

  • Commit 22f2ac51b6d6 ("mm: workingset: fix crash in shadow node shrinker
    caused by replace_page_cache_page()") switched replace_page_cache() from
    raw radix tree operations to page_cache_tree_insert() but didn't take
    into account that the latter function, unlike the raw radix tree op,
    handles mapping->nrpages. As a result, that counter is bumped for each
    page replacement rather than balanced out even.

    The mapping->nrpages counter is used to skip needless radix tree walks
    when invalidating, truncating, syncing inodes without pages, as well as
    statistics for userspace. Since the error is positive, we'll do more
    page cache tree walks than necessary; we won't miss a necessary one.
    And we'll report more buffer pages to userspace than there are. The
    error is limited to fuse inodes.

    Fixes: 22f2ac51b6d6 ("mm: workingset: fix crash in shadow node shrinker caused by replace_page_cache_page()")
    Signed-off-by: Johannes Weiner
    Cc: Andrew Morton
    Cc: Miklos Szeredi
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • When the underflow checks were added to workingset_node_shadow_dec(),
    they triggered immediately:

    kernel BUG at ./include/linux/swap.h:276!
    invalid opcode: 0000 [#1] SMP
    Modules linked in: isofs usb_storage fuse xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_REJECT nf_reject_ipv6
    soundcore wmi acpi_als pinctrl_sunrisepoint kfifo_buf tpm_tis industrialio acpi_pad pinctrl_intel tpm_tis_core tpm nfsd auth_rpcgss nfs_acl lockd grace sunrpc dm_crypt
    CPU: 0 PID: 20929 Comm: blkid Not tainted 4.8.0-rc8-00087-gbe67d60ba944 #1
    Hardware name: System manufacturer System Product Name/Z170-K, BIOS 1803 05/06/2016
    task: ffff8faa93ecd940 task.stack: ffff8faa7f478000
    RIP: page_cache_tree_insert+0xf1/0x100
    Call Trace:
    __add_to_page_cache_locked+0x12e/0x270
    add_to_page_cache_lru+0x4e/0xe0
    mpage_readpages+0x112/0x1d0
    blkdev_readpages+0x1d/0x20
    __do_page_cache_readahead+0x1ad/0x290
    force_page_cache_readahead+0xaa/0x100
    page_cache_sync_readahead+0x3f/0x50
    generic_file_read_iter+0x5af/0x740
    blkdev_read_iter+0x35/0x40
    __vfs_read+0xe1/0x130
    vfs_read+0x96/0x130
    SyS_read+0x55/0xc0
    entry_SYSCALL_64_fastpath+0x13/0x8f
    Code: 03 00 48 8b 5d d8 65 48 33 1c 25 28 00 00 00 44 89 e8 75 19 48 83 c4 18 5b 41 5c 41 5d 41 5e 5d c3 0f 0b 41 bd ef ff ff ff eb d7 0b e8 88 68 ef ff 0f 1f 84 00
    RIP page_cache_tree_insert+0xf1/0x100

    This is a long-standing bug in the way shadow entries are accounted in
    the radix tree nodes. The shrinker needs to know when radix tree nodes
    contain only shadow entries, no pages, so node->count is split in half
    to count shadows in the upper bits and pages in the lower bits.

    Unfortunately, the radix tree implementation doesn't know of this and
    assumes all entries are in node->count. When there is a shadow entry
    directly in root->rnode and the tree is later extended, the radix tree
    implementation will copy that entry into the new node and and bump its
    node->count, i.e. increases the page count bits. Once the shadow gets
    removed and we subtract from the upper counter, node->count underflows
    and triggers the warning. Afterwards, without node->count reaching 0
    again, the radix tree node is leaked.

    Limit shadow entries to when we have actual radix tree nodes and can
    count them properly. That means we lose the ability to detect refaults
    from files that had only the first page faulted in at eviction time.

    Fixes: 449dd6984d0e ("mm: keep page cache radix tree nodes in check")
    Signed-off-by: Johannes Weiner
    Reported-and-tested-by: Linus Torvalds
    Reviewed-by: Jan Kara
    Cc: Andrew Morton
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

04 Oct, 2016

4 commits

  • Pull CPU hotplug updates from Thomas Gleixner:
    "Yet another batch of cpu hotplug core updates and conversions:

    - Provide core infrastructure for multi instance drivers so the
    drivers do not have to keep custom lists.

    - Convert custom lists to the new infrastructure. The block-mq custom
    list conversion comes through the block tree and makes the diffstat
    tip over to more lines removed than added.

    - Handle unbalanced hotplug enable/disable calls more gracefully.

    - Remove the obsolete CPU_STARTING/DYING notifier support.

    - Convert another batch of notifier users.

    The relayfs changes which conflicted with the conversion have been
    shipped to me by Andrew.

    The remaining lot is targeted for 4.10 so that we finally can remove
    the rest of the notifiers"

    * 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits)
    cpufreq: Fix up conversion to hotplug state machine
    blk/mq: Reserve hotplug states for block multiqueue
    x86/apic/uv: Convert to hotplug state machine
    s390/mm/pfault: Convert to hotplug state machine
    mips/loongson/smp: Convert to hotplug state machine
    mips/octeon/smp: Convert to hotplug state machine
    fault-injection/cpu: Convert to hotplug state machine
    padata: Convert to hotplug state machine
    cpufreq: Convert to hotplug state machine
    ACPI/processor: Convert to hotplug state machine
    virtio scsi: Convert to hotplug state machine
    oprofile/timer: Convert to hotplug state machine
    block/softirq: Convert to hotplug state machine
    lib/irq_poll: Convert to hotplug state machine
    x86/microcode: Convert to hotplug state machine
    sh/SH-X3 SMP: Convert to hotplug state machine
    ia64/mca: Convert to hotplug state machine
    ARM/OMAP/wakeupgen: Convert to hotplug state machine
    ARM/shmobile: Convert to hotplug state machine
    arm64/FP/SIMD: Convert to hotplug state machine
    ...

    Linus Torvalds
     
  • Pull x86 vdso updates from Ingo Molnar:
    "The main changes in this cycle centered around adding support for
    32-bit compatible C/R of the vDSO on 64-bit kernels, by Dmitry
    Safonov"

    * 'x86-vdso-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/vdso: Use CONFIG_X86_X32_ABI to enable vdso prctl
    x86/vdso: Only define map_vdso_randomized() if CONFIG_X86_64
    x86/vdso: Only define prctl_map_vdso() if CONFIG_CHECKPOINT_RESTORE
    x86/signal: Add SA_{X32,IA32}_ABI sa_flags
    x86/ptrace: Down with test_thread_flag(TIF_IA32)
    x86/coredump: Use pr_reg size, rather that TIF_IA32 flag
    x86/arch_prctl/vdso: Add ARCH_MAP_VDSO_*
    x86/vdso: Replace calculate_addr in map_vdso() with addr
    x86/vdso: Unmap vdso blob on vvar mapping failure

    Linus Torvalds
     
  • Pull scheduler changes from Ingo Molnar:
    "The main changes are:

    - irqtime accounting cleanups and enhancements. (Frederic Weisbecker)

    - schedstat debugging enhancements, make it more broadly runtime
    available. (Josh Poimboeuf)

    - More work on asymmetric topology/capacity scheduling. (Morten
    Rasmussen)

    - sched/wait fixes and cleanups. (Oleg Nesterov)

    - PELT (per entity load tracking) improvements. (Peter Zijlstra)

    - Rewrite and enhance select_idle_siblings(). (Peter Zijlstra)

    - sched/numa enhancements/fixes (Rik van Riel)

    - sched/cputime scalability improvements (Stanislaw Gruszka)

    - Load calculation arithmetics fixes. (Dietmar Eggemann)

    - sched/deadline enhancements (Tommaso Cucinotta)

    - Fix utilization accounting when switching to the SCHED_NORMAL
    policy. (Vincent Guittot)

    - ... plus misc cleanups and enhancements"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (64 commits)
    sched/irqtime: Consolidate irqtime flushing code
    sched/irqtime: Consolidate accounting synchronization with u64_stats API
    u64_stats: Introduce IRQs disabled helpers
    sched/irqtime: Remove needless IRQs disablement on kcpustat update
    sched/irqtime: No need for preempt-safe accessors
    sched/fair: Fix min_vruntime tracking
    sched/debug: Add SCHED_WARN_ON()
    sched/core: Fix set_user_nice()
    sched/fair: Introduce set_curr_task() helper
    sched/core, ia64: Rename set_curr_task()
    sched/core: Fix incorrect utilization accounting when switching to fair class
    sched/core: Optimize SCHED_SMT
    sched/core: Rewrite and improve select_idle_siblings()
    sched/core: Replace sd_busy/nr_busy_cpus with sched_domain_shared
    sched/core: Introduce 'struct sched_domain_shared'
    sched/core: Restructure destroy_sched_domain()
    sched/core: Remove unused @cpu argument from destroy_sched_domain*()
    sched/wait: Introduce init_wait_entry()
    sched/wait: Avoid abort_exclusive_wait() in __wait_on_bit_lock()
    sched/wait: Avoid abort_exclusive_wait() in ___wait_event()
    ...

    Linus Torvalds
     
  • Pull power management updates from Rafael Wysocki:
    "Traditionally, cpufreq is the area with the greatest number of
    changes, but there are fewer of them than last time. There also is
    some activity in the generic power domains and the devfreq frameworks,
    a couple of system suspend and hibernation fixes and some assorted
    changes in other places.

    One new feature is the cpufreq change to allow the scheduler to pass
    hints to the governors' utilization update callbacks and some code
    rework based on that. Another one is the support for domain removal in
    the generic power domains framework. Also it is now possible to use
    hibernation with PAGE_POISONING_ZERO enabled and devfreq supports the
    RockChip DFI controller and the rk3399 DMC.

    The rest of the changes is mostly fixes and cleanups in a number of
    places.

    Specifics:

    - Add a mechanism for passing hints from the scheduler to cpufreq
    governors via their utilization update callbacks and use it to
    introduce "IOwait boosting" into the schedutil governor and
    intel_pstate that will make them boost performance if the enqueued
    task was previously waiting on I/O (Rafael Wysocki).

    - Fix a schedutil governor problem that causes it to overestimate
    utilization if SMT is in use (Steve Muckle).

    - Update defconfigs trying to use the schedutil governor as a module
    which is not possible any more (Javier Martinez Canillas).

    - Update the intel_pstate's pstate_sample tracepoint to take "IOwait
    boosting" into account (Srinivas Pandruvada).

    - Fix a problem in the cpufreq core causing it to mishandle the
    initialization of CPUs registered after the cpufreq driver (Viresh
    Kumar, Rafael Wysocki).

    - Make the cpufreq-dt driver support per-policy governor tunables,
    clean it up and update its Kconfig description (Viresh Kumar).

    - Add support for more ARM platforms to the cpufreq-dt driver
    (Chanwoo Choi, Dave Gerlach, Geert Uytterhoeven).

    - Make the cpufreq CPPC driver report frequencies in KHz to avoid
    user space compatiblility issues (Al Stone, Hoan Tran).

    - Clean up a few cpufreq drivers (st, kirkwood, SCPI) a bit (Colin
    Ian King, Markus Elfring).

    - Constify some local structures in the intel_pstate driver (Julia
    Lawall).

    - Add a Documentation/cpu-freq/ entry to MAINTAINERS (Jean Delvare).

    - Add support for PM domain removal to the generic power domains
    (genpd) framework, add new DT helper functions to it and make it
    always enable debugfs support if available (Jon Hunter, Tomeu
    Vizoso).

    - Clean up the generic power domains (genpd) framework and make it
    avoid measuring power-on and power-off latencies during system-wide
    PM transitions (Ulf Hansson).

    - Add support for the RockChip DFI controller and the rk3399 DMC to
    the devfreq framework (Lin Huang, Axel Lin, Arnd Bergmann).

    - Add COMPILE_TEST to the devfreq framework (Krzysztof Kozlowski,
    Stephen Rothwell).

    - Fix a minor issue in the exynos-ppmu devfreq driver and fix up
    devfreq Kconfig indentation style (Wei Yongjun, Jisheng Zhang).

    - Fix the system suspend interface to make suspend-to-idle work if
    platform suspend operations have not been registered (Sudeep
    Holla).

    - Make it possible to use hibernation with PAGE_POISONING_ZERO
    enabled (Anisse Astier).

    - Increas the default timeout of the system suspend/resume watchdog
    and make it depend on EXPERT (Chen Yu).

    - Make the operating performance points (OPP) framework avoid using
    OPPs that aren't supported by the platform and fix a build warning
    in it (Dave Gerlach, Arnd Bergmann).

    - Fix the ARM cpuidle driver's return value (Christophe Jaillet).

    - Make the SmartReflex AVS (Adaptive Voltage Scaling) driver use more
    common logging style (Joe Perches)"

    * tag 'pm-4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (58 commits)
    PM / OPP: Don't support OPP if it provides supported-hw but platform does not
    cpufreq: st: add missing \n to end of dev_err message
    cpufreq: kirkwood: add missing \n to end of dev_err messages
    PM / Domains: Rename pm_genpd_sync_poweron|poweroff()
    PM / Domains: Don't measure latency of ->power_on|off() during system PM
    PM / Domains: Remove redundant system PM callbacks
    PM / Domains: Simplify detaching a device from its genpd
    PM / devfreq: rk3399_dmc: Remove explictly regulator_put call in .remove
    PM / devfreq: rockchip: add PM_DEVFREQ_EVENT dependency
    PM / OPP: avoid maybe-uninitialized warning
    PM / Domains: Allow holes in genpd_data.domains array
    cpufreq: CPPC: Avoid overflow when calculating desired_perf
    cpufreq: ti: Use generic platdev driver
    cpufreq: intel_pstate: Add io_boost trace
    partial revert of "PM / devfreq: Add COMPILE_TEST for build coverage"
    cpufreq: intel_pstate: Use IOWAIT flag in Atom algorithm
    cpufreq: schedutil: Add iowait boosting
    cpufreq / sched: SCHED_CPUFREQ_IOWAIT flag to indicate iowait condition
    PM / Domains: Add support for removing nested PM domains by provider
    PM / Domains: Add support for removing PM domains
    ...

    Linus Torvalds
     

03 Oct, 2016

1 commit

  • Pull arm64 updates from Will Deacon:
    "It's a bit all over the place this time with no "killer feature" to
    speak of. Support for mismatched cache line sizes should help people
    seeing whacky JIT failures on some SoCs, and the big.LITTLE perf
    updates have been a long time coming, but a lot of the changes here
    are cleanups.

    We stray outside arch/arm64 in a few areas: the arch/arm/ arch_timer
    workaround is acked by Russell, the DT/OF bits are acked by Rob, the
    arch_timer clocksource changes acked by Marc, CPU hotplug by tglx and
    jump_label by Peter (all CC'd).

    Summary:

    - Support for execute-only page permissions
    - Support for hibernate and DEBUG_PAGEALLOC
    - Support for heterogeneous systems with mismatches cache line sizes
    - Errata workarounds (A53 843419 update and QorIQ A-008585 timer bug)
    - arm64 PMU perf updates, including cpumasks for heterogeneous systems
    - Set UTS_MACHINE for building rpm packages
    - Yet another head.S tidy-up
    - Some cleanups and refactoring, particularly in the NUMA code
    - Lots of random, non-critical fixes across the board"

    * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (100 commits)
    arm64: tlbflush.h: add __tlbi() macro
    arm64: Kconfig: remove SMP dependence for NUMA
    arm64: Kconfig: select OF/ACPI_NUMA under NUMA config
    arm64: fix dump_backtrace/unwind_frame with NULL tsk
    arm/arm64: arch_timer: Use archdata to indicate vdso suitability
    arm64: arch_timer: Work around QorIQ Erratum A-008585
    arm64: arch_timer: Add device tree binding for A-008585 erratum
    arm64: Correctly bounds check virt_addr_valid
    arm64: migrate exception table users off module.h and onto extable.h
    arm64: pmu: Hoist pmu platform device name
    arm64: pmu: Probe default hw/cache counters
    arm64: pmu: add fallback probe table
    MAINTAINERS: Update ARM PMU PROFILING AND DEBUGGING entry
    arm64: Improve kprobes test for atomic sequence
    arm64/kvm: use alternative auto-nop
    arm64: use alternative auto-nop
    arm64: alternative: add auto-nop infrastructure
    arm64: lse: convert lse alternatives NOP padding to use __nops
    arm64: barriers: introduce nops and __nops macros for NOP sequences
    arm64: sysreg: replace open-coded mrs_s/msr_s with {read,write}_sysreg_s
    ...

    Linus Torvalds
     

02 Oct, 2016

1 commit

  • * pm-devfreq:
    PM / devfreq: rk3399_dmc: Remove explictly regulator_put call in .remove
    PM / devfreq: rockchip: add PM_DEVFREQ_EVENT dependency
    partial revert of "PM / devfreq: Add COMPILE_TEST for build coverage"
    PM / devfreq: rockchip: add devfreq driver for rk3399 dmc
    Documentation: bindings: add dt documentation for rk3399 dmc
    PM / devfreq: event: support rockchip dfi controller
    Documentation: bindings: add dt documentation for dfi controller
    PM / devfreq: event: remove duplicate devfreq_event_get_drvdata()
    PM / devfreq: fix Kconfig indent style
    PM / devfreq: Add COMPILE_TEST for build coverage
    PM / devfreq: exynos-ppmu: remove unneeded of_node_put()

    * pm-sleep:
    PM / Hibernate: allow hibernation with PAGE_POISONING_ZERO
    PM / sleep: enable suspend-to-idle even without registered suspend_ops
    PM / sleep: Increase default DPM watchdog timeout to 120

    Rafael J. Wysocki
     

01 Oct, 2016

1 commit

  • Antonio reports the following crash when using fuse under memory pressure:

    kernel BUG at /build/linux-a2WvEb/linux-4.4.0/mm/workingset.c:346!
    invalid opcode: 0000 [#1] SMP
    Modules linked in: all of them
    CPU: 2 PID: 63 Comm: kswapd0 Not tainted 4.4.0-36-generic #55-Ubuntu
    Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013
    task: ffff88040cae6040 ti: ffff880407488000 task.ti: ffff880407488000
    RIP: shadow_lru_isolate+0x181/0x190
    Call Trace:
    __list_lru_walk_one.isra.3+0x8f/0x130
    list_lru_walk_one+0x23/0x30
    scan_shadow_nodes+0x34/0x50
    shrink_slab.part.40+0x1ed/0x3d0
    shrink_zone+0x2ca/0x2e0
    kswapd+0x51e/0x990
    kthread+0xd8/0xf0
    ret_from_fork+0x3f/0x70

    which corresponds to the following sanity check in the shadow node
    tracking:

    BUG_ON(node->count & RADIX_TREE_COUNT_MASK);

    The workingset code tracks radix tree nodes that exclusively contain
    shadow entries of evicted pages in them, and this (somewhat obscure)
    line checks whether there are real pages left that would interfere with
    reclaim of the radix tree node under memory pressure.

    While discussing ways how fuse might sneak pages into the radix tree
    past the workingset code, Miklos pointed to replace_page_cache_page(),
    and indeed there is a problem there: it properly accounts for the old
    page being removed - __delete_from_page_cache() does that - but then
    does a raw raw radix_tree_insert(), not accounting for the replacement
    page. Eventually the page count bits in node->count underflow while
    leaving the node incorrectly linked to the shadow node LRU.

    To address this, make sure replace_page_cache_page() uses the tracked
    page insertion code, page_cache_tree_insert(). This fixes the page
    accounting and makes sure page-containing nodes are properly unlinked
    from the shadow node LRU again.

    Also, make the sanity checks a bit less obscure by using the helpers for
    checking the number of pages and shadows in a radix tree node.

    Fixes: 449dd6984d0e ("mm: keep page cache radix tree nodes in check")
    Link: http://lkml.kernel.org/r/20160919155822.29498-1-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reported-by: Antonio SJ Musumeci
    Debugged-by: Miklos Szeredi
    Cc: [3.15+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

30 Sep, 2016

1 commit


29 Sep, 2016

2 commits

  • 9bb627be47a5 ("mem-hotplug: don't clear the only node in new_node_page()")
    prevents allocating from an empty nodemask, but as David points out, it is
    still wrong. As node_online_map may include memoryless nodes, only
    allocating from these nodes is meaningless.

    This patch uses node_states[N_MEMORY] mask to prevent the above case.

    Fixes: 9bb627be47a5 ("mem-hotplug: don't clear the only node in new_node_page()")
    Fixes: 394e31d2ceb4 ("mem-hotplug: alloc new page from a nearest neighbor node when mem-offline")
    Link: http://lkml.kernel.org/r/1474447117.28370.6.camel@TP420
    Signed-off-by: Li Zhong
    Suggested-by: David Rientjes
    Acked-by: Vlastimil Babka
    Cc: Michal Hocko
    Cc: John Allen
    Cc: Xishi Qiu
    Cc: Joonsoo Kim
    Cc: Naoya Horiguchi
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zhong
     
  • I hit the following hung task when runing a OOM LTP test case with 4.1
    kernel.

    Call trace:
    [] __switch_to+0x74/0x8c
    [] __schedule+0x23c/0x7bc
    [] schedule+0x3c/0x94
    [] rwsem_down_write_failed+0x214/0x350
    [] down_write+0x64/0x80
    [] __ksm_exit+0x90/0x19c
    [] mmput+0x118/0x11c
    [] do_exit+0x2dc/0xa74
    [] do_group_exit+0x4c/0xe4
    [] get_signal+0x444/0x5e0
    [] do_signal+0x1d8/0x450
    [] do_notify_resume+0x70/0x78

    The oom victim cannot terminate because it needs to take mmap_sem for
    write while the lock is held by ksmd for read which loops in the page
    allocator

    ksm_do_scan
    scan_get_next_rmap_item
    down_read
    get_next_rmap_item
    alloc_rmap_item #ksmd will loop permanently.

    There is no way forward because the oom victim cannot release any memory
    in 4.1 based kernel. Since 4.6 we have the oom reaper which would solve
    this problem because it would release the memory asynchronously.
    Nevertheless we can relax alloc_rmap_item requirements and use
    __GFP_NORETRY because the allocation failure is acceptable as ksm_do_scan
    would just retry later after the lock got dropped.

    Such a patch would be also easy to backport to older stable kernels which
    do not have oom_reaper.

    While we are at it add GFP_NOWARN so the admin doesn't have to be alarmed
    by the allocation failure.

    Link: http://lkml.kernel.org/r/1474165570-44398-1-git-send-email-zhongjiang@huawei.com
    Signed-off-by: zhong jiang
    Suggested-by: Hugh Dickins
    Suggested-by: Michal Hocko
    Acked-by: Michal Hocko
    Acked-by: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    zhong jiang
     

26 Sep, 2016

1 commit

  • The NUMA balancing logic uses an arch-specific PROT_NONE page table flag
    defined by pte_protnone() or pmd_protnone() to mark PTEs or huge page
    PMDs respectively as requiring balancing upon a subsequent page fault.
    User-defined PROT_NONE memory regions which also have this flag set will
    not normally invoke the NUMA balancing code as do_page_fault() will send
    a segfault to the process before handle_mm_fault() is even called.

    However if access_remote_vm() is invoked to access a PROT_NONE region of
    memory, handle_mm_fault() is called via faultin_page() and
    __get_user_pages() without any access checks being performed, meaning
    the NUMA balancing logic is incorrectly invoked on a non-NUMA memory
    region.

    A simple means of triggering this problem is to access PROT_NONE mmap'd
    memory using /proc/self/mem which reliably results in the NUMA handling
    functions being invoked when CONFIG_NUMA_BALANCING is set.

    This issue was reported in bugzilla (issue 99101) which includes some
    simple repro code.

    There are BUG_ON() checks in do_numa_page() and do_huge_pmd_numa_page()
    added at commit c0e7cad to avoid accidentally provoking strange
    behaviour by attempting to apply NUMA balancing to pages that are in
    fact PROT_NONE. The BUG_ON()'s are consistently triggered by the repro.

    This patch moves the PROT_NONE check into mm/memory.c rather than
    invoking BUG_ON() as faulting in these pages via faultin_page() is a
    valid reason for reaching the NUMA check with the PROT_NONE page table
    flag set and is therefore not always a bug.

    Link: https://bugzilla.kernel.org/show_bug.cgi?id=99101
    Reported-by: Trevor Saunders
    Signed-off-by: Lorenzo Stoakes
    Acked-by: Rik van Riel
    Cc: Andrew Morton
    Cc: Mel Gorman
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     

25 Sep, 2016

4 commits

  • Merge VM fixes from High Dickins:
    "I get the impression that Andrew is away or busy at the moment, so I'm
    going to send you three independent uncontroversial little mm fixes
    directly - though none is strictly a 4.8 regression fix.

    - shmem: fix tmpfs to handle the huge= option properly from Toshi
    Kani is a one-liner to fix a major embarrassment in 4.8's hugepages
    on tmpfs feature: although Hillf pointed it out in June, somehow
    both Kirill and I repeatedly dropped the ball on this one. You
    might wonder if the feature got tested at all with that bug in:
    yes, it did, but for wider testing coverage, Kirill and I had each
    relied too much on an override which bypasses that condition.

    - huge tmpfs: fix Committed_AS leak just a run-of-the-mill accounting
    fix in the same feature.

    - mm: delete unnecessary and unsafe init_tlb_ubc() is an unrelated
    fix to 4.3's TLB flush batching in reclaim: the bug would be rare,
    and none of us will be shamed if this one misses 4.8; but it got
    such a quick ack from Mel today that I'm inclined to offer it along
    with the first two"

    * emailed patches from Hugh Dickins :
    mm: delete unnecessary and unsafe init_tlb_ubc()
    huge tmpfs: fix Committed_AS leak
    shmem: fix tmpfs to handle the huge= option properly

    Linus Torvalds
     
  • init_tlb_ubc() looked unnecessary to me: tlb_ubc is statically
    initialized with zeroes in the init_task, and copied from parent to
    child while it is quiescent in arch_dup_task_struct(); so I went to
    delete it.

    But inserted temporary debug WARN_ONs in place of init_tlb_ubc() to
    check that it was always empty at that point, and found them firing:
    because memcg reclaim can recurse into global reclaim (when allocating
    biosets for swapout in my case), and arrive back at the init_tlb_ubc()
    in shrink_node_memcg().

    Resetting tlb_ubc.flush_required at that point is wrong: if the upper
    level needs a deferred TLB flush, but the lower level turns out not to,
    we miss a TLB flush. But fortunately, that's the only part of the
    protocol that does not nest: with the initialization removed, cpumask
    collects bits from upper and lower levels, and flushes TLB when needed.

    Fixes: 72b252aed506 ("mm: send one IPI per CPU to TLB flush all entries after unmapping pages")
    Signed-off-by: Hugh Dickins
    Acked-by: Mel Gorman
    Cc: stable@vger.kernel.org # 4.3+
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Under swapping load on huge tmpfs, /proc/meminfo's Committed_AS grows
    bigger and bigger: just a cosmetic issue for most users, but disabling
    for those who run without overcommit (/proc/sys/vm/overcommit_memory 2).

    shmem_uncharge() was forgetting to unaccount __vm_enough_memory's
    charge, and shmem_charge() was forgetting it on the filesystem-full
    error path.

    Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
    Signed-off-by: Hugh Dickins
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • shmem_get_unmapped_area() checks SHMEM_SB(sb)->huge incorrectly, which
    leads to a reversed effect of "huge=" mount option.

    Fix the check in shmem_get_unmapped_area().

    Note, the default value of SHMEM_SB(sb)->huge remains as
    SHMEM_HUGE_NEVER. User will need to specify "huge=" option to enable
    huge page mappings.

    Reported-by: Hillf Danton
    Signed-off-by: Toshi Kani
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Aneesh Kumar K.V
    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Toshi Kani
     

22 Sep, 2016

1 commit


21 Sep, 2016

1 commit

  • While running a compile on arm64, I hit a memory exposure

    usercopy: kernel memory exposure attempt detected from fffffc0000f3b1a8 (buffer_head) (1 bytes)
    ------------[ cut here ]------------
    kernel BUG at mm/usercopy.c:75!
    Internal error: Oops - BUG: 0 [#1] SMP
    Modules linked in: ip6t_rpfilter ip6t_REJECT
    nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_broute bridge stp
    llc ebtable_nat ip6table_security ip6table_raw ip6table_nat
    nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle
    iptable_security iptable_raw iptable_nat nf_conntrack_ipv4
    nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle
    ebtable_filter ebtables ip6table_filter ip6_tables vfat fat xgene_edac
    xgene_enet edac_core i2c_xgene_slimpro i2c_core at803x realtek xgene_dma
    mdio_xgene gpio_dwapb gpio_xgene_sb xgene_rng mailbox_xgene_slimpro nfsd
    auth_rpcgss nfs_acl lockd grace sunrpc xfs libcrc32c sdhci_of_arasan
    sdhci_pltfm sdhci mmc_core xhci_plat_hcd gpio_keys
    CPU: 0 PID: 19744 Comm: updatedb Tainted: G W 4.8.0-rc3-threadinfo+ #1
    Hardware name: AppliedMicro X-Gene Mustang Board/X-Gene Mustang Board, BIOS 3.06.12 Aug 12 2016
    task: fffffe03df944c00 task.stack: fffffe00d128c000
    PC is at __check_object_size+0x70/0x3f0
    LR is at __check_object_size+0x70/0x3f0
    ...
    [] __check_object_size+0x70/0x3f0
    [] filldir64+0x158/0x1a0
    [] __fat_readdir+0x4a0/0x558 [fat]
    [] fat_readdir+0x34/0x40 [fat]
    [] iterate_dir+0x190/0x1e0
    [] SyS_getdents64+0x88/0x120
    [] el0_svc_naked+0x24/0x28

    fffffc0000f3b1a8 is a module address. Modules may have compiled in
    strings which could get copied to userspace. In this instance, it
    looks like "." which matches with a size of 1 byte. Extend the
    is_vmalloc_addr check to be is_vmalloc_or_module_addr to cover
    all possible cases.

    Signed-off-by: Laura Abbott
    Signed-off-by: Kees Cook

    Laura Abbott
     

20 Sep, 2016

6 commits

  • During cgroup2 rollout into production, we started encountering css
    refcount underflows and css access crashes in the memory controller.
    Splitting the heavily shared css reference counter into logical users
    narrowed the imbalance down to the cgroup2 socket memory accounting.

    The problem turns out to be the per-cpu charge cache. Cgroup1 had a
    separate socket counter, but the new cgroup2 socket accounting goes
    through the common charge path that uses a shared per-cpu cache for all
    memory that is being tracked. Those caches are safe against scheduling
    preemption, but not against interrupts - such as the newly added packet
    receive path. When cache draining is interrupted by network RX taking
    pages out of the cache, the resuming drain operation will put references
    of in-use pages, thus causing the imbalance.

    Disable IRQs during all per-cpu charge cache operations.

    Fixes: f7e1cb6ec51b ("mm: memcontrol: account socket memory in unified hierarchy memory controller")
    Link: http://lkml.kernel.org/r/20160914194846.11153-1-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Tejun Heo
    Cc: "David S. Miller"
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: [4.5+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Commit 62c230bc1790 ("mm: add support for a filesystem to activate
    swap files and use direct_IO for writing swap pages") replaced the
    swap_aops dirty hook from __set_page_dirty_no_writeback() with
    swap_set_page_dirty().

    For normal cases without these special SWP flags code path falls back to
    __set_page_dirty_no_writeback() so the behaviour is expected to be the
    same as before.

    But swap_set_page_dirty() makes use of the page_swap_info() helper to
    get the swap_info_struct to check for the flags like SWP_FILE,
    SWP_BLKDEV etc as desired for those features. This helper has
    BUG_ON(!PageSwapCache(page)) which is racy and safe only for the
    set_page_dirty_lock() path.

    For the set_page_dirty() path which is often needed for cases to be
    called from irq context, kswapd() can toggle the flag behind the back
    while the call is getting executed when system is low on memory and
    heavy swapping is ongoing.

    This ends up with undesired kernel panic.

    This patch just moves the check outside the helper to its users
    appropriately to fix kernel panic for the described path. Couple of
    users of helpers already take care of SwapCache condition so I skipped
    them.

    Link: http://lkml.kernel.org/r/1473460718-31013-1-git-send-email-santosh.shilimkar@oracle.com
    Signed-off-by: Santosh Shilimkar
    Cc: Mel Gorman
    Cc: Joe Perches
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: David S. Miller
    Cc: Jens Axboe
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Al Viro
    Cc: [4.7.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Santosh Shilimkar
     
  • dump_page() uses page_mapcount() to get mapcount of the page.
    page_mapcount() has VM_BUG_ON_PAGE(PageSlab(page)) as mapcount doesn't
    make sense for slab pages and the field in struct page used for other
    information.

    It leads to recursion if dump_page() called for slub page and DEBUG_VM
    is enabled:

    dump_page() -> page_mapcount() -> VM_BUG_ON_PAGE() -> dump_page -> ...

    Let's avoid calling page_mapcount() for slab pages in dump_page().

    Link: http://lkml.kernel.org/r/20160908082137.131076-1-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Currently, khugepaged does not permit swapin if there are enough young
    pages in a THP. The problem is when a THP does not have enough young
    pages, khugepaged leaks mapped ptes.

    This patch prohibits leaking mapped ptes.

    Link: http://lkml.kernel.org/r/1472820276-7831-1-git-send-email-ebru.akagunduz@gmail.com
    Signed-off-by: Ebru Akagunduz
    Suggested-by: Andrea Arcangeli
    Reviewed-by: Andrea Arcangeli
    Reviewed-by: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Kirill A. Shutemov
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ebru Akagunduz
     
  • hugepage_vma_revalidate() tries to re-check if we still should try to
    collapse small pages into huge one after the re-acquiring mmap_sem.

    The problem Dmitry Vyukov reported[1] is that the vma found by
    hugepage_vma_revalidate() can be suitable for huge pages, but not the
    same vma we had before dropping mmap_sem. And dereferencing original
    vma can lead to fun results..

    Let's use vma hugepage_vma_revalidate() found instead of assuming it's the
    same as what we had before the lock was dropped.

    [1] http://lkml.kernel.org/r/CACT4Y+Z3gigBvhca9kRJFcjX0G70V_nRhbwKBU+yGoESBDKi9Q@mail.gmail.com

    Link: http://lkml.kernel.org/r/20160907122559.GA6542@black.fi.intel.com
    Signed-off-by: Kirill A. Shutemov
    Reported-by: Dmitry Vyukov
    Reviewed-by: Andrea Arcangeli
    Cc: Ebru Akagunduz
    Cc: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Vegard Nossum
    Cc: Sasha Levin
    Cc: Konstantin Khlebnikov
    Cc: Andrey Ryabinin
    Cc: Greg Thelen
    Cc: Suleiman Souhlal
    Cc: Hugh Dickins
    Cc: David Rientjes
    Cc: syzkaller
    Cc: Kostya Serebryany
    Cc: Alexander Potapenko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Commit 394e31d2ceb4 ("mem-hotplug: alloc new page from a nearest
    neighbor node when mem-offline") introduced new_node_page() for memory
    hotplug.

    In new_node_page(), the nid is cleared before calling
    __alloc_pages_nodemask(). But if it is the only node of the system, and
    the first round allocation fails, it will not be able to get memory from
    an empty nodemask, and will trigger oom.

    The patch checks whether it is the last node on the system, and if it
    is, then don't clear the nid in the nodemask.

    Fixes: 394e31d2ceb4 ("mem-hotplug: alloc new page from a nearest neighbor node when mem-offline")
    Link: http://lkml.kernel.org/r/1473044391.4250.19.camel@TP420
    Signed-off-by: Li Zhong
    Reported-by: John Allen
    Acked-by: Vlastimil Babka
    Cc: Xishi Qiu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zhong
     

15 Sep, 2016

1 commit

  • Add API to change vdso blob type with arch_prctl.
    As this is usefull only by needs of CRIU, expose
    this interface under CONFIG_CHECKPOINT_RESTORE.

    Signed-off-by: Dmitry Safonov
    Acked-by: Andy Lutomirski
    Cc: 0x7f454c46@gmail.com
    Cc: oleg@redhat.com
    Cc: linux-mm@kvack.org
    Cc: gorcunov@openvz.org
    Cc: xemul@virtuozzo.com
    Link: http://lkml.kernel.org/r/20160905133308.28234-4-dsafonov@virtuozzo.com
    Signed-off-by: Thomas Gleixner

    Dmitry Safonov
     

14 Sep, 2016

1 commit

  • Commit:

    4d9424669946 ("mm: convert p[te|md]_mknonnuma and remaining page table manipulations")

    changed NUMA balancing from _PAGE_NUMA to using PROT_NONE, and was quickly
    found to introduce a regression with NUMA grouping.

    It was followed up by these commits:

    53da3bc2ba9e ("mm: fix up numa read-only thread grouping logic")
    bea66fbd11af ("mm: numa: group related processes based on VMA flags instead of page table flags")
    b191f9b106ea ("mm: numa: preserve PTE write permissions across a NUMA hinting fault")

    The first of those two commits try alternate approaches to NUMA
    grouping, which apparently do not work as well as looking at the PTE
    write permissions.

    The latter patch preserves the PTE write permissions across a NUMA
    protection fault. However, it forgets to revert the condition for
    whether or not to group tasks together back to what it was before
    v3.19, even though the information is now preserved in the page tables
    once again.

    This patch brings the NUMA grouping heuristic back to what it was
    before commit 4d9424669946, which the changelogs of subsequent
    commits suggest worked best.

    We have all the information again. We should probably use it.

    Signed-off-by: Rik van Riel
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: aarcange@redhat.com
    Cc: linux-mm@kvack.org
    Cc: mgorman@suse.de
    Link: http://lkml.kernel.org/r/20160908213053.07c992a9@annuminas.surriel.com
    Signed-off-by: Ingo Molnar

    Rik van Riel
     

13 Sep, 2016

1 commit

  • PAGE_POISONING_ZERO disables zeroing new pages on alloc, they are
    poisoned (zeroed) as they become available.
    In the hibernate use case, free pages will appear in the system without
    being cleared, left there by the loading kernel.

    This patch will make sure free pages are cleared on resume when
    PAGE_POISONING_ZERO is enabled. We free the pages just after resume
    because we can't do it later: going through any device resume code might
    allocate some memory and invalidate the free pages bitmap.

    Thus we don't need to disable hibernation when PAGE_POISONING_ZERO is
    enabled.

    Signed-off-by: Anisse Astier
    Reviewed-by: Kees Cook
    Acked-by: Pavel Machek
    Signed-off-by: Rafael J. Wysocki

    Anisse Astier
     

11 Sep, 2016

1 commit

  • Pull libnvdimm fixes from Dan Williams:
    "nvdimm fixes for v4.8, two of them are tagged for -stable:

    - Fix devm_memremap_pages() to use track_pfn_insert(). Otherwise,
    DAX pmd mappings end up with an uncached pgprot, and unusable
    performance for the device-dax interface. The device-dax interface
    appeared in 4.7 so this is tagged for -stable.

    - Fix a couple VM_BUG_ON() checks in the show_smaps() path to
    understand DAX pmd entries. This fix is tagged for -stable.

    - Fix a mis-merge of the nfit machine-check handler to flip the
    polarity of an if() to match the final version of the patch that
    Vishal sent for 4.8-rc1. Without this the nfit machine check
    handler never detects / inserts new 'badblocks' entries which
    applications use to identify lost portions of files.

    - For test purposes, fix the nvdimm_clear_poison() path to operate on
    legacy / simulated nvdimm memory ranges. Without this fix a test
    can set badblocks, but never clear them on these ranges.

    - Fix the range checking done by dax_dev_pmd_fault(). This is not
    tagged for -stable since this problem is mitigated by specifying
    aligned resources at device-dax setup time.

    These patches have appeared in a next release over the past week. The
    recent rebase you can see in the timestamps was to drop an invalid fix
    as identified by the updated device-dax unit tests [1]. The -mm
    touches have an ack from Andrew"

    [1]: "[ndctl PATCH 0/3] device-dax test for recent kernel bugs"
    https://lists.01.org/pipermail/linux-nvdimm/2016-September/006855.html

    * 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
    libnvdimm: allow legacy (e820) pmem region to clear bad blocks
    nfit, mce: Fix SPA matching logic in MCE handler
    mm: fix cache mode of dax pmd mappings
    mm: fix show_smap() for zone_device-pmd ranges
    dax: fix mapping size check

    Linus Torvalds
     

10 Sep, 2016

1 commit

  • Attempting to dump /proc//smaps for a process with pmd dax mappings
    currently results in the following VM_BUG_ONs:

    kernel BUG at mm/huge_memory.c:1105!
    task: ffff88045f16b140 task.stack: ffff88045be14000
    RIP: 0010:[] [] follow_trans_huge_pmd+0x2cb/0x340
    [..]
    Call Trace:
    [] smaps_pte_range+0xa0/0x4b0
    [] ? vsnprintf+0x255/0x4c0
    [] __walk_page_range+0x1fe/0x4d0
    [] walk_page_vma+0x62/0x80
    [] show_smap+0xa6/0x2b0

    kernel BUG at fs/proc/task_mmu.c:585!
    RIP: 0010:[] [] smaps_pte_range+0x499/0x4b0
    Call Trace:
    [] ? vsnprintf+0x255/0x4c0
    [] __walk_page_range+0x1fe/0x4d0
    [] walk_page_vma+0x62/0x80
    [] show_smap+0xa6/0x2b0

    These locations are sanity checking page flags that must be set for an
    anonymous transparent huge page, but are not set for the zone_device
    pages associated with dax mappings.

    Cc: Ross Zwisler
    Cc: Kirill A. Shutemov
    Acked-by: Andrew Morton
    Signed-off-by: Dan Williams

    Dan Williams
     

08 Sep, 2016

1 commit

  • A custom allocator without __GFP_COMP that copies to userspace has been
    found in vmw_execbuf_process[1], so this disables the page-span checker
    by placing it behind a CONFIG for future work where such things can be
    tracked down later.

    [1] https://bugzilla.redhat.com/show_bug.cgi?id=1373326

    Reported-by: Vinson Lee
    Fixes: f5509cc18daa ("mm: Hardened usercopy")
    Signed-off-by: Kees Cook

    Kees Cook
     

07 Sep, 2016

3 commits

  • Install the callbacks via the state machine and let the core invoke
    the callbacks on the already online CPUs.

    Signed-off-by: Sebastian Andrzej Siewior
    Cc: Peter Zijlstra
    Cc: Jens Axboe
    Cc: linux-mm@kvack.org
    Cc: rt@linutronix.de
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/20160818125731.27256-6-bigeasy@linutronix.de
    Signed-off-by: Thomas Gleixner

    Sebastian Andrzej Siewior
     
  • Install the callbacks via the state machine.

    Signed-off-by: Sebastian Andrzej Siewior
    Cc: Andrew Morton
    Cc: Peter Zijlstra
    Cc: Pekka Enberg
    Cc: linux-mm@kvack.org
    Cc: rt@linutronix.de
    Cc: David Rientjes
    Cc: Christoph Lameter
    Cc: Joonsoo Kim
    Link: http://lkml.kernel.org/r/20160818125731.27256-5-bigeasy@linutronix.de
    Signed-off-by: Thomas Gleixner

    Sebastian Andrzej Siewior
     
  • Install the callbacks via the state machine.

    Signed-off-by: Richard Weinberger
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Sebastian Andrzej Siewior
    Reviewed-by: Sebastian Andrzej Siewior
    Cc: Peter Zijlstra
    Cc: Pekka Enberg
    Cc: linux-mm@kvack.org
    Cc: rt@linutronix.de
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrew Morton
    Cc: Christoph Lameter
    Link: http://lkml.kernel.org/r/20160823125319.abeapfjapf2kfezp@linutronix.de
    Signed-off-by: Thomas Gleixner

    Sebastian Andrzej Siewior
     

02 Sep, 2016

3 commits

  • KASAN allocates memory from the page allocator as part of
    kmem_cache_free(), and that can reference current->mempolicy through any
    number of allocation functions. It needs to be NULL'd out before the
    final reference is dropped to prevent a use-after-free bug:

    BUG: KASAN: use-after-free in alloc_pages_current+0x363/0x370 at addr ffff88010b48102c
    CPU: 0 PID: 15425 Comm: trinity-c2 Not tainted 4.8.0-rc2+ #140
    ...
    Call Trace:
    dump_stack
    kasan_object_err
    kasan_report_error
    __asan_report_load2_noabort
    alloc_pages_current mempolicy to NULL before dropping the final
    reference.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1608301442180.63329@chino.kir.corp.google.com
    Fixes: cd11016e5f52 ("mm, kasan: stackdepot implementation. Enable stackdepot for SLAB")
    Signed-off-by: David Rientjes
    Reported-by: Vegard Nossum
    Acked-by: Andrey Ryabinin
    Cc: Alexander Potapenko
    Cc: Dmitry Vyukov
    Cc: [4.6+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Firmware Assisted Dump (FA_DUMP) on ppc64 reserves substantial amounts
    of memory when booting a secondary kernel. Srikar Dronamraju reported
    that multiple nodes may have no memory managed by the buddy allocator
    but still return true for populated_zone().

    Commit 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of
    nodes") was reported to cause kswapd to spin at 100% CPU usage when
    fadump was enabled. The old code happened to deal with the situation of
    a populated node with zero free pages by co-incidence but the current
    code tries to reclaim populated zones without realising that is
    impossible.

    We cannot just convert populated_zone() as many existing users really
    need to check for present_pages. This patch introduces a managed_zone()
    helper and uses it in the few cases where it is critical that the check
    is made for managed pages -- zonelist construction and page reclaim.

    Link: http://lkml.kernel.org/r/20160831195104.GB8119@techsingularity.net
    Signed-off-by: Mel Gorman
    Reported-by: Srikar Dronamraju
    Tested-by: Srikar Dronamraju
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • There have been several reports about pre-mature OOM killer invocation
    in 4.7 kernel when order-2 allocation request (for the kernel stack)
    invoked OOM killer even during basic workloads (light IO or even kernel
    compile on some filesystems). In all reported cases the memory is
    fragmented and there are no order-2+ pages available. There is usually
    a large amount of slab memory (usually dentries/inodes) and further
    debugging has shown that there are way too many unmovable blocks which
    are skipped during the compaction. Multiple reporters have confirmed
    that the current linux-next which includes [1] and [2] helped and OOMs
    are not reproducible anymore.

    A simpler fix for the late rc and stable is to simply ignore the
    compaction feedback and retry as long as there is a reclaim progress and
    we are not getting OOM for order-0 pages. We already do that for
    CONFING_COMPACTION=n so let's reuse the same code when compaction is
    enabled as well.

    [1] http://lkml.kernel.org/r/20160810091226.6709-1-vbabka@suse.cz
    [2] http://lkml.kernel.org/r/f7a9ea9d-bb88-bfd6-e340-3a933559305a@suse.cz

    Fixes: 0a0337e0d1d1 ("mm, oom: rework oom detection")
    Link: http://lkml.kernel.org/r/20160823074339.GB23577@dhcp22.suse.cz
    Signed-off-by: Michal Hocko
    Tested-by: Olaf Hering
    Tested-by: Ralf-Peter Rohbeck
    Cc: Markus Trippelsdorf
    Cc: Arkadiusz Miskiewicz
    Cc: Ralf-Peter Rohbeck
    Cc: Jiri Slaby
    Cc: Vlastimil Babka
    Cc: Joonsoo Kim
    Cc: Tetsuo Handa
    Cc: David Rientjes
    Cc: [4.7.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

27 Aug, 2016

3 commits

  • For DAX inodes we need to be careful to never have page cache pages in
    the mapping->page_tree. This radix tree should be composed only of DAX
    exceptional entries and zero pages.

    ltp's readahead02 test was triggering a warning because we were trying
    to insert a DAX exceptional entry but found that a page cache page had
    already been inserted into the tree. This page was being inserted into
    the radix tree in response to a readahead(2) call.

    Readahead doesn't make sense for DAX inodes, but we don't want it to
    report a failure either. Instead, we just return success and don't do
    any work.

    Link: http://lkml.kernel.org/r/20160824221429.21158-1-ross.zwisler@linux.intel.com
    Signed-off-by: Ross Zwisler
    Reported-by: Jeff Moyer
    Cc: Dan Williams
    Cc: Dave Chinner
    Cc: Dave Hansen
    Cc: Jan Kara
    Cc: [4.5+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     
  • A bugfix in v4.8-rc2 introduced a harmless warning when
    CONFIG_MEMCG_SWAP is disabled but CONFIG_MEMCG is enabled:

    mm/memcontrol.c:4085:27: error: 'mem_cgroup_id_get_online' defined but not used [-Werror=unused-function]
    static struct mem_cgroup *mem_cgroup_id_get_online(struct mem_cgroup *memcg)

    This moves the function inside of the #ifdef block that hides the
    calling function, to avoid the warning.

    Fixes: 1f47b61fb407 ("mm: memcontrol: fix swap counter leak on swapout from offline cgroup")
    Link: http://lkml.kernel.org/r/20160824113733.2776701-1-arnd@arndb.de
    Signed-off-by: Arnd Bergmann
    Acked-by: Michal Hocko
    Acked-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     
  • The current wording of the COMPACTION Kconfig help text doesn't
    emphasise that disabling COMPACTION might cripple the page allocator
    which relies on the compaction quite heavily for high order requests and
    an unexpected OOM can happen with the lack of compaction. Make sure we
    are vocal about that.

    Link: http://lkml.kernel.org/r/20160823091726.GK23577@dhcp22.suse.cz
    Signed-off-by: Michal Hocko
    Cc: Markus Trippelsdorf
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko