29 Jul, 2016

40 commits

  • Pull tracing updates from Steven Rostedt:
    "This is mostly clean ups and small fixes. Some of the more visible
    changes are:

    - The function pid code uses the event pid filtering logic
    - [ku]probe events have access to current->comm
    - trace_printk now has sample code
    - PCI devices now trace physical addresses
    - stack tracing has less unnessary functions traced"

    * tag 'trace-v4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    printk, tracing: Avoiding unneeded blank lines
    tracing: Use __get_str() when manipulating strings
    tracing, RAS: Cleanup on __get_str() usage
    tracing: Use outer () on __get_str() definition
    ftrace: Reduce size of function graph entries
    tracing: Have HIST_TRIGGERS select TRACING
    tracing: Using for_each_set_bit() to simplify trace_pid_write()
    ftrace: Move toplevel init out of ftrace_init_tracefs()
    tracing/function_graph: Fix filters for function_graph threshold
    tracing: Skip more functions when doing stack tracing of events
    tracing: Expose CPU physical addresses (resource values) for PCI devices
    tracing: Show the preempt count of when the event was called
    tracing: Add trace_printk sample code
    tracing: Choose static tp_printk buffer by explicit nesting count
    tracing: expose current->comm to [ku]probe events
    ftrace: Have set_ftrace_pid use the bitmap like events do
    tracing: Move pid_list write processing into its own function
    tracing: Move the pid_list seq_file functions to be global
    tracing: Move filtered_pid helper functions into trace.c
    tracing: Make the pid filtering helper functions global

    Linus Torvalds
     
  • Pull VFIO updates from Alex Williamson:
    - Enable no-iommu mode for platform devices (Peng Fan)
    - Sub-page mmap for exclusive pages (Yongji Xie)
    - Use-after-free fix (Ilya Lesokhin)
    - Support for ACPI-based platform devices (Sinan Kaya)

    * tag 'vfio-v4.8-rc1' of git://github.com/awilliam/linux-vfio:
    vfio: platform: check reset call return code during release
    vfio: platform: check reset call return code during open
    vfio, platform: make reset driver a requirement by default
    vfio: platform: call _RST method when using ACPI
    vfio: platform: add extra debug info argument to call reset
    vfio: platform: add support for ACPI probe
    vfio: platform: determine reset capability
    vfio: platform: move reset call to a common function
    vfio: platform: rename reset function
    vfio: fix possible use after free of vfio group
    vfio-pci: Allow to mmap sub-page MMIO BARs if the mmio page is exclusive
    vfio: platform: support No-IOMMU mode

    Linus Torvalds
     
  • Pull MD updates from Shaohua Li:
    - A bunch of patches from Neil Brown to fix RCU usage
    - Two performance improvement patches from Tomasz Majchrzak
    - Alexey Obitotskiy fixes module refcount issue
    - Arnd Bergmann fixes time granularity
    - Cong Wang fixes a list corruption issue
    - Guoqing Jiang fixes a deadlock in md-cluster
    - A null pointer deference fix from me
    - Song Liu fixes misuse of raid6 rmw
    - Other trival/cleanup fixes from Guoqing Jiang and Xiao Ni

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: (28 commits)
    MD: fix null pointer deference
    raid10: improve random reads performance
    md: add missing sysfs_notify on array_state update
    Fix kernel module refcount handling
    md: use seconds granularity for error logging
    md: reduce the number of synchronize_rcu() calls when multiple devices fail.
    md: be extra careful not to take a reference to a Faulty device.
    md/multipath: add rcu protection to rdev access in multipath_status.
    md/raid5: add rcu protection to rdev accesses in raid5_status.
    md/raid5: add rcu protection to rdev accesses in want_replace
    md/raid5: add rcu protection to rdev accesses in handle_failed_sync.
    md/raid1: add rcu protection to rdev in fix_read_error
    md/raid1: small code cleanup in end_sync_write
    md/raid1: small cleanup in raid1_end_read/write_request
    md/raid10: simplify print_conf a little.
    md/raid10: minor code improvement in fix_read_error()
    md/raid10: add rcu protection to rdev access during reshape.
    md/raid10: add rcu protection to rdev access in raid10_sync_request.
    md/raid10: add rcu protection in raid10_status.
    md/raid10: fix refounct imbalance when resyncing an array with a replacement device.
    ...

    Linus Torvalds
     
  • Pull libnvdimm updates from Dan Williams:

    - Replace pcommit with ADR / directed-flushing.

    The pcommit instruction, which has not shipped on any product, is
    deprecated. Instead, the requirement is that platforms implement
    either ADR, or provide one or more flush addresses per nvdimm.

    ADR (Asynchronous DRAM Refresh) flushes data in posted write buffers
    to the memory controller on a power-fail event.

    Flush addresses are defined in ACPI 6.x as an NVDIMM Firmware
    Interface Table (NFIT) sub-structure: "Flush Hint Address Structure".
    A flush hint is an mmio address that when written and fenced assures
    that all previous posted writes targeting a given dimm have been
    flushed to media.

    - On-demand ARS (address range scrub).

    Linux uses the results of the ACPI ARS commands to track bad blocks
    in pmem devices. When latent errors are detected we re-scrub the
    media to refresh the bad block list, userspace can also request a
    re-scrub at any time.

    - Support for the Microsoft DSM (device specific method) command
    format.

    - Support for EDK2/OVMF virtual disk device memory ranges.

    - Various fixes and cleanups across the subsystem.

    * tag 'libnvdimm-for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (41 commits)
    libnvdimm-btt: Delete an unnecessary check before the function call "__nd_device_register"
    nfit: do an ARS scrub on hitting a latent media error
    nfit: move to nfit/ sub-directory
    nfit, libnvdimm: allow an ARS scrub to be triggered on demand
    libnvdimm: register nvdimm_bus devices with an nd_bus driver
    pmem: clarify a debug print in pmem_clear_poison
    x86/insn: remove pcommit
    Revert "KVM: x86: add pcommit support"
    nfit, tools/testing/nvdimm/: unify shutdown paths
    libnvdimm: move ->module to struct nvdimm_bus_descriptor
    nfit: cleanup acpi_nfit_init calling convention
    nfit: fix _FIT evaluation memory leak + use after free
    tools/testing/nvdimm: add manufacturing_{date|location} dimm properties
    tools/testing/nvdimm: add virtual ramdisk range
    acpi, nfit: treat virtual ramdisk SPA as pmem region
    pmem: kill __pmem address space
    pmem: kill wmb_pmem()
    libnvdimm, pmem: use nvdimm_flush() for namespace I/O writes
    fs/dax: remove wmb_pmem()
    libnvdimm, pmem: flush posted-write queues on shutdown
    ...

    Linus Torvalds
     
  • Pull pin control updates from Linus Walleij:
    "This is the bulk of pin control changes for the v4.8 kernel cycle.

    Nothing stands out as especially exiting: new drivers, new subdrivers,
    lots of cleanups and incremental features.

    Business as usual.

    New drivers:

    - New driver for Oxnas pin control and GPIO. This ARM-based chipset
    is used in a few storage (NAS) type devices.

    - New driver for the MAX77620/MAX20024 pin controller portions.

    - New driver for the Intel Merrifield pin controller.

    New subdrivers:

    - New subdriver for the Qualcomm MDM9615

    - New subdriver for the STM32F746 MCU

    - New subdriver for the Broadcom NSP SoC.

    Cleanups:

    - Demodularization of bool compiled-in drivers.

    Apart from this there is just regular incremental improvements to a
    lot of drivers, especially Uniphier and PFC"

    * tag 'pinctrl-v4.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl: (131 commits)
    pinctrl: fix pincontrol definition for marvell
    pinctrl: xway: fix typo
    Revert "pinctrl: amd: make it explicitly non-modular"
    pinctrl: iproc: Add NSP and Stingray GPIO support
    pinctrl: Update iProc GPIO DT bindings
    pinctrl: bcm: add OF dependencies
    pinctrl: ns2: remove redundant dev_err call in ns2_pinmux_probe()
    pinctrl: Add STM32F746 MCU support
    pinctrl: intel: Protect set wake flow by spin lock
    pinctrl: nsp: remove redundant dev_err call in nsp_pinmux_probe()
    pinctrl: uniphier: add Ethernet pin-mux settings
    sh-pfc: Use PTR_ERR_OR_ZERO() to simplify the code
    pinctrl: ns2: fix return value check in ns2_pinmux_probe()
    pinctrl: qcom: update DT bindings with ebi2 groups
    pinctrl: qcom: establish proper EBI2 pin groups
    pinctrl: imx21: Remove the MODULE_DEVICE_TABLE() macro
    Documentation: dt: Add new compatible to STM32 pinctrl driver bindings
    includes: dt-bindings: Add STM32F746 pinctrl DT bindings
    pinctrl: sunxi: fix nand0 function name for sun8i
    pinctrl: uniphier: remove pointless pin-mux settings for PH1-LD11
    ...

    Linus Torvalds
     
  • Merge more updates from Andrew Morton:
    "The rest of MM"

    * emailed patches from Andrew Morton : (101 commits)
    mm, compaction: simplify contended compaction handling
    mm, compaction: introduce direct compaction priority
    mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations
    mm, page_alloc: make THP-specific decisions more generic
    mm, page_alloc: restructure direct compaction handling in slowpath
    mm, page_alloc: don't retry initial attempt in slowpath
    mm, page_alloc: set alloc_flags only once in slowpath
    lib/stackdepot.c: use __GFP_NOWARN for stack allocations
    mm, kasan: switch SLUB to stackdepot, enable memory quarantine for SLUB
    mm, kasan: account for object redzone in SLUB's nearest_obj()
    mm: fix use-after-free if memory allocation failed in vma_adjust()
    zsmalloc: Delete an unnecessary check before the function call "iput"
    mm/memblock.c: fix index adjustment error in __next_mem_range_rev()
    mem-hotplug: alloc new page from a nearest neighbor node when mem-offline
    mm: optimize copy_page_to/from_iter_iovec
    mm: add cond_resched() to generic_swapfile_activate()
    Revert "mm, mempool: only set __GFP_NOMEMALLOC if there are free elements"
    mm, compaction: don't isolate PageWriteback pages in MIGRATE_SYNC_LIGHT mode
    mm: hwpoison: remove incorrect comments
    make __section_nr() more efficient
    ...

    Linus Torvalds
     
  • Async compaction detects contention either due to failing trylock on
    zone->lock or lru_lock, or by need_resched(). Since 1f9efdef4f3f ("mm,
    compaction: khugepaged should not give up due to need_resched()") the
    code got quite complicated to distinguish these two up to the
    __alloc_pages_slowpath() level, so different decisions could be taken
    for khugepaged allocations.

    After the recent changes, khugepaged allocations don't check for
    contended compaction anymore, so we again don't need to distinguish lock
    and sched contention, and simplify the current convoluted code a lot.

    However, I believe it's also possible to simplify even more and
    completely remove the check for contended compaction after the initial
    async compaction for costly orders, which was originally aimed at THP
    page fault allocations. There are several reasons why this can be done
    now:

    - with the new defaults, THP page faults no longer do reclaim/compaction at
    all, unless the system admin has overridden the default, or application has
    indicated via madvise that it can benefit from THP's. In both cases, it
    means that the potential extra latency is expected and worth the benefits.
    - even if reclaim/compaction proceeds after this patch where it previously
    wouldn't, the second compaction attempt is still async and will detect the
    contention and back off, if the contention persists
    - there are still heuristics like deferred compaction and pageblock skip bits
    in place that prevent excessive THP page fault latencies

    Link: http://lkml.kernel.org/r/20160721073614.24395-9-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • In the context of direct compaction, for some types of allocations we
    would like the compaction to either succeed or definitely fail while
    trying as hard as possible. Current async/sync_light migration mode is
    insufficient, as there are heuristics such as caching scanner positions,
    marking pageblocks as unsuitable or deferring compaction for a zone. At
    least the final compaction attempt should be able to override these
    heuristics.

    To communicate how hard compaction should try, we replace migration mode
    with a new enum compact_priority and change the relevant function
    signatures. In compact_zone_order() where struct compact_control is
    constructed, the priority is mapped to suitable control flags. This
    patch itself has no functional change, as the current priority levels
    are mapped back to the same migration modes as before. Expanding them
    will be done next.

    Note that !CONFIG_COMPACTION variant of try_to_compact_pages() is
    removed, as the only caller exists under CONFIG_COMPACTION.

    Link: http://lkml.kernel.org/r/20160721073614.24395-8-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • After the previous patch, we can distinguish costly allocations that
    should be really lightweight, such as THP page faults, with
    __GFP_NORETRY. This means we don't need to recognize khugepaged
    allocations via PF_KTHREAD anymore. We can also change THP page faults
    in areas where madvise(MADV_HUGEPAGE) was used to try as hard as
    khugepaged, as the process has indicated that it benefits from THP's and
    is willing to pay some initial latency costs.

    We can also make the flags handling less cryptic by distinguishing
    GFP_TRANSHUGE_LIGHT (no reclaim at all, default mode in page fault) from
    GFP_TRANSHUGE (only direct reclaim, khugepaged default). Adding
    __GFP_NORETRY or __GFP_KSWAPD_RECLAIM is done where needed.

    The patch effectively changes the current GFP_TRANSHUGE users as
    follows:

    * get_huge_zero_page() - the zero page lifetime should be relatively
    long and it's shared by multiple users, so it's worth spending some
    effort on it. We use GFP_TRANSHUGE, and __GFP_NORETRY is not added.
    This also restores direct reclaim to this allocation, which was
    unintentionally removed by commit e4a49efe4e7e ("mm: thp: set THP defrag
    by default to madvise and add a stall-free defrag option")

    * alloc_hugepage_khugepaged_gfpmask() - this is khugepaged, so latency
    is not an issue. So if khugepaged "defrag" is enabled (the default), do
    reclaim via GFP_TRANSHUGE without __GFP_NORETRY. We can remove the
    PF_KTHREAD check from page alloc.

    As a side-effect, khugepaged will now no longer check if the initial
    compaction was deferred or contended. This is OK, as khugepaged sleep
    times between collapsion attempts are long enough to prevent noticeable
    disruption, so we should allow it to spend some effort.

    * migrate_misplaced_transhuge_page() - already was masking out
    __GFP_RECLAIM, so just convert to GFP_TRANSHUGE_LIGHT which is
    equivalent.

    * alloc_hugepage_direct_gfpmask() - vma's with VM_HUGEPAGE (via madvise)
    are now allocating without __GFP_NORETRY. Other vma's keep using
    __GFP_NORETRY if direct reclaim/compaction is at all allowed (by default
    it's allowed only for madvised vma's). The rest is conversion to
    GFP_TRANSHUGE(_LIGHT).

    [mhocko@suse.com: suggested GFP_TRANSHUGE_LIGHT]
    Link: http://lkml.kernel.org/r/20160721073614.24395-7-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Since THP allocations during page faults can be costly, extra decisions
    are employed for them to avoid excessive reclaim and compaction, if the
    initial compaction doesn't look promising. The detection has never been
    perfect as there is no gfp flag specific to THP allocations. At this
    moment it checks the whole combination of flags that makes up
    GFP_TRANSHUGE, and hopes that no other users of such combination exist,
    or would mind being treated the same way. Extra care is also taken to
    separate allocations from khugepaged, where latency doesn't matter that
    much.

    It is however possible to distinguish these allocations in a simpler and
    more reliable way. The key observation is that after the initial
    compaction followed by the first iteration of "standard"
    reclaim/compaction, both __GFP_NORETRY allocations and costly
    allocations without __GFP_REPEAT are declared as failures:

    /* Do not loop if specifically requested */
    if (gfp_mask & __GFP_NORETRY)
    goto nopage;

    /*
    * Do not retry costly high order allocations unless they are
    * __GFP_REPEAT
    */
    if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
    goto nopage;

    This means we can further distinguish allocations that are costly order
    *and* additionally include the __GFP_NORETRY flag. As it happens,
    GFP_TRANSHUGE allocations do already fall into this category. This will
    also allow other costly allocations with similar high-order benefit vs
    latency considerations to use this semantic. Furthermore, we can
    distinguish THP allocations that should try a bit harder (such as from
    khugepageed) by removing __GFP_NORETRY, as will be done in the next
    patch.

    Link: http://lkml.kernel.org/r/20160721073614.24395-6-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The retry loop in __alloc_pages_slowpath is supposed to keep trying
    reclaim and compaction (and OOM), until either the allocation succeeds,
    or returns with failure. Success here is more probable when reclaim
    precedes compaction, as certain watermarks have to be met for compaction
    to even try, and more free pages increase the probability of compaction
    success. On the other hand, starting with light async compaction (if
    the watermarks allow it), can be more efficient, especially for smaller
    orders, if there's enough free memory which is just fragmented.

    Thus, the current code starts with compaction before reclaim, and to
    make sure that the last reclaim is always followed by a final
    compaction, there's another direct compaction call at the end of the
    loop. This makes the code hard to follow and adds some duplicated
    handling of migration_mode decisions. It's also somewhat inefficient
    that even if reclaim or compaction decides not to retry, the final
    compaction is still attempted. Some gfp flags combination also shortcut
    these retry decisions by "goto noretry;", making it even harder to
    follow.

    This patch attempts to restructure the code with only minimal functional
    changes. The call to the first compaction and THP-specific checks are
    now placed above the retry loop, and the "noretry" direct compaction is
    removed.

    The initial compaction is additionally restricted only to costly orders,
    as we can expect smaller orders to be held back by watermarks, and only
    larger orders to suffer primarily from fragmentation. This better
    matches the checks in reclaim's shrink_zones().

    There are two other smaller functional changes. One is that the upgrade
    from async migration to light sync migration will always occur after the
    initial compaction. This is how it has been until recent patch "mm,
    oom: protect !costly allocations some more", which introduced upgrading
    the mode based on COMPACT_COMPLETE result, but kept the final compaction
    always upgraded, which made it even more special. It's better to return
    to the simpler handling for now, as migration modes will be further
    modified later in the series.

    The second change is that once both reclaim and compaction declare it's
    not worth to retry the reclaim/compact loop, there is no final
    compaction attempt. As argued above, this is intentional. If that
    final compaction were to succeed, it would be due to a wrong retry
    decision, or simply a race with somebody else freeing memory for us.

    The main outcome of this patch should be simpler code. Logically, the
    initial compaction without reclaim is the exceptional case to the
    reclaim/compaction scheme, but prior to the patch, it was the last loop
    iteration that was exceptional. Now the code matches the logic better.
    The change also enable the following patches.

    Link: http://lkml.kernel.org/r/20160721073614.24395-5-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • After __alloc_pages_slowpath() sets up new alloc_flags and wakes up
    kswapd, it first tries get_page_from_freelist() with the new
    alloc_flags, as it may succeed e.g. due to using min watermark instead
    of low watermark. It makes sense to to do this attempt before adjusting
    zonelist based on alloc_flags/gfp_mask, as it's still relatively a fast
    path if we just wake up kswapd and successfully allocate.

    This patch therefore moves the initial attempt above the retry label and
    reorganizes a bit the part below the retry label. We still have to
    attempt get_page_from_freelist() on each retry, as some allocations
    cannot do that as part of direct reclaim or compaction, and yet are not
    allowed to fail (even though they do a WARN_ON_ONCE() and thus should
    not exist). We can reuse the call meant for ALLOC_NO_WATERMARKS attempt
    and just set alloc_flags to ALLOC_NO_WATERMARKS if the context allows
    it. As a side-effect, the attempts from direct reclaim/compaction will
    also no longer obey watermarks once this is set, but there's little harm
    in that.

    Kswapd wakeups are also done on each retry to be safe from potential
    races resulting in kswapd going to sleep while a process (that may not
    be able to reclaim by itself) is still looping.

    Link: http://lkml.kernel.org/r/20160721073614.24395-4-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Acked-by: Mel Gorman
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • In __alloc_pages_slowpath(), alloc_flags doesn't change after it's
    initialized, so move the initialization above the retry: label. Also
    make the comment above the initialization more descriptive.

    The only exception in the alloc_flags being constant is
    ALLOC_NO_WATERMARKS, which may change due to TIF_MEMDIE being set on the
    allocating thread. We can fix this, and make the code simpler and a bit
    more effective at the same time, by moving the part that determines
    ALLOC_NO_WATERMARKS from gfp_to_alloc_flags() to gfp_pfmemalloc_allowed().

    This means we don't have to mask out ALLOC_NO_WATERMARKS in numerous
    places in __alloc_pages_slowpath() anymore. The only two tests for the
    flag can instead call gfp_pfmemalloc_allowed().

    Link: http://lkml.kernel.org/r/20160721073614.24395-3-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Acked-by: Mel Gorman
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • This (large, atomic) allocation attempt can fail. We expect and handle
    that, so avoid the scary warning.

    Link: http://lkml.kernel.org/r/20160720151905.GB19146@node.shutemov.name
    Cc: Andrey Ryabinin
    Cc: Alexander Potapenko
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: David Rientjes
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • For KASAN builds:
    - switch SLUB allocator to using stackdepot instead of storing the
    allocation/deallocation stacks in the objects;
    - change the freelist hook so that parts of the freelist can be put
    into the quarantine.

    [aryabinin@virtuozzo.com: fixes]
    Link: http://lkml.kernel.org/r/1468601423-28676-1-git-send-email-aryabinin@virtuozzo.com
    Link: http://lkml.kernel.org/r/1468347165-41906-3-git-send-email-glider@google.com
    Signed-off-by: Alexander Potapenko
    Cc: Andrey Konovalov
    Cc: Christoph Lameter
    Cc: Dmitry Vyukov
    Cc: Steven Rostedt (Red Hat)
    Cc: Joonsoo Kim
    Cc: Kostya Serebryany
    Cc: Andrey Ryabinin
    Cc: Kuthonuzo Luruo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     
  • When looking up the nearest SLUB object for a given address, correctly
    calculate its offset if SLAB_RED_ZONE is enabled for that cache.

    Previously, when KASAN had detected an error on an object from a cache
    with SLAB_RED_ZONE set, the actual start address of the object was
    miscalculated, which led to random stacks having been reported.

    When looking up the nearest SLUB object for a given address, correctly
    calculate its offset if SLAB_RED_ZONE is enabled for that cache.

    Fixes: 7ed2f9e663854db ("mm, kasan: SLAB support")
    Link: http://lkml.kernel.org/r/1468347165-41906-2-git-send-email-glider@google.com
    Signed-off-by: Alexander Potapenko
    Cc: Andrey Konovalov
    Cc: Christoph Lameter
    Cc: Dmitry Vyukov
    Cc: Steven Rostedt (Red Hat)
    Cc: Joonsoo Kim
    Cc: Kostya Serebryany
    Cc: Andrey Ryabinin
    Cc: Kuthonuzo Luruo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     
  • There's one case when vma_adjust() expands the vma, overlapping with
    *two* next vma. See case 6 of mprotect, described in the comment to
    vma_merge().

    To handle this (and only this) situation we iterate twice over main part
    of the function. See "goto again".

    Vegard reported[1] that he sees out-of-bounds access complain from
    KASAN, if anon_vma_clone() on the *second* iteration fails.

    This happens because we free 'next' vma by the end of first iteration
    and don't have a way to undo this if anon_vma_clone() fails on the
    second iteration.

    The solution is to do all required allocations upfront, before we touch
    vmas.

    The allocation on the second iteration is only required if first two
    vmas don't have anon_vma, but third does. So we need, in total, one
    anon_vma_clone() call.

    It's easy to adjust 'exporter' to the third vma for such case.

    [1] http://lkml.kernel.org/r/1469514843-23778-1-git-send-email-vegard.nossum@oracle.com

    Link: http://lkml.kernel.org/r/1469625255-126641-1-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Reported-by: Vegard Nossum
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • iput() tests whether its argument is NULL and then returns immediately.
    Thus the test around the call is not needed.

    This issue was detected by using the Coccinelle software.

    Link: http://lkml.kernel.org/r/559cf499-4a01-25f9-c87f-24d906626a57@users.sourceforge.net
    Signed-off-by: Markus Elfring
    Reviewed-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Markus Elfring
     
  • Fix region index adjustment error when parameter type_b of
    __next_mem_range_rev() == NULL.

    Signed-off-by: zijun_hu
    Cc: Alexander Kuleshov
    Cc: Ard Biesheuvel
    Cc: Tang Chen
    Cc: Wei Yang
    Cc: Tang Chen
    Cc: Richard Leitner
    Cc: David Gibson
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    zijun_hu
     
  • If we offline a node, alloc the new page from a nearest neighbor node
    instead of the current node or other remote nodes, because re-migrate is
    a waste of time and the distance of the remote nodes is often very
    large.

    Also use GFP_HIGHUSER_MOVABLE to alloc new page if the zone is movable
    zone or highmem zone.

    Link: http://lkml.kernel.org/r/5795E18B.5060302@huawei.com
    Signed-off-by: Xishi Qiu
    Cc: Joonsoo Kim
    Cc: Vlastimil Babka
    Cc: Naoya Horiguchi
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xishi Qiu
     
  • copy_page_to_iter_iovec() and copy_page_from_iter_iovec() copy some data
    to userspace or from userspace. These functions have a fast path where
    they map a page using kmap_atomic and a slow path where they use kmap.

    kmap is slower than kmap_atomic, so the fast path is preferred.

    However, on kernels without highmem support, kmap just calls
    page_address, so there is no need to avoid kmap. On kernels without
    highmem support, the fast path just increases code size (and cache
    footprint) and it doesn't improve copy performance in any way.

    This patch enables the fast path only if CONFIG_HIGHMEM is defined.

    Code size reduced by this patch:
    x86 (without highmem) 928
    x86-64 960
    sparc64 848
    alpha 1136
    pa-risc 1200

    [akpm@linux-foundation.org: use IS_ENABLED(), per Andi]
    Link: http://lkml.kernel.org/r/alpine.LRH.2.02.1607221711410.4818@file01.intranet.prod.int.rdu2.redhat.com
    Signed-off-by: Mikulas Patocka
    Cc: Hugh Dickins
    Cc: Michal Hocko
    Cc: Alexander Viro
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mikulas Patocka
     
  • generic_swapfile_activate() can take quite long time, it iterates over
    all blocks of a file, so add cond_resched to it. I observed about 1
    second stalls when activating a swapfile that was almost unfragmented -
    this patch fixes it.

    Link: http://lkml.kernel.org/r/alpine.LRH.2.02.1607221710580.4818@file01.intranet.prod.int.rdu2.redhat.com
    Signed-off-by: Mikulas Patocka
    Acked-by: Michal Hocko
    Cc: Hugh Dickins
    Cc: Alexander Viro
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mikulas Patocka
     
  • This reverts commit f9054c70d28b ("mm, mempool: only set __GFP_NOMEMALLOC
    if there are free elements").

    There has been a report about OOM killer invoked when swapping out to a
    dm-crypt device. The primary reason seems to be that the swapout out IO
    managed to completely deplete memory reserves. Ondrej was able to
    bisect and explained the issue by pointing to f9054c70d28b ("mm,
    mempool: only set __GFP_NOMEMALLOC if there are free elements").

    The reason is that the swapout path is not throttled properly because
    the md-raid layer needs to allocate from the generic_make_request path
    which means it allocates from the PF_MEMALLOC context. dm layer uses
    mempool_alloc in order to guarantee a forward progress which used to
    inhibit access to memory reserves when using page allocator. This has
    changed by f9054c70d28b ("mm, mempool: only set __GFP_NOMEMALLOC if
    there are free elements") which has dropped the __GFP_NOMEMALLOC
    protection when the memory pool is depleted.

    If we are running out of memory and the only way forward to free memory
    is to perform swapout we just keep consuming memory reserves rather than
    throttling the mempool allocations and allowing the pending IO to
    complete up to a moment when the memory is depleted completely and there
    is no way forward but invoking the OOM killer. This is less than
    optimal.

    The original intention of f9054c70d28b was to help with the OOM
    situations where the oom victim depends on mempool allocation to make a
    forward progress. David has mentioned the following backtrace:

    schedule
    schedule_timeout
    io_schedule_timeout
    mempool_alloc
    __split_and_process_bio
    dm_request
    generic_make_request
    submit_bio
    mpage_readpages
    ext4_readpages
    __do_page_cache_readahead
    ra_submit
    filemap_fault
    handle_mm_fault
    __do_page_fault
    do_page_fault
    page_fault

    We do not know more about why the mempool is depleted without being
    replenished in time, though. In any case the dm layer shouldn't depend
    on any allocations outside of the dedicated pools so a forward progress
    should be guaranteed. If this is not the case then the dm should be
    fixed rather than papering over the problem and postponing it to later
    by accessing more memory reserves.

    mempools are a mechanism to maintain dedicated memory reserves to
    guaratee forward progress. Allowing them an unbounded access to the
    page allocator memory reserves is going against the whole purpose of
    this mechanism.

    Bisected by Ondrej Kozina.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/20160721145309.GR26379@dhcp22.suse.cz
    Signed-off-by: Michal Hocko
    Reported-by: Ondrej Kozina
    Reviewed-by: Johannes Weiner
    Acked-by: NeilBrown
    Cc: David Rientjes
    Cc: Mikulas Patocka
    Cc: Ondrej Kozina
    Cc: Tetsuo Handa
    Cc: Mel Gorman
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • At present MIGRATE_SYNC_LIGHT is allowing __isolate_lru_page() to
    isolate a PageWriteback page, which __unmap_and_move() then rejects with
    -EBUSY: of course the writeback might complete in between, but that's
    not what we usually expect, so probably better not to isolate it.

    When tested by stress-highalloc from mmtests, this has reduced the
    number of page migrate failures by 60-70%.

    Link: http://lkml.kernel.org/r/20160721073614.24395-2-vbabka@suse.cz
    Signed-off-by: Hugh Dickins
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Acked-by: Mel Gorman
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • dequeue_hwpoisoned_huge_page() can be called without page lock hold, so
    let's remove incorrect comment.

    The reason why the page lock is not really needed is that
    dequeue_hwpoisoned_huge_page() checks page_huge_active() inside
    hugetlb_lock, which allows us to avoid trying to dequeue a hugepage that
    are just allocated but not linked to active list yet, even without
    taking page lock.

    Link: http://lkml.kernel.org/r/20160720092901.GA15995@www9186uo.sakura.ne.jp
    Signed-off-by: Naoya Horiguchi
    Reported-by: Zhan Chen
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • When CONFIG_SPARSEMEM_EXTREME is disabled, __section_nr can get the
    section number with a subtraction directly.

    Link: http://lkml.kernel.org/r/1468988310-11560-1-git-send-email-zhouchengming1@huawei.com
    Signed-off-by: Zhou Chengming
    Cc: Dave Hansen
    Cc: Tejun Heo
    Cc: Hanjun Guo
    Cc: Li Bin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhou Chengming
     
  • If the user tries to disable automatic scanning early in the boot
    process using e.g.:

    echo scan=off > /sys/kernel/debug/kmemleak

    then this command will hang until SECS_FIRST_SCAN (= 60) seconds have
    elapsed, even though the system is fully initialised.

    We can fix this using interruptible sleep and checking if we're supposed
    to stop whenever we wake up (like the rest of the code does).

    Link: http://lkml.kernel.org/r/1468835005-2873-1-git-send-email-vegard.nossum@oracle.com
    Signed-off-by: Vegard Nossum
    Acked-by: Catalin Marinas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vegard Nossum
     
  • When booting an ACPI enabled kernel with 'mem=x', there is the
    possibility that ACPI data regions from the firmware will lie above the
    memory limit. Ordinarily these will be removed by
    memblock_enforce_memory_limit(.).

    Unfortunately, this means that these regions will then be mapped by
    acpi_os_ioremap(.) as device memory (instead of normal) thus unaligned
    accessess will then provoke alignment faults.

    In this patch we adopt memblock_mem_limit_remove_map instead, and this
    preserves these ACPI data regions (marked NOMAP) thus ensuring that
    these regions are not mapped as device memory.

    For example, below is an alignment exception observed on ARM platform
    when booting the kernel with 'acpi=on mem=8G':

    ...
    Unable to handle kernel paging request at virtual address ffff0000080521e7
    pgd = ffff000008aa0000
    [ffff0000080521e7] *pgd=000000801fffe003, *pud=000000801fffd003, *pmd=000000801fffc003, *pte=00e80083ff1c1707
    Internal error: Oops: 96000021 [#1] PREEMPT SMP
    Modules linked in:
    CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.7.0-rc3-next-20160616+ #172
    Hardware name: AMD Overdrive/Supercharger/Default string, BIOS ROD1001A 02/09/2016
    task: ffff800001ef0000 ti: ffff800001ef8000 task.ti: ffff800001ef8000
    PC is at acpi_ns_lookup+0x520/0x734
    LR is at acpi_ns_lookup+0x4a4/0x734
    pc : [] lr : [] pstate: 60000045
    sp : ffff800001efb8b0
    x29: ffff800001efb8c0 x28: 000000000000001b
    x27: 0000000000000001 x26: 0000000000000000
    x25: ffff800001efb9e8 x24: ffff000008a10000
    x23: 0000000000000001 x22: 0000000000000001
    x21: ffff000008724000 x20: 000000000000001b
    x19: ffff0000080521e7 x18: 000000000000000d
    x17: 00000000000038ff x16: 0000000000000002
    x15: 0000000000000007 x14: 0000000000007fff
    x13: ffffff0000000000 x12: 0000000000000018
    x11: 000000001fffd200 x10: 00000000ffffff76
    x9 : 000000000000005f x8 : ffff000008725fa8
    x7 : ffff000008a8df70 x6 : ffff000008a8df70
    x5 : ffff000008a8d000 x4 : 0000000000000010
    x3 : 0000000000000010 x2 : 000000000000000c
    x1 : 0000000000000006 x0 : 0000000000000000
    ...
    acpi_ns_lookup+0x520/0x734
    acpi_ds_load1_begin_op+0x174/0x4fc
    acpi_ps_build_named_op+0xf8/0x220
    acpi_ps_create_op+0x208/0x33c
    acpi_ps_parse_loop+0x204/0x838
    acpi_ps_parse_aml+0x1bc/0x42c
    acpi_ns_one_complete_parse+0x1e8/0x22c
    acpi_ns_parse_table+0x8c/0x128
    acpi_ns_load_table+0xc0/0x1e8
    acpi_tb_load_namespace+0xf8/0x2e8
    acpi_load_tables+0x7c/0x110
    acpi_init+0x90/0x2c0
    do_one_initcall+0x38/0x12c
    kernel_init_freeable+0x148/0x1ec
    kernel_init+0x10/0xec
    ret_from_fork+0x10/0x40
    Code: b9009fbc 2a00037b 36380057 3219037b (b9400260)
    ---[ end trace 03381e5eb0a24de4 ]---
    Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b

    With 'efi=debug', we can see those ACPI regions loaded by firmware on
    that board as:

    efi: 0x0083ff185000-0x0083ff1b4fff [Reserved | | | | | | | | |WB|WT|WC|UC]*
    efi: 0x0083ff1b5000-0x0083ff1c2fff [ACPI Reclaim Memory| | | | | | | | |WB|WT|WC|UC]*
    efi: 0x0083ff223000-0x0083ff224fff [ACPI Memory NVS | | | | | | | | |WB|WT|WC|UC]*

    Link: http://lkml.kernel.org/r/1468475036-5852-3-git-send-email-dennis.chen@arm.com
    Acked-by: Steve Capper
    Signed-off-by: Dennis Chen
    Cc: Catalin Marinas
    Cc: Ard Biesheuvel
    Cc: Pekka Enberg
    Cc: Mel Gorman
    Cc: Tang Chen
    Cc: Tony Luck
    Cc: Ingo Molnar
    Cc: Rafael J. Wysocki
    Cc: Will Deacon
    Cc: Mark Rutland
    Cc: Matt Fleming
    Cc: Kaly Xin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dennis Chen
     
  • In some cases, memblock is queried by kernel to determine whether a
    specified address is RAM or not. For example, the ACPI core needs this
    information to determine which attributes to use when mapping ACPI
    regions(acpi_os_ioremap). Use of incorrect memory types can result in
    faults, data corruption, or other issues.

    Removing memory with memblock_enforce_memory_limit() throws away this
    information, and so a kernel booted with 'mem=' may suffer from the
    issues described above. To avoid this, we need to keep those NOMAP
    regions instead of removing all above the limit, which preserves the
    information we need while preventing other use of those regions.

    This patch adds new infrastructure to retain all NOMAP memblock regions
    while removing others, to cater for this.

    Link: http://lkml.kernel.org/r/1468475036-5852-2-git-send-email-dennis.chen@arm.com
    Signed-off-by: Dennis Chen
    Acked-by: Steve Capper
    Cc: Catalin Marinas
    Cc: Ard Biesheuvel
    Cc: Pekka Enberg
    Cc: Mel Gorman
    Cc: Tang Chen
    Cc: Tony Luck
    Cc: Ingo Molnar
    Cc: Rafael J. Wysocki
    Cc: Will Deacon
    Cc: Mark Rutland
    Cc: Matt Fleming
    Cc: Kaly Xin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dennis Chen
     
  • We currently show:

    task: ti: task.ti: "

    "ti" and "task.ti" are redundant, and neither is actually what we want
    to show, which the the base of the thread stack. Change the display to
    show the stack pointer explicitly.

    Link: http://lkml.kernel.org/r/543ac5bd66ff94000a57a02e11af7239571a3055.1468523549.git.luto@kernel.org
    Signed-off-by: Andy Lutomirski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Lutomirski
     
  • We'll need this cleanup to make the cpu field in thread_info be
    optional.

    Link: http://lkml.kernel.org/r/da298328dc77ea494576c2f20a934218e758a6fa.1468523549.git.luto@kernel.org
    Signed-off-by: Andy Lutomirski
    Cc: Jason Wessel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Lutomirski
     
  • We should account for stacks regardless of stack size, and we need to
    account in sub-page units if THREAD_SIZE < PAGE_SIZE. Change the units
    to kilobytes and Move it into account_kernel_stack().

    Fixes: 12580e4b54ba8 ("mm: memcontrol: report kernel stack usage in cgroup2 memory.stat")
    Link: http://lkml.kernel.org/r/9b5314e3ee5eda61b0317ec1563768602c1ef438.1468523549.git.luto@kernel.org
    Signed-off-by: Andy Lutomirski
    Cc: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Reviewed-by: Josh Poimboeuf
    Reviewed-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Lutomirski
     
  • Currently, NR_KERNEL_STACK tracks the number of kernel stacks in a zone.
    This only makes sense if each kernel stack exists entirely in one zone,
    and allowing vmapped stacks could break this assumption.

    Since frv has THREAD_SIZE < PAGE_SIZE, we need to track kernel stack
    allocations in a unit that divides both THREAD_SIZE and PAGE_SIZE on all
    architectures. Keep it simple and use KiB.

    Link: http://lkml.kernel.org/r/083c71e642c5fa5f1b6898902e1b2db7b48940d4.1468523549.git.luto@kernel.org
    Signed-off-by: Andy Lutomirski
    Cc: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Reviewed-by: Josh Poimboeuf
    Reviewed-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Lutomirski
     
  • Now that ZONE_DEVICE depends on SPARSEMEM_VMEMMAP we can simplify some
    ifdef guards to just ZONE_DEVICE.

    Link: http://lkml.kernel.org/r/146687646788.39261.8020536391978771940.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Reported-by: Vlastimil Babka
    Cc: Eric Sandeen
    Cc: Jeff Moyer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • When it was first introduced CONFIG_ZONE_DEVICE depended on disabling
    CONFIG_ZONE_DMA, a configuration choice reserved for "experts".
    However, now that the ZONE_DMA conflict has been eliminated it no longer
    makes sense to require CONFIG_EXPERT.

    Link: http://lkml.kernel.org/r/146687646274.39261.14267596518720371009.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Reported-by: Eric Sandeen
    Reported-by: Jeff Moyer
    Acked-by: Jeff Moyer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • asm-generic headers are generic implementations for architecture
    specific code and should not be included by common code. Thus use the
    asm/ version of sections.h to get at the linker sections.

    Link: http://lkml.kernel.org/r/1468285103-7470-1-git-send-email-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • The definition of return value of madvise_free_huge_pmd is not clear
    before. According to the suggestion of Minchan Kim, change the type of
    return value to bool and return true if we do MADV_FREE successfully on
    entire pmd page, otherwise, return false. Comments are added too.

    Link: http://lkml.kernel.org/r/1467135452-16688-2-git-send-email-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Acked-by: Minchan Kim
    Cc: "Kirill A. Shutemov"
    Cc: Jerome Marchand
    Cc: Vlastimil Babka
    Cc: Dan Williams
    Cc: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Ebru Akagunduz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • Use ClearPagePrivate/ClearPagePrivate2 helpers to clear
    PG_private/PG_private_2 in page->flags

    Link: http://lkml.kernel.org/r/1467882338-4300-7-git-send-email-opensource.ganesh@gmail.com
    Signed-off-by: Ganesh Mahendran
    Acked-by: Minchan Kim
    Reviewed-by: Sergey Senozhatsky
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ganesh Mahendran
     
  • Add __init,__exit attribute for function that only called in module
    init/exit to save memory.

    Link: http://lkml.kernel.org/r/1467882338-4300-6-git-send-email-opensource.ganesh@gmail.com
    Signed-off-by: Ganesh Mahendran
    Cc: Sergey Senozhatsky
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ganesh Mahendran
     
  • Some minor commebnt changes:

    1). update zs_malloc(),zs_create_pool() function header
    2). update "Usage of struct page fields"

    Link: http://lkml.kernel.org/r/1467882338-4300-5-git-send-email-opensource.ganesh@gmail.com
    Signed-off-by: Ganesh Mahendran
    Reviewed-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ganesh Mahendran