25 Feb, 2017

40 commits

  • There is which provides macros for various gcc
    specific constructs. Eg: __weak for __attribute__((weak)). I've
    cleaned all instances of gcc specific attributes with the right macros
    for all files under /arch/m68k

    Link: http://lkml.kernel.org/r/1485540901-1988-3-git-send-email-gidisrael@gmail.com
    Signed-off-by: Gideon Israel Dsouza
    Cc: Greg Ungerer
    Cc: Geert Uytterhoeven
    Cc: Paul Gortmaker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gideon Israel Dsouza
     
  • Add __mode(x) into compiler-gcc.h as part of a cleanup task I've taken
    up, to replace gcc specific attributes with macros.

    The next patch is a cleanup of the m68k subsystem and it requires a new
    macro to wrap __attribute__ ((mode (...)))

    Link: http://lkml.kernel.org/r/1485540901-1988-2-git-send-email-gidisrael@gmail.com
    Signed-off-by: Gideon Israel Dsouza
    Cc: Greg Ungerer
    Cc: Geert Uytterhoeven
    Cc: Paul Gortmaker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gideon Israel Dsouza
     
  • The timer APIs this header needs are ktime_get(), ktime_add_us(), and
    ktime_compare(). So, including seems enough. This
    commit will cut unnecessary header file parsing.

    Link: http://lkml.kernel.org/r/1481679225-10885-1-git-send-email-yamada.masahiro@socionext.com
    Signed-off-by: Masahiro Yamada
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Masahiro Yamada
     
  • Commit 63159f5dcccb ("uapi: Use __kernel_long_t in struct mq_attr")
    changed the types from long to __kernel_long_t, but didn't add a
    linux/types.h include. Code that tries to include this header directly
    breaks:

    /usr/include/linux/mqueue.h:26:2: error: unknown type name '__kernel_long_t'
    __kernel_long_t mq_flags; /* message queue flags */

    This also upsets configure tests for this header:

    checking linux/mqueue.h usability... no
    checking linux/mqueue.h presence... yes
    configure: WARNING: linux/mqueue.h: present but cannot be compiled
    configure: WARNING: linux/mqueue.h: check for missing prerequisite headers?
    configure: WARNING: linux/mqueue.h: see the Autoconf documentation
    configure: WARNING: linux/mqueue.h: section "Present But Cannot Be Compiled"
    configure: WARNING: linux/mqueue.h: proceeding with the compiler's result
    checking for linux/mqueue.h... no

    Link: http://lkml.kernel.org/r/20170119194644.4403-1-vapier@gentoo.org
    Signed-off-by: Mike Frysinger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Frysinger
     
  • Previously, the hidepid parameter was checked by comparing literal
    integers 0, 1, 2. Let's add a proper enum for this, to make the
    checking more expressive:

    0 → HIDEPID_OFF
    1 → HIDEPID_NO_ACCESS
    2 → HIDEPID_INVISIBLE

    This changes the internal labelling only, the userspace-facing interface
    remains unmodified, and still works with literal integers 0, 1, 2.

    No functional changes.

    Link: http://lkml.kernel.org/r/1484572984-13388-2-git-send-email-djalal@gmail.com
    Signed-off-by: Lafcadio Wluiki
    Signed-off-by: Djalal Harouni
    Acked-by: Kees Cook
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lafcadio Wluiki
     
  • After staring at this code for a while I've figured using small 2-entry
    array describing ARGV and ENVP is the way to address code duplication
    critique.

    Link: http://lkml.kernel.org/r/20170105185724.GA12027@avx2
    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • To make the code clearer, use rb_entry() instead of container_of() to
    deal with rbtree.

    Link: http://lkml.kernel.org/r/4fd1f82818665705ce75c5156a060ae7caa8e0a9.1482160150.git.geliangtang@gmail.com
    Signed-off-by: Geliang Tang
    Cc: Jan Kara
    Cc: Al Viro
    Cc: "David S. Miller"
    Cc: Juergen Gross
    Cc: Dmitry Torokhov
    Cc: Seth Forshee
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Geliang Tang
     
  • Given that the arch does not add its own implementations, simply use the
    asm-generic/current.h (generic-y) header instead of duplicating code.

    Link: http://lkml.kernel.org/r/1485992878-4780-2-git-send-email-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: Matt Turner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • The build of frv defconfig gives warning:

    arch/frv/mb93090-mb00/pci-frv.c:176:5: warning: ignoring return value of 'pci_assign_resource', declared with attribute warn_unused_result

    Just print an error message to silence the warning. We can not do much
    here on error.

    Link: http://lkml.kernel.org/r/1484256471-5379-1-git-send-email-sudipm.mukherjee@gmail.com
    Signed-off-by: Sudip Mukherjee
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sudip Mukherjee
     
  • Make a kasan test which uses a SLAB_ACCOUNT slab cache. If the test is
    run within a non default memcg, then it uncovers the bug fixed by
    "kasan: drain quarantine of memcg slab objects"[1].

    If run without fix [1] it shows "Slab cache still has objects", and the
    kmem_cache structure is leaked.
    Here's an unpatched kernel test:

    $ dmesg -c > /dev/null
    $ mkdir /sys/fs/cgroup/memory/test
    $ echo $$ > /sys/fs/cgroup/memory/test/tasks
    $ modprobe test_kasan 2> /dev/null
    $ dmesg | grep -B1 still
    [ 123.456789] kasan test: memcg_accounted_kmem_cache allocate memcg accounted object
    [ 124.456789] kmem_cache_destroy test_cache: Slab cache still has objects

    Kernels with fix [1] don't have the "Slab cache still has objects"
    warning or the underlying leak.

    The new test runs and passes in the default (root) memcg, though in the
    root memcg it won't uncover the problem fixed by [1].

    Link: http://lkml.kernel.org/r/1482257462-36948-2-git-send-email-gthelen@google.com
    Signed-off-by: Greg Thelen
    Reviewed-by: Vladimir Davydov
    Cc: Andrey Ryabinin
    Cc: Alexander Potapenko
    Cc: Dmitry Vyukov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Thelen
     
  • Per memcg slab accounting and kasan have a problem with kmem_cache
    destruction.
    - kmem_cache_create() allocates a kmem_cache, which is used for
    allocations from processes running in root (top) memcg.
    - Processes running in non root memcg and allocating with either
    __GFP_ACCOUNT or from a SLAB_ACCOUNT cache use a per memcg
    kmem_cache.
    - Kasan catches use-after-free by having kfree() and kmem_cache_free()
    defer freeing of objects. Objects are placed in a quarantine.
    - kmem_cache_destroy() destroys root and non root kmem_caches. It takes
    care to drain the quarantine of objects from the root memcg's
    kmem_cache, but ignores objects associated with non root memcg. This
    causes leaks because quarantined per memcg objects refer to per memcg
    kmem cache being destroyed.

    To see the problem:

    1) create a slab cache with kmem_cache_create(,,,SLAB_ACCOUNT,)
    2) from non root memcg, allocate and free a few objects from cache
    3) dispose of the cache with kmem_cache_destroy() kmem_cache_destroy()
    will trigger a "Slab cache still has objects" warning indicating
    that the per memcg kmem_cache structure was leaked.

    Fix the leak by draining kasan quarantined objects allocated from non
    root memcg.

    Racing memcg deletion is tricky, but handled. kmem_cache_destroy() =>
    shutdown_memcg_caches() => __shutdown_memcg_cache() => shutdown_cache()
    flushes per memcg quarantined objects, even if that memcg has been
    rmdir'd and gone through memcg_deactivate_kmem_caches().

    This leak only affects destroyed SLAB_ACCOUNT kmem caches when kasan is
    enabled. So I don't think it's worth patching stable kernels.

    Link: http://lkml.kernel.org/r/1482257462-36948-1-git-send-email-gthelen@google.com
    Signed-off-by: Greg Thelen
    Reviewed-by: Vladimir Davydov
    Acked-by: Andrey Ryabinin
    Cc: Alexander Potapenko
    Cc: Dmitry Vyukov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Thelen
     
  • Commit 31bc3858ea3e ("add automatic onlining policy for the newly added
    memory") provides the capability to have added memory automatically
    onlined during add, but this appears to be slightly broken.

    The current implementation uses walk_memory_range() to call
    online_memory_block, which uses memory_block_change_state() to online
    the memory. Instead, we should be calling device_online() for the
    memory block in online_memory_block(). This would online the memory
    (the memory bus online routine memory_subsys_online() called from
    device_online calls memory_block_change_state()) and properly update the
    device struct offline flag.

    As a result of the current implementation, attempting to remove a memory
    block after adding it using auto online fails. This is because doing a
    remove, for instance

    echo offline > /sys/devices/system/memory/memoryXXX/state

    uses device_offline() which checks the dev->offline flag.

    Link: http://lkml.kernel.org/r/20170222220744.8119.19687.stgit@ltcalpine2-lp14.aus.stglabs.ibm.com
    Signed-off-by: Nathan Fontenot
    Cc: Michael Ellerman
    Cc: Michael Roth
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nathan Fontenot
     
  • With rw_page, page_endio is used for completing IO on a page and it
    propagates write error to the address space if the IO fails. The
    problem is it accesses page->mapping directly which might be okay for
    file-backed pages but it shouldn't for anonymous page. Otherwise, it
    can corrupt one of field from anon_vma under us and system goes panic
    randomly.

    swap_writepage
    bdev_writepage
    ops->rw_page

    I encountered the BUG during developing new zram feature and it was
    really hard to figure it out because it made random crash, somtime
    mmap_sem lockdep, sometime other places where places never related to
    zram/zsmalloc, and not reproducible with some configuration.

    When I consider how that bug is subtle and people do fast-swap test with
    brd, it's worth to add stable mark, I think.

    Fixes: dd6bd0d9c7db ("swap: use bdev_read_page() / bdev_write_page()")
    Signed-off-by: Minchan Kim
    Acked-by: Michal Hocko
    Cc: Matthew Wilcox
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • We are using the wrong flag value in task_numa_falt function. This can
    result in us doing wrong numa fault statistics update, because we update
    num_pages_migrate and numa_fault_locality etc based on the flag argument
    passed.

    Fixes: bae473a423 ("mm: introduce fault_env")
    Link: http://lkml.kernel.org/r/1487498395-9544-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com
    Signed-off-by: Aneesh Kumar K.V
    Acked-by: Hillf Danton
    Acked-by: Kirill A. Shutemov
    Cc: Rik van Riel
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     
  • Do the prot_none/FOLL_NUMA check after we are sure this is a THP pte.
    Archs can implement prot_none such that it can return true for regular
    pmd entries.

    Link: http://lkml.kernel.org/r/1487498326-8734-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com
    Signed-off-by: Aneesh Kumar K.V
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Hillf Danton
    Cc: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     
  • cleanup rest of dma_addr_t and phys_addr_t type casting in mm
    use %pad for dma_addr_t
    use %pa for phys_addr_t

    Link: http://lkml.kernel.org/r/1486618489-13912-1-git-send-email-miles.chen@mediatek.com
    Signed-off-by: Miles Chen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miles Chen
     
  • The class index and fullness group are not encoded in
    (first)page->mapping any more, after commit 3783689a1aa8 ("zsmalloc:
    introduce zspage structure"). Instead, they are store in struct zspage.

    Just delete this unneeded comment.

    Link: http://lkml.kernel.org/r/1486620822-36826-1-git-send-email-xieyisheng1@huawei.com
    Signed-off-by: Yisheng Xie
    Suggested-by: Sergey Senozhatsky
    Reviewed-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Cc: Nitin Gupta
    Cc: Hanjun Guo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yisheng Xie
     
  • arch_zone_lowest/highest_possible_pfn[] is set to 0 and [ZONE_MOVABLE]
    is skipped in the loop. No need to reset them to 0 again.

    This patch just removes the redundant code.

    Link: http://lkml.kernel.org/r/20170209141731.60208-1-richard.weiyang@gmail.com
    Signed-off-by: Wei Yang
    Cc: Anshuman Khandual
    Cc: Mel Gorman
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • We had used page->lru to link the component pages (except the first
    page) of a zspage, and used INIT_LIST_HEAD(&page->lru) to init it.
    Therefore, to get the last page's next page, which is NULL, we had to
    use page flag PG_Private_2 to identify it.

    But now, we use page->freelist to link all of the pages in zspage and
    init the page->freelist as NULL for last page, so no need to use
    PG_Private_2 anymore.

    This remove redundant SetPagePrivate2 in create_page_chain and
    ClearPagePrivate2 in reset_page(). Save a few cycles for migration of
    zsmalloc page :)

    Link: http://lkml.kernel.org/r/1487076509-49270-1-git-send-email-xieyisheng1@huawei.com
    Signed-off-by: Yisheng Xie
    Reviewed-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yisheng Xie
     
  • At the end of a window period, if the reclaimed pages is greater than
    scanned, an unsigned underflow can result in a huge pressure value and
    thus a critical event. Reclaimed pages is found to go higher than
    scanned because of the addition of reclaimed slab pages to reclaimed in
    shrink_node without a corresponding increment to scanned pages.

    Minchan Kim mentioned that this can also happen in the case of a THP
    page where the scanned is 1 and reclaimed could be 512.

    Link: http://lkml.kernel.org/r/1486641577-11685-1-git-send-email-vinmenon@codeaurora.org
    Signed-off-by: Vinayak Menon
    Acked-by: Minchan Kim
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Vlastimil Babka
    Cc: Rik van Riel
    Cc: Vladimir Davydov
    Cc: Anton Vorontsov
    Cc: Shiraz Hashim
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vinayak Menon
     
  • Remove the prototypes for shmem_mapping() and shmem_zero_setup() from
    linux/mm.h, since they are already provided in linux/shmem_fs.h. But
    shmem_fs.h must then provide the inline stub for shmem_mapping() when
    CONFIG_SHMEM is not set, and a few more cfiles now need to #include it.

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1702081658250.1549@eggly.anvils
    Signed-off-by: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Michal Simek
    Cc: Michael Ellerman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • When @node_reclaim_node isn't 0, the page allocator tries to reclaim
    pages if the amount of free memory in the zones are below the low
    watermark. On Power platform, none of NUMA nodes are scanned for page
    reclaim because no nodes match the condition in zone_allows_reclaim().
    On Power platform, RECLAIM_DISTANCE is set to 10 which is the distance
    of Node-A to Node-A. So the preferred node even won't be scanned for
    page reclaim.

    __alloc_pages_nodemask()
    get_page_from_freelist()
    zone_allows_reclaim()

    Anton proposed the test code as below:

    # cat alloc.c
    :
    int main(int argc, char *argv[])
    {
    void *p;
    unsigned long size;
    unsigned long start, end;

    start = time(NULL);
    size = strtoul(argv[1], NULL, 0);
    printf("To allocate %ldGB memory\n", size);

    size < /proc/sys/vm/zone_reclaim_mode; \
    sync; \
    echo 3 > /proc/sys/vm/drop_caches; \
    # taskset -c 0 cat file.32G > /dev/null; \
    grep FilePages /sys/devices/system/node/node0/meminfo
    Node 0 FilePages: 33619712 kB
    # taskset -c 0 ./alloc 128
    # grep FilePages /sys/devices/system/node/node0/meminfo
    Node 0 FilePages: 33619840 kB
    # grep MemFree /sys/devices/system/node/node0/meminfo
    Node 0 MemFree: 186816 kB

    With the patch applied, the pagecache on node-0 is reclaimed when its
    free memory is running out. It's the expected behaviour.

    # echo 2 > /proc/sys/vm/zone_reclaim_mode; \
    sync; \
    echo 3 > /proc/sys/vm/drop_caches
    # taskset -c 0 cat file.32G > /dev/null; \
    grep FilePages /sys/devices/system/node/node0/meminfo
    Node 0 FilePages: 33605568 kB
    # taskset -c 0 ./alloc 128
    # grep FilePages /sys/devices/system/node/node0/meminfo
    Node 0 FilePages: 1379520 kB
    # grep MemFree /sys/devices/system/node/node0/meminfo
    Node 0 MemFree: 317120 kB

    Fixes: 5f7a75acdb24 ("mm: page_alloc: do not cache reclaim distances")
    Link: http://lkml.kernel.org/r/1486532455-29613-1-git-send-email-gwshan@linux.vnet.ibm.com
    Signed-off-by: Gavin Shan
    Acked-by: Mel Gorman
    Acked-by: Michal Hocko
    Cc: Anton Blanchard
    Cc: Michael Ellerman
    Cc: [3.16+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gavin Shan
     
  • When mainline introduced commit a96dfddbcc04 ("base/memory, hotplug: fix
    a kernel oops in show_valid_zones()"), it obtained the valid start and
    end pfn from the given pfn range. The valid start pfn can fix the
    actual issue, but it introduced another issue. The valid end pfn will
    may exceed the given end_pfn.

    Although the incorrect overflow will not result in actual problem at
    present, but I think it need to be fixed.

    [toshi.kani@hpe.com: remove assumption that end_pfn is aligned by MAX_ORDER_NR_PAGES]
    Fixes: a96dfddbcc04 ("base/memory, hotplug: fix a kernel oops in show_valid_zones()")
    Link: http://lkml.kernel.org/r/1486467299-22648-1-git-send-email-zhongjiang@huawei.com
    Signed-off-by: zhong jiang
    Signed-off-by: Toshi Kani
    Cc: Vlastimil Babka
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    zhong jiang
     
  • The idea is that without doing more calculations we extend zero pages to
    same element pages for zram. zero page is special case of same element
    page with zero element.

    1. the test is done under android 7.0
    2. startup too many applications circularly
    3. sample the zero pages, same pages (none-zero element)
    and total pages in function page_zero_filled

    the result is listed as below:

    ZERO SAME TOTAL
    36214 17842 598196

    ZERO/TOTAL SAME/TOTAL (ZERO+SAME)/TOTAL ZERO/SAME
    AVERAGE 0.060631909 0.024990816 0.085622726 2.663825038
    STDEV 0.00674612 0.005887625 0.009707034 2.115881328
    MAX 0.069698422 0.030046087 0.094975336 7.56043956
    MIN 0.03959586 0.007332205 0.056055193 1.928985507

    from the above data, the benefit is about 2.5% and up to 3% of total
    swapout pages.

    The defect of the patch is that when we recovery a page from non-zero
    element the operations are low efficient for partial read.

    This patch extends zero_page to same_page so if there is any user to
    have monitored zero_pages, he will be surprised if the number is
    increased but it's not harmful, I believe.

    [minchan@kernel.org: do not free same element pages in zram_meta_free]
    Link: http://lkml.kernel.org/r/20170207065741.GA2567@bbox
    Link: http://lkml.kernel.org/r/1483692145-75357-1-git-send-email-zhouxianrong@huawei.com
    Link: http://lkml.kernel.org/r/1486307804-27903-1-git-send-email-minchan@kernel.org
    Signed-off-by: zhouxianrong
    Signed-off-by: Minchan Kim
    Cc: Sergey Senozhatsky
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    zhouxianrong
     
  • The likely/unlikely profiler noticed that the unlikely statement in
    wb_domain_writeout_inc() is constantly wrong. This is due to the "not"
    (!) being outside the unlikely statement. It is likely that
    dom->period_time will be set, but unlikely that it wont be. Move the
    not into the unlikely statement.

    Link: http://lkml.kernel.org/r/20170206120035.3c2e2b91@gandalf.local.home
    Signed-off-by: Steven Rostedt (VMware)
    Reviewed-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Rostedt (VMware)
     
  • With this our protnone becomes a present pte with READ/WRITE/EXEC bit
    cleared. By default we also set _PAGE_PRIVILEGED on such pte. This is
    now used to help us identify a protnone pte that as saved write bit.
    For such pte, we will clear the _PAGE_PRIVILEGED bit. The pte still
    remain non-accessible from both user and kernel.

    [aneesh.kumar@linux.vnet.ibm.com: v3]
    Link: http://lkml.kernel.org/r/1487498625-10891-4-git-send-email-aneesh.kumar@linux.vnet.ibm.com
    Link: http://lkml.kernel.org/r/1487050314-3892-3-git-send-email-aneesh.kumar@linux.vnet.ibm.com
    Signed-off-by: Aneesh Kumar K.V
    Acked-by: Michael Neuling
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Paul Mackerras
    Cc: Benjamin Herrenschmidt
    Cc: Michael Ellerman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     
  • Without this KSM will consider the page write protected, but a numa
    fault can later mark the page writable. This can result in memory
    corruption.

    Link: http://lkml.kernel.org/r/1487498625-10891-3-git-send-email-aneesh.kumar@linux.vnet.ibm.com
    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     
  • Patch series "Numabalancing preserve write fix", v2.

    This patch series address an issue w.r.t THP migration and autonuma
    preserve write feature. migrate_misplaced_transhuge_page() cannot deal
    with concurrent modification of the page. It does a page copy without
    following the migration pte sequence. IIUC, this was done to keep the
    migration simpler and at the time of implemenation we didn't had THP
    page cache which would have required a more elaborate migration scheme.
    That means thp autonuma migration expect the protnone with saved write
    to be done such that both kernel and user cannot update the page
    content. This patch series enables archs like ppc64 to do that. We are
    good with the hash translation mode with the current code, because we
    never create a hardware page table entry for a protnone pte.

    This patch (of 2):

    Autonuma preserves the write permission across numa fault to avoid
    taking a writefault after a numa fault (Commit: b191f9b106ea " mm: numa:
    preserve PTE write permissions across a NUMA hinting fault").
    Architecture can implement protnone in different ways and some may
    choose to implement that by clearing Read/ Write/Exec bit of pte.
    Setting the write bit on such pte can result in wrong behaviour. Fix
    this up by allowing arch to override how to save the write bit on a
    protnone pte.

    [aneesh.kumar@linux.vnet.ibm.com: don't mark pte saved write in case of dirty_accountable]
    Link: http://lkml.kernel.org/r/1487942884-16517-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com
    [aneesh.kumar@linux.vnet.ibm.com: v3]
    Link: http://lkml.kernel.org/r/1487498625-10891-2-git-send-email-aneesh.kumar@linux.vnet.ibm.com
    Link: http://lkml.kernel.org/r/1487050314-3892-2-git-send-email-aneesh.kumar@linux.vnet.ibm.com
    Signed-off-by: Aneesh Kumar K.V
    Acked-by: Michael Neuling
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Paul Mackerras
    Cc: Benjamin Herrenschmidt
    Cc: Michael Ellerman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     
  • Architectures like ppc64, use privilege access bit to mark pte non
    accessible. This implies that kernel can do a copy_to_user to an
    address marked for numa fault. This also implies that there can be a
    parallel hardware update for the pte. set_pte_at cannot be used in such
    scenarios. Hence switch the pte update to use ptep_get_and_clear and
    set_pte_at combination.

    [akpm@linux-foundation.org: remove unwanted ppc change, per Aneesh]
    Link: http://lkml.kernel.org/r/1486400776-28114-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com
    Signed-off-by: Aneesh Kumar K.V
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     
  • Running my likely/unlikely profiler, I discovered that the test in
    shmem_write_begin() that tests for info->seals as unlikely, is always
    incorrect. This is because shmem_get_inode() sets info->seals to have
    F_SEAL_SEAL set by default, and it is unlikely to be cleared when
    shmem_write_begin() is called. Thus, the if statement is very likely.

    But as the if statement block only cares about F_SEAL_WRITE and
    F_SEAL_GROW, change the test to only test those two bits.

    Link: http://lkml.kernel.org/r/20170203105656.7aec6237@gandalf.local.home
    Signed-off-by: Steven Rostedt (VMware)
    Acked-by: Hugh Dickins
    Cc: David Herrmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Rostedt (VMware)
     
  • Hillf Danton pointed out that since commit 1d82de618dd ("mm, vmscan:
    make kswapd reclaim in terms of nodes") that PGDAT_WRITEBACK is no
    longer cleared.

    It was not noticed as triggering it requires pages under writeback to
    cycle twice through the LRU and before kswapd gets stalled.
    Historically, such issues tended to occur on small machines writing
    heavily to slow storage such as a USB stick.

    Once kswapd stalls, direct reclaim stalls may be higher but due to the
    fact that memory pressure is required, it would not be very noticable.

    Michal Hocko suggested removing the flag entirely but the conservative
    fix is to restore the intended PGDAT_WRITEBACK behaviour and clear the
    flag when a suitable zone is balanced.

    Fixes: 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes")
    Link: http://lkml.kernel.org/r/20170203203222.gq7hk66yc36lpgtb@suse.de
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: Hillf Danton
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The fault wrappers drm_vm_fault(), drm_vm_shm_fault(),
    drm_vm_dma_fault() and drm_vm_sg_fault() used to provide extra logic
    beyond what was in the "drm_do_*" versions of these functions, but as of
    commit ca0b07d9a969 ("drm: convert drm from nopage to fault") they are
    just unnecessary wrappers that do nothing.

    Remove them, and rename the the drm_do_* fault handlers to remove the
    "do_" since they no longer have corresponding wrappers.

    Link: http://lkml.kernel.org/r/1486155698-25717-1-git-send-email-ross.zwisler@linux.intel.com
    Signed-off-by: Ross Zwisler
    Reviewed-by: Sean Paul
    Acked-by: Daniel Vetter
    Cc: Dave Jiang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     
  • Fix whitespace issues, extraneous braces.

    Link: http://lkml.kernel.org/r/1485992240-10986-5-git-send-email-me@tobin.cc
    Signed-off-by: Tobin C Harding
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tobin C Harding
     
  • Patch fixes sparse warning: Using plain integer as NULL pointer.
    Replaces assignment of 0 to pointer with NULL assignment.

    Link: http://lkml.kernel.org/r/1485992240-10986-2-git-send-email-me@tobin.cc
    Signed-off-by: Tobin C Harding
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tobin C Harding
     
  • Link: http://lkml.kernel.org/r/20170202011942.1609-1-standby24x7@gmail.com
    Signed-off-by: Masanari Iida
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Masanari Iida
     
  • __vmalloc_area_node() allocates pages to cover the requested vmalloc
    size. This can be a lot of memory. If the current task is killed by
    the OOM killer, and thus has an unlimited access to memory reserves, it
    can consume all the memory theoretically. Fix this by checking for
    fatal_signal_pending and back off early.

    Link: http://lkml.kernel.org/r/20170201092706.9966-4-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reviewed-by: Christoph Hellwig
    Cc: Tetsuo Handa
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • There are many reasons of CMA allocation failure such as EBUSY, ENOMEM,
    EINTR. But we did not know error reason so far. This patch prints the
    error value.

    Additionally if CONFIG_CMA_DEBUG is enabled, this patch shows bitmap
    status to know available pages. Actually CMA internally tries on all
    available regions because some regions can be failed because of EBUSY.
    Bitmap status is useful to know in detail on both ENONEM and EBUSY;

    ENOMEM: not tried at all because of no available region
    it could be too small total region or could be fragmentation issue
    EBUSY: tried some region but all failed

    This is an ENOMEM example with this patch.

    [2: Binder:714_1: 744] cma: cma_alloc: alloc failed, req-size: 256 pages, ret: -12

    If CONFIG_CMA_DEBUG is enabled, avabile pages also will be shown as
    concatenated size@position format. So 4@572 means that there are 4
    available pages at 572 position starting from 0 position.

    [2: Binder:714_1: 744] cma: number of available pages: 4@572+7@585+7@601+8@632+38@730+166@1114+127@1921=> 357 free of 2048 total pages

    Link: http://lkml.kernel.org/r/1485909785-3952-1-git-send-email-jaewon31.kim@samsung.com
    Signed-off-by: Jaewon Kim
    Acked-by: Michal Nazarewicz
    Cc: Laura Abbott
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jaewon Kim
     
  • If madvise(2) advice will result in the underlying vma being split and
    the number of areas mapped by the process will exceed
    /proc/sys/vm/max_map_count as a result, return ENOMEM instead of EAGAIN.

    EAGAIN is returned by madvise(2) when a kernel resource, such as slab,
    is temporarily unavailable. It indicates that userspace should retry
    the advice in the near future. This is important for advice such as
    MADV_DONTNEED which is often used by malloc implementations to free
    memory back to the system: we really do want to free memory back when
    madvise(2) returns EAGAIN because slab allocations (for vmas, anon_vmas,
    or mempolicies) cannot be allocated.

    Encountering /proc/sys/vm/max_map_count is not a temporary failure,
    however, so return ENOMEM to indicate this is a more serious issue. A
    followup patch to the man page will specify this behavior.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1701241431120.42507@chino.kir.corp.google.com
    Signed-off-by: David Rientjes
    Cc: Jonathan Corbet
    Cc: Johannes Weiner
    Cc: Jerome Marchand
    Cc: "Kirill A. Shutemov"
    Cc: Michael Kerrisk
    Cc: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • The callers of the DMA alloc functions already provide the proper
    context GFP flags. Make sure to pass them through to the CMA allocator,
    to make the CMA compaction context aware.

    Link: http://lkml.kernel.org/r/20170127172328.18574-3-l.stach@pengutronix.de
    Signed-off-by: Lucas Stach
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Radim Krcmar
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: Chris Zankel
    Cc: Ralf Baechle
    Cc: Paolo Bonzini
    Cc: Alexander Graf
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lucas Stach
     
  • Most users of this interface just want to use it with the default
    GFP_KERNEL flags, but for cases where DMA memory is allocated it may be
    called from a different context.

    No functional change yet, just passing through the flag to the
    underlying alloc_contig_range function.

    Link: http://lkml.kernel.org/r/20170127172328.18574-2-l.stach@pengutronix.de
    Signed-off-by: Lucas Stach
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Radim Krcmar
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: Chris Zankel
    Cc: Ralf Baechle
    Cc: Paolo Bonzini
    Cc: Alexander Graf
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lucas Stach