21 Nov, 2018

1 commit

  • commit ac5b2c18911ffe95c08d69273917f90212cf5659 upstream.

    THP allocation might be really disruptive when allocated on NUMA system
    with the local node full or hard to reclaim. Stefan has posted an
    allocation stall report on 4.12 based SLES kernel which suggests the
    same issue:

    kvm: page allocation stalls for 194572ms, order:9, mode:0x4740ca(__GFP_HIGHMEM|__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|__GFP_MOVABLE|__GFP_DIRECT_RECLAIM), nodemask=(null)
    kvm cpuset=/ mems_allowed=0-1
    CPU: 10 PID: 84752 Comm: kvm Tainted: G W 4.12.0+98-ph 0000001 SLE15 (unreleased)
    Hardware name: Supermicro SYS-1029P-WTRT/X11DDW-NT, BIOS 2.0 12/05/2017
    Call Trace:
    dump_stack+0x5c/0x84
    warn_alloc+0xe0/0x180
    __alloc_pages_slowpath+0x820/0xc90
    __alloc_pages_nodemask+0x1cc/0x210
    alloc_pages_vma+0x1e5/0x280
    do_huge_pmd_wp_page+0x83f/0xf00
    __handle_mm_fault+0x93d/0x1060
    handle_mm_fault+0xc6/0x1b0
    __do_page_fault+0x230/0x430
    do_page_fault+0x2a/0x70
    page_fault+0x7b/0x80
    [...]
    Mem-Info:
    active_anon:126315487 inactive_anon:1612476 isolated_anon:5
    active_file:60183 inactive_file:245285 isolated_file:0
    unevictable:15657 dirty:286 writeback:1 unstable:0
    slab_reclaimable:75543 slab_unreclaimable:2509111
    mapped:81814 shmem:31764 pagetables:370616 bounce:0
    free:32294031 free_pcp:6233 free_cma:0
    Node 0 active_anon:254680388kB inactive_anon:1112760kB active_file:240648kB inactive_file:981168kB unevictable:13368kB isolated(anon):0kB isolated(file):0kB mapped:280240kB dirty:1144kB writeback:0kB shmem:95832kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 81225728kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
    Node 1 active_anon:250583072kB inactive_anon:5337144kB active_file:84kB inactive_file:0kB unevictable:49260kB isolated(anon):20kB isolated(file):0kB mapped:47016kB dirty:0kB writeback:4kB shmem:31224kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 31897600kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no

    The defrag mode is "madvise" and from the above report it is clear that
    the THP has been allocated for MADV_HUGEPAGA vma.

    Andrea has identified that the main source of the problem is
    __GFP_THISNODE usage:

    : The problem is that direct compaction combined with the NUMA
    : __GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very
    : hard the local node, instead of failing the allocation if there's no
    : THP available in the local node.
    :
    : Such logic was ok until __GFP_THISNODE was added to the THP allocation
    : path even with MPOL_DEFAULT.
    :
    : The idea behind the __GFP_THISNODE addition, is that it is better to
    : provide local memory in PAGE_SIZE units than to use remote NUMA THP
    : backed memory. That largely depends on the remote latency though, on
    : threadrippers for example the overhead is relatively low in my
    : experience.
    :
    : The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in
    : extremely slow qemu startup with vfio, if the VM is larger than the
    : size of one host NUMA node. This is because it will try very hard to
    : unsuccessfully swapout get_user_pages pinned pages as result of the
    : __GFP_THISNODE being set, instead of falling back to PAGE_SIZE
    : allocations and instead of trying to allocate THP on other nodes (it
    : would be even worse without vfio type1 GUP pins of course, except it'd
    : be swapping heavily instead).

    Fix this by removing __GFP_THISNODE for THP requests which are
    requesting the direct reclaim. This effectivelly reverts 5265047ac301
    on the grounds that the zone/node reclaim was known to be disruptive due
    to premature reclaim when there was memory free. While it made sense at
    the time for HPC workloads without NUMA awareness on rare machines, it
    was ultimately harmful in the majority of cases. The existing behaviour
    is similar, if not as widespare as it applies to a corner case but
    crucially, it cannot be tuned around like zone_reclaim_mode can. The
    default behaviour should always be to cause the least harm for the
    common case.

    If there are specialised use cases out there that want zone_reclaim_mode
    in specific cases, then it can be built on top. Longterm we should
    consider a memory policy which allows for the node reclaim like behavior
    for the specific memory ranges which would allow a

    [1] http://lkml.kernel.org/r/20180820032204.9591-1-aarcange@redhat.com

    Mel said:

    : Both patches look correct to me but I'm responding to this one because
    : it's the fix. The change makes sense and moves further away from the
    : severe stalling behaviour we used to see with both THP and zone reclaim
    : mode.
    :
    : I put together a basic experiment with usemem configured to reference a
    : buffer multiple times that is 80% the size of main memory on a 2-socket
    : box with symmetric node sizes and defrag set to "always". The defrag
    : setting is not the default but it would be functionally similar to
    : accessing a buffer with madvise(MADV_HUGEPAGE). Usemem is configured to
    : reference the buffer multiple times and while it's not an interesting
    : workload, it would be expected to complete reasonably quickly as it fits
    : within memory. The results were;
    :
    : usemem
    : vanilla noreclaim-v1
    : Amean Elapsd-1 42.78 ( 0.00%) 26.87 ( 37.18%)
    : Amean Elapsd-3 27.55 ( 0.00%) 7.44 ( 73.00%)
    : Amean Elapsd-4 5.72 ( 0.00%) 5.69 ( 0.45%)
    :
    : This shows the elapsed time in seconds for 1 thread, 3 threads and 4
    : threads referencing buffers 80% the size of memory. With the patches
    : applied, it's 37.18% faster for the single thread and 73% faster with two
    : threads. Note that 4 threads showing little difference does not indicate
    : the problem is related to thread counts. It's simply the case that 4
    : threads gets spread so their workload mostly fits in one node.
    :
    : The overall view from /proc/vmstats is more startling
    :
    : 4.19.0-rc1 4.19.0-rc1
    : vanillanoreclaim-v1r1
    : Minor Faults 35593425 708164
    : Major Faults 484088 36
    : Swap Ins 3772837 0
    : Swap Outs 3932295 0
    :
    : Massive amounts of swap in/out without the patch
    :
    : Direct pages scanned 6013214 0
    : Kswapd pages scanned 0 0
    : Kswapd pages reclaimed 0 0
    : Direct pages reclaimed 4033009 0
    :
    : Lots of reclaim activity without the patch
    :
    : Kswapd efficiency 100% 100%
    : Kswapd velocity 0.000 0.000
    : Direct efficiency 67% 100%
    : Direct velocity 11191.956 0.000
    :
    : Mostly from direct reclaim context as you'd expect without the patch.
    :
    : Page writes by reclaim 3932314.000 0.000
    : Page writes file 19 0
    : Page writes anon 3932295 0
    : Page reclaim immediate 42336 0
    :
    : Writes from reclaim context is never good but the patch eliminates it.
    :
    : We should never have default behaviour to thrash the system for such a
    : basic workload. If zone reclaim mode behaviour is ever desired but on a
    : single task instead of a global basis then the sensible option is to build
    : a mempolicy that enforces that behaviour.

    This was a severe regression compared to previous kernels that made
    important workloads unusable and it starts when __GFP_THISNODE was
    added to THP allocations under MADV_HUGEPAGE. It is not a significant
    risk to go to the previous behavior before __GFP_THISNODE was added, it
    worked like that for years.

    This was simply an optimization to some lucky workloads that can fit in
    a single node, but it ended up breaking the VM for others that can't
    possibly fit in a single node, so going back is safe.

    [mhocko@suse.com: rewrote the changelog based on the one from Andrea]
    Link: http://lkml.kernel.org/r/20180925120326.24392-2-mhocko@kernel.org
    Fixes: 5265047ac301 ("mm, thp: really limit transparent hugepage allocation to local node")
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Michal Hocko
    Reported-by: Stefan Priebe
    Debugged-by: Andrea Arcangeli
    Reported-by: Alex Williamson
    Reviewed-by: Mel Gorman
    Tested-by: Mel Gorman
    Cc: Zi Yan
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Cc: "Kirill A. Shutemov"
    Cc: [4.1+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Andrea Arcangeli
     

30 May, 2018

1 commit

  • [ Upstream commit 8970a63e965b43288c4f5f40efbc2bbf80de7f16 ]

    Alexander reported a use of uninitialized memory in __mpol_equal(),
    which is caused by incorrect use of preferred_node.

    When mempolicy in mode MPOL_PREFERRED with flags MPOL_F_LOCAL, it uses
    numa_node_id() instead of preferred_node, however, __mpol_equal() uses
    preferred_node without checking whether it is MPOL_F_LOCAL or not.

    [akpm@linux-foundation.org: slight comment tweak]
    Link: http://lkml.kernel.org/r/4ebee1c2-57f6-bcb8-0e2d-1833d1ee0bb7@huawei.com
    Fixes: fc36b8d3d819 ("mempolicy: use MPOL_F_LOCAL to Indicate Preferred Local Policy")
    Signed-off-by: Yisheng Xie
    Reported-by: Alexander Potapenko
    Tested-by: Alexander Potapenko
    Reviewed-by: Andrew Morton
    Cc: Dmitriy Vyukov
    Cc: Vlastimil Babka
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Yisheng Xie
     

26 Apr, 2018

2 commits

  • [ Upstream commit 0486a38bcc4749808edbc848f1bcf232042770fc ]

    As in manpage of migrate_pages, the errno should be set to EINVAL when
    none of the node IDs specified by new_nodes are on-line and allowed by
    the process's current cpuset context, or none of the specified nodes
    contain memory. However, when test by following case:

    new_nodes = 0;
    old_nodes = 0xf;
    ret = migrate_pages(pid, old_nodes, new_nodes, MAX);

    The ret will be 0 and no errno is set. As the new_nodes is empty, we
    should expect EINVAL as documented.

    To fix the case like above, this patch check whether target nodes AND
    current task_nodes is empty, and then check whether AND
    node_states[N_MEMORY] is empty.

    Link: http://lkml.kernel.org/r/1510882624-44342-4-git-send-email-xieyisheng1@huawei.com
    Signed-off-by: Yisheng Xie
    Acked-by: Vlastimil Babka
    Cc: Andi Kleen
    Cc: Chris Salls
    Cc: Christopher Lameter
    Cc: David Rientjes
    Cc: Ingo Molnar
    Cc: Naoya Horiguchi
    Cc: Tan Xiaojun
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Yisheng Xie
     
  • [ Upstream commit 56521e7a02b7b84a5e72691a1fb15570e6055545 ]

    As Xiaojun reported the ltp of migrate_pages01 will fail on arm64 system
    which has 4 nodes[0...3], all have memory and CONFIG_NODES_SHIFT=2:

    migrate_pages01 0 TINFO : test_invalid_nodes
    migrate_pages01 14 TFAIL : migrate_pages_common.c:45: unexpected failure - returned value = 0, expected: -1
    migrate_pages01 15 TFAIL : migrate_pages_common.c:55: call succeeded unexpectedly

    In this case the test_invalid_nodes of migrate_pages01 will call:
    SYSC_migrate_pages as:

    migrate_pages(0, , {0x0000000000000001}, 64, , {0x0000000000000010}, 64) = 0

    The new nodes specifies one or more node IDs that are greater than the
    maximum supported node ID, however, the errno is not set to EINVAL as
    expected.

    As man pages of set_mempolicy[1], mbind[2], and migrate_pages[3]
    mentioned, when nodemask specifies one or more node IDs that are greater
    than the maximum supported node ID, the errno should set to EINVAL.
    However, get_nodes only check whether the part of bits
    [BITS_PER_LONG*BITS_TO_LONGS(MAX_NUMNODES), maxnode) is zero or not, and
    remain [MAX_NUMNODES, BITS_PER_LONG*BITS_TO_LONGS(MAX_NUMNODES)
    unchecked.

    This patch is to check the bits of [MAX_NUMNODES, maxnode) in get_nodes
    to let migrate_pages set the errno to EINVAL when nodemask specifies one
    or more node IDs that are greater than the maximum supported node ID,
    which follows the manpage's guide.

    [1] http://man7.org/linux/man-pages/man2/set_mempolicy.2.html
    [2] http://man7.org/linux/man-pages/man2/mbind.2.html
    [3] http://man7.org/linux/man-pages/man2/migrate_pages.2.html

    Link: http://lkml.kernel.org/r/1510882624-44342-3-git-send-email-xieyisheng1@huawei.com
    Signed-off-by: Yisheng Xie
    Reported-by: Tan Xiaojun
    Acked-by: Vlastimil Babka
    Cc: Andi Kleen
    Cc: Chris Salls
    Cc: Christopher Lameter
    Cc: David Rientjes
    Cc: Ingo Molnar
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Yisheng Xie
     

14 Oct, 2017

1 commit

  • Commit 3a321d2a3dde ("mm: change the call sites of numa statistics
    items") separated NUMA counters from zone counters, but the
    NUMA_INTERLEAVE_HIT call site wasn't updated to use the new interface.
    So alloc_page_interleave() actually increments NR_ZONE_INACTIVE_FILE
    instead of NUMA_INTERLEAVE_HIT.

    Fix this by using __inc_numa_state() interface to increment
    NUMA_INTERLEAVE_HIT.

    Link: http://lkml.kernel.org/r/20171003191003.8573-1-aryabinin@virtuozzo.com
    Fixes: 3a321d2a3dde ("mm: change the call sites of numa statistics items")
    Signed-off-by: Andrey Ryabinin
    Acked-by: Mel Gorman
    Cc: Kemi Wang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     

09 Sep, 2017

4 commits

  • VMA and its address bounds checks are too late in this function. They
    must have been verified earlier in the page fault sequence. Hence just
    remove them.

    Link: http://lkml.kernel.org/r/20170901130137.7617-1-khandual@linux.vnet.ibm.com
    Signed-off-by: Anshuman Khandual
    Suggested-by: Vlastimil Babka
    Acked-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     
  • While reading the code I found that offset_il_node() has a vm_area_struct
    pointer parameter which is unused.

    Link: http://lkml.kernel.org/r/1502899755-23146-1-git-send-email-ldufour@linux.vnet.ibm.com
    Signed-off-by: Laurent Dufour
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Laurent Dufour
     
  • This patch enables thp migration for mbind(2) and migrate_pages(2).

    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Zi Yan
    Cc: "H. Peter Anvin"
    Cc: Anshuman Khandual
    Cc: Dave Hansen
    Cc: David Nellans
    Cc: Ingo Molnar
    Cc: Kirill A. Shutemov
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Thomas Gleixner
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Patch series "mm: page migration enhancement for thp", v9.

    Motivations:

    1. THP migration becomes important in the upcoming heterogeneous memory
    systems. As David Nellans from NVIDIA pointed out from other threads
    (http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1349227.html),
    future GPUs or other accelerators will have their memory managed by
    operating systems. Moving data into and out of these memory nodes
    efficiently is critical to applications that use GPUs or other
    accelerators. Existing page migration only supports base pages, which
    has a very low memory bandwidth utilization. My experiments (see
    below) show THP migration can migrate pages more efficiently.

    2. Base page migration vs THP migration throughput.

    Here are cross-socket page migration results from calling
    move_pages() syscall:

    In x86_64, a Intel two-socket E5-2640v3 box,
    - single 4KB base page migration takes 62.47 us, using 0.06 GB/s BW,
    - single 2MB THP migration takes 658.54 us, using 2.97 GB/s BW,
    - 512 4KB base page migration takes 1987.38 us, using 0.98 GB/s BW.

    In ppc64, a two-socket Power8 box,
    - single 64KB base page migration takes 49.3 us, using 1.24 GB/s BW,
    - single 16MB THP migration takes 2202.17 us, using 7.10 GB/s BW,
    - 256 64KB base page migration takes 2543.65 us, using 6.14 GB/s BW.

    THP migration can give us 3x and 1.15x throughput over base page
    migration in x86_64 and ppc64 respectivley.

    You can test it out by using the code here:
    https://github.com/x-y-z/thp-migration-bench

    3. Existing page migration splits THP before migration and cannot
    guarantee the migrated pages are still contiguous. Contiguity is
    always what GPUs and accelerators look for. Without THP migration,
    khugepaged needs to do extra work to reassemble the migrated pages
    back to THPs.

    This patch (of 10):

    Introduce a separate check routine related to MPOL_MF_INVERT flag. This
    patch just does cleanup, no behavioral change.

    Link: http://lkml.kernel.org/r/20170717193955.20207-2-zi.yan@sent.com
    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Zi Yan
    Cc: Kirill A. Shutemov
    Cc: Minchan Kim
    Cc: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Anshuman Khandual
    Cc: Dave Hansen
    Cc: David Nellans
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Andrea Arcangeli
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

19 Aug, 2017

1 commit

  • I hit a use after free issue when executing trinity and repoduced it
    with KASAN enabled. The related call trace is as follows.

    BUG: KASan: use after free in SyS_get_mempolicy+0x3c8/0x960 at addr ffff8801f582d766
    Read of size 2 by task syz-executor1/798

    INFO: Allocated in mpol_new.part.2+0x74/0x160 age=3 cpu=1 pid=799
    __slab_alloc+0x768/0x970
    kmem_cache_alloc+0x2e7/0x450
    mpol_new.part.2+0x74/0x160
    mpol_new+0x66/0x80
    SyS_mbind+0x267/0x9f0
    system_call_fastpath+0x16/0x1b
    INFO: Freed in __mpol_put+0x2b/0x40 age=4 cpu=1 pid=799
    __slab_free+0x495/0x8e0
    kmem_cache_free+0x2f3/0x4c0
    __mpol_put+0x2b/0x40
    SyS_mbind+0x383/0x9f0
    system_call_fastpath+0x16/0x1b
    INFO: Slab 0xffffea0009cb8dc0 objects=23 used=8 fp=0xffff8801f582de40 flags=0x200000000004080
    INFO: Object 0xffff8801f582d760 @offset=5984 fp=0xffff8801f582d600

    Bytes b4 ffff8801f582d750: ae 01 ff ff 00 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a ........ZZZZZZZZ
    Object ffff8801f582d760: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
    Object ffff8801f582d770: 6b 6b 6b 6b 6b 6b 6b a5 kkkkkkk.
    Redzone ffff8801f582d778: bb bb bb bb bb bb bb bb ........
    Padding ffff8801f582d8b8: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ
    Memory state around the buggy address:
    ffff8801f582d600: fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc fc
    ffff8801f582d680: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    >ffff8801f582d700: fc fc fc fc fc fc fc fc fc fc fc fc fb fb fb fc

    !shared memory policy is not protected against parallel removal by other
    thread which is normally protected by the mmap_sem. do_get_mempolicy,
    however, drops the lock midway while we can still access it later.

    Early premature up_read is a historical artifact from times when
    put_user was called in this path see https://lwn.net/Articles/124754/
    but that is gone since 8bccd85ffbaf ("[PATCH] Implement sys_* do_*
    layering in the memory policy layer."). but when we have the the
    current mempolicy ref count model. The issue was introduced
    accordingly.

    Fix the issue by removing the premature release.

    Link: http://lkml.kernel.org/r/1502950924-27521-1-git-send-email-zhongjiang@huawei.com
    Signed-off-by: zhong jiang
    Acked-by: Michal Hocko
    Cc: Minchan Kim
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: [2.6+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    zhong jiang
     

13 Jul, 2017

1 commit

  • Page migration (for memory hotplug, soft_offline_page or mbind) needs to
    allocate a new memory. This can trigger an oom killer if the target
    memory is depleated. Although quite unlikely, still possible,
    especially for the memory hotplug (offlining of memoery).

    Up to now we didn't really have reasonable means to back off.
    __GFP_NORETRY can fail just too easily and __GFP_THISNODE sticks to a
    single node and that is not suitable for all callers.

    But now that we have __GFP_RETRY_MAYFAIL we should use it. It is
    preferable to fail the migration than disrupt the system by killing some
    processes.

    Link: http://lkml.kernel.org/r/20170623085345.11304-7-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Alex Belits
    Cc: Chris Wilson
    Cc: Christoph Hellwig
    Cc: Darrick J. Wong
    Cc: David Daney
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: NeilBrown
    Cc: Ralf Baechle
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

07 Jul, 2017

4 commits

  • Two wrappers of __alloc_pages_nodemask() are checking
    task->mems_allowed_seq themselves to retry allocation that has raced
    with a cpuset update.

    This has been shown to be ineffective in preventing premature OOM's
    which can happen in __alloc_pages_slowpath() long before it returns back
    to the wrappers to detect the race at that level.

    Previous patches have made __alloc_pages_slowpath() more robust, so we
    can now simply remove the seqlock checking in the wrappers to prevent
    further wrong impression that it can actually help.

    Link: http://lkml.kernel.org/r/20170517081140.30654-7-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Cc: Anshuman Khandual
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Dimitri Sivanich
    Cc: Hugh Dickins
    Cc: Li Zefan
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when
    changing cpuset's mems") has introduced a two-step protocol when
    rebinding task's mempolicy due to cpuset update, in order to avoid a
    parallel allocation seeing an empty effective nodemask and failing.

    Later, commit cc9a6c877661 ("cpuset: mm: reduce large amounts of memory
    barrier related damage v3") introduced a seqlock protection and removed
    the synchronization point between the two update steps. At that point
    (or perhaps later), the two-step rebinding became unnecessary.

    Currently it only makes sure that the update first adds new nodes in
    step 1 and then removes nodes in step 2. Without memory barriers the
    effects are questionable, and even then this cannot prevent a parallel
    zonelist iteration checking the nodemask at each step to observe all
    nodes as unusable for allocation. We now fully rely on the seqlock to
    prevent premature OOMs and allocation failures.

    We can thus remove the two-step update parts and simplify the code.

    Link: http://lkml.kernel.org/r/20170517081140.30654-5-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Cc: Anshuman Khandual
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Dimitri Sivanich
    Cc: Hugh Dickins
    Cc: Li Zefan
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The main allocator function __alloc_pages_nodemask() takes a zonelist
    pointer as one of its parameters. All of its callers directly or
    indirectly obtain the zonelist via node_zonelist() using a preferred
    node id and gfp_mask. We can make the code a bit simpler by doing the
    zonelist lookup in __alloc_pages_nodemask(), passing it a preferred node
    id instead (gfp_mask is already another parameter).

    There are some code size benefits thanks to removal of inlined
    node_zonelist():

    bloat-o-meter add/remove: 2/2 grow/shrink: 4/36 up/down: 399/-1351 (-952)

    This will also make things simpler if we proceed with converting cpusets
    to zonelists.

    Link: http://lkml.kernel.org/r/20170517081140.30654-4-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Reviewed-by: Christoph Lameter
    Acked-by: Michal Hocko
    Cc: Dimitri Sivanich
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Cc: Anshuman Khandual
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Li Zefan
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The task->il_next variable stores the next allocation node id for task's
    MPOL_INTERLEAVE policy. mpol_rebind_nodemask() updates interleave and
    bind mempolicies due to changing cpuset mems. Currently it also tries
    to make sure that current->il_next is valid within the updated nodemask.
    This is bogus, because 1) we are updating potentially any task's
    mempolicy, not just current, and 2) we might be updating a per-vma
    mempolicy, not task one.

    The interleave_nodes() function that uses il_next can cope fine with the
    value not being within the currently allowed nodes, so this hasn't
    manifested as an actual issue.

    We can remove the need for updating il_next completely by changing it to
    il_prev and store the node id of the previous interleave allocation
    instead of the next id. Then interleave_nodes() can calculate the next
    id using the current nodemask and also store it as il_prev, except when
    querying the next node via do_get_mempolicy().

    Link: http://lkml.kernel.org/r/20170517081140.30654-3-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Reviewed-by: Christoph Lameter
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Cc: Anshuman Khandual
    Cc: David Rientjes
    Cc: Dimitri Sivanich
    Cc: Hugh Dickins
    Cc: Li Zefan
    Cc: Mel Gorman
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

09 Apr, 2017

1 commit


02 Mar, 2017

3 commits

  • But first update the code that uses these facilities with the
    new header.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • …sched/numa_balancing.h>

    We are going to split <linux/sched/numa_balancing.h> out of <linux/sched.h>, which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder <linux/sched/numa_balancing.h> file that just
    maps to <linux/sched.h> to make this patch obviously correct and
    bisectable.

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Mike Galbraith <efault@gmx.de>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Ingo Molnar
     
  • We are going to split out of , which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder file that just
    maps to to make this patch obviously correct and
    bisectable.

    The APIs that are going to be moved first are:

    mm_alloc()
    __mmdrop()
    mmdrop()
    mmdrop_async_fn()
    mmdrop_async()
    mmget_not_zero()
    mmput()
    mmput_async()
    get_task_mm()
    mm_access()
    mm_release()

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

25 Jan, 2017

1 commit

  • Since commit be97a41b291e ("mm/mempolicy.c: merge alloc_hugepage_vma to
    alloc_pages_vma") alloc_pages_vma() can potentially free a mempolicy by
    mpol_cond_put() before accessing the embedded nodemask by
    __alloc_pages_nodemask(). The commit log says it's so "we can use a
    single exit path within the function" but that's clearly wrong. We can
    still do that when doing mpol_cond_put() after the allocation attempt.

    Make sure the mempolicy is not freed prematurely, otherwise
    __alloc_pages_nodemask() can end up using a bogus nodemask, which could
    lead e.g. to premature OOM.

    Fixes: be97a41b291e ("mm/mempolicy.c: merge alloc_hugepage_vma to alloc_pages_vma")
    Link: http://lkml.kernel.org/r/20170118141124.8345-1-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Acked-by: David Rientjes
    Cc: Aneesh Kumar K.V
    Cc: Andrea Arcangeli
    Cc: [4.0+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

25 Dec, 2016

1 commit


13 Dec, 2016

3 commits

  • The MPOL_F_STATIC_NODES and MPOL_F_RELATIVE_NODES flags are irrelevant
    when setting them for MPOL_LOCAL NUMA memory policy via set_mempolicy or
    mbind.

    Return the "invalid argument" from set_mempolicy and mbind whenever any
    of these flags is passed along with MPOL_LOCAL.

    It is consistent with MPOL_PREFERRED passed with empty nodemask.

    It slightly shortens the execution time in paths where these flags are
    used e.g. when trying to rebind the NUMA nodes for changes in cgroups
    cpuset mems (mpol_rebind_preferred()) or when just printing the mempolicy
    structure (/proc/PID/numa_maps). Isolated tests done.

    Link: http://lkml.kernel.org/r/20161027163037.4089-1-kwapulinski.piotr@gmail.com
    Signed-off-by: Piotr Kwapulinski
    Acked-by: David Rientjes
    Cc: Kirill A. Shutemov
    Cc: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Liang Chen
    Cc: Mel Gorman
    Cc: Dave Hansen
    Cc: Nathan Zimmer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Piotr Kwapulinski
     
  • __GFP_THISNODE is documented to enforce the allocation to be satisified
    from the requested node with no fallbacks or placement policy
    enforcements. policy_zonelist seemingly breaks this semantic if the
    current policy is MPOL_MBIND and instead of taking the node it will
    fallback to the first node in the mask if the requested one is not in
    the mask. This is confusing to say the least because it fact we
    shouldn't ever go that path. First tasks shouldn't be scheduled on CPUs
    with nodes outside of their mempolicy binding. And secondly
    policy_zonelist is called only from 3 places:

    - huge_zonelist - never should do __GFP_THISNODE when going this path

    - alloc_pages_vma - which shouldn't depend on __GFP_THISNODE either

    - alloc_pages_current - which uses default_policy id __GFP_THISNODE is
    used

    So we shouldn't even need to care about this possibility and can drop
    the confusing code. Let's keep a WARN_ON_ONCE in place to catch
    potential users and fix them up properly (aka use a different allocation
    function which ignores mempolicy).

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/20161013125958.32155-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • While doing MADV_DONTNEED on a large area of thp memory, I noticed we
    encountered many unlikely() branches in profiles for each backing
    hugepage. This is because zap_pmd_range() would call split_huge_pmd(),
    which rechecked the conditions that were already validated, but as part
    of an unlikely() branch.

    Avoid the unlikely() branch when in a context where pmd is known to be
    good for __split_huge_pmd() directly.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1610181600300.84525@chino.kir.corp.google.com
    Signed-off-by: David Rientjes
    Acked-by: Vlastimil Babka
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

19 Oct, 2016

1 commit

  • This removes the 'write' and 'force' from get_user_pages() and replaces
    them with 'gup_flags' to make the use of FOLL_FORCE explicit in callers
    as use of this flag can result in surprising behaviour (and hence bugs)
    within the mm subsystem.

    Signed-off-by: Lorenzo Stoakes
    Acked-by: Christian König
    Acked-by: Jesper Nilsson
    Acked-by: Michal Hocko
    Reviewed-by: Jan Kara
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     

08 Oct, 2016

1 commit

  • Use the existing enums instead of hardcoded index when looking at the
    zonelist. This makes it more readable. No functionality change by this
    patch.

    Link: http://lkml.kernel.org/r/1472227078-24852-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com
    Signed-off-by: Aneesh Kumar K.V
    Reviewed-by: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     

02 Sep, 2016

1 commit

  • KASAN allocates memory from the page allocator as part of
    kmem_cache_free(), and that can reference current->mempolicy through any
    number of allocation functions. It needs to be NULL'd out before the
    final reference is dropped to prevent a use-after-free bug:

    BUG: KASAN: use-after-free in alloc_pages_current+0x363/0x370 at addr ffff88010b48102c
    CPU: 0 PID: 15425 Comm: trinity-c2 Not tainted 4.8.0-rc2+ #140
    ...
    Call Trace:
    dump_stack
    kasan_object_err
    kasan_report_error
    __asan_report_load2_noabort
    alloc_pages_current mempolicy to NULL before dropping the final
    reference.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1608301442180.63329@chino.kir.corp.google.com
    Fixes: cd11016e5f52 ("mm, kasan: stackdepot implementation. Enable stackdepot for SLAB")
    Signed-off-by: David Rientjes
    Reported-by: Vegard Nossum
    Acked-by: Andrey Ryabinin
    Cc: Alexander Potapenko
    Cc: Dmitry Vyukov
    Cc: [4.6+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

29 Jul, 2016

1 commit

  • This moves the LRU lists from the zone to the node and related data such
    as counters, tracing, congestion tracking and writeback tracking.

    Unfortunately, due to reclaim and compaction retry logic, it is
    necessary to account for the number of LRU pages on both zone and node
    logic. Most reclaim logic is based on the node counters but the retry
    logic uses the zone counters which do not distinguish inactive and
    active sizes. It would be possible to leave the LRU counters on a
    per-zone basis but it's a heavier calculation across multiple cache
    lines that is much more frequent than the retry checks.

    Other than the LRU counters, this is mostly a mechanical patch but note
    that it introduces a number of anomalies. For example, the scans are
    per-zone but using per-node counters. We also mark a node as congested
    when a zone is congested. This causes weird problems that are fixed
    later but is easier to review.

    In the event that there is excessive overhead on 32-bit systems due to
    the nodes being on LRU then there are two potential solutions

    1. Long-term isolation of highmem pages when reclaim is lowmem

    When pages are skipped, they are immediately added back onto the LRU
    list. If lowmem reclaim persisted for long periods of time, the same
    highmem pages get continually scanned. The idea would be that lowmem
    keeps those pages on a separate list until a reclaim for highmem pages
    arrives that splices the highmem pages back onto the LRU. It potentially
    could be implemented similar to the UNEVICTABLE list.

    That would reduce the skip rate with the potential corner case is that
    highmem pages have to be scanned and reclaimed to free lowmem slab pages.

    2. Linear scan lowmem pages if the initial LRU shrink fails

    This will break LRU ordering but may be preferable and faster during
    memory pressure than skipping LRU pages.

    Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

27 Jul, 2016

2 commits

  • Here's basic implementation of huge pages support for shmem/tmpfs.

    It's all pretty streight-forward:

    - shmem_getpage() allcoates huge page if it can and try to inserd into
    radix tree with shmem_add_to_page_cache();

    - shmem_add_to_page_cache() puts the page onto radix-tree if there's
    space for it;

    - shmem_undo_range() removes huge pages, if it fully within range.
    Partial truncate of huge pages zero out this part of THP.

    This have visible effect on fallocate(FALLOC_FL_PUNCH_HOLE)
    behaviour. As we don't really create hole in this case,
    lseek(SEEK_HOLE) may have inconsistent results depending what
    pages happened to be allocated.

    - no need to change shmem_fault: core-mm will map an compound page as
    huge if VMA is suitable;

    Link: http://lkml.kernel.org/r/1466021202-61880-30-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • split_huge_pmd() doesn't guarantee that the pmd is normal pmd pointing
    to pte entries, which can be checked with pmd_trans_unstable(). Some
    callers make this assertion and some do it differently and some not, so
    let's do it in a unified manner.

    Link: http://lkml.kernel.org/r/1464741400-12143-1-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Cc: "Kirill A. Shutemov"
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

20 May, 2016

3 commits

  • The allocator fast path looks up the first usable zone in a zonelist and
    then get_page_from_freelist does the same job in the zonelist iterator.
    This patch preserves the necessary information.

    4.6.0-rc2 4.6.0-rc2
    fastmark-v1r20 initonce-v1r20
    Min alloc-odr0-1 364.00 ( 0.00%) 359.00 ( 1.37%)
    Min alloc-odr0-2 262.00 ( 0.00%) 260.00 ( 0.76%)
    Min alloc-odr0-4 214.00 ( 0.00%) 214.00 ( 0.00%)
    Min alloc-odr0-8 186.00 ( 0.00%) 186.00 ( 0.00%)
    Min alloc-odr0-16 173.00 ( 0.00%) 173.00 ( 0.00%)
    Min alloc-odr0-32 165.00 ( 0.00%) 165.00 ( 0.00%)
    Min alloc-odr0-64 161.00 ( 0.00%) 162.00 ( -0.62%)
    Min alloc-odr0-128 159.00 ( 0.00%) 161.00 ( -1.26%)
    Min alloc-odr0-256 168.00 ( 0.00%) 170.00 ( -1.19%)
    Min alloc-odr0-512 180.00 ( 0.00%) 181.00 ( -0.56%)
    Min alloc-odr0-1024 190.00 ( 0.00%) 190.00 ( 0.00%)
    Min alloc-odr0-2048 196.00 ( 0.00%) 196.00 ( 0.00%)
    Min alloc-odr0-4096 202.00 ( 0.00%) 202.00 ( 0.00%)
    Min alloc-odr0-8192 206.00 ( 0.00%) 205.00 ( 0.49%)
    Min alloc-odr0-16384 206.00 ( 0.00%) 205.00 ( 0.49%)

    The benefit is negligible and the results are within the noise but each
    cycle counts.

    Signed-off-by: Mel Gorman
    Cc: Vlastimil Babka
    Cc: Jesper Dangaard Brouer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This code was pretty obscure and was relying upon obscure side-effects
    of next_node(-1, ...) and was relying upon NUMA_NO_NODE being equal to
    -1.

    Clean that all up and document the function's intent.

    Acked-by: Vlastimil Babka
    Cc: Xishi Qiu
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Naoya Horiguchi
    Cc: Laura Abbott
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Lots of code does

    node = next_node(node, XXX);
    if (node == MAX_NUMNODES)
    node = first_node(XXX);

    so create next_node_in() to do this and use it in various places.

    [mhocko@suse.com: use next_node_in() helper]
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Signed-off-by: Michal Hocko
    Cc: Xishi Qiu
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Naoya Horiguchi
    Cc: Laura Abbott
    Cc: Hui Zhu
    Cc: Wang Xiaoqiang
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

21 Mar, 2016

1 commit

  • Pull x86 protection key support from Ingo Molnar:
    "This tree adds support for a new memory protection hardware feature
    that is available in upcoming Intel CPUs: 'protection keys' (pkeys).

    There's a background article at LWN.net:

    https://lwn.net/Articles/643797/

    The gist is that protection keys allow the encoding of
    user-controllable permission masks in the pte. So instead of having a
    fixed protection mask in the pte (which needs a system call to change
    and works on a per page basis), the user can map a (handful of)
    protection mask variants and can change the masks runtime relatively
    cheaply, without having to change every single page in the affected
    virtual memory range.

    This allows the dynamic switching of the protection bits of large
    amounts of virtual memory, via user-space instructions. It also
    allows more precise control of MMU permission bits: for example the
    executable bit is separate from the read bit (see more about that
    below).

    This tree adds the MM infrastructure and low level x86 glue needed for
    that, plus it adds a high level API to make use of protection keys -
    if a user-space application calls:

    mmap(..., PROT_EXEC);

    or

    mprotect(ptr, sz, PROT_EXEC);

    (note PROT_EXEC-only, without PROT_READ/WRITE), the kernel will notice
    this special case, and will set a special protection key on this
    memory range. It also sets the appropriate bits in the Protection
    Keys User Rights (PKRU) register so that the memory becomes unreadable
    and unwritable.

    So using protection keys the kernel is able to implement 'true'
    PROT_EXEC on x86 CPUs: without protection keys PROT_EXEC implies
    PROT_READ as well. Unreadable executable mappings have security
    advantages: they cannot be read via information leaks to figure out
    ASLR details, nor can they be scanned for ROP gadgets - and they
    cannot be used by exploits for data purposes either.

    We know about no user-space code that relies on pure PROT_EXEC
    mappings today, but binary loaders could start making use of this new
    feature to map binaries and libraries in a more secure fashion.

    There is other pending pkeys work that offers more high level system
    call APIs to manage protection keys - but those are not part of this
    pull request.

    Right now there's a Kconfig that controls this feature
    (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) that is default enabled
    (like most x86 CPU feature enablement code that has no runtime
    overhead), but it's not user-configurable at the moment. If there's
    any serious problem with this then we can make it configurable and/or
    flip the default"

    * 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
    x86/mm/pkeys: Fix mismerge of protection keys CPUID bits
    mm/pkeys: Fix siginfo ABI breakage caused by new u64 field
    x86/mm/pkeys: Fix access_error() denial of writes to write-only VMA
    mm/core, x86/mm/pkeys: Add execute-only protection keys support
    x86/mm/pkeys: Create an x86 arch_calc_vm_prot_bits() for VMA flags
    x86/mm/pkeys: Allow kernel to modify user pkey rights register
    x86/fpu: Allow setting of XSAVE state
    x86/mm: Factor out LDT init from context init
    mm/core, x86/mm/pkeys: Add arch_validate_pkey()
    mm/core, arch, powerpc: Pass a protection key in to calc_vm_flag_bits()
    x86/mm/pkeys: Actually enable Memory Protection Keys in the CPU
    x86/mm/pkeys: Add Kconfig prompt to existing config option
    x86/mm/pkeys: Dump pkey from VMA in /proc/pid/smaps
    x86/mm/pkeys: Dump PKRU with other kernel registers
    mm/core, x86/mm/pkeys: Differentiate instruction fetches
    x86/mm/pkeys: Optimize fault handling in access_error()
    mm/core: Do not enforce PKEY permissions on remote mm access
    um, pkeys: Add UML arch_*_access_permitted() methods
    mm/gup, x86/mm/pkeys: Check VMAs and PTEs for protection keys
    x86/mm/gup: Simplify get_user_pages() PTE bit handling
    ...

    Linus Torvalds
     

18 Mar, 2016

1 commit

  • Kernel style prefers a single string over split strings when the string is
    'user-visible'.

    Miscellanea:

    - Add a missing newline
    - Realign arguments

    Signed-off-by: Joe Perches
    Acked-by: Tejun Heo [percpu]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     

16 Mar, 2016

1 commit

  • VM_HUGETLB and VM_MIXEDMAP vma needs to be excluded to avoid compound
    pages being marked for migration and unexpected COWs when handling
    hugetlb fault.

    Thanks to Naoya Horiguchi for reminding me on these checks.

    Signed-off-by: Liang Chen
    Signed-off-by: Gavin Guo
    Suggested-by: Naoya Horiguchi
    Acked-by: David Rientjes
    Cc: SeongJae Park
    Cc: Rik van Riel
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Liang Chen
     

10 Mar, 2016

1 commit

  • We don't have native support of THP migration, so we have to split huge
    page into small pages in order to migrate it to different node. This
    includes PTE-mapped huge pages.

    I made mistake in refcounting patchset: we don't actually split
    PTE-mapped huge page in queue_pages_pte_range(), if we step on head
    page.

    The result is that the head page is queued for migration, but none of
    tail pages: putting head page on queue takes pin on the page and any
    subsequent attempts of split_huge_pages() would fail and we skip queuing
    tail pages.

    unmap_and_move_huge_page() will eventually split the huge pages, but
    only one of 512 pages would get migrated.

    Let's fix the situation.

    Fixes: 248db92da13f2507 ("migrate_pages: try to split pages on queuing")
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

16 Feb, 2016

1 commit

  • We will soon modify the vanilla get_user_pages() so it can no
    longer be used on mm/tasks other than 'current/current->mm',
    which is by far the most common way it is called. For now,
    we allow the old-style calls, but warn when they are used.
    (implemented in previous patch)

    This patch switches all callers of:

    get_user_pages()
    get_user_pages_unlocked()
    get_user_pages_locked()

    to stop passing tsk/mm so they will no longer see the warnings.

    Signed-off-by: Dave Hansen
    Reviewed-by: Thomas Gleixner
    Cc: Andrea Arcangeli
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Dave Hansen
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Kirill A. Shutemov
    Cc: Linus Torvalds
    Cc: Naoya Horiguchi
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Srikar Dronamraju
    Cc: Vlastimil Babka
    Cc: jack@suse.cz
    Cc: linux-mm@kvack.org
    Link: http://lkml.kernel.org/r/20160212210156.113E9407@viggo.jf.intel.com
    Signed-off-by: Ingo Molnar

    Dave Hansen
     

06 Feb, 2016

1 commit

  • Maybe I miss some point, but I don't see a reason why we try to queue
    pages from non migratable VMAs.

    This testcase steps on VM_BUG_ON_PAGE() in isolate_lru_page():

    #include
    #include
    #include
    #include
    #include

    #define SIZE 0x2000

    int foo;

    int main()
    {
    int fd;
    char *p;
    unsigned long mask = 2;

    fd = open("/dev/sg0", O_RDWR);
    p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);
    /* Faultin pages */
    foo = p[0] + p[0x1000];
    mbind(p, SIZE, MPOL_BIND, &mask, 4, MPOL_MF_MOVE | MPOL_MF_STRICT);
    return 0;
    }

    The only case when we can queue pages from such VMA is MPOL_MF_STRICT
    plus MPOL_MF_MOVE or MPOL_MF_MOVE_ALL for VMA which has pages on LRU,
    but gfp mask is not sutable for migaration (see mapping_gfp_mask() check
    in vma_migratable()). That's looks like a bug to me.

    Let's filter out non-migratable vma at start of queue_pages_test_walk()
    and go to queue_pages_pte_range() only if MPOL_MF_MOVE or
    MPOL_MF_MOVE_ALL flag is set.

    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Dmitry Vyukov
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Cc: Naoya Horiguchi
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

16 Jan, 2016

1 commit

  • MPOL_MF_LAZY is not visible from userspace since a720094ded8c ("mm:
    mempolicy: Hide MPOL_NOOP and MPOL_MF_LAZY from userspace for now"), but
    it should still skip non-migratable VMAs such as VM_IO, VM_PFNMAP, and
    VM_HUGETLB VMAs, and avoid useless overhead of minor faults.

    Signed-off-by: Liang Chen
    Signed-off-by: Gavin Guo
    Acked-by: Rik van Riel
    Cc: Mel Gorman
    Cc: Andi Kleen
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Liang Chen