06 Mar, 2019

2 commits

  • Syzbot with KMSAN reports (excerpt):

    ==================================================================
    BUG: KMSAN: uninit-value in mpol_rebind_policy mm/mempolicy.c:353 [inline]
    BUG: KMSAN: uninit-value in mpol_rebind_mm+0x249/0x370 mm/mempolicy.c:384
    CPU: 1 PID: 17420 Comm: syz-executor4 Not tainted 4.20.0-rc7+ #15
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
    Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0x173/0x1d0 lib/dump_stack.c:113
    kmsan_report+0x12e/0x2a0 mm/kmsan/kmsan.c:613
    __msan_warning+0x82/0xf0 mm/kmsan/kmsan_instr.c:295
    mpol_rebind_policy mm/mempolicy.c:353 [inline]
    mpol_rebind_mm+0x249/0x370 mm/mempolicy.c:384
    update_tasks_nodemask+0x608/0xca0 kernel/cgroup/cpuset.c:1120
    update_nodemasks_hier kernel/cgroup/cpuset.c:1185 [inline]
    update_nodemask kernel/cgroup/cpuset.c:1253 [inline]
    cpuset_write_resmask+0x2a98/0x34b0 kernel/cgroup/cpuset.c:1728

    ...

    Uninit was created at:
    kmsan_save_stack_with_flags mm/kmsan/kmsan.c:204 [inline]
    kmsan_internal_poison_shadow+0x92/0x150 mm/kmsan/kmsan.c:158
    kmsan_kmalloc+0xa6/0x130 mm/kmsan/kmsan_hooks.c:176
    kmem_cache_alloc+0x572/0xb90 mm/slub.c:2777
    mpol_new mm/mempolicy.c:276 [inline]
    do_mbind mm/mempolicy.c:1180 [inline]
    kernel_mbind+0x8a7/0x31a0 mm/mempolicy.c:1347
    __do_sys_mbind mm/mempolicy.c:1354 [inline]

    As it's difficult to report where exactly the uninit value resides in
    the mempolicy object, we have to guess a bit. mm/mempolicy.c:353
    contains this part of mpol_rebind_policy():

    if (!mpol_store_user_nodemask(pol) &&
    nodes_equal(pol->w.cpuset_mems_allowed, *newmask))

    "mpol_store_user_nodemask(pol)" is testing pol->flags, which I couldn't
    ever see being uninitialized after leaving mpol_new(). So I'll guess
    it's actually about accessing pol->w.cpuset_mems_allowed on line 354,
    but still part of statement starting on line 353.

    For w.cpuset_mems_allowed to be not initialized, and the nodes_equal()
    reachable for a mempolicy where mpol_set_nodemask() is called in
    do_mbind(), it seems the only possibility is a MPOL_PREFERRED policy
    with empty set of nodes, i.e. MPOL_LOCAL equivalent, with MPOL_F_LOCAL
    flag. Let's exclude such policies from the nodes_equal() check. Note
    the uninit access should be benign anyway, as rebinding this kind of
    policy is always a no-op. Therefore no actual need for stable
    inclusion.

    Link: http://lkml.kernel.org/r/a71997c3-e8ae-a787-d5ce-3db05768b27c@suse.cz
    Link: http://lkml.kernel.org/r/73da3e9c-cc84-509e-17d9-0c434bb9967d@suse.cz
    Signed-off-by: Vlastimil Babka
    Reported-by: syzbot+b19c2dc2c990ea657a71@syzkaller.appspotmail.com
    Cc: Alexander Potapenko
    Cc: Dmitry Vyukov
    Cc: Andrea Arcangeli
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: David Rientjes
    Cc: Yisheng Xie
    Cc: zhong jiang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Patch series "Replace all open encodings for NUMA_NO_NODE", v3.

    All these places for replacement were found by running the following
    grep patterns on the entire kernel code. Please let me know if this
    might have missed some instances. This might also have replaced some
    false positives. I will appreciate suggestions, inputs and review.

    1. git grep "nid == -1"
    2. git grep "node == -1"
    3. git grep "nid = -1"
    4. git grep "node = -1"

    This patch (of 2):

    At present there are multiple places where invalid node number is
    encoded as -1. Even though implicitly understood it is always better to
    have macros in there. Replace these open encodings for an invalid node
    number with the global macro NUMA_NO_NODE. This helps remove NUMA
    related assumptions like 'invalid node' from various places redirecting
    them to a common definition.

    Link: http://lkml.kernel.org/r/1545127933-10711-2-git-send-email-anshuman.khandual@arm.com
    Signed-off-by: Anshuman Khandual
    Reviewed-by: David Hildenbrand
    Acked-by: Jeff Kirsher [ixgbe]
    Acked-by: Jens Axboe [mtip32xx]
    Acked-by: Vinod Koul [dmaengine.c]
    Acked-by: Michael Ellerman [powerpc]
    Acked-by: Doug Ledford [drivers/infiniband]
    Cc: Joseph Qi
    Cc: Hans Verkuil
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     

22 Feb, 2019

1 commit

  • The system call, get_mempolicy() [1], passes an unsigned long *nodemask
    pointer and an unsigned long maxnode argument which specifies the length
    of the user's nodemask array in bits (which is rounded up). The manual
    page says that if the maxnode value is too small, get_mempolicy will
    return EINVAL but there is no system call to return this minimum value.
    To determine this value, some programs search /proc//status for a
    line starting with "Mems_allowed:" and use the number of digits in the
    mask to determine the minimum value. A recent change to the way this line
    is formatted [2] causes these programs to compute a value less than
    MAX_NUMNODES so get_mempolicy() returns EINVAL.

    Change get_mempolicy(), the older compat version of get_mempolicy(), and
    the copy_nodes_to_user() function to use nr_node_ids instead of
    MAX_NUMNODES, thus preserving the defacto method of computing the minimum
    size for the nodemask array and the maxnode argument.

    [1] http://man7.org/linux/man-pages/man2/get_mempolicy.2.html
    [2] https://lore.kernel.org/lkml/1545405631-6808-1-git-send-email-longman@redhat.com

    Link: http://lkml.kernel.org/r/20190211180245.22295-1-rcampbell@nvidia.com
    Fixes: 4fb8e5b89bcbbbb ("include/linux/nodemask.h: use nr_node_ids (not MAX_NUMNODES) in __nodemask_pr_numnodes()")
    Signed-off-by: Ralph Campbell
    Suggested-by: Alexander Duyck
    Cc: Waiman Long
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ralph Campbell
     

09 Dec, 2018

1 commit

  • This reverts commit 89c83fb539f95491be80cdd5158e6f0ce329e317.

    This should have been done as part of 2f0799a0ffc0 ("mm, thp: restore
    node-local hugepage allocations"). The movement of the thp allocation
    policy from alloc_pages_vma() to alloc_hugepage_direct_gfpmask() was
    intended to only set __GFP_THISNODE for mempolicies that are not
    MPOL_BIND whereas the revert could set this regardless of mempolicy.

    While the check for MPOL_BIND between alloc_hugepage_direct_gfpmask()
    and alloc_pages_vma() was racy, that has since been removed since the
    revert. What is left is the possibility to use __GFP_THISNODE in
    policy_node() when it is unexpected because the special handling for
    hugepages in alloc_pages_vma() was removed as part of the consolidation.

    Secondly, prior to 89c83fb539f9, alloc_pages_vma() implemented a somewhat
    different policy for hugepage allocations, which were allocated through
    alloc_hugepage_vma(). For hugepage allocations, if the allocating
    process's node is in the set of allowed nodes, allocate with
    __GFP_THISNODE for that node (for MPOL_PREFERRED, use that node with
    __GFP_THISNODE instead). This was changed for shmem_alloc_hugepage() to
    allow fallback to other nodes in 89c83fb539f9 as it did for new_page() in
    mm/mempolicy.c which is functionally different behavior and removes the
    requirement to only allocate hugepages locally.

    So this commit does a full revert of 89c83fb539f9 instead of the partial
    revert that was done in 2f0799a0ffc0. The result is the same thp
    allocation policy for 4.20 that was in 4.19.

    Fixes: 89c83fb539f9 ("mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask")
    Fixes: 2f0799a0ffc0 ("mm, thp: restore node-local hugepage allocations")
    Signed-off-by: David Rientjes
    Acked-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

06 Dec, 2018

1 commit

  • This is a full revert of ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for
    MADV_HUGEPAGE mappings") and a partial revert of 89c83fb539f9 ("mm, thp:
    consolidate THP gfp handling into alloc_hugepage_direct_gfpmask").

    By not setting __GFP_THISNODE, applications can allocate remote hugepages
    when the local node is fragmented or low on memory when either the thp
    defrag setting is "always" or the vma has been madvised with
    MADV_HUGEPAGE.

    Remote access to hugepages often has much higher latency than local pages
    of the native page size. On Haswell, ac5b2c18911f was shown to have a
    13.9% access regression after this commit for binaries that remap their
    text segment to be backed by transparent hugepages.

    The intent of ac5b2c18911f is to address an issue where a local node is
    low on memory or fragmented such that a hugepage cannot be allocated. In
    every scenario where this was described as a fix, there is abundant and
    unfragmented remote memory available to allocate from, even with a greater
    access latency.

    If remote memory is also low or fragmented, not setting __GFP_THISNODE was
    also measured on Haswell to have a 40% regression in allocation latency.

    Restore __GFP_THISNODE for thp allocations.

    Fixes: ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings")
    Fixes: 89c83fb539f9 ("mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask")
    Cc: Andrea Arcangeli
    Cc: Mel Gorman
    Cc: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Andrew Morton
    Signed-off-by: David Rientjes
    Signed-off-by: Linus Torvalds

    David Rientjes
     

04 Nov, 2018

2 commits

  • THP allocation mode is quite complex and it depends on the defrag mode.
    This complexity is hidden in alloc_hugepage_direct_gfpmask from a large
    part currently. The NUMA special casing (namely __GFP_THISNODE) is
    however independent and placed in alloc_pages_vma currently. This both
    adds an unnecessary branch to all vma based page allocation requests and
    it makes the code more complex unnecessarily as well. Not to mention
    that e.g. shmem THP used to do the node reclaiming unconditionally
    regardless of the defrag mode until recently. This was not only
    unexpected behavior but it was also hardly a good default behavior and I
    strongly suspect it was just a side effect of the code sharing more than
    a deliberate decision which suggests that such a layering is wrong.

    Get rid of the thp special casing from alloc_pages_vma and move the
    logic to alloc_hugepage_direct_gfpmask. __GFP_THISNODE is applied to the
    resulting gfp mask only when the direct reclaim is not requested and
    when there is no explicit numa binding to preserve the current logic.

    Please note that there's also a slight difference wrt MPOL_BIND now. The
    previous code would avoid using __GFP_THISNODE if the local node was
    outside of policy_nodemask(). After this patch __GFP_THISNODE is avoided
    for all MPOL_BIND policies. So there's a difference that if local node
    is actually allowed by the bind policy's nodemask, previously
    __GFP_THISNODE would be added, but now it won't be. From the behavior
    POV this is still correct because the policy nodemask is used.

    Link: http://lkml.kernel.org/r/20180925120326.24392-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Alex Williamson
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Cc: Stefan Priebe - Profihost AG
    Cc: Zi Yan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • THP allocation might be really disruptive when allocated on NUMA system
    with the local node full or hard to reclaim. Stefan has posted an
    allocation stall report on 4.12 based SLES kernel which suggests the
    same issue:

    kvm: page allocation stalls for 194572ms, order:9, mode:0x4740ca(__GFP_HIGHMEM|__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|__GFP_MOVABLE|__GFP_DIRECT_RECLAIM), nodemask=(null)
    kvm cpuset=/ mems_allowed=0-1
    CPU: 10 PID: 84752 Comm: kvm Tainted: G W 4.12.0+98-ph 0000001 SLE15 (unreleased)
    Hardware name: Supermicro SYS-1029P-WTRT/X11DDW-NT, BIOS 2.0 12/05/2017
    Call Trace:
    dump_stack+0x5c/0x84
    warn_alloc+0xe0/0x180
    __alloc_pages_slowpath+0x820/0xc90
    __alloc_pages_nodemask+0x1cc/0x210
    alloc_pages_vma+0x1e5/0x280
    do_huge_pmd_wp_page+0x83f/0xf00
    __handle_mm_fault+0x93d/0x1060
    handle_mm_fault+0xc6/0x1b0
    __do_page_fault+0x230/0x430
    do_page_fault+0x2a/0x70
    page_fault+0x7b/0x80
    [...]
    Mem-Info:
    active_anon:126315487 inactive_anon:1612476 isolated_anon:5
    active_file:60183 inactive_file:245285 isolated_file:0
    unevictable:15657 dirty:286 writeback:1 unstable:0
    slab_reclaimable:75543 slab_unreclaimable:2509111
    mapped:81814 shmem:31764 pagetables:370616 bounce:0
    free:32294031 free_pcp:6233 free_cma:0
    Node 0 active_anon:254680388kB inactive_anon:1112760kB active_file:240648kB inactive_file:981168kB unevictable:13368kB isolated(anon):0kB isolated(file):0kB mapped:280240kB dirty:1144kB writeback:0kB shmem:95832kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 81225728kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
    Node 1 active_anon:250583072kB inactive_anon:5337144kB active_file:84kB inactive_file:0kB unevictable:49260kB isolated(anon):20kB isolated(file):0kB mapped:47016kB dirty:0kB writeback:4kB shmem:31224kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 31897600kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no

    The defrag mode is "madvise" and from the above report it is clear that
    the THP has been allocated for MADV_HUGEPAGA vma.

    Andrea has identified that the main source of the problem is
    __GFP_THISNODE usage:

    : The problem is that direct compaction combined with the NUMA
    : __GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very
    : hard the local node, instead of failing the allocation if there's no
    : THP available in the local node.
    :
    : Such logic was ok until __GFP_THISNODE was added to the THP allocation
    : path even with MPOL_DEFAULT.
    :
    : The idea behind the __GFP_THISNODE addition, is that it is better to
    : provide local memory in PAGE_SIZE units than to use remote NUMA THP
    : backed memory. That largely depends on the remote latency though, on
    : threadrippers for example the overhead is relatively low in my
    : experience.
    :
    : The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in
    : extremely slow qemu startup with vfio, if the VM is larger than the
    : size of one host NUMA node. This is because it will try very hard to
    : unsuccessfully swapout get_user_pages pinned pages as result of the
    : __GFP_THISNODE being set, instead of falling back to PAGE_SIZE
    : allocations and instead of trying to allocate THP on other nodes (it
    : would be even worse without vfio type1 GUP pins of course, except it'd
    : be swapping heavily instead).

    Fix this by removing __GFP_THISNODE for THP requests which are
    requesting the direct reclaim. This effectivelly reverts 5265047ac301
    on the grounds that the zone/node reclaim was known to be disruptive due
    to premature reclaim when there was memory free. While it made sense at
    the time for HPC workloads without NUMA awareness on rare machines, it
    was ultimately harmful in the majority of cases. The existing behaviour
    is similar, if not as widespare as it applies to a corner case but
    crucially, it cannot be tuned around like zone_reclaim_mode can. The
    default behaviour should always be to cause the least harm for the
    common case.

    If there are specialised use cases out there that want zone_reclaim_mode
    in specific cases, then it can be built on top. Longterm we should
    consider a memory policy which allows for the node reclaim like behavior
    for the specific memory ranges which would allow a

    [1] http://lkml.kernel.org/r/20180820032204.9591-1-aarcange@redhat.com

    Mel said:

    : Both patches look correct to me but I'm responding to this one because
    : it's the fix. The change makes sense and moves further away from the
    : severe stalling behaviour we used to see with both THP and zone reclaim
    : mode.
    :
    : I put together a basic experiment with usemem configured to reference a
    : buffer multiple times that is 80% the size of main memory on a 2-socket
    : box with symmetric node sizes and defrag set to "always". The defrag
    : setting is not the default but it would be functionally similar to
    : accessing a buffer with madvise(MADV_HUGEPAGE). Usemem is configured to
    : reference the buffer multiple times and while it's not an interesting
    : workload, it would be expected to complete reasonably quickly as it fits
    : within memory. The results were;
    :
    : usemem
    : vanilla noreclaim-v1
    : Amean Elapsd-1 42.78 ( 0.00%) 26.87 ( 37.18%)
    : Amean Elapsd-3 27.55 ( 0.00%) 7.44 ( 73.00%)
    : Amean Elapsd-4 5.72 ( 0.00%) 5.69 ( 0.45%)
    :
    : This shows the elapsed time in seconds for 1 thread, 3 threads and 4
    : threads referencing buffers 80% the size of memory. With the patches
    : applied, it's 37.18% faster for the single thread and 73% faster with two
    : threads. Note that 4 threads showing little difference does not indicate
    : the problem is related to thread counts. It's simply the case that 4
    : threads gets spread so their workload mostly fits in one node.
    :
    : The overall view from /proc/vmstats is more startling
    :
    : 4.19.0-rc1 4.19.0-rc1
    : vanillanoreclaim-v1r1
    : Minor Faults 35593425 708164
    : Major Faults 484088 36
    : Swap Ins 3772837 0
    : Swap Outs 3932295 0
    :
    : Massive amounts of swap in/out without the patch
    :
    : Direct pages scanned 6013214 0
    : Kswapd pages scanned 0 0
    : Kswapd pages reclaimed 0 0
    : Direct pages reclaimed 4033009 0
    :
    : Lots of reclaim activity without the patch
    :
    : Kswapd efficiency 100% 100%
    : Kswapd velocity 0.000 0.000
    : Direct efficiency 67% 100%
    : Direct velocity 11191.956 0.000
    :
    : Mostly from direct reclaim context as you'd expect without the patch.
    :
    : Page writes by reclaim 3932314.000 0.000
    : Page writes file 19 0
    : Page writes anon 3932295 0
    : Page reclaim immediate 42336 0
    :
    : Writes from reclaim context is never good but the patch eliminates it.
    :
    : We should never have default behaviour to thrash the system for such a
    : basic workload. If zone reclaim mode behaviour is ever desired but on a
    : single task instead of a global basis then the sensible option is to build
    : a mempolicy that enforces that behaviour.

    This was a severe regression compared to previous kernels that made
    important workloads unusable and it starts when __GFP_THISNODE was
    added to THP allocations under MADV_HUGEPAGE. It is not a significant
    risk to go to the previous behavior before __GFP_THISNODE was added, it
    worked like that for years.

    This was simply an optimization to some lucky workloads that can fit in
    a single node, but it ended up breaking the VM for others that can't
    possibly fit in a single node, so going back is safe.

    [mhocko@suse.com: rewrote the changelog based on the one from Andrea]
    Link: http://lkml.kernel.org/r/20180925120326.24392-2-mhocko@kernel.org
    Fixes: 5265047ac301 ("mm, thp: really limit transparent hugepage allocation to local node")
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Michal Hocko
    Reported-by: Stefan Priebe
    Debugged-by: Andrea Arcangeli
    Reported-by: Alex Williamson
    Reviewed-by: Mel Gorman
    Tested-by: Mel Gorman
    Cc: Zi Yan
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Cc: "Kirill A. Shutemov"
    Cc: [4.1+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

27 Oct, 2018

2 commits

  • match_string() returns the index of an array for a matching string, which
    can be used intead of open coded implementation.

    Link: http://lkml.kernel.org/r/1536988365-50310-1-git-send-email-zhongjiang@huawei.com
    Signed-off-by: zhong jiang
    Reviewed-by: Andrew Morton
    Acked-by: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Andrey Ryabinin
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    zhong jiang
     
  • get_mempolicy(MPOL_F_NODE|MPOL_F_ADDR) called a get_user_pages that would
    not be waiting for userfaults before failing and it would hit on a SIGBUS
    instead. Using get_user_pages_locked/unlocked instead will allow
    get_mempolicy to allow userfaults to resolve the fault and fill the hole,
    before grabbing the node id of the page.

    If the user calls get_mempolicy() with MPOL_F_ADDR | MPOL_F_NODE for an
    address inside an area managed by uffd and there is no page at that
    address, the page allocation from within get_mempolicy() will fail
    because get_user_pages() does not allow for page fault retry required
    for uffd; the user will get SIGBUS.

    With this patch, the page fault will be resolved by the uffd and the
    get_mempolicy() will continue normally.

    Background:

    Via code review, previously the syscall would have returned -EFAULT
    (vm_fault_to_errno), now it will block and wait for an userfault (if
    it's waken before the fault is resolved it'll still -EFAULT).

    This way get_mempolicy will give a chance to an "unaware" app to be
    compliant with userfaults.

    The reason this visible change is that becoming "userfault compliant"
    cannot regress anything: all other syscalls including read(2)/write(2)
    had to become "userfault compliant" long time ago (that's one of the
    things userfaultfd can do that PROT_NONE and trapping segfaults can't).

    So this is just one more syscall that become "userfault compliant" like
    all other major ones already were.

    This has been happening on virtio-bridge dpdk process which just called
    get_mempolicy on the guest space post live migration, but before the
    memory had a chance to be migrated to destination.

    I didn't run an strace to be able to show the -EFAULT going away, but
    I've the confirmation of the below debug aid information (only visible
    with CONFIG_DEBUG_VM=y) going away with the patch:

    [20116.371461] FAULT_FLAG_ALLOW_RETRY missing 0
    [20116.371464] CPU: 1 PID: 13381 Comm: vhost-events Not tainted 4.17.12-200.fc28.x86_64 #1
    [20116.371465] Hardware name: LENOVO 20FAS2BN0A/20FAS2BN0A, BIOS N1CET54W (1.22 ) 02/10/2017
    [20116.371466] Call Trace:
    [20116.371473] dump_stack+0x5c/0x80
    [20116.371476] handle_userfault.cold.37+0x1b/0x22
    [20116.371479] ? remove_wait_queue+0x20/0x60
    [20116.371481] ? poll_freewait+0x45/0xa0
    [20116.371483] ? do_sys_poll+0x31c/0x520
    [20116.371485] ? radix_tree_lookup_slot+0x1e/0x50
    [20116.371488] shmem_getpage_gfp+0xce7/0xe50
    [20116.371491] ? page_add_file_rmap+0x1a/0x2c0
    [20116.371493] shmem_fault+0x78/0x1e0
    [20116.371495] ? filemap_map_pages+0x3a1/0x450
    [20116.371498] __do_fault+0x1f/0xc0
    [20116.371500] __handle_mm_fault+0xe2e/0x12f0
    [20116.371502] handle_mm_fault+0xda/0x200
    [20116.371504] __get_user_pages+0x238/0x790
    [20116.371506] get_user_pages+0x3e/0x50
    [20116.371510] kernel_get_mempolicy+0x40b/0x700
    [20116.371512] ? vfs_write+0x170/0x1a0
    [20116.371515] __x64_sys_get_mempolicy+0x21/0x30
    [20116.371517] do_syscall_64+0x5b/0x160
    [20116.371520] entry_SYSCALL_64_after_hwframe+0x44/0xa9

    The above harmless debug message (not a kernel crash, just a
    dump_stack()) is shown with CONFIG_DEBUG_VM=y to more quickly identify
    and improve kernel spots that may have to become "userfaultfd
    compliant" like this one (without having to run an strace and search
    for syscall misbehavior). Spots like the above are more closer to a
    kernel bug for the non-cooperative usages that Mike focuses on, than
    for for dpdk qemu-cooperative usages that reproduced it, but it's still
    nicer to get this fixed for dpdk too.

    The part of the patch that caused me to think is only the
    implementation issue of mpol_get, but it looks like it should work safe
    no matter the kind of mempolicy structure that is (the default static
    policy also starts at 1 so it'll go to 2 and back to 1 without crashing
    everything at 0).

    [rppt@linux.vnet.ibm.com: changelog addition]
    http://lkml.kernel.org/r/20180904073718.GA26916@rapoport-lnx
    Link: http://lkml.kernel.org/r/20180831214848.23676-1-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Reported-by: Maxime Coquelin
    Tested-by: Dr. David Alan Gilbert
    Reviewed-by: Mike Rapoport
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

23 Aug, 2018

2 commits

  • zone->node is configured only when CONFIG_NUMA=y, so it is a good idea to
    have inline functions to access this field in order to avoid ifdef's in c
    files.

    Link: http://lkml.kernel.org/r/20180730101757.28058-3-osalvador@techadventures.net
    Signed-off-by: Pavel Tatashin
    Signed-off-by: Oscar Salvador
    Reviewed-by: Oscar Salvador
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Aaron Lu
    Cc: Dan Williams
    Cc: David Hildenbrand
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Pasha Tatashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Tatashin
     
  • Rather than in vm_area_alloc(). To ensure that the various oddball
    stack-based vmas are in a good state. Some of the callers were zeroing
    them out, others were not.

    Acked-by: Kirill A. Shutemov
    Cc: Russell King
    Cc: Dmitry Vyukov
    Cc: Oleg Nesterov
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

27 Jul, 2018

1 commit

  • Make sure to initialize all VMAs properly, not only those which come
    from vm_area_cachep.

    Link: http://lkml.kernel.org/r/20180724121139.62570-3-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Acked-by: Linus Torvalds
    Reviewed-by: Andrew Morton
    Cc: Dmitry Vyukov
    Cc: Oleg Nesterov
    Cc: Andrea Arcangeli
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

12 Apr, 2018

3 commits

  • THP migration is hacked into the generic migration with rather
    surprising semantic. The migration allocation callback is supposed to
    check whether the THP can be migrated at once and if that is not the
    case then it allocates a simple page to migrate. unmap_and_move then
    fixes that up by spliting the THP into small pages while moving the head
    page to the newly allocated order-0 page. Remaning pages are moved to
    the LRU list by split_huge_page. The same happens if the THP allocation
    fails. This is really ugly and error prone [1].

    I also believe that split_huge_page to the LRU lists is inherently wrong
    because all tail pages are not migrated. Some callers will just work
    around that by retrying (e.g. memory hotplug). There are other pfn
    walkers which are simply broken though. e.g. madvise_inject_error will
    migrate head and then advances next pfn by the huge page size.
    do_move_page_to_node_array, queue_pages_range (migrate_pages, mbind),
    will simply split the THP before migration if the THP migration is not
    supported then falls back to single page migration but it doesn't handle
    tail pages if the THP migration path is not able to allocate a fresh THP
    so we end up with ENOMEM and fail the whole migration which is a
    questionable behavior. Page compaction doesn't try to migrate large
    pages so it should be immune.

    This patch tries to unclutter the situation by moving the special THP
    handling up to the migrate_pages layer where it actually belongs. We
    simply split the THP page into the existing list if unmap_and_move fails
    with ENOMEM and retry. So we will _always_ migrate all THP subpages and
    specific migrate_pages users do not have to deal with this case in a
    special way.

    [1] http://lkml.kernel.org/r/20171121021855.50525-1-zi.yan@sent.com

    Link: http://lkml.kernel.org/r/20180103082555.14592-4-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Zi Yan
    Cc: Andrea Reale
    Cc: Anshuman Khandual
    Cc: Mike Kravetz
    Cc: Naoya Horiguchi
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • No allocation callback is using this argument anymore. new_page_node
    used to use this parameter to convey node_id resp. migration error up
    to move_pages code (do_move_page_to_node_array). The error status never
    made it into the final status field and we have a better way to
    communicate node id to the status field now. All other allocation
    callbacks simply ignored the argument so we can drop it finally.

    [mhocko@suse.com: fix migration callback]
    Link: http://lkml.kernel.org/r/20180105085259.GH2801@dhcp22.suse.cz
    [akpm@linux-foundation.org: fix alloc_misplaced_dst_page()]
    [mhocko@kernel.org: fix build]
    Link: http://lkml.kernel.org/r/20180103091134.GB11319@dhcp22.suse.cz
    Link: http://lkml.kernel.org/r/20180103082555.14592-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reviewed-by: Zi Yan
    Cc: Andrea Reale
    Cc: Anshuman Khandual
    Cc: Kirill A. Shutemov
    Cc: Mike Kravetz
    Cc: Naoya Horiguchi
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Patch series "unclutter thp migration"

    Motivation:

    THP migration is hacked into the generic migration with rather
    surprising semantic. The migration allocation callback is supposed to
    check whether the THP can be migrated at once and if that is not the
    case then it allocates a simple page to migrate. unmap_and_move then
    fixes that up by splitting the THP into small pages while moving the
    head page to the newly allocated order-0 page. Remaining pages are
    moved to the LRU list by split_huge_page. The same happens if the THP
    allocation fails. This is really ugly and error prone [2].

    I also believe that split_huge_page to the LRU lists is inherently wrong
    because all tail pages are not migrated. Some callers will just work
    around that by retrying (e.g. memory hotplug). There are other pfn
    walkers which are simply broken though. e.g. madvise_inject_error will
    migrate head and then advances next pfn by the huge page size.
    do_move_page_to_node_array, queue_pages_range (migrate_pages, mbind),
    will simply split the THP before migration if the THP migration is not
    supported then falls back to single page migration but it doesn't handle
    tail pages if the THP migration path is not able to allocate a fresh THP
    so we end up with ENOMEM and fail the whole migration which is a
    questionable behavior. Page compaction doesn't try to migrate large
    pages so it should be immune.

    The first patch reworks do_pages_move which relies on a very ugly
    calling semantic when the return status is pushed to the migration path
    via private pointer. It uses pre allocated fixed size batching to
    achieve that. We simply cannot do the same if a THP is to be split
    during the migration path which is done in the patch 3. Patch 2 is
    follow up cleanup which removes the mentioned return status calling
    convention ugliness.

    On a side note:

    There are some semantic issues I have encountered on the way when
    working on patch 1 but I am not addressing them here. E.g. trying to
    move THP tail pages will result in either success or EBUSY (the later
    one more likely once we isolate head from the LRU list). Hugetlb
    reports EACCESS on tail pages. Some errors are reported via status
    parameter but migration failures are not even though the original
    `reason' argument suggests there was an intention to do so. From a
    quick look into git history this never worked. I have tried to keep the
    semantic unchanged.

    Then there is a relatively minor thing that the page isolation might
    fail because of pages not being on the LRU - e.g. because they are
    sitting on the per-cpu LRU caches. Easily fixable.

    This patch (of 3):

    do_pages_move is supposed to move user defined memory (an array of
    addresses) to the user defined numa nodes (an array of nodes one for
    each address). The user provided status array then contains resulting
    numa node for each address or an error. The semantic of this function
    is little bit confusing because only some errors are reported back.
    Notably migrate_pages error is only reported via the return value. This
    patch doesn't try to address these semantic nuances but rather change
    the underlying implementation.

    Currently we are processing user input (which can be really large) in
    batches which are stored to a temporarily allocated page. Each address
    is resolved to its struct page and stored to page_to_node structure
    along with the requested target numa node. The array of these
    structures is then conveyed down the page migration path via private
    argument. new_page_node then finds the corresponding structure and
    allocates the proper target page.

    What is the problem with the current implementation and why to change
    it? Apart from being quite ugly it also doesn't cope with unexpected
    pages showing up on the migration list inside migrate_pages path. That
    doesn't happen currently but the follow up patch would like to make the
    thp migration code more clear and that would need to split a THP into
    the list for some cases.

    How does the new implementation work? Well, instead of batching into a
    fixed size array we simply batch all pages that should be migrated to
    the same node and isolate all of them into a linked list which doesn't
    require any additional storage. This should work reasonably well
    because page migration usually migrates larger ranges of memory to a
    specific node. So the common case should work equally well as the
    current implementation. Even if somebody constructs an input where the
    target numa nodes would be interleaved we shouldn't see a large
    performance impact because page migration alone doesn't really benefit
    from batching. mmap_sem batching for the lookup is quite questionable
    and isolate_lru_page which would benefit from batching is not using it
    even in the current implementation.

    Link: http://lkml.kernel.org/r/20180103082555.14592-2-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Andrew Morton
    Cc: Anshuman Khandual
    Cc: Zi Yan
    Cc: Naoya Horiguchi
    Cc: Vlastimil Babka
    Cc: Andrea Reale
    Cc: Kirill A. Shutemov
    Cc: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

03 Apr, 2018

4 commits

  • Pull removal of in-kernel calls to syscalls from Dominik Brodowski:
    "System calls are interaction points between userspace and the kernel.
    Therefore, system call functions such as sys_xyzzy() or
    compat_sys_xyzzy() should only be called from userspace via the
    syscall table, but not from elsewhere in the kernel.

    At least on 64-bit x86, it will likely be a hard requirement from
    v4.17 onwards to not call system call functions in the kernel: It is
    better to use use a different calling convention for system calls
    there, where struct pt_regs is decoded on-the-fly in a syscall wrapper
    which then hands processing over to the actual syscall function. This
    means that only those parameters which are actually needed for a
    specific syscall are passed on during syscall entry, instead of
    filling in six CPU registers with random user space content all the
    time (which may cause serious trouble down the call chain). Those
    x86-specific patches will be pushed through the x86 tree in the near
    future.

    Moreover, rules on how data may be accessed may differ between kernel
    data and user data. This is another reason why calling sys_xyzzy() is
    generally a bad idea, and -- at most -- acceptable in arch-specific
    code.

    This patchset removes all in-kernel calls to syscall functions in the
    kernel with the exception of arch/. On top of this, it cleans up the
    three places where many syscalls are referenced or prototyped, namely
    kernel/sys_ni.c, include/linux/syscalls.h and include/linux/compat.h"

    * 'syscalls-next' of git://git.kernel.org/pub/scm/linux/kernel/git/brodo/linux: (109 commits)
    bpf: whitelist all syscalls for error injection
    kernel/sys_ni: remove {sys_,sys_compat} from cond_syscall definitions
    kernel/sys_ni: sort cond_syscall() entries
    syscalls/x86: auto-create compat_sys_*() prototypes
    syscalls: sort syscall prototypes in include/linux/compat.h
    net: remove compat_sys_*() prototypes from net/compat.h
    syscalls: sort syscall prototypes in include/linux/syscalls.h
    kexec: move sys_kexec_load() prototype to syscalls.h
    x86/sigreturn: use SYSCALL_DEFINE0
    x86: fix sys_sigreturn() return type to be long, not unsigned long
    x86/ioport: add ksys_ioperm() helper; remove in-kernel calls to sys_ioperm()
    mm: add ksys_readahead() helper; remove in-kernel calls to sys_readahead()
    mm: add ksys_mmap_pgoff() helper; remove in-kernel calls to sys_mmap_pgoff()
    mm: add ksys_fadvise64_64() helper; remove in-kernel call to sys_fadvise64_64()
    fs: add ksys_fallocate() wrapper; remove in-kernel calls to sys_fallocate()
    fs: add ksys_p{read,write}64() helpers; remove in-kernel calls to syscalls
    fs: add ksys_truncate() wrapper; remove in-kernel calls to sys_truncate()
    fs: add ksys_sync_file_range helper(); remove in-kernel calls to syscall
    kernel: add ksys_setsid() helper; remove in-kernel call to sys_setsid()
    kernel: add ksys_unshare() helper; remove in-kernel calls to sys_unshare()
    ...

    Linus Torvalds
     
  • Using the mm-internal kernel_[sg]et_mempolicy() helper allows us to get
    rid of the mm-internal calls to the sys_[sg]et_mempolicy() syscalls.

    This patch is part of a series which removes in-kernel calls to syscalls.
    On this basis, the syscall entry path can be streamlined. For details, see
    http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

    Cc: Al Viro
    Cc: linux-mm@kvack.org
    Cc: Andrew Morton
    Signed-off-by: Dominik Brodowski

    Dominik Brodowski
     
  • Using the mm-internal kernel_mbind() helper allows us to get rid of the
    mm-internal call to the sys_mbind() syscall.

    This patch is part of a series which removes in-kernel calls to syscalls.
    On this basis, the syscall entry path can be streamlined. For details, see
    http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

    Cc: Al Viro
    Cc: linux-mm@kvack.org
    Cc: Andrew Morton
    Signed-off-by: Dominik Brodowski

    Dominik Brodowski
     
  • Move compat_sys_migrate_pages() to mm/mempolicy.c and make it call a newly
    introduced helper -- kernel_migrate_pages() -- instead of the syscall.

    This patch is part of a series which removes in-kernel calls to syscalls.
    On this basis, the syscall entry path can be streamlined. For details, see
    http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

    Cc: Al Viro
    Cc: linux-mm@kvack.org
    Cc: Andrew Morton
    Signed-off-by: Dominik Brodowski

    Dominik Brodowski
     

23 Mar, 2018

1 commit

  • Alexander reported a use of uninitialized memory in __mpol_equal(),
    which is caused by incorrect use of preferred_node.

    When mempolicy in mode MPOL_PREFERRED with flags MPOL_F_LOCAL, it uses
    numa_node_id() instead of preferred_node, however, __mpol_equal() uses
    preferred_node without checking whether it is MPOL_F_LOCAL or not.

    [akpm@linux-foundation.org: slight comment tweak]
    Link: http://lkml.kernel.org/r/4ebee1c2-57f6-bcb8-0e2d-1833d1ee0bb7@huawei.com
    Fixes: fc36b8d3d819 ("mempolicy: use MPOL_F_LOCAL to Indicate Preferred Local Policy")
    Signed-off-by: Yisheng Xie
    Reported-by: Alexander Potapenko
    Tested-by: Alexander Potapenko
    Reviewed-by: Andrew Morton
    Cc: Dmitriy Vyukov
    Cc: Vlastimil Babka
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yisheng Xie
     

01 Feb, 2018

5 commits

  • Dan Carpenter has noticed that mbind migration callback (new_page) can
    get a NULL vma pointer and choke on it inside alloc_huge_page_vma which
    relies on the VMA to get the hstate. We used to BUG_ON this case but
    the BUG_+ON has been removed recently by "hugetlb, mempolicy: fix the
    mbind hugetlb migration".

    The proper way to handle this is to get the hstate from the migrated
    page and rely on huge_node (resp. get_vma_policy) do the right thing
    with null VMA. We are currently falling back to the default mempolicy
    in that case which is in line what THP path is doing here.

    Link: http://lkml.kernel.org/r/20180110104712.GR1732@dhcp22.suse.cz
    Signed-off-by: Michal Hocko
    Reported-by: Dan Carpenter
    Cc: Naoya Horiguchi
    Cc: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • do_mbind migration code relies on alloc_huge_page_noerr for hugetlb
    pages. alloc_huge_page_noerr uses alloc_huge_page which is a highlevel
    allocation function which has to take care of reserves, overcommit or
    hugetlb cgroup accounting. None of that is really required for the page
    migration because the new page is only temporal and either will replace
    the original page or it will be dropped. This is essentially as for
    other migration call paths and there shouldn't be any reason to handle
    mbind in a special way.

    The current implementation is even suboptimal because the migration
    might fail just because the hugetlb cgroup limit is reached, or the
    overcommit is saturated.

    Fix this by making mbind like other hugetlb migration paths. Add a new
    migration helper alloc_huge_page_vma as a wrapper around
    alloc_huge_page_nodemask with additional mempolicy handling.

    alloc_huge_page_noerr has no more users and it can go.

    Link: http://lkml.kernel.org/r/20180103093213.26329-7-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reviewed-by: Mike Kravetz
    Reviewed-by: Naoya Horiguchi
    Cc: Andrea Reale
    Cc: Anshuman Khandual
    Cc: Kirill A. Shutemov
    Cc: Vlastimil Babka
    Cc: Zi Yan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • As in manpage of migrate_pages, the errno should be set to EINVAL when
    none of the node IDs specified by new_nodes are on-line and allowed by
    the process's current cpuset context, or none of the specified nodes
    contain memory. However, when test by following case:

    new_nodes = 0;
    old_nodes = 0xf;
    ret = migrate_pages(pid, old_nodes, new_nodes, MAX);

    The ret will be 0 and no errno is set. As the new_nodes is empty, we
    should expect EINVAL as documented.

    To fix the case like above, this patch check whether target nodes AND
    current task_nodes is empty, and then check whether AND
    node_states[N_MEMORY] is empty.

    Link: http://lkml.kernel.org/r/1510882624-44342-4-git-send-email-xieyisheng1@huawei.com
    Signed-off-by: Yisheng Xie
    Acked-by: Vlastimil Babka
    Cc: Andi Kleen
    Cc: Chris Salls
    Cc: Christopher Lameter
    Cc: David Rientjes
    Cc: Ingo Molnar
    Cc: Naoya Horiguchi
    Cc: Tan Xiaojun
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yisheng Xie
     
  • As Xiaojun reported the ltp of migrate_pages01 will fail on arm64 system
    which has 4 nodes[0...3], all have memory and CONFIG_NODES_SHIFT=2:

    migrate_pages01 0 TINFO : test_invalid_nodes
    migrate_pages01 14 TFAIL : migrate_pages_common.c:45: unexpected failure - returned value = 0, expected: -1
    migrate_pages01 15 TFAIL : migrate_pages_common.c:55: call succeeded unexpectedly

    In this case the test_invalid_nodes of migrate_pages01 will call:
    SYSC_migrate_pages as:

    migrate_pages(0, , {0x0000000000000001}, 64, , {0x0000000000000010}, 64) = 0

    The new nodes specifies one or more node IDs that are greater than the
    maximum supported node ID, however, the errno is not set to EINVAL as
    expected.

    As man pages of set_mempolicy[1], mbind[2], and migrate_pages[3]
    mentioned, when nodemask specifies one or more node IDs that are greater
    than the maximum supported node ID, the errno should set to EINVAL.
    However, get_nodes only check whether the part of bits
    [BITS_PER_LONG*BITS_TO_LONGS(MAX_NUMNODES), maxnode) is zero or not, and
    remain [MAX_NUMNODES, BITS_PER_LONG*BITS_TO_LONGS(MAX_NUMNODES)
    unchecked.

    This patch is to check the bits of [MAX_NUMNODES, maxnode) in get_nodes
    to let migrate_pages set the errno to EINVAL when nodemask specifies one
    or more node IDs that are greater than the maximum supported node ID,
    which follows the manpage's guide.

    [1] http://man7.org/linux/man-pages/man2/set_mempolicy.2.html
    [2] http://man7.org/linux/man-pages/man2/mbind.2.html
    [3] http://man7.org/linux/man-pages/man2/migrate_pages.2.html

    Link: http://lkml.kernel.org/r/1510882624-44342-3-git-send-email-xieyisheng1@huawei.com
    Signed-off-by: Yisheng Xie
    Reported-by: Tan Xiaojun
    Acked-by: Vlastimil Babka
    Cc: Andi Kleen
    Cc: Chris Salls
    Cc: Christopher Lameter
    Cc: David Rientjes
    Cc: Ingo Molnar
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yisheng Xie
     
  • We have already checked whether maxnode is a page worth of bits, by:
    maxnode > PAGE_SIZE*BITS_PER_BYTE

    So no need to check it once more.

    Link: http://lkml.kernel.org/r/1510882624-44342-2-git-send-email-xieyisheng1@huawei.com
    Signed-off-by: Yisheng Xie
    Acked-by: Vlastimil Babka
    Acked-by: David Rientjes
    Cc: Ingo Molnar
    Cc: David Rientjes
    Cc: Naoya Horiguchi
    Cc: Chris Salls
    Cc: Andi Kleen
    Cc: Christopher Lameter
    Cc: Tan Xiaojun
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yisheng Xie
     

16 Nov, 2017

2 commits

  • This is the second step which introduces a tunable interface that allow
    numa stats configurable for optimizing zone_statistics(), as suggested
    by Dave Hansen and Ying Huang.

    =========================================================================

    When page allocation performance becomes a bottleneck and you can
    tolerate some possible tool breakage and decreased numa counter
    precision, you can do:

    echo 0 > /proc/sys/vm/numa_stat

    In this case, numa counter update is ignored. We can see about
    *4.8%*(185->176) drop of cpu cycles per single page allocation and
    reclaim on Jesper's page_bench01 (single thread) and *8.1%*(343->315)
    drop of cpu cycles per single page allocation and reclaim on Jesper's
    page_bench03 (88 threads) running on a 2-Socket Broadwell-based server
    (88 threads, 126G memory).

    Benchmark link provided by Jesper D Brouer (increase loop times to
    10000000):

    https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench

    =========================================================================

    When page allocation performance is not a bottleneck and you want all
    tooling to work, you can do:

    echo 1 > /proc/sys/vm/numa_stat

    This is system default setting.

    Many thanks to Michal Hocko, Dave Hansen, Ying Huang and Vlastimil Babka
    for comments to help improve the original patch.

    [keescook@chromium.org: make sure mutex is a global static]
    Link: http://lkml.kernel.org/r/20171107213809.GA4314@beast
    Link: http://lkml.kernel.org/r/1508290927-8518-1-git-send-email-kemi.wang@intel.com
    Signed-off-by: Kemi Wang
    Signed-off-by: Kees Cook
    Reported-by: Jesper Dangaard Brouer
    Suggested-by: Dave Hansen
    Suggested-by: Ying Huang
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: "Luis R . Rodriguez"
    Cc: Kees Cook
    Cc: Jonathan Corbet
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Christopher Lameter
    Cc: Sebastian Andrzej Siewior
    Cc: Andrey Ryabinin
    Cc: Tim Chen
    Cc: Andi Kleen
    Cc: Aaron Lu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kemi Wang
     
  • Commit 197e7e521384 ("Sanitize 'move_pages()' permission checks") fixed
    a security issue I reported in the move_pages syscall, and made it so
    that you can't act on set-uid processes unless you have the
    CAP_SYS_PTRACE capability.

    Unify the access check logic of migrate_pages to match the new behavior
    of move_pages. We discussed this a bit in the security@ list and
    thought it'd be good for consistency even though there's no evident
    security impact. The NUMA node access checks are left intact and
    require CAP_SYS_NICE as before.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1710011830320.6333@lakka.kapsi.fi
    Signed-off-by: Otto Ebeling
    Acked-by: Michal Hocko
    Cc: Eric W. Biederman
    Cc: Willy Tarreau
    Cc: Kees Cook
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Otto Ebeling
     

14 Oct, 2017

1 commit

  • Commit 3a321d2a3dde ("mm: change the call sites of numa statistics
    items") separated NUMA counters from zone counters, but the
    NUMA_INTERLEAVE_HIT call site wasn't updated to use the new interface.
    So alloc_page_interleave() actually increments NR_ZONE_INACTIVE_FILE
    instead of NUMA_INTERLEAVE_HIT.

    Fix this by using __inc_numa_state() interface to increment
    NUMA_INTERLEAVE_HIT.

    Link: http://lkml.kernel.org/r/20171003191003.8573-1-aryabinin@virtuozzo.com
    Fixes: 3a321d2a3dde ("mm: change the call sites of numa statistics items")
    Signed-off-by: Andrey Ryabinin
    Acked-by: Mel Gorman
    Cc: Kemi Wang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     

09 Sep, 2017

4 commits

  • VMA and its address bounds checks are too late in this function. They
    must have been verified earlier in the page fault sequence. Hence just
    remove them.

    Link: http://lkml.kernel.org/r/20170901130137.7617-1-khandual@linux.vnet.ibm.com
    Signed-off-by: Anshuman Khandual
    Suggested-by: Vlastimil Babka
    Acked-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     
  • While reading the code I found that offset_il_node() has a vm_area_struct
    pointer parameter which is unused.

    Link: http://lkml.kernel.org/r/1502899755-23146-1-git-send-email-ldufour@linux.vnet.ibm.com
    Signed-off-by: Laurent Dufour
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Laurent Dufour
     
  • This patch enables thp migration for mbind(2) and migrate_pages(2).

    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Zi Yan
    Cc: "H. Peter Anvin"
    Cc: Anshuman Khandual
    Cc: Dave Hansen
    Cc: David Nellans
    Cc: Ingo Molnar
    Cc: Kirill A. Shutemov
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Thomas Gleixner
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Patch series "mm: page migration enhancement for thp", v9.

    Motivations:

    1. THP migration becomes important in the upcoming heterogeneous memory
    systems. As David Nellans from NVIDIA pointed out from other threads
    (http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1349227.html),
    future GPUs or other accelerators will have their memory managed by
    operating systems. Moving data into and out of these memory nodes
    efficiently is critical to applications that use GPUs or other
    accelerators. Existing page migration only supports base pages, which
    has a very low memory bandwidth utilization. My experiments (see
    below) show THP migration can migrate pages more efficiently.

    2. Base page migration vs THP migration throughput.

    Here are cross-socket page migration results from calling
    move_pages() syscall:

    In x86_64, a Intel two-socket E5-2640v3 box,
    - single 4KB base page migration takes 62.47 us, using 0.06 GB/s BW,
    - single 2MB THP migration takes 658.54 us, using 2.97 GB/s BW,
    - 512 4KB base page migration takes 1987.38 us, using 0.98 GB/s BW.

    In ppc64, a two-socket Power8 box,
    - single 64KB base page migration takes 49.3 us, using 1.24 GB/s BW,
    - single 16MB THP migration takes 2202.17 us, using 7.10 GB/s BW,
    - 256 64KB base page migration takes 2543.65 us, using 6.14 GB/s BW.

    THP migration can give us 3x and 1.15x throughput over base page
    migration in x86_64 and ppc64 respectivley.

    You can test it out by using the code here:
    https://github.com/x-y-z/thp-migration-bench

    3. Existing page migration splits THP before migration and cannot
    guarantee the migrated pages are still contiguous. Contiguity is
    always what GPUs and accelerators look for. Without THP migration,
    khugepaged needs to do extra work to reassemble the migrated pages
    back to THPs.

    This patch (of 10):

    Introduce a separate check routine related to MPOL_MF_INVERT flag. This
    patch just does cleanup, no behavioral change.

    Link: http://lkml.kernel.org/r/20170717193955.20207-2-zi.yan@sent.com
    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Zi Yan
    Cc: Kirill A. Shutemov
    Cc: Minchan Kim
    Cc: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Anshuman Khandual
    Cc: Dave Hansen
    Cc: David Nellans
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Andrea Arcangeli
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

19 Aug, 2017

1 commit

  • I hit a use after free issue when executing trinity and repoduced it
    with KASAN enabled. The related call trace is as follows.

    BUG: KASan: use after free in SyS_get_mempolicy+0x3c8/0x960 at addr ffff8801f582d766
    Read of size 2 by task syz-executor1/798

    INFO: Allocated in mpol_new.part.2+0x74/0x160 age=3 cpu=1 pid=799
    __slab_alloc+0x768/0x970
    kmem_cache_alloc+0x2e7/0x450
    mpol_new.part.2+0x74/0x160
    mpol_new+0x66/0x80
    SyS_mbind+0x267/0x9f0
    system_call_fastpath+0x16/0x1b
    INFO: Freed in __mpol_put+0x2b/0x40 age=4 cpu=1 pid=799
    __slab_free+0x495/0x8e0
    kmem_cache_free+0x2f3/0x4c0
    __mpol_put+0x2b/0x40
    SyS_mbind+0x383/0x9f0
    system_call_fastpath+0x16/0x1b
    INFO: Slab 0xffffea0009cb8dc0 objects=23 used=8 fp=0xffff8801f582de40 flags=0x200000000004080
    INFO: Object 0xffff8801f582d760 @offset=5984 fp=0xffff8801f582d600

    Bytes b4 ffff8801f582d750: ae 01 ff ff 00 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a ........ZZZZZZZZ
    Object ffff8801f582d760: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
    Object ffff8801f582d770: 6b 6b 6b 6b 6b 6b 6b a5 kkkkkkk.
    Redzone ffff8801f582d778: bb bb bb bb bb bb bb bb ........
    Padding ffff8801f582d8b8: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ
    Memory state around the buggy address:
    ffff8801f582d600: fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc fc
    ffff8801f582d680: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    >ffff8801f582d700: fc fc fc fc fc fc fc fc fc fc fc fc fb fb fb fc

    !shared memory policy is not protected against parallel removal by other
    thread which is normally protected by the mmap_sem. do_get_mempolicy,
    however, drops the lock midway while we can still access it later.

    Early premature up_read is a historical artifact from times when
    put_user was called in this path see https://lwn.net/Articles/124754/
    but that is gone since 8bccd85ffbaf ("[PATCH] Implement sys_* do_*
    layering in the memory policy layer."). but when we have the the
    current mempolicy ref count model. The issue was introduced
    accordingly.

    Fix the issue by removing the premature release.

    Link: http://lkml.kernel.org/r/1502950924-27521-1-git-send-email-zhongjiang@huawei.com
    Signed-off-by: zhong jiang
    Acked-by: Michal Hocko
    Cc: Minchan Kim
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: [2.6+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    zhong jiang
     

13 Jul, 2017

1 commit

  • Page migration (for memory hotplug, soft_offline_page or mbind) needs to
    allocate a new memory. This can trigger an oom killer if the target
    memory is depleated. Although quite unlikely, still possible,
    especially for the memory hotplug (offlining of memoery).

    Up to now we didn't really have reasonable means to back off.
    __GFP_NORETRY can fail just too easily and __GFP_THISNODE sticks to a
    single node and that is not suitable for all callers.

    But now that we have __GFP_RETRY_MAYFAIL we should use it. It is
    preferable to fail the migration than disrupt the system by killing some
    processes.

    Link: http://lkml.kernel.org/r/20170623085345.11304-7-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Alex Belits
    Cc: Chris Wilson
    Cc: Christoph Hellwig
    Cc: Darrick J. Wong
    Cc: David Daney
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: NeilBrown
    Cc: Ralf Baechle
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

07 Jul, 2017

4 commits

  • Two wrappers of __alloc_pages_nodemask() are checking
    task->mems_allowed_seq themselves to retry allocation that has raced
    with a cpuset update.

    This has been shown to be ineffective in preventing premature OOM's
    which can happen in __alloc_pages_slowpath() long before it returns back
    to the wrappers to detect the race at that level.

    Previous patches have made __alloc_pages_slowpath() more robust, so we
    can now simply remove the seqlock checking in the wrappers to prevent
    further wrong impression that it can actually help.

    Link: http://lkml.kernel.org/r/20170517081140.30654-7-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Cc: Anshuman Khandual
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Dimitri Sivanich
    Cc: Hugh Dickins
    Cc: Li Zefan
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when
    changing cpuset's mems") has introduced a two-step protocol when
    rebinding task's mempolicy due to cpuset update, in order to avoid a
    parallel allocation seeing an empty effective nodemask and failing.

    Later, commit cc9a6c877661 ("cpuset: mm: reduce large amounts of memory
    barrier related damage v3") introduced a seqlock protection and removed
    the synchronization point between the two update steps. At that point
    (or perhaps later), the two-step rebinding became unnecessary.

    Currently it only makes sure that the update first adds new nodes in
    step 1 and then removes nodes in step 2. Without memory barriers the
    effects are questionable, and even then this cannot prevent a parallel
    zonelist iteration checking the nodemask at each step to observe all
    nodes as unusable for allocation. We now fully rely on the seqlock to
    prevent premature OOMs and allocation failures.

    We can thus remove the two-step update parts and simplify the code.

    Link: http://lkml.kernel.org/r/20170517081140.30654-5-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Cc: Anshuman Khandual
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Dimitri Sivanich
    Cc: Hugh Dickins
    Cc: Li Zefan
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The main allocator function __alloc_pages_nodemask() takes a zonelist
    pointer as one of its parameters. All of its callers directly or
    indirectly obtain the zonelist via node_zonelist() using a preferred
    node id and gfp_mask. We can make the code a bit simpler by doing the
    zonelist lookup in __alloc_pages_nodemask(), passing it a preferred node
    id instead (gfp_mask is already another parameter).

    There are some code size benefits thanks to removal of inlined
    node_zonelist():

    bloat-o-meter add/remove: 2/2 grow/shrink: 4/36 up/down: 399/-1351 (-952)

    This will also make things simpler if we proceed with converting cpusets
    to zonelists.

    Link: http://lkml.kernel.org/r/20170517081140.30654-4-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Reviewed-by: Christoph Lameter
    Acked-by: Michal Hocko
    Cc: Dimitri Sivanich
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Cc: Anshuman Khandual
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Li Zefan
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The task->il_next variable stores the next allocation node id for task's
    MPOL_INTERLEAVE policy. mpol_rebind_nodemask() updates interleave and
    bind mempolicies due to changing cpuset mems. Currently it also tries
    to make sure that current->il_next is valid within the updated nodemask.
    This is bogus, because 1) we are updating potentially any task's
    mempolicy, not just current, and 2) we might be updating a per-vma
    mempolicy, not task one.

    The interleave_nodes() function that uses il_next can cope fine with the
    value not being within the currently allowed nodes, so this hasn't
    manifested as an actual issue.

    We can remove the need for updating il_next completely by changing it to
    il_prev and store the node id of the previous interleave allocation
    instead of the next id. Then interleave_nodes() can calculate the next
    id using the current nodemask and also store it as il_prev, except when
    querying the next node via do_get_mempolicy().

    Link: http://lkml.kernel.org/r/20170517081140.30654-3-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Reviewed-by: Christoph Lameter
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Cc: Anshuman Khandual
    Cc: David Rientjes
    Cc: Dimitri Sivanich
    Cc: Hugh Dickins
    Cc: Li Zefan
    Cc: Mel Gorman
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

09 Apr, 2017

1 commit


02 Mar, 2017

1 commit