06 Nov, 2020

1 commit


03 Nov, 2020

1 commit

  • When flags in queue_pages_pte_range don't have MPOL_MF_MOVE or
    MPOL_MF_MOVE_ALL bits, code breaks and passing origin pte - 1 to
    pte_unmap_unlock seems like not a good idea.

    queue_pages_pte_range can run in MPOL_MF_MOVE_ALL mode which doesn't
    migrate misplaced pages but returns with EIO when encountering such a
    page. Since commit a7f40cfe3b7a ("mm: mempolicy: make mbind() return
    -EIO when MPOL_MF_STRICT is specified") and early break on the first pte
    in the range results in pte_unmap_unlock on an underflow pte. This can
    lead to lockups later on when somebody tries to lock the pte resp.
    page_table_lock again..

    Fixes: a7f40cfe3b7a ("mm: mempolicy: make mbind() return -EIO when MPOL_MF_STRICT is specified")
    Signed-off-by: Shijie Luo
    Signed-off-by: Miaohe Lin
    Signed-off-by: Andrew Morton
    Reviewed-by: Oscar Salvador
    Acked-by: Michal Hocko
    Cc: Miaohe Lin
    Cc: Feilong Lin
    Cc: Shijie Luo
    Cc:
    Link: https://lkml.kernel.org/r/20201019074853.50856-1-luoshijie1@huawei.com
    Signed-off-by: Linus Torvalds

    Shijie Luo
     

25 Oct, 2020

1 commit


14 Oct, 2020

2 commits

  • No one use this macro anymore.

    Also fix code style of policy_node().

    Signed-off-by: Wei Yang
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Link: https://lkml.kernel.org/r/20200921021401.84508-1-richard.weiyang@linux.alibaba.com
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • It is not necessary to hold the lock of current when setting nodemask of
    a new policy.

    Signed-off-by: Wei Yang
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Link: https://lkml.kernel.org/r/20200921040416.86185-1-richard.weiyang@linux.alibaba.com
    Signed-off-by: Linus Torvalds

    Wei Yang
     

17 Aug, 2020

1 commit


15 Aug, 2020

1 commit

  • The thp prefix is more frequently used than hpage and we should be
    consistent between the various functions.

    [akpm@linux-foundation.org: fix mm/migrate.c]

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: William Kucharski
    Reviewed-by: Zi Yan
    Cc: Mike Kravetz
    Cc: David Hildenbrand
    Cc: "Kirill A. Shutemov"
    Link: http://lkml.kernel.org/r/20200629151959.15779-6-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

13 Aug, 2020

6 commits

  • …ernel/git/abelloni/linux") into android-mainline

    Steps on the way to 5.9-rc1.

    Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
    Change-Id: Iceded779988ff472863b7e1c54e22a9fa6383a30

    Greg Kroah-Hartman
     
  • There is a well-defined migration target allocation callback. Use it.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Christoph Hellwig
    Cc: Mike Kravetz
    Cc: Naoya Horiguchi
    Cc: Roman Gushchin
    Link: http://lkml.kernel.org/r/1594622517-20681-7-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • There is no difference between two migration callback functions,
    alloc_huge_page_node() and alloc_huge_page_nodemask(), except
    __GFP_THISNODE handling. It's redundant to have two almost similar
    functions in order to handle this flag. So, this patch tries to remove
    one by introducing a new argument, gfp_mask, to
    alloc_huge_page_nodemask().

    After introducing gfp_mask argument, it's caller's job to provide correct
    gfp_mask. So, every callsites for alloc_huge_page_nodemask() are changed
    to provide gfp_mask.

    Note that it's safe to remove a node id check in alloc_huge_page_node()
    since there is no caller passing NUMA_NO_NODE as a node id.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Reviewed-by: Mike Kravetz
    Reviewed-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Christoph Hellwig
    Cc: Naoya Horiguchi
    Cc: Roman Gushchin
    Link: http://lkml.kernel.org/r/1594622517-20681-4-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Previous implementatoin calls untagged_addr() before error check, while if
    the error check failed and return EINVAL, the untagged_addr() call is just
    useless work.

    Signed-off-by: Wenchao Hao
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Link: http://lkml.kernel.org/r/20200801090825.5597-1-haowenchao22@gmail.com
    Signed-off-by: Linus Torvalds

    Wenchao Hao
     
  • Fix W=1 compile warnings (invalid kerneldoc):

    mm/mempolicy.c:137: warning: Function parameter or member 'node' not described in 'numa_map_to_online_node'
    mm/mempolicy.c:137: warning: Excess function parameter 'nid' description in 'numa_map_to_online_node'

    Signed-off-by: Krzysztof Kozlowski
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Link: http://lkml.kernel.org/r/20200728171109.28687-3-krzk@kernel.org
    Signed-off-by: Linus Torvalds

    Krzysztof Kozlowski
     
  • In the reservation routine, we only check whether the cpuset meets the
    memory allocation requirements. But we ignore the mempolicy of MPOL_BIND
    case. If someone mmap hugetlb succeeds, but the subsequent memory
    allocation may fail due to mempolicy restrictions and receives the SIGBUS
    signal. This can be reproduced by the follow steps.

    1) Compile the test case.
    cd tools/testing/selftests/vm/
    gcc map_hugetlb.c -o map_hugetlb

    2) Pre-allocate huge pages. Suppose there are 2 numa nodes in the
    system. Each node will pre-allocate one huge page.
    echo 2 > /proc/sys/vm/nr_hugepages

    3) Run test case(mmap 4MB). We receive the SIGBUS signal.
    numactl --membind=3D0 ./map_hugetlb 4

    With this patch applied, the mmap will fail in the step 3) and throw
    "mmap: Cannot allocate memory".

    [akpm@linux-foundation.org: include sched.h for `current']

    Reported-by: Jianchao Guo
    Suggested-by: Michal Hocko
    Signed-off-by: Muchun Song
    Signed-off-by: Andrew Morton
    Reviewed-by: Mike Kravetz
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Michel Lespinasse
    Cc: Baoquan He
    Link: http://lkml.kernel.org/r/20200728034938.14993-1-songmuchun@bytedance.com
    Signed-off-by: Linus Torvalds

    Muchun Song
     

07 Aug, 2020

1 commit


17 Jul, 2020

1 commit

  • Using uninitialized_var() is dangerous as it papers over real bugs[1]
    (or can in the future), and suppresses unrelated compiler warnings
    (e.g. "unused variable"). If the compiler thinks it is uninitialized,
    either simply initialize the variable or make compiler changes.

    In preparation for removing[2] the[3] macro[4], remove all remaining
    needless uses with the following script:

    git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \
    xargs perl -pi -e \
    's/\buninitialized_var\(([^\)]+)\)/\1/g;
    s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;'

    drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid
    pathological white-space.

    No outstanding warnings were found building allmodconfig with GCC 9.3.0
    for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64,
    alpha, and m68k.

    [1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/
    [2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/
    [3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/
    [4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/

    Reviewed-by: Leon Romanovsky # drivers/infiniband and mlx4/mlx5
    Acked-by: Jason Gunthorpe # IB
    Acked-by: Kalle Valo # wireless drivers
    Reviewed-by: Chao Yu # erofs
    Signed-off-by: Kees Cook

    Kees Cook
     

24 Jun, 2020

1 commit


22 Jun, 2020

1 commit


10 Jun, 2020

3 commits

  • Convert comments that reference mmap_sem to reference mmap_lock instead.

    [akpm@linux-foundation.org: fix up linux-next leftovers]
    [akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil]
    [akpm@linux-foundation.org: more linux-next fixups, per Michel]

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Daniel Jordan
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Laurent Dufour
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Convert comments that reference old mmap_sem APIs to reference
    corresponding new mmap locking APIs instead.

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Davidlohr Bueso
    Reviewed-by: Daniel Jordan
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Laurent Dufour
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-12-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • This change converts the existing mmap_sem rwsem calls to use the new mmap
    locking API instead.

    The change is generated using coccinelle with the following rule:

    // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .

    @@
    expression mm;
    @@
    (
    -init_rwsem
    +mmap_init_lock
    |
    -down_write
    +mmap_write_lock
    |
    -down_write_killable
    +mmap_write_lock_killable
    |
    -down_write_trylock
    +mmap_write_trylock
    |
    -up_write
    +mmap_write_unlock
    |
    -downgrade_write
    +mmap_write_downgrade
    |
    -down_read
    +mmap_read_lock
    |
    -down_read_killable
    +mmap_read_lock_killable
    |
    -down_read_trylock
    +mmap_read_trylock
    |
    -up_read
    +mmap_read_unlock
    )
    -(&mm->mmap_sem)
    +(mm)

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Daniel Jordan
    Reviewed-by: Laurent Dufour
    Reviewed-by: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

04 Jun, 2020

1 commit

  • ba841078cd05 ("mm/mempolicy: Allow lookup_node() to handle fatal signal")
    has added a special casing for 0 return value because that was a possible
    gup return value when interrupted by fatal signal. This has been fixed by
    ae46d2aa6a7f ("mm/gup: Let __get_user_pages_locked() return -EINTR for
    fatal signal") in the mean time so ba841078cd05 can be reverted.

    This patch however doesn't go all the way to revert it because the check
    for 0 is wrong and confusing here. Firstly it is inherently unsafe to
    access the page when get_user_pages_locked returns 0 (aka no page
    returned).

    Fortunatelly this will not happen because get_user_pages_locked will not
    return 0 when nr_pages > 0 unless FOLL_NOWAIT is specified which is not
    the case here. Document this potential error code in gup code while we
    are at it.

    Signed-off-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Cc: Peter Xu
    Link: http://lkml.kernel.org/r/20200421071026.18394-1-mhocko@kernel.org
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

10 Apr, 2020

1 commit


09 Apr, 2020

1 commit

  • Pull libnvdimm and dax updates from Dan Williams:
    "There were multiple touches outside of drivers/nvdimm/ this round to
    add cross arch compatibility to the devm_memremap_pages() interface,
    enhance numa information for persistent memory ranges, and add a
    zero_page_range() dax operation.

    This cycle I switched from the patchwork api to Konstantin's b4 script
    for collecting tags (from x86, PowerPC, filesystem, and device-mapper
    folks), and everything looks to have gone ok there. This has all
    appeared in -next with no reported issues.

    Summary:

    - Add support for region alignment configuration and enforcement to
    fix compatibility across architectures and PowerPC page size
    configurations.

    - Introduce 'zero_page_range' as a dax operation. This facilitates
    filesystem-dax operation without a block-device.

    - Introduce phys_to_target_node() to facilitate drivers that want to
    know resulting numa node if a given reserved address range was
    onlined.

    - Advertise a persistence-domain for of_pmem and papr_scm. The
    persistence domain indicates where cpu-store cycles need to reach
    in the platform-memory subsystem before the platform will consider
    them power-fail protected.

    - Promote numa_map_to_online_node() to a cross-kernel generic
    facility.

    - Save x86 numa information to allow for node-id lookups for reserved
    memory ranges, deploy that capability for the e820-pmem driver.

    - Pick up some miscellaneous minor fixes, that missed v5.6-final,
    including a some smatch reports in the ioctl path and some unit
    test compilation fixups.

    - Fixup some flexible-array declarations"

    * tag 'libnvdimm-for-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (29 commits)
    dax: Move mandatory ->zero_page_range() check in alloc_dax()
    dax,iomap: Add helper dax_iomap_zero() to zero a range
    dax: Use new dax zero page method for zeroing a page
    dm,dax: Add dax zero_page_range operation
    s390,dcssblk,dax: Add dax zero_page_range operation to dcssblk driver
    dax, pmem: Add a dax operation zero_page_range
    pmem: Add functions for reading/writing page to/from pmem
    libnvdimm: Update persistence domain value for of_pmem and papr_scm device
    tools/test/nvdimm: Fix out of tree build
    libnvdimm/region: Fix build error
    libnvdimm/region: Replace zero-length array with flexible-array member
    libnvdimm/label: Replace zero-length array with flexible-array member
    ACPI: NFIT: Replace zero-length array with flexible-array member
    libnvdimm/region: Introduce an 'align' attribute
    libnvdimm/region: Introduce NDD_LABELING
    libnvdimm/namespace: Enforce memremap_compat_align()
    libnvdimm/pfn: Prevent raw mode fallback if pfn-infoblock valid
    libnvdimm: Out of bounds read in __nd_ioctl()
    acpi/nfit: improve bounds checking for 'func'
    mm/memremap_pages: Introduce memremap_compat_align()
    ...

    Linus Torvalds
     

08 Apr, 2020

6 commits

  • lookup_node() uses gup to pin the page and get node information. It
    checks against ret>=0 assuming the page will be filled in. However it's
    also possible that gup will return zero, for example, when the thread is
    quickly killed with a fatal signal. Teach lookup_node() to gracefully
    return an error -EFAULT if it happens.

    Meanwhile, initialize "page" to NULL to avoid potential risk of
    exploiting the pointer.

    Fixes: 4426e945df58 ("mm/gup: allow VM_FAULT_RETRY for multiple times")
    Reported-by: syzbot+693dc11fcb53120b5559@syzkaller.appspotmail.com
    Signed-off-by: Peter Xu
    Signed-off-by: Linus Torvalds

    Peter Xu
     
  • Convert the various /* fallthrough */ comments to the pseudo-keyword
    fallthrough;

    Done via script:
    https://lore.kernel.org/lkml/b56602fcf79f849e733e7b521bb0e17895d390fa.1582230379.git.joe@perches.com/

    Signed-off-by: Joe Perches
    Signed-off-by: Andrew Morton
    Reviewed-by: Gustavo A. R. Silva
    Link: http://lkml.kernel.org/r/f62fea5d10eb0ccfc05d87c242a620c261219b66.camel@perches.com
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Sparse reports a warning at queue_pages_pmd()

    context imbalance in queue_pages_pmd() - unexpected unlock

    The root cause is the missing annotation at queue_pages_pmd()
    Add the missing __releases(ptl)

    Signed-off-by: Jules Irenge
    Signed-off-by: Andrew Morton
    Link: http://lkml.kernel.org/r/20200214204741.94112-8-jbi.octave@gmail.com
    Signed-off-by: Linus Torvalds

    Jules Irenge
     
  • change_protection() was used by either the NUMA or mprotect() code,
    there's one parameter for each of the callers (dirty_accountable and
    prot_numa). Further, these parameters are passed along the calls:

    - change_protection_range()
    - change_p4d_range()
    - change_pud_range()
    - change_pmd_range()
    - ...

    Now we introduce a flag for change_protect() and all these helpers to
    replace these parameters. Then we can avoid passing multiple parameters
    multiple times along the way.

    More importantly, it'll greatly simplify the work if we want to introduce
    any new parameters to change_protection(). In the follow up patches, a
    new parameter for userfaultfd write protection will be introduced.

    No functional change at all.

    Signed-off-by: Peter Xu
    Signed-off-by: Andrew Morton
    Reviewed-by: Jerome Glisse
    Cc: Andrea Arcangeli
    Cc: Bobby Powers
    Cc: Brian Geffon
    Cc: David Hildenbrand
    Cc: Denis Plotnikov
    Cc: "Dr . David Alan Gilbert"
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: "Kirill A . Shutemov"
    Cc: Martin Cracauer
    Cc: Marty McFadden
    Cc: Maya Gokhale
    Cc: Mel Gorman
    Cc: Mike Kravetz
    Cc: Mike Rapoport
    Cc: Pavel Emelyanov
    Cc: Rik van Riel
    Cc: Shaohua Li
    Link: http://lkml.kernel.org/r/20200220163112.11409-7-peterx@redhat.com
    Signed-off-by: Linus Torvalds

    Peter Xu
     
  • Some comments for MADV_FREE is revised and added to help people understand
    the MADV_FREE code, especially the page flag, PG_swapbacked. This makes
    page_is_file_cache() isn't consistent with its comments. So the function
    is renamed to page_is_file_lru() to make them consistent again. All these
    are put in one patch as one logical change.

    Suggested-by: David Hildenbrand
    Suggested-by: Johannes Weiner
    Suggested-by: David Rientjes
    Signed-off-by: "Huang, Ying"
    Signed-off-by: Andrew Morton
    Acked-by: Johannes Weiner
    Acked-by: David Rientjes
    Acked-by: Michal Hocko
    Acked-by: Pankaj Gupta
    Acked-by: Vlastimil Babka
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Link: http://lkml.kernel.org/r/20200317100342.2730705-1-ying.huang@intel.com
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • Lets move vma_is_accessible() helper to include/linux/mm.h which makes it
    available for general use. While here, this replaces all remaining open
    encodings for VMA access check with vma_is_accessible().

    Signed-off-by: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Acked-by: Geert Uytterhoeven
    Acked-by: Guo Ren
    Acked-by: Vlastimil Babka
    Cc: Guo Ren
    Cc: Geert Uytterhoeven
    Cc: Ralf Baechle
    Cc: Paul Burton
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Cc: Yoshinori Sato
    Cc: Rich Felker
    Cc: Dave Hansen
    Cc: Andy Lutomirski
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Steven Rostedt
    Cc: Mel Gorman
    Cc: Alexander Viro
    Cc: "Aneesh Kumar K.V"
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnd Bergmann
    Cc: Nick Piggin
    Cc: Paul Mackerras
    Cc: Will Deacon
    Link: http://lkml.kernel.org/r/1582520593-30704-3-git-send-email-anshuman.khandual@arm.com
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     

07 Apr, 2020

1 commit


03 Apr, 2020

4 commits

  • Using an empty (malformed) nodelist that is not caught during mount option
    parsing leads to a stack-out-of-bounds access.

    The option string that was used was: "mpol=prefer:,". However,
    MPOL_PREFERRED requires a single node number, which is not being provided
    here.

    Add a check that 'nodes' is not empty after parsing for MPOL_PREFERRED's
    nodeid.

    Fixes: 095f1fc4ebf3 ("mempolicy: rework shmem mpol parsing and display")
    Reported-by: Entropy Moe
    Reported-by: syzbot+b055b1a6b2b958707a21@syzkaller.appspotmail.com
    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Tested-by: syzbot+b055b1a6b2b958707a21@syzkaller.appspotmail.com
    Cc: Lee Schermerhorn
    Link: http://lkml.kernel.org/r/89526377-7eb6-b662-e1d8-4430928abde9@infradead.org
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • The VM_BUG_ON() is already used by queue_pages_test_walk(), it sounds
    better to dump more debug information by using VM_BUG_ON_VMA() to help
    debugging.

    Signed-off-by: Yang Shi
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Cc: "Li Xinhai"
    Cc: Qian Cai
    Link: http://lkml.kernel.org/r/1579068565-110432-1-git-send-email-yang.shi@linux.alibaba.com
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • vma_migratable() is called to check if pages in vma can be migrated before
    go ahead to further actions. Currently it is used in below code path:

    - task_numa_work
    - mbind
    - move_pages

    For hugetlb mapping, whether vma is migratable or not is determined by:
    - CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION
    - arch_hugetlb_migration_supported

    Issue: current code only checks for CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION
    alone, and no code should use it directly. (note that current code in
    vma_migratable don't cause failure or bug because
    unmap_and_move_huge_page() will catch unsupported hugepage and handle it
    properly)

    This patch checks the two factors by hugepage_migration_supported for
    impoving code logic and robustness. It will enable early bail out of
    hugepage migration procedure, but because currently all architecture
    supporting hugepage migration is able to support all page size, we would
    not see performance gain with this patch applied.

    vma_migratable() is moved to mm/mempolicy.c, because of the circular
    reference of mempolicy.h and hugetlb.h cause defining it as inline not
    feasible.

    Signed-off-by: Li Xinhai
    Signed-off-by: Andrew Morton
    Reviewed-by: Mike Kravetz
    Acked-by: Michal Hocko
    Cc: Anshuman Khandual
    Cc: Naoya Horiguchi
    Link: http://lkml.kernel.org/r/1579786179-30633-1-git-send-email-lixinhai.lxh@gmail.com
    Signed-off-by: Linus Torvalds

    Li Xinhai
     
  • MPOL_MF_STRICT is used in mbind() for purposes:

    (1) MPOL_MF_STRICT is set alone without MPOL_MF_MOVE or
    MPOL_MF_MOVE_ALL, to check if there is misplaced page and return -EIO;

    (2) MPOL_MF_STRICT is set with MPOL_MF_MOVE or MPOL_MF_MOVE_ALL, to
    check if there is misplaced page which is failed to isolate, or page
    is success on isolate but failed to move, and return -EIO.

    For non hugepage mapping, (1) and (2) are implemented as expectation. For
    hugepage mapping, (1) is not implemented. And in (2), the part about
    failed to isolate and report -EIO is not implemented.

    This patch implements the missed parts for hugepage mapping. Benefits
    with it applied:

    - User space can apply same code logic to handle mbind() on hugepage and
    non hugepage mapping;

    - Reliably using MPOL_MF_STRICT alone to check whether there is
    misplaced page or not when bind policy on address range, especially for
    address range which contains both hugepage and non hugepage mapping.

    Analysis of potential impact to existing users:

    - If MPOL_MF_STRICT alone was previously used, hugetlb pages not
    following the memory policy would not cause an EIO error. After this
    change, hugetlb pages are treated like all other pages. If
    MPOL_MF_STRICT alone is used and hugetlb pages do not follow memory
    policy an EIO error will be returned.

    - For users who using MPOL_MF_STRICT with MPOL_MF_MOVE or
    MPOL_MF_MOVE_ALL, the semantic about some pages could not be moved will
    not be changed by this patch, because failed to isolate and failed to
    move have same effects to users, so their existing code will not be
    impacted.

    In mbind man page, the note about 'MPOL_MF_STRICT is ignored on huge page
    mappings' can be removed after this patch is applied.

    Mike:

    : The current behavior with MPOL_MF_STRICT and hugetlb pages is inconsistent
    : and does not match documentation (as described above). The special
    : behavior for hugetlb pages ideally should have been removed when hugetlb
    : page migration was introduced. It is unlikely that anyone relies on
    : today's inconsistent behavior, and removing one more case of special
    : handling for hugetlb pages is a good thing.

    Signed-off-by: Li Xinhai
    Signed-off-by: Andrew Morton
    Reviewed-by: Mike Kravetz
    Reviewed-by: Naoya Horiguchi
    Cc: Michal Hocko
    Cc: linux-man
    Link: http://lkml.kernel.org/r/1581559627-6206-1-git-send-email-lixinhai.lxh@gmail.com
    Signed-off-by: Linus Torvalds

    Li Xinhai
     

18 Feb, 2020

2 commits

  • Update numa_map_to_online_node() to stop falling back to numa node 0
    when the input is NUMA_NO_NODE. Also, skip the lookup if @node is
    online. This makes the routine compatible with other arch node mapping
    routines.

    Reported-by: Aneesh Kumar K.V
    Reviewed-by: Aneesh Kumar K.V
    Link: https://lore.kernel.org/r/157401275716.43284.13185549705765009174.stgit@dwillia2-desk3.amr.corp.intel.com
    Reviewed-by: Ingo Molnar
    Signed-off-by: Dan Williams
    Link: https://lore.kernel.org/r/158188325316.894464.15650888748083329531.stgit@dwillia2-desk3.amr.corp.intel.com

    Dan Williams
     
  • The acpi_map_pxm_to_online_node() helper is used to find the closest
    online node to a given proximity domain. This is used to map devices in
    a proximity domain with no online memory or cpus to the closest online
    node and populate a device's 'numa_node' property. The numa_node
    property allows applications to be migrated "close" to a resource.

    In preparation for providing a generic facility to optionally map an
    address range to its closest online node, or the node the range would
    represent were it to be onlined (target_node), up-level the core of
    acpi_map_pxm_to_online_node() to a generic mm/numa helper.

    Cc: Michal Hocko
    Acked-by: Rafael J. Wysocki
    Reviewed-by: Ingo Molnar
    Signed-off-by: Dan Williams
    Link: https://lore.kernel.org/r/158188324802.894464.13128795207831894206.stgit@dwillia2-desk3.amr.corp.intel.com

    Dan Williams
     

08 Feb, 2020

1 commit


01 Feb, 2020

1 commit

  • What we are trying to do is change the '=' character to a NUL terminator
    and then at the end of the function we restore it back to an '='. The
    problem is there are two error paths where we jump to the end of the
    function before we have replaced the '=' with NUL.

    We end up putting the '=' in the wrong place (possibly one element
    before the start of the buffer).

    Link: http://lkml.kernel.org/r/20200115055426.vdjwvry44nfug7yy@kili.mountain
    Reported-by: syzbot+e64a13c5369a194d67df@syzkaller.appspotmail.com
    Fixes: 095f1fc4ebf3 ("mempolicy: rework shmem mpol parsing and display")
    Signed-off-by: Dan Carpenter
    Acked-by: Vlastimil Babka
    Dmitry Vyukov
    Cc: Michal Hocko
    Cc: Dan Carpenter
    Cc: Lee Schermerhorn
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Carpenter
     

20 Jan, 2020

1 commit


14 Jan, 2020

1 commit

  • THP page faults now attempt a __GFP_THISNODE allocation first, which
    should only compact existing free memory, followed by another attempt
    that can allocate from any node using reclaim/compaction effort
    specified by global defrag setting and madvise.

    This patch makes the following changes to the scheme:

    - Before the patch, the first allocation relies on a check for
    pageblock order and __GFP_IO to prevent excessive reclaim. This
    however affects also the second attempt, which is not limited to
    single node.

    Instead of that, reuse the existing check for costly order
    __GFP_NORETRY allocations, and make sure the first THP attempt uses
    __GFP_NORETRY. As a side-effect, all costly order __GFP_NORETRY
    allocations will bail out if compaction needs reclaim, while
    previously they only bailed out when compaction was deferred due to
    previous failures.

    This should be still acceptable within the __GFP_NORETRY semantics.

    - Before the patch, the second allocation attempt (on all nodes) was
    passing __GFP_NORETRY. This is redundant as the check for pageblock
    order (discussed above) was stronger. It's also contrary to
    madvise(MADV_HUGEPAGE) which means some effort to allocate THP is
    requested.

    After this patch, the second attempt doesn't pass __GFP_THISNODE nor
    __GFP_NORETRY.

    To sum up, THP page faults now try the following attempts:

    1. local node only THP allocation with no reclaim, just compaction.
    2. for madvised VMA's or when synchronous compaction is enabled always - THP
    allocation from any node with effort determined by global defrag setting
    and VMA madvise
    3. fallback to base pages on any node

    Link: http://lkml.kernel.org/r/08a3f4dd-c3ce-0009-86c5-9ee51aba8557@suse.cz
    Fixes: b39d0ee2632d ("mm, page_alloc: avoid expensive reclaim when compaction may not succeed")
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Linus Torvalds
    Cc: Andrea Arcangeli
    Cc: Mel Gorman
    Cc: "Kirill A. Shutemov"
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka