03 Nov, 2020

1 commit

  • When flags in queue_pages_pte_range don't have MPOL_MF_MOVE or
    MPOL_MF_MOVE_ALL bits, code breaks and passing origin pte - 1 to
    pte_unmap_unlock seems like not a good idea.

    queue_pages_pte_range can run in MPOL_MF_MOVE_ALL mode which doesn't
    migrate misplaced pages but returns with EIO when encountering such a
    page. Since commit a7f40cfe3b7a ("mm: mempolicy: make mbind() return
    -EIO when MPOL_MF_STRICT is specified") and early break on the first pte
    in the range results in pte_unmap_unlock on an underflow pte. This can
    lead to lockups later on when somebody tries to lock the pte resp.
    page_table_lock again..

    Fixes: a7f40cfe3b7a ("mm: mempolicy: make mbind() return -EIO when MPOL_MF_STRICT is specified")
    Signed-off-by: Shijie Luo
    Signed-off-by: Miaohe Lin
    Signed-off-by: Andrew Morton
    Reviewed-by: Oscar Salvador
    Acked-by: Michal Hocko
    Cc: Miaohe Lin
    Cc: Feilong Lin
    Cc: Shijie Luo
    Cc:
    Link: https://lkml.kernel.org/r/20201019074853.50856-1-luoshijie1@huawei.com
    Signed-off-by: Linus Torvalds

    Shijie Luo
     

14 Oct, 2020

2 commits

  • No one use this macro anymore.

    Also fix code style of policy_node().

    Signed-off-by: Wei Yang
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Link: https://lkml.kernel.org/r/20200921021401.84508-1-richard.weiyang@linux.alibaba.com
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • It is not necessary to hold the lock of current when setting nodemask of
    a new policy.

    Signed-off-by: Wei Yang
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Link: https://lkml.kernel.org/r/20200921040416.86185-1-richard.weiyang@linux.alibaba.com
    Signed-off-by: Linus Torvalds

    Wei Yang
     

15 Aug, 2020

1 commit

  • The thp prefix is more frequently used than hpage and we should be
    consistent between the various functions.

    [akpm@linux-foundation.org: fix mm/migrate.c]

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: William Kucharski
    Reviewed-by: Zi Yan
    Cc: Mike Kravetz
    Cc: David Hildenbrand
    Cc: "Kirill A. Shutemov"
    Link: http://lkml.kernel.org/r/20200629151959.15779-6-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

13 Aug, 2020

5 commits

  • There is a well-defined migration target allocation callback. Use it.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Christoph Hellwig
    Cc: Mike Kravetz
    Cc: Naoya Horiguchi
    Cc: Roman Gushchin
    Link: http://lkml.kernel.org/r/1594622517-20681-7-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • There is no difference between two migration callback functions,
    alloc_huge_page_node() and alloc_huge_page_nodemask(), except
    __GFP_THISNODE handling. It's redundant to have two almost similar
    functions in order to handle this flag. So, this patch tries to remove
    one by introducing a new argument, gfp_mask, to
    alloc_huge_page_nodemask().

    After introducing gfp_mask argument, it's caller's job to provide correct
    gfp_mask. So, every callsites for alloc_huge_page_nodemask() are changed
    to provide gfp_mask.

    Note that it's safe to remove a node id check in alloc_huge_page_node()
    since there is no caller passing NUMA_NO_NODE as a node id.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Reviewed-by: Mike Kravetz
    Reviewed-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Christoph Hellwig
    Cc: Naoya Horiguchi
    Cc: Roman Gushchin
    Link: http://lkml.kernel.org/r/1594622517-20681-4-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Previous implementatoin calls untagged_addr() before error check, while if
    the error check failed and return EINVAL, the untagged_addr() call is just
    useless work.

    Signed-off-by: Wenchao Hao
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Link: http://lkml.kernel.org/r/20200801090825.5597-1-haowenchao22@gmail.com
    Signed-off-by: Linus Torvalds

    Wenchao Hao
     
  • Fix W=1 compile warnings (invalid kerneldoc):

    mm/mempolicy.c:137: warning: Function parameter or member 'node' not described in 'numa_map_to_online_node'
    mm/mempolicy.c:137: warning: Excess function parameter 'nid' description in 'numa_map_to_online_node'

    Signed-off-by: Krzysztof Kozlowski
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Link: http://lkml.kernel.org/r/20200728171109.28687-3-krzk@kernel.org
    Signed-off-by: Linus Torvalds

    Krzysztof Kozlowski
     
  • In the reservation routine, we only check whether the cpuset meets the
    memory allocation requirements. But we ignore the mempolicy of MPOL_BIND
    case. If someone mmap hugetlb succeeds, but the subsequent memory
    allocation may fail due to mempolicy restrictions and receives the SIGBUS
    signal. This can be reproduced by the follow steps.

    1) Compile the test case.
    cd tools/testing/selftests/vm/
    gcc map_hugetlb.c -o map_hugetlb

    2) Pre-allocate huge pages. Suppose there are 2 numa nodes in the
    system. Each node will pre-allocate one huge page.
    echo 2 > /proc/sys/vm/nr_hugepages

    3) Run test case(mmap 4MB). We receive the SIGBUS signal.
    numactl --membind=3D0 ./map_hugetlb 4

    With this patch applied, the mmap will fail in the step 3) and throw
    "mmap: Cannot allocate memory".

    [akpm@linux-foundation.org: include sched.h for `current']

    Reported-by: Jianchao Guo
    Suggested-by: Michal Hocko
    Signed-off-by: Muchun Song
    Signed-off-by: Andrew Morton
    Reviewed-by: Mike Kravetz
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Michel Lespinasse
    Cc: Baoquan He
    Link: http://lkml.kernel.org/r/20200728034938.14993-1-songmuchun@bytedance.com
    Signed-off-by: Linus Torvalds

    Muchun Song
     

17 Jul, 2020

1 commit

  • Using uninitialized_var() is dangerous as it papers over real bugs[1]
    (or can in the future), and suppresses unrelated compiler warnings
    (e.g. "unused variable"). If the compiler thinks it is uninitialized,
    either simply initialize the variable or make compiler changes.

    In preparation for removing[2] the[3] macro[4], remove all remaining
    needless uses with the following script:

    git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \
    xargs perl -pi -e \
    's/\buninitialized_var\(([^\)]+)\)/\1/g;
    s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;'

    drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid
    pathological white-space.

    No outstanding warnings were found building allmodconfig with GCC 9.3.0
    for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64,
    alpha, and m68k.

    [1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/
    [2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/
    [3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/
    [4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/

    Reviewed-by: Leon Romanovsky # drivers/infiniband and mlx4/mlx5
    Acked-by: Jason Gunthorpe # IB
    Acked-by: Kalle Valo # wireless drivers
    Reviewed-by: Chao Yu # erofs
    Signed-off-by: Kees Cook

    Kees Cook
     

10 Jun, 2020

3 commits

  • Convert comments that reference mmap_sem to reference mmap_lock instead.

    [akpm@linux-foundation.org: fix up linux-next leftovers]
    [akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil]
    [akpm@linux-foundation.org: more linux-next fixups, per Michel]

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Daniel Jordan
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Laurent Dufour
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Convert comments that reference old mmap_sem APIs to reference
    corresponding new mmap locking APIs instead.

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Davidlohr Bueso
    Reviewed-by: Daniel Jordan
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Laurent Dufour
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-12-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • This change converts the existing mmap_sem rwsem calls to use the new mmap
    locking API instead.

    The change is generated using coccinelle with the following rule:

    // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .

    @@
    expression mm;
    @@
    (
    -init_rwsem
    +mmap_init_lock
    |
    -down_write
    +mmap_write_lock
    |
    -down_write_killable
    +mmap_write_lock_killable
    |
    -down_write_trylock
    +mmap_write_trylock
    |
    -up_write
    +mmap_write_unlock
    |
    -downgrade_write
    +mmap_write_downgrade
    |
    -down_read
    +mmap_read_lock
    |
    -down_read_killable
    +mmap_read_lock_killable
    |
    -down_read_trylock
    +mmap_read_trylock
    |
    -up_read
    +mmap_read_unlock
    )
    -(&mm->mmap_sem)
    +(mm)

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Daniel Jordan
    Reviewed-by: Laurent Dufour
    Reviewed-by: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

04 Jun, 2020

1 commit

  • ba841078cd05 ("mm/mempolicy: Allow lookup_node() to handle fatal signal")
    has added a special casing for 0 return value because that was a possible
    gup return value when interrupted by fatal signal. This has been fixed by
    ae46d2aa6a7f ("mm/gup: Let __get_user_pages_locked() return -EINTR for
    fatal signal") in the mean time so ba841078cd05 can be reverted.

    This patch however doesn't go all the way to revert it because the check
    for 0 is wrong and confusing here. Firstly it is inherently unsafe to
    access the page when get_user_pages_locked returns 0 (aka no page
    returned).

    Fortunatelly this will not happen because get_user_pages_locked will not
    return 0 when nr_pages > 0 unless FOLL_NOWAIT is specified which is not
    the case here. Document this potential error code in gup code while we
    are at it.

    Signed-off-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Cc: Peter Xu
    Link: http://lkml.kernel.org/r/20200421071026.18394-1-mhocko@kernel.org
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

09 Apr, 2020

1 commit

  • Pull libnvdimm and dax updates from Dan Williams:
    "There were multiple touches outside of drivers/nvdimm/ this round to
    add cross arch compatibility to the devm_memremap_pages() interface,
    enhance numa information for persistent memory ranges, and add a
    zero_page_range() dax operation.

    This cycle I switched from the patchwork api to Konstantin's b4 script
    for collecting tags (from x86, PowerPC, filesystem, and device-mapper
    folks), and everything looks to have gone ok there. This has all
    appeared in -next with no reported issues.

    Summary:

    - Add support for region alignment configuration and enforcement to
    fix compatibility across architectures and PowerPC page size
    configurations.

    - Introduce 'zero_page_range' as a dax operation. This facilitates
    filesystem-dax operation without a block-device.

    - Introduce phys_to_target_node() to facilitate drivers that want to
    know resulting numa node if a given reserved address range was
    onlined.

    - Advertise a persistence-domain for of_pmem and papr_scm. The
    persistence domain indicates where cpu-store cycles need to reach
    in the platform-memory subsystem before the platform will consider
    them power-fail protected.

    - Promote numa_map_to_online_node() to a cross-kernel generic
    facility.

    - Save x86 numa information to allow for node-id lookups for reserved
    memory ranges, deploy that capability for the e820-pmem driver.

    - Pick up some miscellaneous minor fixes, that missed v5.6-final,
    including a some smatch reports in the ioctl path and some unit
    test compilation fixups.

    - Fixup some flexible-array declarations"

    * tag 'libnvdimm-for-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (29 commits)
    dax: Move mandatory ->zero_page_range() check in alloc_dax()
    dax,iomap: Add helper dax_iomap_zero() to zero a range
    dax: Use new dax zero page method for zeroing a page
    dm,dax: Add dax zero_page_range operation
    s390,dcssblk,dax: Add dax zero_page_range operation to dcssblk driver
    dax, pmem: Add a dax operation zero_page_range
    pmem: Add functions for reading/writing page to/from pmem
    libnvdimm: Update persistence domain value for of_pmem and papr_scm device
    tools/test/nvdimm: Fix out of tree build
    libnvdimm/region: Fix build error
    libnvdimm/region: Replace zero-length array with flexible-array member
    libnvdimm/label: Replace zero-length array with flexible-array member
    ACPI: NFIT: Replace zero-length array with flexible-array member
    libnvdimm/region: Introduce an 'align' attribute
    libnvdimm/region: Introduce NDD_LABELING
    libnvdimm/namespace: Enforce memremap_compat_align()
    libnvdimm/pfn: Prevent raw mode fallback if pfn-infoblock valid
    libnvdimm: Out of bounds read in __nd_ioctl()
    acpi/nfit: improve bounds checking for 'func'
    mm/memremap_pages: Introduce memremap_compat_align()
    ...

    Linus Torvalds
     

08 Apr, 2020

6 commits

  • lookup_node() uses gup to pin the page and get node information. It
    checks against ret>=0 assuming the page will be filled in. However it's
    also possible that gup will return zero, for example, when the thread is
    quickly killed with a fatal signal. Teach lookup_node() to gracefully
    return an error -EFAULT if it happens.

    Meanwhile, initialize "page" to NULL to avoid potential risk of
    exploiting the pointer.

    Fixes: 4426e945df58 ("mm/gup: allow VM_FAULT_RETRY for multiple times")
    Reported-by: syzbot+693dc11fcb53120b5559@syzkaller.appspotmail.com
    Signed-off-by: Peter Xu
    Signed-off-by: Linus Torvalds

    Peter Xu
     
  • Convert the various /* fallthrough */ comments to the pseudo-keyword
    fallthrough;

    Done via script:
    https://lore.kernel.org/lkml/b56602fcf79f849e733e7b521bb0e17895d390fa.1582230379.git.joe@perches.com/

    Signed-off-by: Joe Perches
    Signed-off-by: Andrew Morton
    Reviewed-by: Gustavo A. R. Silva
    Link: http://lkml.kernel.org/r/f62fea5d10eb0ccfc05d87c242a620c261219b66.camel@perches.com
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Sparse reports a warning at queue_pages_pmd()

    context imbalance in queue_pages_pmd() - unexpected unlock

    The root cause is the missing annotation at queue_pages_pmd()
    Add the missing __releases(ptl)

    Signed-off-by: Jules Irenge
    Signed-off-by: Andrew Morton
    Link: http://lkml.kernel.org/r/20200214204741.94112-8-jbi.octave@gmail.com
    Signed-off-by: Linus Torvalds

    Jules Irenge
     
  • change_protection() was used by either the NUMA or mprotect() code,
    there's one parameter for each of the callers (dirty_accountable and
    prot_numa). Further, these parameters are passed along the calls:

    - change_protection_range()
    - change_p4d_range()
    - change_pud_range()
    - change_pmd_range()
    - ...

    Now we introduce a flag for change_protect() and all these helpers to
    replace these parameters. Then we can avoid passing multiple parameters
    multiple times along the way.

    More importantly, it'll greatly simplify the work if we want to introduce
    any new parameters to change_protection(). In the follow up patches, a
    new parameter for userfaultfd write protection will be introduced.

    No functional change at all.

    Signed-off-by: Peter Xu
    Signed-off-by: Andrew Morton
    Reviewed-by: Jerome Glisse
    Cc: Andrea Arcangeli
    Cc: Bobby Powers
    Cc: Brian Geffon
    Cc: David Hildenbrand
    Cc: Denis Plotnikov
    Cc: "Dr . David Alan Gilbert"
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: "Kirill A . Shutemov"
    Cc: Martin Cracauer
    Cc: Marty McFadden
    Cc: Maya Gokhale
    Cc: Mel Gorman
    Cc: Mike Kravetz
    Cc: Mike Rapoport
    Cc: Pavel Emelyanov
    Cc: Rik van Riel
    Cc: Shaohua Li
    Link: http://lkml.kernel.org/r/20200220163112.11409-7-peterx@redhat.com
    Signed-off-by: Linus Torvalds

    Peter Xu
     
  • Some comments for MADV_FREE is revised and added to help people understand
    the MADV_FREE code, especially the page flag, PG_swapbacked. This makes
    page_is_file_cache() isn't consistent with its comments. So the function
    is renamed to page_is_file_lru() to make them consistent again. All these
    are put in one patch as one logical change.

    Suggested-by: David Hildenbrand
    Suggested-by: Johannes Weiner
    Suggested-by: David Rientjes
    Signed-off-by: "Huang, Ying"
    Signed-off-by: Andrew Morton
    Acked-by: Johannes Weiner
    Acked-by: David Rientjes
    Acked-by: Michal Hocko
    Acked-by: Pankaj Gupta
    Acked-by: Vlastimil Babka
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Link: http://lkml.kernel.org/r/20200317100342.2730705-1-ying.huang@intel.com
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • Lets move vma_is_accessible() helper to include/linux/mm.h which makes it
    available for general use. While here, this replaces all remaining open
    encodings for VMA access check with vma_is_accessible().

    Signed-off-by: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Acked-by: Geert Uytterhoeven
    Acked-by: Guo Ren
    Acked-by: Vlastimil Babka
    Cc: Guo Ren
    Cc: Geert Uytterhoeven
    Cc: Ralf Baechle
    Cc: Paul Burton
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Cc: Yoshinori Sato
    Cc: Rich Felker
    Cc: Dave Hansen
    Cc: Andy Lutomirski
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Steven Rostedt
    Cc: Mel Gorman
    Cc: Alexander Viro
    Cc: "Aneesh Kumar K.V"
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnd Bergmann
    Cc: Nick Piggin
    Cc: Paul Mackerras
    Cc: Will Deacon
    Link: http://lkml.kernel.org/r/1582520593-30704-3-git-send-email-anshuman.khandual@arm.com
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     

03 Apr, 2020

4 commits

  • Using an empty (malformed) nodelist that is not caught during mount option
    parsing leads to a stack-out-of-bounds access.

    The option string that was used was: "mpol=prefer:,". However,
    MPOL_PREFERRED requires a single node number, which is not being provided
    here.

    Add a check that 'nodes' is not empty after parsing for MPOL_PREFERRED's
    nodeid.

    Fixes: 095f1fc4ebf3 ("mempolicy: rework shmem mpol parsing and display")
    Reported-by: Entropy Moe
    Reported-by: syzbot+b055b1a6b2b958707a21@syzkaller.appspotmail.com
    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Tested-by: syzbot+b055b1a6b2b958707a21@syzkaller.appspotmail.com
    Cc: Lee Schermerhorn
    Link: http://lkml.kernel.org/r/89526377-7eb6-b662-e1d8-4430928abde9@infradead.org
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • The VM_BUG_ON() is already used by queue_pages_test_walk(), it sounds
    better to dump more debug information by using VM_BUG_ON_VMA() to help
    debugging.

    Signed-off-by: Yang Shi
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Cc: "Li Xinhai"
    Cc: Qian Cai
    Link: http://lkml.kernel.org/r/1579068565-110432-1-git-send-email-yang.shi@linux.alibaba.com
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • vma_migratable() is called to check if pages in vma can be migrated before
    go ahead to further actions. Currently it is used in below code path:

    - task_numa_work
    - mbind
    - move_pages

    For hugetlb mapping, whether vma is migratable or not is determined by:
    - CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION
    - arch_hugetlb_migration_supported

    Issue: current code only checks for CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION
    alone, and no code should use it directly. (note that current code in
    vma_migratable don't cause failure or bug because
    unmap_and_move_huge_page() will catch unsupported hugepage and handle it
    properly)

    This patch checks the two factors by hugepage_migration_supported for
    impoving code logic and robustness. It will enable early bail out of
    hugepage migration procedure, but because currently all architecture
    supporting hugepage migration is able to support all page size, we would
    not see performance gain with this patch applied.

    vma_migratable() is moved to mm/mempolicy.c, because of the circular
    reference of mempolicy.h and hugetlb.h cause defining it as inline not
    feasible.

    Signed-off-by: Li Xinhai
    Signed-off-by: Andrew Morton
    Reviewed-by: Mike Kravetz
    Acked-by: Michal Hocko
    Cc: Anshuman Khandual
    Cc: Naoya Horiguchi
    Link: http://lkml.kernel.org/r/1579786179-30633-1-git-send-email-lixinhai.lxh@gmail.com
    Signed-off-by: Linus Torvalds

    Li Xinhai
     
  • MPOL_MF_STRICT is used in mbind() for purposes:

    (1) MPOL_MF_STRICT is set alone without MPOL_MF_MOVE or
    MPOL_MF_MOVE_ALL, to check if there is misplaced page and return -EIO;

    (2) MPOL_MF_STRICT is set with MPOL_MF_MOVE or MPOL_MF_MOVE_ALL, to
    check if there is misplaced page which is failed to isolate, or page
    is success on isolate but failed to move, and return -EIO.

    For non hugepage mapping, (1) and (2) are implemented as expectation. For
    hugepage mapping, (1) is not implemented. And in (2), the part about
    failed to isolate and report -EIO is not implemented.

    This patch implements the missed parts for hugepage mapping. Benefits
    with it applied:

    - User space can apply same code logic to handle mbind() on hugepage and
    non hugepage mapping;

    - Reliably using MPOL_MF_STRICT alone to check whether there is
    misplaced page or not when bind policy on address range, especially for
    address range which contains both hugepage and non hugepage mapping.

    Analysis of potential impact to existing users:

    - If MPOL_MF_STRICT alone was previously used, hugetlb pages not
    following the memory policy would not cause an EIO error. After this
    change, hugetlb pages are treated like all other pages. If
    MPOL_MF_STRICT alone is used and hugetlb pages do not follow memory
    policy an EIO error will be returned.

    - For users who using MPOL_MF_STRICT with MPOL_MF_MOVE or
    MPOL_MF_MOVE_ALL, the semantic about some pages could not be moved will
    not be changed by this patch, because failed to isolate and failed to
    move have same effects to users, so their existing code will not be
    impacted.

    In mbind man page, the note about 'MPOL_MF_STRICT is ignored on huge page
    mappings' can be removed after this patch is applied.

    Mike:

    : The current behavior with MPOL_MF_STRICT and hugetlb pages is inconsistent
    : and does not match documentation (as described above). The special
    : behavior for hugetlb pages ideally should have been removed when hugetlb
    : page migration was introduced. It is unlikely that anyone relies on
    : today's inconsistent behavior, and removing one more case of special
    : handling for hugetlb pages is a good thing.

    Signed-off-by: Li Xinhai
    Signed-off-by: Andrew Morton
    Reviewed-by: Mike Kravetz
    Reviewed-by: Naoya Horiguchi
    Cc: Michal Hocko
    Cc: linux-man
    Link: http://lkml.kernel.org/r/1581559627-6206-1-git-send-email-lixinhai.lxh@gmail.com
    Signed-off-by: Linus Torvalds

    Li Xinhai
     

18 Feb, 2020

2 commits

  • Update numa_map_to_online_node() to stop falling back to numa node 0
    when the input is NUMA_NO_NODE. Also, skip the lookup if @node is
    online. This makes the routine compatible with other arch node mapping
    routines.

    Reported-by: Aneesh Kumar K.V
    Reviewed-by: Aneesh Kumar K.V
    Link: https://lore.kernel.org/r/157401275716.43284.13185549705765009174.stgit@dwillia2-desk3.amr.corp.intel.com
    Reviewed-by: Ingo Molnar
    Signed-off-by: Dan Williams
    Link: https://lore.kernel.org/r/158188325316.894464.15650888748083329531.stgit@dwillia2-desk3.amr.corp.intel.com

    Dan Williams
     
  • The acpi_map_pxm_to_online_node() helper is used to find the closest
    online node to a given proximity domain. This is used to map devices in
    a proximity domain with no online memory or cpus to the closest online
    node and populate a device's 'numa_node' property. The numa_node
    property allows applications to be migrated "close" to a resource.

    In preparation for providing a generic facility to optionally map an
    address range to its closest online node, or the node the range would
    represent were it to be onlined (target_node), up-level the core of
    acpi_map_pxm_to_online_node() to a generic mm/numa helper.

    Cc: Michal Hocko
    Acked-by: Rafael J. Wysocki
    Reviewed-by: Ingo Molnar
    Signed-off-by: Dan Williams
    Link: https://lore.kernel.org/r/158188324802.894464.13128795207831894206.stgit@dwillia2-desk3.amr.corp.intel.com

    Dan Williams
     

01 Feb, 2020

1 commit

  • What we are trying to do is change the '=' character to a NUL terminator
    and then at the end of the function we restore it back to an '='. The
    problem is there are two error paths where we jump to the end of the
    function before we have replaced the '=' with NUL.

    We end up putting the '=' in the wrong place (possibly one element
    before the start of the buffer).

    Link: http://lkml.kernel.org/r/20200115055426.vdjwvry44nfug7yy@kili.mountain
    Reported-by: syzbot+e64a13c5369a194d67df@syzkaller.appspotmail.com
    Fixes: 095f1fc4ebf3 ("mempolicy: rework shmem mpol parsing and display")
    Signed-off-by: Dan Carpenter
    Acked-by: Vlastimil Babka
    Dmitry Vyukov
    Cc: Michal Hocko
    Cc: Dan Carpenter
    Cc: Lee Schermerhorn
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Carpenter
     

14 Jan, 2020

1 commit

  • THP page faults now attempt a __GFP_THISNODE allocation first, which
    should only compact existing free memory, followed by another attempt
    that can allocate from any node using reclaim/compaction effort
    specified by global defrag setting and madvise.

    This patch makes the following changes to the scheme:

    - Before the patch, the first allocation relies on a check for
    pageblock order and __GFP_IO to prevent excessive reclaim. This
    however affects also the second attempt, which is not limited to
    single node.

    Instead of that, reuse the existing check for costly order
    __GFP_NORETRY allocations, and make sure the first THP attempt uses
    __GFP_NORETRY. As a side-effect, all costly order __GFP_NORETRY
    allocations will bail out if compaction needs reclaim, while
    previously they only bailed out when compaction was deferred due to
    previous failures.

    This should be still acceptable within the __GFP_NORETRY semantics.

    - Before the patch, the second allocation attempt (on all nodes) was
    passing __GFP_NORETRY. This is redundant as the check for pageblock
    order (discussed above) was stronger. It's also contrary to
    madvise(MADV_HUGEPAGE) which means some effort to allocate THP is
    requested.

    After this patch, the second attempt doesn't pass __GFP_THISNODE nor
    __GFP_NORETRY.

    To sum up, THP page faults now try the following attempts:

    1. local node only THP allocation with no reclaim, just compaction.
    2. for madvised VMA's or when synchronous compaction is enabled always - THP
    allocation from any node with effort determined by global defrag setting
    and VMA madvise
    3. fallback to base pages on any node

    Link: http://lkml.kernel.org/r/08a3f4dd-c3ce-0009-86c5-9ee51aba8557@suse.cz
    Fixes: b39d0ee2632d ("mm, page_alloc: avoid expensive reclaim when compaction may not succeed")
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Linus Torvalds
    Cc: Andrea Arcangeli
    Cc: Mel Gorman
    Cc: "Kirill A. Shutemov"
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

02 Dec, 2019

2 commits

  • mbind() is required to report EFAULT if range, specified by addr and
    len, contains unmapped holes. In current implementation, below rules
    are applied for this checking:

    1: Unmapped holes at any part of the specified range should be reported
    as EFAULT if mbind() for none MPOL_DEFAULT cases;

    2: Unmapped holes at any part of the specified range should be ignored
    (do not reprot EFAULT) if mbind() for MPOL_DEFAULT case;

    3: The whole range in an unmapped hole should be reported as EFAULT;

    Note that rule 2 does not fullfill the mbind() API definition, but since
    that behavior has existed for long days (the internal flag
    MPOL_MF_DISCONTIG_OK is for this purpose), this patch does not plan to
    change it.

    In current code, application observed inconsistent behavior on rule 1
    and rule 2 respectively. That inconsistency is fixed as below details.

    Cases of rule 1:

    - Hole at head side of range. Current code reprot EFAULT, no change by
    this patch.

    [ vma ][ hole ][ vma ]
    [ range ]

    - Hole at middle of range. Current code report EFAULT, no change by
    this patch.

    [ vma ][ hole ][ vma ]
    [ range ]

    - Hole at tail side of range. Current code do not report EFAULT, this
    patch fixes it.

    [ vma ][ hole ][ vma ]
    [ range ]

    Cases of rule 2:

    - Hole at head side of range. Current code reports EFAULT, this patch
    fixes it.

    [ vma ][ hole ][ vma ]
    [ range ]

    - Hole at middle of range. Current code does not report EFAULT, no
    change by this patch.

    [ vma ][ hole ][ vma]
    [ range ]

    - Hole at tail side of range. Current code does not report EFAULT, no
    change by this patch.

    [ vma ][ hole ][ vma]
    [ range ]

    This patch has no changes to rule 3.

    The unmapped hole checking can also be handled by using .pte_hole(),
    instead of .test_walk(). But .pte_hole() is called for holes inside and
    outside vma, which causes more cost, so this patch keeps the original
    design with .test_walk().

    Link: http://lkml.kernel.org/r/1573218104-11021-3-git-send-email-lixinhai.lxh@gmail.com
    Fixes: 6f4576e3687b ("mempolicy: apply page table walker on queue_pages_range()")
    Signed-off-by: Li Xinhai
    Reviewed-by: Naoya Horiguchi
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Hugh Dickins
    Cc: linux-man
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Xinhai
     
  • Patch series "mm: Fix checking unmapped holes for mbind", v4.

    This patchset fix checking unmapped holes for mbind().

    First patch makes sure the vma been correctly tracked in .test_walk(),
    so each time when .test_walk() is called, the neighborhood of two vma
    is correct.

    Current problem is that the !vma_migratable() check could cause return
    immediately without update tracking to vma.

    Second patch fix the inconsistent report of EFAULT when mbind() is
    called for MPOL_DEFAULT and non MPOL_DEFAULT cases, so application do
    not need to have workaround code to handle this special behavior.
    Currently there are two problems, one is that the .test_walk() can not
    know there is hole at tail side of range, because .test_walk() only
    call for vma not for hole. The other one is that mbind_range() checks
    for hole at head side of range but do not consider the
    MPOL_MF_DISCONTIG_OK flag as done in .test_walk().

    This patch (of 2):

    Checking unmapped hole and updating the previous vma must be handled
    first, otherwise the unmapped hole could be calculated from a wrong
    previous vma.

    Several commits were relevant to this error:

    - commit 6f4576e3687b ("mempolicy: apply page table walker on
    queue_pages_range()")

    This commit was correct, the VM_PFNMAP check was after updating
    previous vma

    - commit 48684a65b4e3 ("mm: pagewalk: fix misbehavior of
    walk_page_range for vma(VM_PFNMAP)")

    This commit added VM_PFNMAP check before updating previous vma. Then,
    there were two VM_PFNMAP check did same thing twice.

    - commit acda0c334028 ("mm/mempolicy.c: get rid of duplicated check for
    vma(VM_PFNMAP) in queue_page s_range()")

    This commit tried to fix the duplicated VM_PFNMAP check, but it
    wrongly removed the one which was after updating vma.

    Link: http://lkml.kernel.org/r/1573218104-11021-2-git-send-email-lixinhai.lxh@gmail.com
    Fixes: acda0c334028 (mm/mempolicy.c: get rid of duplicated check for vma(VM_PFNMAP) in queue_pages_range())
    Signed-off-by: Li Xinhai
    Reviewed-by: Naoya Horiguchi
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Hugh Dickins
    Cc: linux-man
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Xinhai
     

16 Nov, 2019

1 commit

  • Commit d883544515aa ("mm: mempolicy: make the behavior consistent when
    MPOL_MF_MOVE* and MPOL_MF_STRICT were specified") fixed the return value
    of mbind() for a couple of corner cases. But, it altered the errno for
    some other cases, for example, mbind() should return -EFAULT when part
    or all of the memory range specified by nodemask and maxnode points
    outside your accessible address space, or there was an unmapped hole in
    the specified memory range specified by addr and len.

    Fix this by preserving the errno returned by queue_pages_range(). And,
    the pagelist may be not empty even though queue_pages_range() returns
    error, put the pages back to LRU since mbind_range() is not called to
    really apply the policy so those pages should not be migrated, this is
    also the old behavior before the problematic commit.

    Link: http://lkml.kernel.org/r/1572454731-3925-1-git-send-email-yang.shi@linux.alibaba.com
    Fixes: d883544515aa ("mm: mempolicy: make the behavior consistent when MPOL_MF_MOVE* and MPOL_MF_STRICT were specified")
    Signed-off-by: Yang Shi
    Reported-by: Li Xinhai
    Reviewed-by: Li Xinhai
    Cc: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Mel Gorman
    Cc: [4.19 and 5.2+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     

29 Sep, 2019

4 commits

  • Merge hugepage allocation updates from David Rientjes:
    "We (mostly Linus, Andrea, and myself) have been discussing offlist how
    to implement a sane default allocation strategy for hugepages on NUMA
    platforms.

    With these reverts in place, the page allocator will happily allocate
    a remote hugepage immediately rather than try to make a local hugepage
    available. This incurs a substantial performance degradation when
    memory compaction would have otherwise made a local hugepage
    available.

    This series reverts those reverts and attempts to propose a more sane
    default allocation strategy specifically for hugepages. Andrea
    acknowledges this is likely to fix the swap storms that he originally
    reported that resulted in the patches that removed __GFP_THISNODE from
    hugepage allocations.

    The immediate goal is to return 5.3 to the behavior the kernel has
    implemented over the past several years so that remote hugepages are
    not immediately allocated when local hugepages could have been made
    available because the increased access latency is untenable.

    The next goal is to introduce a sane default allocation strategy for
    hugepages allocations in general regardless of the configuration of
    the system so that we prevent thrashing of local memory when
    compaction is unlikely to succeed and can prefer remote hugepages over
    remote native pages when the local node is low on memory."

    Note on timing: this reverts the hugepage VM behavior changes that got
    introduced fairly late in the 5.3 cycle, and that fixed a huge
    performance regression for certain loads that had been around since
    4.18.

    Andrea had this note:

    "The regression of 4.18 was that it was taking hours to start a VM
    where 3.10 was only taking a few seconds, I reported all the details
    on lkml when it was finally tracked down in August 2018.

    https://lore.kernel.org/linux-mm/20180820032640.9896-2-aarcange@redhat.com/

    __GFP_THISNODE in MADV_HUGEPAGE made the above enterprise vfio
    workload degrade like in the "current upstream" above. And it still
    would have been that bad as above until 5.3-rc5"

    where the bad behavior ends up happening as you fill up a local node,
    and without that change, you'd get into the nasty swap storm behavior
    due to compaction working overtime to make room for more memory on the
    nodes.

    As a result 5.3 got the two performance fix reverts in rc5.

    However, David Rientjes then noted that those performance fixes in turn
    regressed performance for other loads - although not quite to the same
    degree. He suggested reverting the reverts and instead replacing them
    with two small changes to how hugepage allocations are done (patch
    descriptions rephrased by me):

    - "avoid expensive reclaim when compaction may not succeed": just admit
    that the allocation failed when you're trying to allocate a huge-page
    and compaction wasn't successful.

    - "allow hugepage fallback to remote nodes when madvised": when that
    node-local huge-page allocation failed, retry without forcing the
    local node.

    but by then I judged it too late to replace the fixes for a 5.3 release.
    So 5.3 was released with behavior that harked back to the pre-4.18 logic.

    But now we're in the merge window for 5.4, and we can see if this
    alternate model fixes not just the horrendous swap storm behavior, but
    also restores the performance regression that the late reverts caused.

    Fingers crossed.

    * emailed patches from David Rientjes :
    mm, page_alloc: allow hugepage fallback to remote nodes when madvised
    mm, page_alloc: avoid expensive reclaim when compaction may not succeed
    Revert "Revert "Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask""
    Revert "Revert "mm, thp: restore node-local hugepage allocations""

    Linus Torvalds
     
  • For systems configured to always try hard to allocate transparent
    hugepages (thp defrag setting of "always") or for memory that has been
    explicitly madvised to MADV_HUGEPAGE, it is often better to fallback to
    remote memory to allocate the hugepage if the local allocation fails
    first.

    The point is to allow the initial call to __alloc_pages_node() to attempt
    to defragment local memory to make a hugepage available, if possible,
    rather than immediately fallback to remote memory. Local hugepages will
    always have a better access latency than remote (huge)pages, so an attempt
    to make a hugepage available locally is always preferred.

    If memory compaction cannot be successful locally, however, it is likely
    better to fallback to remote memory. This could take on two forms: either
    allow immediate fallback to remote memory or do per-zone watermark checks.
    It would be possible to fallback only when per-zone watermarks fail for
    order-0 memory, since that would require local reclaim for all subsequent
    faults so remote huge allocation is likely better than thrashing the local
    zone for large workloads.

    In this case, it is assumed that because the system is configured to try
    hard to allocate hugepages or the vma is advised to explicitly want to try
    hard for hugepages that remote allocation is better when local allocation
    and memory compaction have both failed.

    Signed-off-by: David Rientjes
    Cc: Andrea Arcangeli
    Cc: Michal Hocko
    Cc: Mel Gorman
    Cc: Vlastimil Babka
    Cc: Stefan Priebe - Profihost AG
    Cc: "Kirill A. Shutemov"
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • This reverts commit 92717d429b38e4f9f934eed7e605cc42858f1839.

    Since commit a8282608c88e ("Revert "mm, thp: restore node-local hugepage
    allocations"") is reverted in this series, it is better to restore the
    previous 5.2 behavior between the thp allocation and the page allocator
    rather than to attempt any consolidation or cleanup for a policy that is
    now reverted. It's less risky during an rc cycle and subsequent patches
    in this series further modify the same policy that the pre-5.3 behavior
    implements.

    Consolidation and cleanup can be done subsequent to a sane default page
    allocation strategy, so this patch reverts a cleanup done on a strategy
    that is now reverted and thus is the least risky option.

    Signed-off-by: David Rientjes
    Cc: Andrea Arcangeli
    Cc: Michal Hocko
    Cc: Mel Gorman
    Cc: Vlastimil Babka
    Cc: Stefan Priebe - Profihost AG
    Cc: "Kirill A. Shutemov"
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • This reverts commit a8282608c88e08b1782141026eab61204c1e533f.

    The commit references the original intended semantic for MADV_HUGEPAGE
    which has subsequently taken on three unique purposes:

    - enables or disables thp for a range of memory depending on the system's
    config (is thp "enabled" set to "always" or "madvise"),

    - determines the synchronous compaction behavior for thp allocations at
    fault (is thp "defrag" set to "always", "defer+madvise", or "madvise"),
    and

    - reverts a previous MADV_NOHUGEPAGE (there is no madvise mode to only
    clear previous hugepage advice).

    These are the three purposes that currently exist in 5.2 and over the
    past several years that userspace has been written around. Adding a
    NUMA locality preference adds a fourth dimension to an already conflated
    advice mode.

    Based on the semantic that MADV_HUGEPAGE has provided over the past
    several years, there exist workloads that use the tunable based on these
    principles: specifically that the allocation should attempt to
    defragment a local node before falling back. It is agreed that remote
    hugepages typically (but not always) have a better access latency than
    remote native pages, although on Naples this is at parity for
    intersocket.

    The revert commit that this patch reverts allows hugepage allocation to
    immediately allocate remotely when local memory is fragmented. This is
    contrary to the semantic of MADV_HUGEPAGE over the past several years:
    that is, memory compaction should be attempted locally before falling
    back.

    The performance degradation of remote hugepages over local hugepages on
    Rome, for example, is 53.5% increased access latency. For this reason,
    the goal is to revert back to the 5.2 and previous behavior that would
    attempt local defragmentation before falling back. With the patch that
    is reverted by this patch, we see performance degradations at the tail
    because the allocator happily allocates the remote hugepage rather than
    even attempting to make a local hugepage available.

    zone_reclaim_mode is not a solution to this problem since it does not
    only impact hugepage allocations but rather changes the memory
    allocation strategy for *all* page allocations.

    Signed-off-by: David Rientjes
    Cc: Andrea Arcangeli
    Cc: Michal Hocko
    Cc: Mel Gorman
    Cc: Vlastimil Babka
    Cc: Stefan Priebe - Profihost AG
    Cc: "Kirill A. Shutemov"
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

26 Sep, 2019

1 commit

  • This patch is a part of a series that extends kernel ABI to allow to pass
    tagged user pointers (with the top byte set to something else other than
    0x00) as syscall arguments.

    This patch allows tagged pointers to be passed to the following memory
    syscalls: get_mempolicy, madvise, mbind, mincore, mlock, mlock2, mprotect,
    mremap, msync, munlock, move_pages.

    The mmap and mremap syscalls do not currently accept tagged addresses.
    Architectures may interpret the tag as a background colour for the
    corresponding vma.

    Link: http://lkml.kernel.org/r/aaf0c0969d46b2feb9017f3e1b3ef3970b633d91.1563904656.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov
    Reviewed-by: Khalid Aziz
    Reviewed-by: Vincenzo Frascino
    Reviewed-by: Catalin Marinas
    Reviewed-by: Kees Cook
    Cc: Al Viro
    Cc: Dave Hansen
    Cc: Eric Auger
    Cc: Felix Kuehling
    Cc: Jens Wiklander
    Cc: Mauro Carvalho Chehab
    Cc: Mike Rapoport
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Konovalov
     

25 Sep, 2019

1 commit

  • 1) task_nodes = cpuset_mems_allowed(current);
    -> cpuset_mems_allowed() guaranteed to return some non-empty
    subset of node_states[N_MEMORY].

    2) nodes_and(*new, *new, task_nodes);
    -> after nodes_and(), the 'new' should be empty or appropriate
    nodemask(online node and with memory).

    After 1) and 2), we could remove unnecessary check whether the 'new'
    AND node_states[N_MEMORY] is empty.

    Link: http://lkml.kernel.org/r/20190806023634.55356-1-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang
    Acked-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Dan Williams
    Cc: Michal Hocko
    Cc: Oscar Salvador
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kefeng Wang
     

07 Sep, 2019

2 commits

  • The mm_walk structure currently mixed data and code. Split out the
    operations vectors into a new mm_walk_ops structure, and while we are
    changing the API also declare the mm_walk structure inside the
    walk_page_range and walk_page_vma functions.

    Based on patch from Linus Torvalds.

    Link: https://lore.kernel.org/r/20190828141955.22210-3-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Thomas Hellstrom
    Reviewed-by: Steven Price
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     
  • Add a new header for the two handful of users of the walk_page_range /
    walk_page_vma interface instead of polluting all users of mm.h with it.

    Link: https://lore.kernel.org/r/20190828141955.22210-2-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Thomas Hellstrom
    Reviewed-by: Steven Price
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig