05 Apr, 2020

1 commit

  • Pull drm hugepage support from Dave Airlie:
    "This adds support for hugepages to TTM and has been tested with the
    vmwgfx drivers, though I expect other drivers to start using it"

    * tag 'drm-next-2020-04-03-1' of git://anongit.freedesktop.org/drm/drm:
    drm/vmwgfx: Hook up the helpers to align buffer objects
    drm/vmwgfx: Introduce a huge page aligning TTM range manager
    drm: Add a drm_get_unmapped_area() helper
    drm/vmwgfx: Support huge page faults
    drm/ttm, drm/vmwgfx: Support huge TTM pagefaults
    mm: Add vmf_insert_pfn_xxx_prot() for huge page-table entries
    mm: Split huge pages on write-notify or COW
    mm: Introduce vma_is_special_huge
    fs: Constify vma argument to vma_is_dax

    Linus Torvalds
     

04 Apr, 2020

1 commit

  • Pull cgroup updates from Tejun Heo:

    - Christian extended clone3 so that processes can be spawned into
    cgroups directly.

    This is not only neat in terms of semantics but also avoids grabbing
    the global cgroup_threadgroup_rwsem for migration.

    - Daniel added !root xattr support to cgroupfs.

    Userland already uses xattrs on cgroupfs for bookkeeping. This will
    allow delegated cgroups to support such usages.

    - Prateek tried to make cpuset hotplug handling synchronous but that
    led to possible deadlock scenarios. Reverted.

    - Other minor changes including release_agent_path handling cleanup.

    * 'for-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    docs: cgroup-v1: Document the cpuset_v2_mode mount option
    Revert "cpuset: Make cpuset hotplug synchronous"
    cgroupfs: Support user xattrs
    kernfs: Add option to enable user xattrs
    kernfs: Add removed_size out param for simple_xattr_set
    kernfs: kvmalloc xattr value instead of kmalloc
    cgroup: Restructure release_agent_path handling
    selftests/cgroup: add tests for cloning into cgroups
    clone3: allow spawning processes into cgroups
    cgroup: add cgroup_may_write() helper
    cgroup: refactor fork helpers
    cgroup: add cgroup_get_from_file() helper
    cgroup: unify attach permission checking
    cpuset: Make cpuset hotplug synchronous
    cgroup.c: Use built-in RCU list checking
    kselftest/cgroup: add cgroup destruction test
    cgroup: Clean up css_set task traversal

    Linus Torvalds
     

03 Apr, 2020

38 commits

  • Huge page-table entries for TTM

    In order to reduce CPU usage [1] and in theory TLB misses this patchset enables
    huge- and giant page-table entries for TTM and TTM-enabled graphics drivers.

    Signed-off-by: Dave Airlie
    From: Thomas Hellstrom (VMware)
    Link: https://patchwork.freedesktop.org/patch/msgid/20200325073102.6129-1-thomas_os@shipmail.org

    Dave Airlie
     
  • Pull percpu updates from Dennis Zhou:
    "This is just a few documentation fixes for percpu refcount and bitmap
    helpers that went in v5.6, and moving my emails to all be at korg"

    * 'for-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu:
    percpu: update copyright emails to dennis@kernel.org
    include/bitmap.h: add new functions to documentation
    include/bitmap.h: add missing parameter in docs
    percpu_ref: Fix comment regarding percpu_ref_init flags

    Linus Torvalds
     
  • Merge updates from Andrew Morton:
    "A large amount of MM, plenty more to come.

    Subsystems affected by this patch series:
    - tools
    - kthread
    - kbuild
    - scripts
    - ocfs2
    - vfs
    - mm: slub, kmemleak, pagecache, gup, swap, memcg, pagemap, mremap,
    sparsemem, kasan, pagealloc, vmscan, compaction, mempolicy,
    hugetlbfs, hugetlb"

    * emailed patches from Andrew Morton : (155 commits)
    include/linux/huge_mm.h: check PageTail in hpage_nr_pages even when !THP
    mm/hugetlb: fix build failure with HUGETLB_PAGE but not HUGEBTLBFS
    selftests/vm: fix map_hugetlb length used for testing read and write
    mm/hugetlb: remove unnecessary memory fetch in PageHeadHuge()
    mm/hugetlb.c: clean code by removing unnecessary initialization
    hugetlb_cgroup: add hugetlb_cgroup reservation docs
    hugetlb_cgroup: add hugetlb_cgroup reservation tests
    hugetlb: support file_region coalescing again
    hugetlb_cgroup: support noreserve mappings
    hugetlb_cgroup: add accounting for shared mappings
    hugetlb: disable region_add file_region coalescing
    hugetlb_cgroup: add reservation accounting for private mappings
    mm/hugetlb_cgroup: fix hugetlb_cgroup migration
    hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations
    hugetlb_cgroup: add hugetlb_cgroup reservation counter
    hugetlbfs: Use i_mmap_rwsem to address page fault/truncate race
    hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization
    mm/memblock.c: remove redundant assignment to variable max_addr
    mm: mempolicy: require at least one nodeid for MPOL_PREFERRED
    mm: mempolicy: use VM_BUG_ON_VMA in queue_pages_test_walk()
    ...

    Linus Torvalds
     
  • Pull exec/proc updates from Eric Biederman:
    "This contains two significant pieces of work: the work to sort out
    proc_flush_task, and the work to solve a deadlock between strace and
    exec.

    Fixing proc_flush_task so that it no longer requires a persistent
    mount makes improvements to proc possible. The removal of the
    persistent mount solves an old regression that that caused the hidepid
    mount option to only work on remount not on mount. The regression was
    found and reported by the Android folks. This further allows Alexey
    Gladkov's work making proc mount options specific to an individual
    mount of proc to move forward.

    The work on exec starts solving a long standing issue with exec that
    it takes mutexes of blocking userspace applications, which makes exec
    extremely deadlock prone. For the moment this adds a second mutex with
    a narrower scope that handles all of the easy cases. Which makes the
    tricky cases easy to spot. With a little luck the code to solve those
    deadlocks will be ready by next merge window"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (25 commits)
    signal: Extend exec_id to 64bits
    pidfd: Use new infrastructure to fix deadlocks in execve
    perf: Use new infrastructure to fix deadlocks in execve
    proc: io_accounting: Use new infrastructure to fix deadlocks in execve
    proc: Use new infrastructure to fix deadlocks in execve
    kernel/kcmp.c: Use new infrastructure to fix deadlocks in execve
    kernel: doc: remove outdated comment cred.c
    mm: docs: Fix a comment in process_vm_rw_core
    selftests/ptrace: add test cases for dead-locks
    exec: Fix a deadlock in strace
    exec: Add exec_update_mutex to replace cred_guard_mutex
    exec: Move exec_mmap right after de_thread in flush_old_exec
    exec: Move cleanup of posix timers on exec out of de_thread
    exec: Factor unshare_sighand out of de_thread and call it separately
    exec: Only compute current once in flush_old_exec
    pid: Improve the comment about waiting in zap_pid_ns_processes
    proc: Remove the now unnecessary internal mount of proc
    uml: Create a private mount of proc for mconsole
    uml: Don't consult current to find the proc_mnt in mconsole_proc
    proc: Use a list of inodes to flush from proc
    ...

    Linus Torvalds
     
  • Commit f1e61557f023 ("mm: pack compound_dtor and compound_order into one
    word in struct page") changed compound_dtor from a pointer to an array
    index in order to pack it. To check if page has the hugeltbfs
    compound_dtor, we can just compare the index directly without fetching the
    function pointer. Said commit did that with PageHuge() and we can do the
    same with PageHeadHuge() to make the code a bit smaller and faster.

    Signed-off-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Reviewed-by: Mike Kravetz
    Acked-by: David Rientjes
    Acked-by: Kirill A. Shutemov
    Cc: Neha Agarwal
    Link: http://lkml.kernel.org/r/20200311172440.6988-1-vbabka@suse.cz
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Previously variable 'check_addr' was initialized, but was not read later
    before reassigning. So the initialization can be removed.

    Signed-off-by: Mateusz Nosek
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Reviewed-by: Mike Kravetz
    Link: http://lkml.kernel.org/r/20200303212354.25226-1-mateusznosek0@gmail.com
    Signed-off-by: Linus Torvalds

    Mateusz Nosek
     
  • An earlier patch in this series disabled file_region coalescing in order
    to hang the hugetlb_cgroup uncharge info on the file_region entries.

    This patch re-adds support for coalescing of file_region entries.
    Essentially everytime we add an entry, we call a recursive function that
    tries to coalesce the added region with the regions next to it. The worst
    case call depth for this function is 3: one to coalesce with the region
    next to it, one to coalesce to the region prev, and one to reach the base
    case.

    This is an important performance optimization as private mappings add
    their entries page by page, and we could incur big performance costs for
    large mappings with lots of file_region entries in their resv_map.

    [almasrymina@google.com: fix CONFIG_CGROUP_HUGETLB ifdefs]
    Link: http://lkml.kernel.org/r/20200214204544.231482-1-almasrymina@google.com
    [almasrymina@google.com: remove check_coalesce_bug debug code]
    Link: http://lkml.kernel.org/r/20200219233610.13808-1-almasrymina@google.com
    Signed-off-by: Mina Almasry
    Signed-off-by: Andrew Morton
    Reviewed-by: Mike Kravetz
    Acked-by: David Rientjes
    Cc: Greg Thelen
    Cc: Mike Kravetz
    Cc: Sandipan Das
    Cc: Shakeel Butt
    Cc: Shuah Khan
    Cc: Randy Dunlap
    Link: http://lkml.kernel.org/r/20200211213128.73302-7-almasrymina@google.com
    Signed-off-by: Linus Torvalds

    Mina Almasry
     
  • Support MAP_NORESERVE accounting as part of the new counter.

    For each hugepage allocation, at allocation time we check if there is a
    reservation for this allocation or not. If there is a reservation for
    this allocation, then this allocation was charged at reservation time, and
    we don't re-account it. If there is no reserevation for this allocation,
    we charge the appropriate hugetlb_cgroup.

    The hugetlb_cgroup to uncharge for this allocation is stored in
    page[3].private. We use new APIs added in an earlier patch to set this
    pointer.

    Signed-off-by: Mina Almasry
    Signed-off-by: Andrew Morton
    Reviewed-by: Mike Kravetz
    Cc: David Rientjes
    Cc: Greg Thelen
    Cc: Sandipan Das
    Cc: Shakeel Butt
    Cc: Shuah Khan
    Link: http://lkml.kernel.org/r/20200211213128.73302-6-almasrymina@google.com
    Signed-off-by: Linus Torvalds

    Mina Almasry
     
  • For shared mappings, the pointer to the hugetlb_cgroup to uncharge lives
    in the resv_map entries, in file_region->reservation_counter.

    After a call to region_chg, we charge the approprate hugetlb_cgroup, and
    if successful, we pass on the hugetlb_cgroup info to a follow up
    region_add call. When a file_region entry is added to the resv_map via
    region_add, we put the pointer to that cgroup in
    file_region->reservation_counter. If charging doesn't succeed, we report
    the error to the caller, so that the kernel fails the reservation.

    On region_del, which is when the hugetlb memory is unreserved, we also
    uncharge the file_region->reservation_counter.

    [akpm@linux-foundation.org: forward declare struct file_region]
    Signed-off-by: Mina Almasry
    Signed-off-by: Andrew Morton
    Reviewed-by: Mike Kravetz
    Cc: David Rientjes
    Cc: Greg Thelen
    Cc: Mike Kravetz
    Cc: Sandipan Das
    Cc: Shakeel Butt
    Cc: Shuah Khan
    Link: http://lkml.kernel.org/r/20200211213128.73302-5-almasrymina@google.com
    Signed-off-by: Linus Torvalds

    Mina Almasry
     
  • A follow up patch in this series adds hugetlb cgroup uncharge info the
    file_region entries in resv->regions. The cgroup uncharge info may differ
    for different regions, so they can no longer be coalesced at region_add
    time. So, disable region coalescing in region_add in this patch.

    Behavior change:

    Say a resv_map exists like this [0->1], [2->3], and [5->6].

    Then a region_chg/add call comes in region_chg/add(f=0, t=5).

    Old code would generate resv->regions: [0->5], [5->6].
    New code would generate resv->regions: [0->1], [1->2], [2->3], [3->5],
    [5->6].

    Special care needs to be taken to handle the resv->adds_in_progress
    variable correctly. In the past, only 1 region would be added for every
    region_chg and region_add call. But now, each call may add multiple
    regions, so we can no longer increment adds_in_progress by 1 in
    region_chg, or decrement adds_in_progress by 1 after region_add or
    region_abort. Instead, region_chg calls add_reservation_in_range() to
    count the number of regions needed and allocates those, and that info is
    passed to region_add and region_abort to decrement adds_in_progress
    correctly.

    We've also modified the assumption that region_add after region_chg never
    fails. region_chg now pre-allocates at least 1 region for region_add. If
    region_add needs more regions than region_chg has allocated for it, then
    it may fail.

    [almasrymina@google.com: fix file_region entry allocations]
    Link: http://lkml.kernel.org/r/20200219012736.20363-1-almasrymina@google.com
    Signed-off-by: Mina Almasry
    Signed-off-by: Andrew Morton
    Reviewed-by: Mike Kravetz
    Acked-by: David Rientjes
    Cc: Sandipan Das
    Cc: Shakeel Butt
    Cc: Shuah Khan
    Cc: Greg Thelen
    Cc: Miguel Ojeda
    Link: http://lkml.kernel.org/r/20200211213128.73302-4-almasrymina@google.com
    Signed-off-by: Linus Torvalds

    Mina Almasry
     
  • Normally the pointer to the cgroup to uncharge hangs off the struct page,
    and gets queried when it's time to free the page. With hugetlb_cgroup
    reservations, this is not possible. Because it's possible for a page to
    be reserved by one task and actually faulted in by another task.

    The best place to put the hugetlb_cgroup pointer to uncharge for
    reservations is in the resv_map. But, because the resv_map has different
    semantics for private and shared mappings, the code patch to
    charge/uncharge shared and private mappings is different. This patch
    implements charging and uncharging for private mappings.

    For private mappings, the counter to uncharge is in
    resv_map->reservation_counter. On initializing the resv_map this is set
    to NULL. On reservation of a region in private mapping, the tasks
    hugetlb_cgroup is charged and the hugetlb_cgroup is placed is
    resv_map->reservation_counter.

    On hugetlb_vm_op_close, we uncharge resv_map->reservation_counter.

    [akpm@linux-foundation.org: forward declare struct resv_map]
    Signed-off-by: Mina Almasry
    Signed-off-by: Andrew Morton
    Reviewed-by: Mike Kravetz
    Acked-by: David Rientjes
    Cc: Greg Thelen
    Cc: Sandipan Das
    Cc: Shakeel Butt
    Cc: Shuah Khan
    Link: http://lkml.kernel.org/r/20200211213128.73302-3-almasrymina@google.com
    Signed-off-by: Linus Torvalds

    Mina Almasry
     
  • Commit c32300516047 ("hugetlb_cgroup: add interface for charge/uncharge
    hugetlb reservations") mistakingly doesn't handle the migration of *both*
    the reservation hugetlb_cgroup and the fault hugetlb_cgroup correctly.

    What should happen is that both cgroups shuold be queried from the old
    page, then both set to NULL on the old page, then both inserted into the
    new page.

    The mistake also creates the following warning:

    mm/hugetlb_cgroup.c: In function 'hugetlb_cgroup_migrate':
    mm/hugetlb_cgroup.c:777:25: warning: variable 'h_cg' set but not used
    [-Wunused-but-set-variable]
    struct hugetlb_cgroup *h_cg;
    ^~~~

    Solution is to add the missing steps, namly setting the reservation
    hugetlb_cgroup to NULL on the old page, and setting the fault
    hugetlb_cgroup on the new page.

    Fixes: c32300516047 ("hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations")
    Reported-by: Qian Cai
    Signed-off-by: Mina Almasry
    Signed-off-by: Andrew Morton
    Cc: David Rientjes
    Cc: Greg Thelen
    Cc: Mike Kravetz
    Cc: Sandipan Das
    Cc: Shakeel Butt
    Cc: Shuah Khan
    Link: http://lkml.kernel.org/r/20200218194727.46995-1-almasrymina@google.com
    Signed-off-by: Linus Torvalds

    Mina Almasry
     
  • Augments hugetlb_cgroup_charge_cgroup to be able to charge hugetlb usage
    or hugetlb reservation counter.

    Adds a new interface to uncharge a hugetlb_cgroup counter via
    hugetlb_cgroup_uncharge_counter.

    Integrates the counter with hugetlb_cgroup, via hugetlb_cgroup_init,
    hugetlb_cgroup_have_usage, and hugetlb_cgroup_css_offline.

    Signed-off-by: Mina Almasry
    Signed-off-by: Andrew Morton
    Acked-by: Mike Kravetz
    Acked-by: David Rientjes
    Cc: Greg Thelen
    Cc: Sandipan Das
    Cc: Shakeel Butt
    Cc: Shuah Khan
    Link: http://lkml.kernel.org/r/20200211213128.73302-2-almasrymina@google.com
    Signed-off-by: Linus Torvalds

    Mina Almasry
     
  • These counters will track hugetlb reservations rather than hugetlb memory
    faulted in. This patch only adds the counter, following patches add the
    charging and uncharging of the counter.

    This is patch 1 of an 9 patch series.

    Problem:

    Currently tasks attempting to reserve more hugetlb memory than is
    available get a failure at mmap/shmget time. This is thanks to Hugetlbfs
    Reservations [1]. However, if a task attempts to reserve more hugetlb
    memory than its hugetlb_cgroup limit allows, the kernel will allow the
    mmap/shmget call, but will SIGBUS the task when it attempts to fault in
    the excess memory.

    We have users hitting their hugetlb_cgroup limits and thus we've been
    looking at this failure mode. We'd like to improve this behavior such
    that users violating the hugetlb_cgroup limits get an error on mmap/shmget
    time, rather than getting SIGBUS'd when they try to fault the excess
    memory in. This gives the user an opportunity to fallback more gracefully
    to non-hugetlbfs memory for example.

    The underlying problem is that today's hugetlb_cgroup accounting happens
    at hugetlb memory *fault* time, rather than at *reservation* time. Thus,
    enforcing the hugetlb_cgroup limit only happens at fault time, and the
    offending task gets SIGBUS'd.

    Proposed Solution:

    A new page counter named
    'hugetlb.xMB.rsvd.[limit|usage|max_usage]_in_bytes'. This counter has
    slightly different semantics than
    'hugetlb.xMB.[limit|usage|max_usage]_in_bytes':

    - While usage_in_bytes tracks all *faulted* hugetlb memory,
    rsvd.usage_in_bytes tracks all *reserved* hugetlb memory and hugetlb
    memory faulted in without a prior reservation.

    - If a task attempts to reserve more memory than limit_in_bytes allows,
    the kernel will allow it to do so. But if a task attempts to reserve
    more memory than rsvd.limit_in_bytes, the kernel will fail this
    reservation.

    This proposal is implemented in this patch series, with tests to verify
    functionality and show the usage.

    Alternatives considered:

    1. A new cgroup, instead of only a new page_counter attached to the
    existing hugetlb_cgroup. Adding a new cgroup seemed like a lot of code
    duplication with hugetlb_cgroup. Keeping hugetlb related page counters
    under hugetlb_cgroup seemed cleaner as well.

    2. Instead of adding a new counter, we considered adding a sysctl that
    modifies the behavior of hugetlb.xMB.[limit|usage]_in_bytes, to do
    accounting at reservation time rather than fault time. Adding a new
    page_counter seems better as userspace could, if it wants, choose to
    enforce different cgroups differently: one via limit_in_bytes, and
    another via rsvd.limit_in_bytes. This could be very useful if you're
    transitioning how hugetlb memory is partitioned on your system one
    cgroup at a time, for example. Also, someone may find usage for both
    limit_in_bytes and rsvd.limit_in_bytes concurrently, and this approach
    gives them the option to do so.

    Testing:
    - Added tests passing.
    - Used libhugetlbfs for regression testing.

    [1]: https://www.kernel.org/doc/html/latest/vm/hugetlbfs_reserv.html

    Signed-off-by: Mina Almasry
    Signed-off-by: Andrew Morton
    Reviewed-by: Mike Kravetz
    Acked-by: David Rientjes
    Cc: Shuah Khan
    Cc: Shakeel Butt
    Cc: Greg Thelen
    Cc: Sandipan Das
    Link: http://lkml.kernel.org/r/20200211213128.73302-1-almasrymina@google.com
    Signed-off-by: Linus Torvalds

    Mina Almasry
     
  • hugetlbfs page faults can race with truncate and hole punch operations.
    Current code in the page fault path attempts to handle this by 'backing
    out' operations if we encounter the race. One obvious omission in the
    current code is removing a page newly added to the page cache. This is
    pretty straight forward to address, but there is a more subtle and
    difficult issue of backing out hugetlb reservations. To handle this
    correctly, the 'reservation state' before page allocation needs to be
    noted so that it can be properly backed out. There are four distinct
    possibilities for reservation state: shared/reserved, shared/no-resv,
    private/reserved and private/no-resv. Backing out a reservation may
    require memory allocation which could fail so that needs to be taken
    into account as well.

    Instead of writing the required complicated code for this rare
    occurrence, just eliminate the race. i_mmap_rwsem is now held in read
    mode for the duration of page fault processing. Hold i_mmap_rwsem in
    write mode when modifying i_size. In this way, truncation can not
    proceed when page faults are being processed. In addition, i_size
    will not change during fault processing so a single check can be made
    to ensure faults are not beyond (proposed) end of file. Faults can
    still race with hole punch, but that race is handled by existing code
    and the use of hugetlb_fault_mutex.

    With this modification, checks for races with truncation in the page
    fault path can be simplified and removed. remove_inode_hugepages no
    longer needs to take hugetlb_fault_mutex in the case of truncation.
    Comments are expanded to explain reasoning behind locking.

    Signed-off-by: Mike Kravetz
    Signed-off-by: Andrew Morton
    Cc: Andrea Arcangeli
    Cc: "Aneesh Kumar K . V"
    Cc: Davidlohr Bueso
    Cc: Hugh Dickins
    Cc: "Kirill A . Shutemov"
    Cc: Michal Hocko
    Cc: Naoya Horiguchi
    Cc: Prakash Sangappa
    Link: http://lkml.kernel.org/r/20200316205756.146666-3-mike.kravetz@oracle.com
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • Patch series "hugetlbfs: use i_mmap_rwsem for more synchronization", v2.

    While discussing the issue with huge_pte_offset [1], I remembered that
    there were more outstanding hugetlb races. These issues are:

    1) For shared pmds, huge PTE pointers returned by huge_pte_alloc can become
    invalid via a call to huge_pmd_unshare by another thread.
    2) hugetlbfs page faults can race with truncation causing invalid global
    reserve counts and state.

    A previous attempt was made to use i_mmap_rwsem in this manner as
    described at [2]. However, those patches were reverted starting with [3]
    due to locking issues.

    To effectively use i_mmap_rwsem to address the above issues it needs to be
    held (in read mode) during page fault processing. However, during fault
    processing we need to lock the page we will be adding. Lock ordering
    requires we take page lock before i_mmap_rwsem. Waiting until after
    taking the page lock is too late in the fault process for the
    synchronization we want to do.

    To address this lock ordering issue, the following patches change the lock
    ordering for hugetlb pages. This is not too invasive as hugetlbfs
    processing is done separate from core mm in many places. However, I don't
    really like this idea. Much ugliness is contained in the new routine
    hugetlb_page_mapping_lock_write() of patch 1.

    The only other way I can think of to address these issues is by catching
    all the races. After catching a race, cleanup, backout, retry ... etc,
    as needed. This can get really ugly, especially for huge page
    reservations. At one time, I started writing some of the reservation
    backout code for page faults and it got so ugly and complicated I went
    down the path of adding synchronization to avoid the races. Any other
    suggestions would be welcome.

    [1] https://lore.kernel.org/linux-mm/1582342427-230392-1-git-send-email-longpeng2@huawei.com/
    [2] https://lore.kernel.org/linux-mm/20181222223013.22193-1-mike.kravetz@oracle.com/
    [3] https://lore.kernel.org/linux-mm/20190103235452.29335-1-mike.kravetz@oracle.com
    [4] https://lore.kernel.org/linux-mm/1584028670.7365.182.camel@lca.pw/
    [5] https://lore.kernel.org/lkml/20200312183142.108df9ac@canb.auug.org.au/

    This patch (of 2):

    While looking at BUGs associated with invalid huge page map counts, it was
    discovered and observed that a huge pte pointer could become 'invalid' and
    point to another task's page table. Consider the following:

    A task takes a page fault on a shared hugetlbfs file and calls
    huge_pte_alloc to get a ptep. Suppose the returned ptep points to a
    shared pmd.

    Now, another task truncates the hugetlbfs file. As part of truncation, it
    unmaps everyone who has the file mapped. If the range being truncated is
    covered by a shared pmd, huge_pmd_unshare will be called. For all but the
    last user of the shared pmd, huge_pmd_unshare will clear the pud pointing
    to the pmd. If the task in the middle of the page fault is not the last
    user, the ptep returned by huge_pte_alloc now points to another task's
    page table or worse. This leads to bad things such as incorrect page
    map/reference counts or invalid memory references.

    To fix, expand the use of i_mmap_rwsem as follows:
    - i_mmap_rwsem is held in read mode whenever huge_pmd_share is called.
    huge_pmd_share is only called via huge_pte_alloc, so callers of
    huge_pte_alloc take i_mmap_rwsem before calling. In addition, callers
    of huge_pte_alloc continue to hold the semaphore until finished with
    the ptep.
    - i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is called.

    One problem with this scheme is that it requires taking i_mmap_rwsem
    before taking the page lock during page faults. This is not the order
    specified in the rest of mm code. Handling of hugetlbfs pages is mostly
    isolated today. Therefore, we use this alternative locking order for
    PageHuge() pages.

    mapping->i_mmap_rwsem
    hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
    page->flags PG_locked (lock_page)

    To help with lock ordering issues, hugetlb_page_mapping_lock_write() is
    introduced to write lock the i_mmap_rwsem associated with a page.

    In most cases it is easy to get address_space via vma->vm_file->f_mapping.
    However, in the case of migration or memory errors for anon pages we do
    not have an associated vma. A new routine _get_hugetlb_page_mapping()
    will use anon_vma to get address_space in these cases.

    Signed-off-by: Mike Kravetz
    Signed-off-by: Andrew Morton
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Naoya Horiguchi
    Cc: "Aneesh Kumar K . V"
    Cc: Andrea Arcangeli
    Cc: "Kirill A . Shutemov"
    Cc: Davidlohr Bueso
    Cc: Prakash Sangappa
    Link: http://lkml.kernel.org/r/20200316205756.146666-2-mike.kravetz@oracle.com
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • The variable max_addr is being initialized with a value that is never read
    and it is being updated later with a new value. The initialization is
    redundant and can be removed.

    Signed-off-by: Colin Ian King
    Signed-off-by: Andrew Morton
    Reviewed-by: Pankaj Gupta
    Reviewed-by: Mike Rapoport
    Link: http://lkml.kernel.org/r/20200228235003.112718-1-colin.king@canonical.com
    Addresses-Coverity: ("Unused value")
    Signed-off-by: Linus Torvalds

    Colin Ian King
     
  • Using an empty (malformed) nodelist that is not caught during mount option
    parsing leads to a stack-out-of-bounds access.

    The option string that was used was: "mpol=prefer:,". However,
    MPOL_PREFERRED requires a single node number, which is not being provided
    here.

    Add a check that 'nodes' is not empty after parsing for MPOL_PREFERRED's
    nodeid.

    Fixes: 095f1fc4ebf3 ("mempolicy: rework shmem mpol parsing and display")
    Reported-by: Entropy Moe
    Reported-by: syzbot+b055b1a6b2b958707a21@syzkaller.appspotmail.com
    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Tested-by: syzbot+b055b1a6b2b958707a21@syzkaller.appspotmail.com
    Cc: Lee Schermerhorn
    Link: http://lkml.kernel.org/r/89526377-7eb6-b662-e1d8-4430928abde9@infradead.org
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • The VM_BUG_ON() is already used by queue_pages_test_walk(), it sounds
    better to dump more debug information by using VM_BUG_ON_VMA() to help
    debugging.

    Signed-off-by: Yang Shi
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Cc: "Li Xinhai"
    Cc: Qian Cai
    Link: http://lkml.kernel.org/r/1579068565-110432-1-git-send-email-yang.shi@linux.alibaba.com
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • vma_migratable() is called to check if pages in vma can be migrated before
    go ahead to further actions. Currently it is used in below code path:

    - task_numa_work
    - mbind
    - move_pages

    For hugetlb mapping, whether vma is migratable or not is determined by:
    - CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION
    - arch_hugetlb_migration_supported

    Issue: current code only checks for CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION
    alone, and no code should use it directly. (note that current code in
    vma_migratable don't cause failure or bug because
    unmap_and_move_huge_page() will catch unsupported hugepage and handle it
    properly)

    This patch checks the two factors by hugepage_migration_supported for
    impoving code logic and robustness. It will enable early bail out of
    hugepage migration procedure, but because currently all architecture
    supporting hugepage migration is able to support all page size, we would
    not see performance gain with this patch applied.

    vma_migratable() is moved to mm/mempolicy.c, because of the circular
    reference of mempolicy.h and hugetlb.h cause defining it as inline not
    feasible.

    Signed-off-by: Li Xinhai
    Signed-off-by: Andrew Morton
    Reviewed-by: Mike Kravetz
    Acked-by: Michal Hocko
    Cc: Anshuman Khandual
    Cc: Naoya Horiguchi
    Link: http://lkml.kernel.org/r/1579786179-30633-1-git-send-email-lixinhai.lxh@gmail.com
    Signed-off-by: Linus Torvalds

    Li Xinhai
     
  • MPOL_MF_STRICT is used in mbind() for purposes:

    (1) MPOL_MF_STRICT is set alone without MPOL_MF_MOVE or
    MPOL_MF_MOVE_ALL, to check if there is misplaced page and return -EIO;

    (2) MPOL_MF_STRICT is set with MPOL_MF_MOVE or MPOL_MF_MOVE_ALL, to
    check if there is misplaced page which is failed to isolate, or page
    is success on isolate but failed to move, and return -EIO.

    For non hugepage mapping, (1) and (2) are implemented as expectation. For
    hugepage mapping, (1) is not implemented. And in (2), the part about
    failed to isolate and report -EIO is not implemented.

    This patch implements the missed parts for hugepage mapping. Benefits
    with it applied:

    - User space can apply same code logic to handle mbind() on hugepage and
    non hugepage mapping;

    - Reliably using MPOL_MF_STRICT alone to check whether there is
    misplaced page or not when bind policy on address range, especially for
    address range which contains both hugepage and non hugepage mapping.

    Analysis of potential impact to existing users:

    - If MPOL_MF_STRICT alone was previously used, hugetlb pages not
    following the memory policy would not cause an EIO error. After this
    change, hugetlb pages are treated like all other pages. If
    MPOL_MF_STRICT alone is used and hugetlb pages do not follow memory
    policy an EIO error will be returned.

    - For users who using MPOL_MF_STRICT with MPOL_MF_MOVE or
    MPOL_MF_MOVE_ALL, the semantic about some pages could not be moved will
    not be changed by this patch, because failed to isolate and failed to
    move have same effects to users, so their existing code will not be
    impacted.

    In mbind man page, the note about 'MPOL_MF_STRICT is ignored on huge page
    mappings' can be removed after this patch is applied.

    Mike:

    : The current behavior with MPOL_MF_STRICT and hugetlb pages is inconsistent
    : and does not match documentation (as described above). The special
    : behavior for hugetlb pages ideally should have been removed when hugetlb
    : page migration was introduced. It is unlikely that anyone relies on
    : today's inconsistent behavior, and removing one more case of special
    : handling for hugetlb pages is a good thing.

    Signed-off-by: Li Xinhai
    Signed-off-by: Andrew Morton
    Reviewed-by: Mike Kravetz
    Reviewed-by: Naoya Horiguchi
    Cc: Michal Hocko
    Cc: linux-man
    Link: http://lkml.kernel.org/r/1581559627-6206-1-git-send-email-lixinhai.lxh@gmail.com
    Signed-off-by: Linus Torvalds

    Li Xinhai
     
  • Previously 0 was assigned to variable 'last_migrated_pfn'. But the
    variable is not read after that, so the assignment can be removed.

    Signed-off-by: Mateusz Nosek
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Link: http://lkml.kernel.org/r/20200318174509.15021-1-mateusznosek0@gmail.com
    Signed-off-by: Linus Torvalds

    Mateusz Nosek
     
  • Since commit 5bbe3547aa3ba ("mm: allow compaction of unevictable pages")
    it is allowed to examine mlocked pages and compact them by default. On
    -RT even minor pagefaults are problematic because it may take a few 100us
    to resolve them and until then the task is blocked.

    Make compact_unevictable_allowed = 0 default and issue a warning on RT if
    it is changed.

    [bigeasy@linutronix.de: v5]
    Link: https://lore.kernel.org/linux-mm/20190710144138.qyn4tuttdq6h7kqx@linutronix.de/
    Link: http://lkml.kernel.org/r/20200319165536.ovi75tsr2seared4@linutronix.de
    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Acked-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Thomas Gleixner
    Cc: Luis Chamberlain
    Cc: Kees Cook
    Cc: Iurii Zaikin
    Cc: Vlastimil Babka
    Link: https://lore.kernel.org/linux-mm/20190710144138.qyn4tuttdq6h7kqx@linutronix.de/
    Link: http://lkml.kernel.org/r/20200303202225.nhqc3v5gwlb7x6et@linutronix.de
    Signed-off-by: Linus Torvalds

    Sebastian Andrzej Siewior
     
  • Dan reports:

    The patch 5e1f0f098b46: "mm, compaction: capture a page under direct
    compaction" from Mar 5, 2019, leads to the following Smatch complaint:

    mm/compaction.c:2321 compact_zone_order()
    error: we previously assumed 'capture' could be null (see line 2313)

    mm/compaction.c
    2288 static enum compact_result compact_zone_order(struct zone *zone, int order,
    2289 gfp_t gfp_mask, enum compact_priority prio,
    2290 unsigned int alloc_flags, int classzone_idx,
    2291 struct page **capture)
    ^^^^^^^

    2313 if (capture)
    ^^^^^^^
    Check for NULL

    2314 current->capture_control = &capc;
    2315
    2316 ret = compact_zone(&cc, &capc);
    2317
    2318 VM_BUG_ON(!list_empty(&cc.freepages));
    2319 VM_BUG_ON(!list_empty(&cc.migratepages));
    2320
    2321 *capture = capc.page;
    ^^^^^^^^
    Unchecked dereference.

    2322 current->capture_control = NULL;
    2323

    In practice this is not an issue, as the only caller path passes non-NULL
    capture:

    __alloc_pages_direct_compact()
    struct page *page = NULL;
    try_to_compact_pages(capture = &page);
    compact_zone_order(capture = capture);

    So let's remove the unnecessary check, which should also make Smatch happy.

    Fixes: 5e1f0f098b46 ("mm, compaction: capture a page under direct compaction")
    Reported-by: Dan Carpenter
    Suggested-by: Andrew Morton
    Signed-off-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Acked-by: Mel Gorman
    Link: http://lkml.kernel.org/r/18b0df3c-0589-d96c-23fa-040798fee187@suse.cz
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The code to implement THP migrations already exists, and the code for CMA
    to clear out a region of memory already exists.

    Only a few small tweaks are needed to allow CMA to move THP memory when
    attempting an allocation from alloc_contig_range.

    With these changes, migrating THPs from a CMA area works when allocating a
    1GB hugepage from CMA memory.

    [riel@surriel.com: fix hugetlbfs pages per Mike, cleanup per Vlastimil]
    Link: http://lkml.kernel.org/r/20200228104700.0af2f18d@imladris.surriel.com
    Signed-off-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Reviewed-by: Zi Yan
    Reviewed-by: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Andrea Arcangeli
    Cc: Mike Kravetz
    Cc: Joonsoo Kim
    Link: http://lkml.kernel.org/r/20200227213238.1298752-2-riel@surriel.com
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • Patch series "fix THP migration for CMA allocations", v2.

    Transparent huge pages are allocated with __GFP_MOVABLE, and can end up in
    CMA memory blocks. Transparent huge pages also have most of the
    infrastructure in place to allow migration.

    However, a few pieces were missing, causing THP migration to fail when
    attempting to use CMA to allocate 1GB hugepages.

    With these patches in place, THP migration from CMA blocks seems to work,
    both for anonymous THPs and for tmpfs/shmem THPs.

    This patch (of 2):

    Add information to struct compact_control to indicate that the allocator
    would really like to clear out this specific part of memory, used by for
    example CMA.

    Signed-off-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Zi Yan
    Cc: Joonsoo Kim
    Link: http://lkml.kernel.org/r/20200227213238.1298752-1-riel@surriel.com
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • sc->memcg_low_skipped resets skipped_deactivate to 0 but this is not
    needed as this code path is never reachable with skipped_deactivate != 0
    due to previous sc->skipped_deactivate branch.

    [mhocko@kernel.org: rewrite changelog]
    Signed-off-by: Mateusz Nosek
    Signed-off-by: Andrew Morton
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Link: http://lkml.kernel.org/r/20200319165938.23354-1-mateusznosek0@gmail.com
    Signed-off-by: Linus Torvalds

    Mateusz Nosek
     
  • This gives some size improvement:

    $size mm/vmscan.o (before)
    text data bss dec hex filename
    53670 24123 12 77805 12fed mm/vmscan.o

    $size mm/vmscan.o (after)
    text data bss dec hex filename
    53648 24123 12 77783 12fd7 mm/vmscan.o

    Signed-off-by: Kirill Tkhai
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Link: http://lkml.kernel.org/r/Message-ID:
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     
  • Previously 0 was assigned to variable 'lruvec_size', but the variable was
    never read later. So the assignment can be removed.

    Fixes: f87bccde6a7d ("mm/vmscan: remove unused lru_pages argument")
    Signed-off-by: Mateusz Nosek
    Signed-off-by: Andrew Morton
    Reviewed-by: Matthew Wilcox (Oracle)
    Reviewed-by: Wei Yang
    Reviewed-by: David Hildenbrand
    Link: http://lkml.kernel.org/r/20200229214022.11853-1-mateusznosek0@gmail.com
    Signed-off-by: Linus Torvalds

    Mateusz Nosek
     
  • pgdat->kswapd_classzone_idx could be accessed concurrently in
    wakeup_kswapd(). Plain writes and reads without any lock protection
    result in data races. Fix them by adding a pair of READ|WRITE_ONCE() as
    well as saving a branch (compilers might well optimize the original code
    in an unintentional way anyway). While at it, also take care of
    pgdat->kswapd_order and non-kswapd threads in allow_direct_reclaim(). The
    data races were reported by KCSAN,

    BUG: KCSAN: data-race in wakeup_kswapd / wakeup_kswapd

    write to 0xffff9f427ffff2dc of 4 bytes by task 7454 on cpu 13:
    wakeup_kswapd+0xf1/0x400
    wakeup_kswapd at mm/vmscan.c:3967
    wake_all_kswapds+0x59/0xc0
    wake_all_kswapds at mm/page_alloc.c:4241
    __alloc_pages_slowpath+0xdcc/0x1290
    __alloc_pages_slowpath at mm/page_alloc.c:4512
    __alloc_pages_nodemask+0x3bb/0x450
    alloc_pages_vma+0x8a/0x2c0
    do_anonymous_page+0x16e/0x6f0
    __handle_mm_fault+0xcd5/0xd40
    handle_mm_fault+0xfc/0x2f0
    do_page_fault+0x263/0x6f9
    page_fault+0x34/0x40

    1 lock held by mtest01/7454:
    #0: ffff9f425afe8808 (&mm->mmap_sem#2){++++}, at:
    do_page_fault+0x143/0x6f9
    do_user_addr_fault at arch/x86/mm/fault.c:1405
    (inlined by) do_page_fault at arch/x86/mm/fault.c:1539
    irq event stamp: 6944085
    count_memcg_event_mm+0x1a6/0x270
    count_memcg_event_mm+0x119/0x270
    __do_softirq+0x34c/0x57c
    irq_exit+0xa2/0xc0

    read to 0xffff9f427ffff2dc of 4 bytes by task 7472 on cpu 38:
    wakeup_kswapd+0xc8/0x400
    wake_all_kswapds+0x59/0xc0
    __alloc_pages_slowpath+0xdcc/0x1290
    __alloc_pages_nodemask+0x3bb/0x450
    alloc_pages_vma+0x8a/0x2c0
    do_anonymous_page+0x16e/0x6f0
    __handle_mm_fault+0xcd5/0xd40
    handle_mm_fault+0xfc/0x2f0
    do_page_fault+0x263/0x6f9
    page_fault+0x34/0x40

    1 lock held by mtest01/7472:
    #0: ffff9f425a9ac148 (&mm->mmap_sem#2){++++}, at:
    do_page_fault+0x143/0x6f9
    irq event stamp: 6793561
    count_memcg_event_mm+0x1a6/0x270
    count_memcg_event_mm+0x119/0x270
    __do_softirq+0x34c/0x57c
    irq_exit+0xa2/0xc0

    BUG: KCSAN: data-race in kswapd / wakeup_kswapd

    write to 0xffff90973ffff2dc of 4 bytes by task 820 on cpu 6:
    kswapd+0x27c/0x8d0
    kthread+0x1e0/0x200
    ret_from_fork+0x27/0x50

    read to 0xffff90973ffff2dc of 4 bytes by task 6299 on cpu 0:
    wakeup_kswapd+0xf3/0x450
    wake_all_kswapds+0x59/0xc0
    __alloc_pages_slowpath+0xdcc/0x1290
    __alloc_pages_nodemask+0x3bb/0x450
    alloc_pages_vma+0x8a/0x2c0
    do_anonymous_page+0x170/0x700
    __handle_mm_fault+0xc9f/0xd00
    handle_mm_fault+0xfc/0x2f0
    do_page_fault+0x263/0x6f9
    page_fault+0x34/0x40

    Signed-off-by: Qian Cai
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Cc: Marco Elver
    Cc: Matthew Wilcox
    Link: http://lkml.kernel.org/r/1582749472-5171-1-git-send-email-cai@lca.pw
    Signed-off-by: Linus Torvalds

    Qian Cai
     
  • kswapd kernel thread starts either with a CPU affinity set to the full cpu
    mask of its target node or without any affinity at all if the node is
    CPUless. There is a cpu hotplug callback (kswapd_cpu_online) that
    implements an elaborate way to update this mask when a cpu is onlined.

    It is not really clear whether there is any actual benefit from this
    scheme. Completely CPU-less NUMA nodes rarely gain a new CPU during
    runtime. Drop the code for that reason. If there is a real usecase then
    we can resurrect and simplify the code.

    [mhocko@suse.com rewrite changelog]

    Suggested-by: Michal Hocko
    Signed-off-by: Wei Yang
    Signed-off-by: Andrew Morton
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Link: http://lkml.kernel.org/r/20200218224422.3407-1-richardw.yang@linux.intel.com
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • The commit 98fa15f34cb3 ("mm: replace all open encodings for
    NUMA_NO_NODE") did the replacement across the kernel tree, but we got
    some more in vmscan.c since then.

    Signed-off-by: Yang Shi
    Signed-off-by: Andrew Morton
    Reviewed-by: Anshuman Khandual
    Acked-by: Minchan Kim
    Acked-by: David Rientjes
    Link: http://lkml.kernel.org/r/1581568298-45317-1-git-send-email-yang.shi@linux.alibaba.com
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • Use mem_cgroup_is_root() API to check if memcg is root memcg instead of
    open coding.

    Signed-off-by: Yang Shi
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Acked-by: Michal Hocko
    Acked-by: David Rientjes
    Link: http://lkml.kernel.org/r/1581398649-125989-2-git-send-email-yang.shi@linux.alibaba.com
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • When kstrndup fails, no memory was allocated and we can exit directly.

    [david@redhat.com: reword changelog]
    Signed-off-by: Yang Shi
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Acked-by: David Rientjes
    Link: http://lkml.kernel.org/r/1581398649-125989-1-git-send-email-yang.shi@linux.alibaba.com
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • Simplify page_is_buddy() to reduce the redundant code for better code
    readability.

    Signed-off-by: chenqiwu
    Signed-off-by: Andrew Morton
    Reviewed-by: Alexander Duyck
    Reviewed-by: Matthew Wilcox (Oracle)
    Reviewed-by: Vlastimil Babka
    Acked-by: Pankaj Gupta
    Link: http://lkml.kernel.org/r/1583853751-5525-1-git-send-email-qiwuchen55@gmail.com
    Signed-off-by: Linus Torvalds

    chenqiwu
     
  • Previously if branch condition was false, the assignment was not executed.
    The assignment can be safely executed even when the condition is false
    and it is not incorrect as it assigns the value of 'nodemask' to
    'ac.nodemask' which already has the same value.

    So as the assignment can be executed unconditionally, the branch can be
    removed.

    Signed-off-by: Mateusz Nosek
    Signed-off-by: Andrew Morton
    Reviewed-by: Matthew Wilcox (Oracle)
    Link: http://lkml.kernel.org/r/20200307225335.31300-1-mateusznosek0@gmail.com
    Signed-off-by: Linus Torvalds

    Mateusz Nosek
     
  • Use free_area_empty() API to replace list_empty() for better code
    readability.

    Signed-off-by: chenqiwu
    Signed-off-by: Andrew Morton
    Reviewed-by: Matthew Wilcox (Oracle)
    Link: http://lkml.kernel.org/r/1583674354-7713-1-git-send-email-qiwuchen55@gmail.com
    Signed-off-by: Linus Torvalds

    chenqiwu
     
  • This patch makes ALLOC_KSWAPD equal to __GFP_KSWAPD_RECLAIM (cast to int).

    Thanks to that code like:

    if (gfp_mask & __GFP_KSWAPD_RECLAIM)
    alloc_flags |= ALLOC_KSWAPD;

    can be changed to:

    alloc_flags |= (__force int) (gfp_mask &__GFP_KSWAPD_RECLAIM);

    Thanks to this one branch less is generated in the assembly.

    In case of ALLOC_KSWAPD flag two branches are saved, first one in code
    that always executes in the beginning of page allocation and the second
    one in loop in page allocator slowpath.

    Signed-off-by: Mateusz Nosek
    Signed-off-by: Andrew Morton
    Acked-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Link: http://lkml.kernel.org/r/20200304162118.14784-1-mateusznosek0@gmail.com
    Signed-off-by: Linus Torvalds

    Mateusz Nosek