Eric Lee / smarc-fsl-linux-kernel

25 Sep, 2019

3 commits

28eb3c808 shmem: fix obsolete comment in shmem_getpage_gfp() ... Browse Code »

Replace "fault_mm" with "vmf" in code comment because commit cfda05267f7b
("userfaultfd: shmem: add userfaultfd hook for shared memory faults") has
changed the prototpye of shmem_getpage_gfp() - pass vmf instead of
fault_mm to the function.

Before:
static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
struct page **pagep, enum sgp_type sgp,
gfp_t gfp, struct mm_struct *fault_mm, int *fault_type);
After:
static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
struct page **pagep, enum sgp_type sgp,
gfp_t gfp, struct vm_area_struct *vma,
struct vm_fault *vmf, vm_fault_t *fault_type);

Link: http://lkml.kernel.org/r/20190816100204.9781-1-miles.chen@mediatek.com
Signed-off-by: Miles Chen
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Miles Chen
2019-09-25 06:54:12 +0800
4101196b1 mm: page cache: store only head pages in i_pages ... Browse Code »

Transparent Huge Pages are currently stored in i_pages as pointers to
consecutive subpages. This patch changes that to storing consecutive
pointers to the head page in preparation for storing huge pages more
efficiently in i_pages.

Large parts of this are "inspired" by Kirill's patch
https://lore.kernel.org/lkml/20170126115819.58875-2-kirill.shutemov@linux.intel.com/

Kirill and Huang Ying contributed several fixes.

[willy@infradead.org: use compound_nr, squish uninit-var warning]
Link: http://lkml.kernel.org/r/20190731210400.7419-1-willy@infradead.org
Signed-off-by: Matthew Wilcox
Acked-by: Jan Kara
Reviewed-by: Kirill Shutemov
Reviewed-by: Song Liu
Tested-by: Song Liu
Tested-by: William Kucharski
Reviewed-by: William Kucharski
Tested-by: Qian Cai
Tested-by: Mikhail Gavrilov
Cc: Hugh Dickins
Cc: Chris Wilson
Cc: Song Liu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matthew Wilcox (Oracle)
2019-09-25 06:54:08 +0800
d8c6546b1 mm: introduce compound_nr() ... Browse Code »

Replace 1 << compound_order(page) with compound_nr(page). Minor
improvements in readability.

Link: http://lkml.kernel.org/r/20190721104612.19120-4-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle)
Reviewed-by: Andrew Morton
Reviewed-by: Ira Weiny
Acked-by: Kirill A. Shutemov
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matthew Wilcox (Oracle)
2019-09-25 06:54:08 +0800

13 Sep, 2019

5 commits

f32356261 vfs: Convert ramfs, shmem, tmpfs, devtmpfs, rootfs to use the new mount API ... Browse Code »

Convert the ramfs, shmem, tmpfs, devtmpfs and rootfs filesystems to the new
internal mount API as the old one will be obsoleted and removed. This
allows greater flexibility in communication of mount parameters between
userspace, the VFS and the filesystem.

See Documentation/filesystems/mount_api.txt for more information.

Note that tmpfs is slightly tricky as it can contain embedded commas, so it
can't be trivially split up using strsep() to break on commas in
generic_parse_monolithic(). Instead, tmpfs has to supply its own generic
parser.

However, if tmpfs changes, then devtmpfs and rootfs, which are wrappers
around tmpfs or ramfs, must change too - and thus so must ramfs, so these
had to be converted also.

[AV: rewritten]

Signed-off-by: David Howells
cc: Hugh Dickins
cc: linux-mm@kvack.org
Signed-off-by: Al Viro

David Howells
2019-09-13 09:05:34 +0800
626c3920a shmem_parse_one(): switch to use of fs_parse() ... Browse Code »

This thing will eventually become our ->parse_param(), while
shmem_parse_options() - ->parse_monolithic(). At that point
shmem_parse_options() will start calling vfs_parse_fs_string(),
rather than calling shmem_parse_one() directly.

Signed-off-by: Al Viro

Al Viro
2019-09-13 09:01:32 +0800
e04dc423a shmem_parse_options(): take handling a single option into a helper ... Browse Code »

mechanical move.

Signed-off-by: Al Viro

Al Viro
2019-09-13 09:00:32 +0800
f6490b7fb shmem_parse_options(): don't bother with mpol in separate variable ... Browse Code »

just use ctx->mpol (note that callers always set ctx->mpol to NULL when
calling that).

Signed-off-by: Al Viro

Al Viro
2019-09-13 09:00:26 +0800
0b5071dd3 shmem_parse_options(): use a separate structure to keep the results ... Browse Code »

... and copy the data from it into sbinfo in the callers.
For use by remount we need to keep track whether there'd
been options setting max_inodes, max_blocks and huge resp.
and do the sanity checks (and copying) only if such options
had been seen. uid/gid/mode is ignored by remount and
NULL mpol is already explicitly treated as "ignore it",
so we don't need to keep track of those.

Note: theoretically, mpol_parse_string() may return NULL
not in case of error (for default policy), so the assumption
that NULL mpol means "change nothing" is incorrect. However,
that's the mainline behaviour and any changes belong in
a separate patch. If we go for that, we'll need to keep
track of having encountered mpol= option too.

[changes in remount logics from Hugh Dickins folded]

Signed-off-by: Al Viro

Al Viro
2019-09-13 08:59:14 +0800

06 Sep, 2019

1 commit

7e30d2a5e make shmem_fill_super() static ... Browse Code »

... have callers use shmem_mount()

Signed-off-by: Al Viro

Al Viro
2019-09-06 02:34:28 +0800

14 Aug, 2019

1 commit

92717d429 Revert "Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask"" ... Browse Code »

Patch series "reapply: relax __GFP_THISNODE for MADV_HUGEPAGE mappings".

The fixes for what was originally reported as "pathological THP
behavior" we rightfully reverted to be sure not to introduced
regressions at end of a merge window after a severe regression report
from the kernel bot. We can safely re-apply them now that we had time
to analyze the problem.

The mm process worked fine, because the good fixes were eventually
committed upstream without excessive delay.

The regression reported by the kernel bot however forced us to revert
the good fixes to be sure not to introduce regressions and to give us
the time to analyze the issue further. The silver lining is that this
extra time allowed to think more at this issue and also plan for a
future direction to improve things further in terms of THP NUMA
locality.

This patch (of 2):

This reverts commit 356ff8a9a78fb35d ("Revert "mm, thp: consolidate THP
gfp handling into alloc_hugepage_direct_gfpmask"). So it reapplies
89c83fb539f954 ("mm, thp: consolidate THP gfp handling into
alloc_hugepage_direct_gfpmask").

Consolidation of the THP allocation flags at the same place was meant to
be a clean up to easier handle otherwise scattered code which is
imposing a maintenance burden. There were no real problems observed
with the gfp mask consolidation but the reversion was rushed through
without a larger consensus regardless.

This patch brings the consolidation back because this should make the
long term maintainability easier as well as it should allow future
changes to be less error prone.

[mhocko@kernel.org: changelog additions]
Link: http://lkml.kernel.org/r/20190503223146.2312-2-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli
Acked-by: Michal Hocko
Cc: Mel Gorman
Cc: Vlastimil Babka
Cc: David Rientjes
Cc: Zi Yan
Cc: Stefan Priebe - Profihost AG
Cc: "Kirill A. Shutemov"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2019-08-14 07:06:52 +0800

20 Jul, 2019

1 commit

933a90bf4 Merge branch 'work.mount0' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull vfs mount updates from Al Viro:
"The first part of mount updates.

Convert filesystems to use the new mount API"

* 'work.mount0' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
mnt_init(): call shmem_init() unconditionally
constify ksys_mount() string arguments
don't bother with registering rootfs
init_rootfs(): don't bother with init_ramfs_fs()
vfs: Convert smackfs to use the new mount API
vfs: Convert selinuxfs to use the new mount API
vfs: Convert securityfs to use the new mount API
vfs: Convert apparmorfs to use the new mount API
vfs: Convert openpromfs to use the new mount API
vfs: Convert xenfs to use the new mount API
vfs: Convert gadgetfs to use the new mount API
vfs: Convert oprofilefs to use the new mount API
vfs: Convert ibmasmfs to use the new mount API
vfs: Convert qib_fs/ipathfs to use the new mount API
vfs: Convert efivarfs to use the new mount API
vfs: Convert configfs to use the new mount API
vfs: Convert binfmt_misc to use the new mount API
convenience helper: get_tree_single()
convenience helper get_tree_nodev()
vfs: Kill sget_userns()
...

Linus Torvalds
2019-07-20 01:42:02 +0800

19 Jul, 2019

1 commit

c06306696 mm: thp: fix false negative of shmem vma's THP eligibility ... Browse Code »

Commit 7635d9cbe832 ("mm, thp, proc: report THP eligibility for each
vma") introduced THPeligible bit for processes' smaps. But, when
checking the eligibility for shmem vma, __transparent_hugepage_enabled()
is called to override the result from shmem_huge_enabled(). It may
result in the anonymous vma's THP flag override shmem's. For example,
running a simple test which create THP for shmem, but with anonymous THP
disabled, when reading the process's smaps, it may show:

7fc92ec00000-7fc92f000000 rw-s 00000000 00:14 27764 /dev/shm/test
Size: 4096 kB
...
[snip]
...
ShmemPmdMapped: 4096 kB
...
[snip]
...
THPeligible: 0

And, /proc/meminfo does show THP allocated and PMD mapped too:

ShmemHugePages: 4096 kB
ShmemPmdMapped: 4096 kB

This doesn't make too much sense. The shmem objects should be treated
separately from anonymous THP. Calling shmem_huge_enabled() with
checking MMF_DISABLE_THP sounds good enough. And, we could skip stack
and dax vma check since we already checked if the vma is shmem already.

Also check if vma is suitable for THP by calling
transhuge_vma_suitable().

And minor fix to smaps output format and documentation.

Link: http://lkml.kernel.org/r/1560401041-32207-3-git-send-email-yang.shi@linux.alibaba.com
Fixes: 7635d9cbe832 ("mm, thp, proc: report THP eligibility for each vma")
Signed-off-by: Yang Shi
Acked-by: Hugh Dickins
Cc: Kirill A. Shutemov
Cc: Michal Hocko
Cc: Vlastimil Babka
Cc: David Rientjes
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yang Shi
2019-07-19 08:08:06 +0800

17 Jul, 2019

1 commit

e5f2249ab mm/shmem.c: fix unused shmem_parse_huge() function warning ... Browse Code »

When CONFIG_SYSFS is disabled but CONFIG_TMPFS is enabled, we get a
warning about shmem_parse_huge() never being called:

mm/shmem.c:417:12: error: unused function 'shmem_parse_huge' [-Werror,-Wunused-function]
static int shmem_parse_huge(const char *str)

Change the #ifdef so we no longer build this function in that configuration.

Link: http://lkml.kernel.org/r/20190712091141.673355-1-arnd@arndb.de
Fixes: 144df3b288c4 ("vfs: Convert ramfs, shmem, tmpfs, devtmpfs, rootfs to use the new mount API")
Signed-off-by: Arnd Bergmann
Cc: Hugh Dickins
Cc: Arnd Bergmann
Cc: David Howells
Cc: Al Viro
Cc: Matthew Wilcox
Cc: Vlastimil Babka
Cc: Andrea Arcangeli
Cc: Vineeth Remanan Pillai
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Arnd Bergmann
2019-07-17 10:23:21 +0800

06 Jul, 2019

1 commit

69bf4b6b5 Revert "mm: page cache: store only head pages in i_pages" ... Browse Code »

This reverts commit 5fd4ca2d84b249f0858ce28cf637cf25b61a398f.

Mikhail Gavrilov reports that it causes the VM_BUG_ON_PAGE() in
__delete_from_swap_cache() to trigger:

page:ffffd6d34dff0000 refcount:1 mapcount:1 mapping:ffff97812323a689 index:0xfecec363
anon
flags: 0x17fffe00080034(uptodate|lru|active|swapbacked)
raw: 0017fffe00080034 ffffd6d34c67c508 ffffd6d3504b8d48 ffff97812323a689
raw: 00000000fecec363 0000000000000000 0000000100000000 ffff978433ace000
page dumped because: VM_BUG_ON_PAGE(entry != page)
page->mem_cgroup:ffff978433ace000
------------[ cut here ]------------
kernel BUG at mm/swap_state.c:170!
invalid opcode: 0000 [#1] SMP NOPTI
CPU: 1 PID: 221 Comm: kswapd0 Not tainted 5.2.0-0.rc2.git0.1.fc31.x86_64 #1
Hardware name: System manufacturer System Product Name/ROG STRIX X470-I GAMING, BIOS 2202 04/11/2019
RIP: 0010:__delete_from_swap_cache+0x20d/0x240
Code: 30 65 48 33 04 25 28 00 00 00 75 4a 48 83 c4 38 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 c7 c6 2f dc 0f 8a 48 89 c7 e8 93 1b fd ff 0b 48 c7 c6 a8 74 0f 8a e8 85 1b fd ff 0f 0b 48 c7 c6 a8 7d 0f
RSP: 0018:ffffa982036e7980 EFLAGS: 00010046
RAX: 0000000000000021 RBX: 0000000000000040 RCX: 0000000000000006
RDX: 0000000000000000 RSI: 0000000000000086 RDI: ffff97843d657900
RBP: 0000000000000001 R08: ffffa982036e7835 R09: 0000000000000535
R10: ffff97845e21a46c R11: ffffa982036e7835 R12: ffff978426387120
R13: 0000000000000000 R14: ffffd6d34dff0040 R15: ffffd6d34dff0000
FS: 0000000000000000(0000) GS:ffff97843d640000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00002cba88ef5000 CR3: 000000078a97c000 CR4: 00000000003406e0
Call Trace:
delete_from_swap_cache+0x46/0xa0
try_to_free_swap+0xbc/0x110
swap_writepage+0x13/0x70
pageout.isra.0+0x13c/0x350
shrink_page_list+0xc14/0xdf0
shrink_inactive_list+0x1e5/0x3c0
shrink_node_memcg+0x202/0x760
shrink_node+0xe0/0x470
balance_pgdat+0x2d1/0x510
kswapd+0x220/0x420
kthread+0xfb/0x130
ret_from_fork+0x22/0x40

and it's not immediately obvious why it happens. It's too late in the
rc cycle to do anything but revert for now.

Link: https://lore.kernel.org/lkml/CABXGCsN9mYmBD-4GaaeW_NrDu+FDXLzr_6x+XNxfmFV6QkYCDg@mail.gmail.com/
Reported-and-bisected-by: Mikhail Gavrilov
Suggested-by: Jan Kara
Cc: Michal Hocko
Cc: Vlastimil Babka
Cc: Matthew Wilcox
Cc: Kirill Shutemov
Cc: William Kucharski
Cc: Andrew Morton
Signed-off-by: Linus Torvalds

Linus Torvalds
2019-07-06 10:55:18 +0800

05 Jul, 2019

1 commit

037f11b47 mnt_init(): call shmem_init() unconditionally ... Browse Code »

No point having two call sites (earlier in init_rootfs() from
mnt_init() in case we are going to use shmem-style rootfs,
later from do_basic_setup() unconditionally), along with the
logics in shmem_init() itself to make the second call a no-op...

Signed-off-by: Al Viro

Al Viro
2019-07-05 10:01:59 +0800

15 May, 2019

1 commit

5fd4ca2d8 mm: page cache: store only head pages in i_pages ... Browse Code »

Transparent Huge Pages are currently stored in i_pages as pointers to
consecutive subpages. This patch changes that to storing consecutive
pointers to the head page in preparation for storing huge pages more
efficiently in i_pages.

Large parts of this are "inspired" by Kirill's patch
https://lore.kernel.org/lkml/20170126115819.58875-2-kirill.shutemov@linux.intel.com/

[willy@infradead.org: fix swapcache pages]
Link: http://lkml.kernel.org/r/20190324155441.GF10344@bombadil.infradead.org
[kirill@shutemov.name: hugetlb stores pages in page cache differently]
Link: http://lkml.kernel.org/r/20190404134553.vuvhgmghlkiw2hgl@kshutemo-mobl1
Link: http://lkml.kernel.org/r/20190307153051.18815-1-willy@infradead.org
Signed-off-by: Matthew Wilcox
Acked-by: Jan Kara
Reviewed-by: Kirill Shutemov
Reviewed-and-tested-by: Song Liu
Tested-by: William Kucharski
Reviewed-by: William Kucharski
Tested-by: Qian Cai
Cc: Hugh Dickins
Cc: Song Liu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matthew Wilcox
2019-05-15 00:47:45 +0800

08 May, 2019

1 commit

168e153d5 Merge branch 'work.icache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull vfs inode freeing updates from Al Viro:
"Introduction of separate method for RCU-delayed part of
->destroy_inode() (if any).

Pretty much as posted, except that destroy_inode() stashes
->free_inode into the victim (anon-unioned with ->i_fops) before
scheduling i_callback() and the last two patches (sockfs conversion
and folding struct socket_wq into struct socket) are excluded - that
pair should go through netdev once davem reopens his tree"

* 'work.icache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (58 commits)
orangefs: make use of ->free_inode()
shmem: make use of ->free_inode()
hugetlb: make use of ->free_inode()
overlayfs: make use of ->free_inode()
jfs: switch to ->free_inode()
fuse: switch to ->free_inode()
ext4: make use of ->free_inode()
ecryptfs: make use of ->free_inode()
ceph: use ->free_inode()
btrfs: use ->free_inode()
afs: switch to use of ->free_inode()
dax: make use of ->free_inode()
ntfs: switch to ->free_inode()
securityfs: switch to ->free_inode()
apparmor: switch to ->free_inode()
rpcpipe: switch to ->free_inode()
bpf: switch to ->free_inode()
mqueue: switch to ->free_inode()
ufs: switch to ->free_inode()
coda: switch to ->free_inode()
...

Linus Torvalds
2019-05-08 01:57:05 +0800

02 May, 2019

1 commit

74b1da564 shmem: make use of ->free_inode() ... Browse Code »

same situation as for hugetlbfs

Signed-off-by: Al Viro

Al Viro
2019-05-02 10:43:27 +0800

20 Apr, 2019

2 commits

af53d3e9e mm: swapoff: shmem_unuse() stop eviction without igrab() ... Browse Code »

The igrab() in shmem_unuse() looks good, but we forgot that it gives no
protection against concurrent unmounting: a point made by Konstantin
Khlebnikov eight years ago, and then fixed in 2.6.39 by 778dd893ae78
("tmpfs: fix race between umount and swapoff"). The current 5.1-rc
swapoff is liable to hit "VFS: Busy inodes after unmount of tmpfs.
Self-destruct in 5 seconds. Have a nice day..." followed by GPF.

Once again, give up on using igrab(); but don't go back to making such
heavy-handed use of shmem_swaplist_mutex as last time: that would spoil
the new design, and I expect could deadlock inside shmem_swapin_page().

Instead, shmem_unuse() just raise a "stop_eviction" count in the shmem-
specific inode, and shmem_evict_inode() wait for that to go down to 0.
Call it "stop_eviction" rather than "swapoff_busy" because it can be put
to use for others later (huge tmpfs patches expect to use it).

That simplifies shmem_unuse(), protecting it from both unlink and
unmount; and in practice lets it locate all the swap in its first try.
But do not rely on that: there's still a theoretical case, when
shmem_writepage() might have been preempted after its get_swap_page(),
before making the swap entry visible to swapoff.

[hughd@google.com: remove incorrect list_del()]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1904091133570.1898@eggly.anvils
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1904081259400.1523@eggly.anvils
Fixes: b56a2d8af914 ("mm: rid swapoff of quadratic complexity")
Signed-off-by: Hugh Dickins
Cc: "Alex Xu (Hello71)"
Cc: Huang Ying
Cc: Kelley Nielsen
Cc: Konstantin Khlebnikov
Cc: Rik van Riel
Cc: Vineeth Pillai
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2019-04-20 00:46:04 +0800
870395465 mm: swapoff: shmem_find_swap_entries() filter out other types ... Browse Code »

Swapfile "type" was passed all the way down to shmem_unuse_inode(), but
then forgotten from shmem_find_swap_entries(): with the result that
removing one swapfile would try to free up all the swap from shmem - no
problem when only one swapfile anyway, but counter-productive when more,
causing swapoff to be unnecessarily OOM-killed when it should succeed.

Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1904081254470.1523@eggly.anvils
Fixes: b56a2d8af914 ("mm: rid swapoff of quadratic complexity")
Signed-off-by: Hugh Dickins
Cc: Konstantin Khlebnikov
Cc: "Alex Xu (Hello71)"
Cc: Vineeth Pillai
Cc: Kelley Nielsen
Cc: Rik van Riel
Cc: Huang Ying
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2019-04-20 00:46:04 +0800

06 Mar, 2019

3 commits

ab3948f58 mm/memfd: add an F_SEAL_FUTURE_WRITE seal to memfd ... Browse Code »

Android uses ashmem for sharing memory regions. We are looking forward
to migrating all usecases of ashmem to memfd so that we can possibly
remove the ashmem driver in the future from staging while also
benefiting from using memfd and contributing to it. Note staging
drivers are also not ABI and generally can be removed at anytime.

One of the main usecases Android has is the ability to create a region
and mmap it as writeable, then add protection against making any
"future" writes while keeping the existing already mmap'ed
writeable-region active. This allows us to implement a usecase where
receivers of the shared memory buffer can get a read-only view, while
the sender continues to write to the buffer. See CursorWindow
documentation in Android for more details:

https://developer.android.com/reference/android/database/CursorWindow

This usecase cannot be implemented with the existing F_SEAL_WRITE seal.
To support the usecase, this patch adds a new F_SEAL_FUTURE_WRITE seal
which prevents any future mmap and write syscalls from succeeding while
keeping the existing mmap active.

A better way to do F_SEAL_FUTURE_WRITE seal was discussed [1] last week
where we don't need to modify core VFS structures to get the same
behavior of the seal. This solves several side-effects pointed by Andy.
self-tests are provided in later patch to verify the expected semantics.

[1] https://lore.kernel.org/lkml/20181111173650.GA256781@google.com/

Thanks a lot to Andy for suggestions to improve code.

Link: http://lkml.kernel.org/r/20190112203816.85534-2-joel@joelfernandes.org
Signed-off-by: Joel Fernandes (Google)
Acked-by: John Stultz
Cc: Andy Lutomirski
Cc: Minchan Kim
Cc: Jann Horn
Cc: Al Viro
Cc: Andy Lutomirski
Cc: Hugh Dickins
Cc: J. Bruce Fields
Cc: Jeff Layton
Cc: Marc-Andr Lureau
Cc: Matthew Wilcox
Cc: Mike Kravetz
Cc: Shuah Khan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joel Fernandes (Google)
2019-03-06 13:07:19 +0800
b56a2d8af mm: rid swapoff of quadratic complexity ... Browse Code »

This patch was initially posted by Kelley Nielsen. Reposting the patch
with all review comments addressed and with minor modifications and
optimizations. Also, folding in the fixes offered by Hugh Dickins and
Huang Ying. Tests were rerun and commit message updated with new
results.

try_to_unuse() is of quadratic complexity, with a lot of wasted effort.
It unuses swap entries one by one, potentially iterating over all the
page tables for all the processes in the system for each one.

This new proposed implementation of try_to_unuse simplifies its
complexity to linear. It iterates over the system's mms once, unusing
all the affected entries as it walks each set of page tables. It also
makes similar changes to shmem_unuse.

Improvement

swapoff was called on a swap partition containing about 6G of data, in a
VM(8cpu, 16G RAM), and calls to unuse_pte_range() were counted.

Present implementation....about 1200M calls(8min, avg 80% cpu util).
Prototype.................about 9.0K calls(3min, avg 5% cpu util).

Details

In shmem_unuse(), iterate over the shmem_swaplist and, for each
shmem_inode_info that contains a swap entry, pass it to
shmem_unuse_inode(), along with the swap type. In shmem_unuse_inode(),
iterate over its associated xarray, and store the index and value of
each swap entry in an array for passing to shmem_swapin_page() outside
of the RCU critical section.

In try_to_unuse(), instead of iterating over the entries in the type and
unusing them one by one, perhaps walking all the page tables for all the
processes for each one, iterate over the mmlist, making one pass. Pass
each mm to unuse_mm() to begin its page table walk, and during the walk,
unuse all the ptes that have backing store in the swap type received by
try_to_unuse(). After the walk, check the type for orphaned swap
entries with find_next_to_unuse(), and remove them from the swap cache.
If find_next_to_unuse() starts over at the beginning of the type, repeat
the check of the shmem_swaplist and the walk a maximum of three times.

Change unuse_mm() and the intervening walk functions down to
unuse_pte_range() to take the type as a parameter, and to iterate over
their entire range, calling the next function down on every iteration.
In unuse_pte_range(), make a swap entry from each pte in the range using
the passed in type. If it has backing store in the type, call
swapin_readahead() to retrieve the page and pass it to unuse_pte().

Pass the count of pages_to_unuse down the page table walks in
try_to_unuse(), and return from the walk when the desired number of
pages has been swapped back in.

Link: http://lkml.kernel.org/r/20190114153129.4852-2-vpillai@digitalocean.com
Signed-off-by: Vineeth Remanan Pillai
Signed-off-by: Kelley Nielsen
Signed-off-by: Huang Ying
Acked-by: Hugh Dickins
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vineeth Remanan Pillai
2019-03-06 13:07:18 +0800
c5bf121e4 mm: refactor swap-in logic out of shmem_getpage_gfp ... Browse Code »

swapin logic can be reused independently without rest of the logic in
shmem_getpage_gfp. So lets refactor it out as an independent function.

Link: http://lkml.kernel.org/r/20190114153129.4852-1-vpillai@digitalocean.com
Signed-off-by: Vineeth Remanan Pillai
Reviewed-by: Andrew Morton
Cc: Huang Ying
Cc: Hugh Dickins
Cc: Kelley Nielsen
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vineeth Remanan Pillai
2019-03-06 13:07:18 +0800

26 Feb, 2019

1 commit

29b00e609 tmpfs: fix uninitialized return value in shmem_link ... Browse Code »

When we made the shmem_reserve_inode call in shmem_link conditional, we
forgot to update the declaration for ret so that it always has a known
value. Dan Carpenter pointed out this deficiency in the original patch.

Fixes: 1062af920c07 ("tmpfs: fix link accounting when a tmpfile is linked in")
Reported-by: Dan Carpenter
Signed-off-by: Darrick J. Wong
Signed-off-by: Hugh Dickins
Cc: Matej Kupljen
Cc: Al Viro
Cc: Andrew Morton
Signed-off-by: Linus Torvalds

Darrick J. Wong
2019-02-26 03:49:22 +0800

22 Feb, 2019

1 commit

1062af920 tmpfs: fix link accounting when a tmpfile is linked in ... Browse Code »

tmpfs has a peculiarity of accounting hard links as if they were
separate inodes: so that when the number of inodes is limited, as it is
by default, a user cannot soak up an unlimited amount of unreclaimable
dcache memory just by repeatedly linking a file.

But when v3.11 added O_TMPFILE, and the ability to use linkat() on the
fd, we missed accommodating this new case in tmpfs: "df -i" shows that
an extra "inode" remains accounted after the file is unlinked and the fd
closed and the actual inode evicted. If a user repeatedly links
tmpfiles into a tmpfs, the limit will be hit (ENOSPC) even after they
are deleted.

Just skip the extra reservation from shmem_link() in this case: there's
a sense in which this first link of a tmpfile is then cheaper than a
hard link of another file, but the accounting works out, and there's
still good limiting, so no need to do anything more complicated.

Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1902182134370.7035@eggly.anvils
Fixes: f4e0c30c191 ("allow the temp files created by open() to be linked to")
Signed-off-by: Darrick J. Wong
Signed-off-by: Hugh Dickins
Reported-by: Matej Kupljen
Acked-by: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Darrick J. Wong
2019-02-22 01:01:00 +0800

29 Dec, 2018

2 commits

ca79b0c21 mm: convert totalram_pages and totalhigh_pages variables to atomic ... Browse Code »

totalram_pages and totalhigh_pages are made static inline function.

Main motivation was that managed_page_count_lock handling was complicating
things. It was discussed in length here,
https://lore.kernel.org/patchwork/patch/995739/#1181785 So it seemes
better to remove the lock and convert variables to atomic, with preventing
poteintial store-to-read tearing as a bonus.

[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/1542090790-21750-4-git-send-email-arunks@codeaurora.org
Signed-off-by: Arun KS
Suggested-by: Michal Hocko
Suggested-by: Vlastimil Babka
Reviewed-by: Konstantin Khlebnikov
Reviewed-by: Pavel Tatashin
Acked-by: Michal Hocko
Acked-by: Vlastimil Babka
Cc: David Hildenbrand
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Arun KS
2018-12-29 04:11:47 +0800
3d6357de8 mm: reference totalram_pages and managed_pages once per function ... Browse Code »

Patch series "mm: convert totalram_pages, totalhigh_pages and managed
pages to atomic", v5.

This series converts totalram_pages, totalhigh_pages and
zone->managed_pages to atomic variables.

totalram_pages, zone->managed_pages and totalhigh_pages updates are
protected by managed_page_count_lock, but readers never care about it.
Convert these variables to atomic to avoid readers potentially seeing a
store tear.

Main motivation was that managed_page_count_lock handling was complicating
things. It was discussed in length here,
https://lore.kernel.org/patchwork/patch/995739/#1181785 It seemes better
to remove the lock and convert variables to atomic. With the change,
preventing poteintial store-to-read tearing comes as a bonus.

This patch (of 4):

This is in preparation to a later patch which converts totalram_pages and
zone->managed_pages to atomic variables. Please note that re-reading the
value might lead to a different value and as such it could lead to
unexpected behavior. There are no known bugs as a result of the current
code but it is better to prevent from them in principle.

Link: http://lkml.kernel.org/r/1542090790-21750-2-git-send-email-arunks@codeaurora.org
Signed-off-by: Arun KS
Reviewed-by: Konstantin Khlebnikov
Reviewed-by: David Hildenbrand
Acked-by: Michal Hocko
Acked-by: Vlastimil Babka
Reviewed-by: Pavel Tatashin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Arun KS
2018-12-29 04:11:47 +0800

26 Dec, 2018

1 commit

4971f090a Merge tag 'drm-next-2018-12-14' of git://anongit.freedesktop.org/drm/drm ... Browse Code »

Pull drm updates from Dave Airlie:
"Core:
- shared fencing staging removal
- drop transactional atomic helpers and move helpers to new location
- DP/MST atomic cleanup
- Leasing cleanups and drop EXPORT_SYMBOL
- Convert drivers to atomic helpers and generic fbdev.
- removed deprecated obj_ref/unref in favour of get/put
- Improve dumb callback documentation
- MODESET_LOCK_BEGIN/END helpers

panels:
- CDTech panels, Banana Pi Panel, DLC1010GIG,
- Olimex LCD-O-LinuXino, Samsung S6D16D0, Truly NT35597 WQXGA,
- Himax HX8357D, simulated RTSM AEMv8.
- GPD Win2 panel
- AUO G101EVN010

vgem:
- render node support

ttm:
- move global init out of drivers
- fix LRU handling for ghost objects
- Support for simultaneous submissions to multiple engines

scheduler:
- timeout/fault handling changes to help GPU recovery
- helpers for hw with preemption support

i915:
- Scaler/Watermark fixes
- DP MST + powerwell fixes
- PSR fixes
- Break long get/put shmemfs pages
- Icelake fixes
- Icelake DSI video mode enablement
- Engine workaround improvements

amdgpu:
- freesync support
- GPU reset enabled on CI, VI, SOC15 dGPUs
- ABM support in DC
- KFD support for vega12/polaris12
- SDMA paging queue on vega
- More amdkfd code sharing
- DCC scanout on GFX9
- DC kerneldoc
- Updated SMU firmware for GFX8 chips
- XGMI PSP + hive reset support
- GPU reset
- DC trace support
- Powerplay updates for newer Polaris
- Cursor plane update fast path
- kfd dma-buf support

virtio-gpu:
- add EDID support

vmwgfx:
- pageflip with damage support

nouveau:
- Initial Turing TU104/TU106 modesetting support

msm:
- a2xx gpu support for apq8060 and imx5
- a2xx gpummu support
- mdp4 display support for apq8060
- DPU fixes and cleanups
- enhanced profiling support
- debug object naming interface
- get_iova/page pinning decoupling

tegra:
- Tegra194 host1x, VIC and display support enabled
- Audio over HDMI for Tegra186 and Tegra194

exynos:
- DMA/IOMMU refactoring
- plane alpha + blend mode support
- Color format fixes for mixer driver

rcar-du:
- R8A7744 and R8A77470 support
- R8A77965 LVDS support

imx:
- fbdev emulation fix
- multi-tiled scalling fixes
- SPDX identifiers

rockchip
- dw_hdmi support
- dw-mipi-dsi + dual dsi support
- mailbox read size fix

qxl:
- fix cursor pinning

vc4:
- YUV support (scaling + cursor)

v3d:
- enable TFU (Texture Formatting Unit)

mali-dp:
- add support for linear tiled formats

sun4i:
- Display Engine 3 support
- H6 DE3 mixer 0 support
- H6 display engine support
- dw-hdmi support
- H6 HDMI phy support
- implicit fence waiting
- BGRX8888 support

meson:
- Overlay plane support
- implicit fence waiting
- HDMI 1.4 4k modes

bridge:
- i2c fixes for sii902x"

* tag 'drm-next-2018-12-14' of git://anongit.freedesktop.org/drm/drm: (1403 commits)
drm/amd/display: Add fast path for cursor plane updates
drm/amdgpu: Enable GPU recovery by default for CI
drm/amd/display: Fix duplicating scaling/underscan connector state
drm/amd/display: Fix unintialized max_bpc state values
Revert "drm/amd/display: Set RMX_ASPECT as default"
drm/amdgpu: Fix stub function name
drm/msm/dpu: Fix clock issue after bind failure
drm/msm/dpu: Clean up dpu_media_info.h static inline functions
drm/msm/dpu: Further cleanups for static inline functions
drm/msm/dpu: Cleanup the debugfs functions
drm/msm/dpu: Remove dpu_irq and unused functions
drm/msm: Make irq_postinstall optional
drm/msm/dpu: Cleanup callers of dpu_hw_blk_init
drm/msm/dpu: Remove unused functions
drm/msm/dpu: Remove dpu_crtc_is_enabled()
drm/msm/dpu: Remove dpu_crtc_get_mixer_height
drm/msm/dpu: Remove dpu_dbg
drm/msm: dpu: Remove crtc_lock
drm/msm: dpu: Remove vblank_requested flag from dpu_crtc
drm/msm: dpu: Separate crtc assignment from vblank enable
...

Linus Torvalds
2018-12-26 03:48:26 +0800

14 Dec, 2018

1 commit

880b9df1b Merge tag 'xarray-4.20-rc7' of git://git.infradead.org/users/willy/linux-dax ... Browse Code »

Pull XArray fixes from Matthew Wilcox:
"Two bugfixes, each with test-suite updates, two improvements to the
test-suite without associated bugs, and one patch adding a missing
API"

* tag 'xarray-4.20-rc7' of git://git.infradead.org/users/willy/linux-dax:
XArray: Fix xa_alloc when id exceeds max
XArray tests: Check iterating over multiorder entries
XArray tests: Handle larger indices more elegantly
XArray: Add xa_cmpxchg_irq and xa_cmpxchg_bh
radix tree: Don't return retry entries from lookup

Linus Torvalds
2018-12-14 08:35:58 +0800

09 Dec, 2018

1 commit

356ff8a9a Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask" ... Browse Code »

This reverts commit 89c83fb539f95491be80cdd5158e6f0ce329e317.

This should have been done as part of 2f0799a0ffc0 ("mm, thp: restore
node-local hugepage allocations"). The movement of the thp allocation
policy from alloc_pages_vma() to alloc_hugepage_direct_gfpmask() was
intended to only set __GFP_THISNODE for mempolicies that are not
MPOL_BIND whereas the revert could set this regardless of mempolicy.

While the check for MPOL_BIND between alloc_hugepage_direct_gfpmask()
and alloc_pages_vma() was racy, that has since been removed since the
revert. What is left is the possibility to use __GFP_THISNODE in
policy_node() when it is unexpected because the special handling for
hugepages in alloc_pages_vma() was removed as part of the consolidation.

Secondly, prior to 89c83fb539f9, alloc_pages_vma() implemented a somewhat
different policy for hugepage allocations, which were allocated through
alloc_hugepage_vma(). For hugepage allocations, if the allocating
process's node is in the set of allowed nodes, allocate with
__GFP_THISNODE for that node (for MPOL_PREFERRED, use that node with
__GFP_THISNODE instead). This was changed for shmem_alloc_hugepage() to
allow fallback to other nodes in 89c83fb539f9 as it did for new_page() in
mm/mempolicy.c which is functionally different behavior and removes the
requirement to only allocate hugepages locally.

So this commit does a full revert of 89c83fb539f9 instead of the partial
revert that was done in 2f0799a0ffc0. The result is the same thp
allocation policy for 4.20 that was in 4.19.

Fixes: 89c83fb539f9 ("mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask")
Fixes: 2f0799a0ffc0 ("mm, thp: restore node-local hugepage allocations")
Signed-off-by: David Rientjes
Acked-by: Vlastimil Babka
Cc: Andrea Arcangeli
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2018-12-09 02:26:20 +0800

06 Dec, 2018

1 commit

55f3f7eab XArray: Add xa_cmpxchg_irq and xa_cmpxchg_bh ... Browse Code »

These convenience wrappers match the other _irq and _bh wrappers we
already have. It turns out I'd already open-coded xa_cmpxchg_irq()
in the shmem code, so convert that.

Signed-off-by: Matthew Wilcox

Matthew Wilcox
2018-12-06 21:26:17 +0800

01 Dec, 2018

5 commits

aaa52e340 mm/khugepaged: fix crashes due to misaccounted holes ... Browse Code »

Huge tmpfs testing on a shortish file mapped into a pmd-rounded extent
hit shmem_evict_inode()'s WARN_ON(inode->i_blocks) followed by
clear_inode()'s BUG_ON(inode->i_data.nrpages) when the file was later
closed and unlinked.

khugepaged's collapse_shmem() was forgetting to update mapping->nrpages
on the rollback path, after it had added but then needs to undo some
holes.

There is indeed an irritating asymmetry between shmem_charge(), whose
callers want it to increment nrpages after successfully accounting
blocks, and shmem_uncharge(), when __delete_from_page_cache() already
decremented nrpages itself: oh well, just add a comment on that to them
both.

And shmem_recalc_inode() is supposed to be called when the accounting is
expected to be in balance (so it can deduce from imbalance that reclaim
discarded some pages): so change shmem_charge() to update nrpages
earlier (though it's rare for the difference to matter at all).

Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261523450.2275@eggly.anvils
Fixes: 800d8c63b2e98 ("shmem: add huge pages support")
Fixes: f3f0e1d2150b2 ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: Hugh Dickins
Acked-by: Kirill A. Shutemov
Cc: Jerome Glisse
Cc: Konstantin Khlebnikov
Cc: Matthew Wilcox
Cc: [4.8+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2018-12-01 06:56:15 +0800
dcf7fe9d8 userfaultfd: shmem: UFFDIO_COPY: set the page dirty if VM_WRITE is not set ... Browse Code »

Set the page dirty if VM_WRITE is not set because in such case the pte
won't be marked dirty and the page would be reclaimed without writepage
(i.e. swapout in the shmem case).

This was found by source review. Most apps (certainly including QEMU)
only use UFFDIO_COPY on PROT_READ|PROT_WRITE mappings or the app can't
modify the memory in the first place. This is for correctness and it
could help the non cooperative use case to avoid unexpected data loss.

Link: http://lkml.kernel.org/r/20181126173452.26955-6-aarcange@redhat.com
Reviewed-by: Hugh Dickins
Cc: stable@vger.kernel.org
Fixes: 4c27fe4c4c84 ("userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support")
Reported-by: Hugh Dickins
Signed-off-by: Andrea Arcangeli
Cc: "Dr. David Alan Gilbert"
Cc: Jann Horn
Cc: Mike Kravetz
Cc: Mike Rapoport
Cc: Peter Xu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2018-12-01 06:56:14 +0800
e2a50c1f6 userfaultfd: shmem: add i_size checks ... Browse Code »

With MAP_SHARED: recheck the i_size after taking the PT lock, to
serialize against truncate with the PT lock. Delete the page from the
pagecache if the i_size_read check fails.

With MAP_PRIVATE: check the i_size after the PT lock before mapping
anonymous memory or zeropages into the MAP_PRIVATE shmem mapping.

A mostly irrelevant cleanup: like we do the delete_from_page_cache()
pagecache removal after dropping the PT lock, the PT lock is a spinlock
so drop it before the sleepable page lock.

Link: http://lkml.kernel.org/r/20181126173452.26955-5-aarcange@redhat.com
Fixes: 4c27fe4c4c84 ("userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support")
Signed-off-by: Andrea Arcangeli
Reviewed-by: Mike Rapoport
Reviewed-by: Hugh Dickins
Reported-by: Jann Horn
Cc:
Cc: "Dr. David Alan Gilbert"
Cc: Mike Kravetz
Cc: Peter Xu
Cc: stable@vger.kernel.org
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2018-12-01 06:56:14 +0800
9e368259a userfaultfd: use ENOENT instead of EFAULT if the atomic copy user fails ... Browse Code »

Patch series "userfaultfd shmem updates".

Jann found two bugs in the userfaultfd shmem MAP_SHARED backend: the
lack of the VM_MAYWRITE check and the lack of i_size checks.

Then looking into the above we also fixed the MAP_PRIVATE case.

Hugh by source review also found a data loss source if UFFDIO_COPY is
used on shmem MAP_SHARED PROT_READ mappings (the production usages
incidentally run with PROT_READ|PROT_WRITE, so the data loss couldn't
happen in those production usages like with QEMU).

The whole patchset is marked for stable.

We verified QEMU postcopy live migration with guest running on shmem
MAP_PRIVATE run as well as before after the fix of shmem MAP_PRIVATE.
Regardless if it's shmem or hugetlbfs or MAP_PRIVATE or MAP_SHARED, QEMU
unconditionally invokes a punch hole if the guest mapping is filebacked
and a MADV_DONTNEED too (needed to get rid of the MAP_PRIVATE COWs and
for the anon backend).

This patch (of 5):

We internally used EFAULT to communicate with the caller, switch to
ENOENT, so EFAULT can be used as a non internal retval.

Link: http://lkml.kernel.org/r/20181126173452.26955-2-aarcange@redhat.com
Fixes: 4c27fe4c4c84 ("userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support")
Signed-off-by: Andrea Arcangeli
Reviewed-by: Mike Rapoport
Reviewed-by: Hugh Dickins
Cc: Mike Kravetz
Cc: Jann Horn
Cc: Peter Xu
Cc: "Dr. David Alan Gilbert"
Cc:
Cc: stable@vger.kernel.org
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2018-12-01 06:56:14 +0800
c1cb20d43 mm: use swp_offset as key in shmem_replace_page() ... Browse Code »

We changed the key of swap cache tree from swp_entry_t.val to
swp_offset. We need to do so in shmem_replace_page() as well.

Hugh said:
"shmem_replace_page() has been wrong since the day I wrote it: good
enough to work on swap "type" 0, which is all most people ever use
(especially those few who need shmem_replace_page() at all), but
broken once there are any non-0 swp_type bits set in the higher order
bits"

Link: http://lkml.kernel.org/r/20181121215442.138545-1-yuzhao@google.com
Fixes: f6ab1f7f6b2d ("mm, swap: use offset of swap entry as key of swap cache")
Signed-off-by: Yu Zhao
Reviewed-by: Matthew Wilcox
Acked-by: Hugh Dickins
Cc: [4.9+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yu Zhao
2018-12-01 06:56:14 +0800

20 Nov, 2018

1 commit

2ac5e38ea Merge drm/drm-next into drm-intel-next-queued ... Browse Code »

Pull in v4.20-rc3 via drm-next.

Signed-off-by: Jani Nikula

Jani Nikula
2018-11-20 19:14:08 +0800

19 Nov, 2018

1 commit

1a4136469 tmpfs: make lseek(SEEK_DATA/SEK_HOLE) return ENXIO with a negative offset ... Browse Code »

Other filesystems such as ext4, f2fs and ubifs all return ENXIO when
lseek (SEEK_DATA or SEEK_HOLE) requests a negative offset.

man 2 lseek says

: EINVAL whence is not valid. Or: the resulting file offset would be
: negative, or beyond the end of a seekable device.
:
: ENXIO whence is SEEK_DATA or SEEK_HOLE, and the file offset is beyond
: the end of the file.

Make tmpfs return ENXIO under these circumstances as well. After this,
tmpfs also passes xfstests's generic/448.

[akpm@linux-foundation.org: rewrite changelog]
Link: http://lkml.kernel.org/r/1540434176-14349-1-git-send-email-yuyufen@huawei.com
Signed-off-by: Yufen Yu
Reviewed-by: Andrew Morton
Cc: Al Viro
Cc: Hugh Dickins
Cc: William Kucharski
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yufen Yu
2018-11-19 02:15:10 +0800

07 Nov, 2018

1 commit

64e3d12f7 mm, drm/i915: mark pinned shmemfs pages as unevictable ... Browse Code »

The i915 driver uses shmemfs to allocate backing storage for gem
objects. These shmemfs pages can be pinned (increased ref count) by
shmem_read_mapping_page_gfp(). When a lot of pages are pinned, vmscan
wastes a lot of time scanning these pinned pages. In some extreme case,
all pages in the inactive anon lru are pinned, and only the inactive
anon lru is scanned due to inactive_ratio, the system cannot swap and
invokes the oom-killer. Mark these pinned pages as unevictable to speed
up vmscan.

Export pagevec API check_move_unevictable_pages().

This patch was inspired by Chris Wilson's change [1].

[1]: https://patchwork.kernel.org/patch/9768741/

Cc: Chris Wilson
Cc: Joonas Lahtinen
Cc: Peter Zijlstra
Cc: Andrew Morton
Cc: Dave Hansen
Signed-off-by: Kuo-Hsin Yang
Acked-by: Michal Hocko # mm part
Reviewed-by: Chris Wilson
Acked-by: Dave Hansen
Acked-by: Andrew Morton
Link: https://patchwork.freedesktop.org/patch/msgid/20181106132324.17390-1-chris@chris-wilson.co.uk
Signed-off-by: Chris Wilson

Kuo-Hsin Yang
2018-11-07 23:28:32 +0800

04 Nov, 2018

1 commit

89c83fb53 mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask ... Browse Code »

THP allocation mode is quite complex and it depends on the defrag mode.
This complexity is hidden in alloc_hugepage_direct_gfpmask from a large
part currently. The NUMA special casing (namely __GFP_THISNODE) is
however independent and placed in alloc_pages_vma currently. This both
adds an unnecessary branch to all vma based page allocation requests and
it makes the code more complex unnecessarily as well. Not to mention
that e.g. shmem THP used to do the node reclaiming unconditionally
regardless of the defrag mode until recently. This was not only
unexpected behavior but it was also hardly a good default behavior and I
strongly suspect it was just a side effect of the code sharing more than
a deliberate decision which suggests that such a layering is wrong.

Get rid of the thp special casing from alloc_pages_vma and move the
logic to alloc_hugepage_direct_gfpmask. __GFP_THISNODE is applied to the
resulting gfp mask only when the direct reclaim is not requested and
when there is no explicit numa binding to preserve the current logic.

Please note that there's also a slight difference wrt MPOL_BIND now. The
previous code would avoid using __GFP_THISNODE if the local node was
outside of policy_nodemask(). After this patch __GFP_THISNODE is avoided
for all MPOL_BIND policies. So there's a difference that if local node
is actually allowed by the bind policy's nodemask, previously
__GFP_THISNODE would be added, but now it won't be. From the behavior
POV this is still correct because the policy nodemask is used.

Link: http://lkml.kernel.org/r/20180925120326.24392-3-mhocko@kernel.org
Signed-off-by: Michal Hocko
Acked-by: Vlastimil Babka
Cc: Alex Williamson
Cc: Andrea Arcangeli
Cc: David Rientjes
Cc: "Kirill A. Shutemov"
Cc: Mel Gorman
Cc: Stefan Priebe - Profihost AG
Cc: Zi Yan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2018-11-04 01:09:37 +0800