23 Oct, 2015
1 commit
-
commit 3aaa76e125c1dd58c9b599baa8c6021896874c12 upstream.
Since commit bcc54222309c ("mm: hugetlb: introduce page_huge_active")
each hugetlb page maintains its active flag to avoid a race condition
betwe= en multiple calls of isolate_huge_page(), but current kernel
doesn't set the f= lag on a hugepage allocated by migration because the
proper putback routine isn= 't called. This means that users could
still encounter the race referred to by bcc54222309c in this special
case, so this patch fixes it.Fixes: bcc54222309c ("mm: hugetlb: introduce page_huge_active")
Signed-off-by: Naoya Horiguchi
Cc: Michal Hocko
Cc: Andi Kleen
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman
16 Apr, 2015
1 commit
-
With the page flag sanitization patchset, an invalid usage of
ClearPageSwapCache() is detected in migration_page_copy().
migrate_page_copy() is shared by both normal and hugepage (both thp and
hugetlb) code path, so let's check PageSwapCache() and clear it if it's
set to avoid misuse of the invalid clear operation.Signed-off-by: Naoya Horiguchi
Acked-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
15 Apr, 2015
2 commits
-
This code is dead since commit 9e645ab6d089 ("sched/numa: Continue PTE
scanning even if migrate rate limited") so remove it.Signed-off-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
With gcc version 4.7.3 (Ubuntu/Linaro 4.7.3-12ubuntu1) :
mm/migrate.c: In function `migrate_pages':
mm/migrate.c:1148:1: internal compiler error: in push_minipool_fix, at config/arm/arm.c:13500
Please submit a full bug report,
with preprocessed source if appropriate.
See for instructions.
Preprocessed source stored into /tmp/ccPoM1tr.out file, please attach this to your bugreport.
make[1]: *** [mm/migrate.o] Error 1
make: *** [mm/migrate.o] Error 2Mark unmap_and_move() (which is used in a single place only) "noinline"
to work around this compiler bug.[akpm@linux-foundation.org: make it conditional on gcc-4.7.3 and arm]
[khilman@kernel.org: fine-tune compiler versions]
[akpm@linux-foundation.org: fix comment]
Signed-off-by: Geert Uytterhoeven
Reported-by: Kevin Hilman
Cc: Marc Zyngier
Tested-by: Kevin Hilman
Tested-by: Lina Iyer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
13 Feb, 2015
2 commits
-
With PROT_NONE, the traditional page table manipulation functions are
sufficient.[andre.przywara@arm.com: fix compiler warning in pmdp_invalidate()]
[akpm@linux-foundation.org: fix build with STRICT_MM_TYPECHECKS]
Signed-off-by: Mel Gorman
Acked-by: Linus Torvalds
Acked-by: Aneesh Kumar
Tested-by: Sasha Levin
Cc: Benjamin Herrenschmidt
Cc: Dave Jones
Cc: Hugh Dickins
Cc: Ingo Molnar
Cc: Kirill Shutemov
Cc: Paul Mackerras
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Automatic NUMA balancing depends on being able to protect PTEs to trap a
fault and gather reference locality information. Very broadly speaking
it would mark PTEs as not present and use another bit to distinguish
between NUMA hinting faults and other types of faults. It was
universally loved by everybody and caused no problems whatsoever. That
last sentence might be a lie.This series is very heavily based on patches from Linus and Aneesh to
replace the existing PTE/PMD NUMA helper functions with normal change
protections. I did alter and add parts of it but I consider them
relatively minor contributions. At their suggestion, acked-bys are in
there but I've no problem converting them to Signed-off-by if requested.AFAIK, this has received no testing on ppc64 and I'm depending on Aneesh
for that. I tested trinity under kvm-tool and passed and ran a few
other basic tests. At the time of writing, only the short-lived tests
have completed but testing of V2 indicated that long-term testing had no
surprises. In most cases I'm leaving out detail as it's not that
interesting.specjbb single JVM: There was negligible performance difference in the
benchmark itself for short runs. However, system activity is
higher and interrupts are much higher over time -- possibly TLB
flushes. Migrations are also higher. Overall, this is more overhead
but considering the problems faced with the old approach I think
we just have to suck it up and find another way of reducing the
overhead.specjbb multi JVM: Negligible performance difference to the actual benchmark
but like the single JVM case, the system overhead is noticeably
higher. Again, interrupts are a major factor.autonumabench: This was all over the place and about all that can be
reasonably concluded is that it's different but not necessarily
better or worse.autonumabench
3.18.0-rc5 3.18.0-rc5
mmotm-20141119 protnone-v3r3
User NUMA01 32380.24 ( 0.00%) 21642.92 ( 33.16%)
User NUMA01_THEADLOCAL 22481.02 ( 0.00%) 22283.22 ( 0.88%)
User NUMA02 3137.00 ( 0.00%) 3116.54 ( 0.65%)
User NUMA02_SMT 1614.03 ( 0.00%) 1543.53 ( 4.37%)
System NUMA01 322.97 ( 0.00%) 1465.89 (-353.88%)
System NUMA01_THEADLOCAL 91.87 ( 0.00%) 49.32 ( 46.32%)
System NUMA02 37.83 ( 0.00%) 14.61 ( 61.38%)
System NUMA02_SMT 7.36 ( 0.00%) 7.45 ( -1.22%)
Elapsed NUMA01 716.63 ( 0.00%) 599.29 ( 16.37%)
Elapsed NUMA01_THEADLOCAL 553.98 ( 0.00%) 539.94 ( 2.53%)
Elapsed NUMA02 83.85 ( 0.00%) 83.04 ( 0.97%)
Elapsed NUMA02_SMT 86.57 ( 0.00%) 79.15 ( 8.57%)
CPU NUMA01 4563.00 ( 0.00%) 3855.00 ( 15.52%)
CPU NUMA01_THEADLOCAL 4074.00 ( 0.00%) 4136.00 ( -1.52%)
CPU NUMA02 3785.00 ( 0.00%) 3770.00 ( 0.40%)
CPU NUMA02_SMT 1872.00 ( 0.00%) 1959.00 ( -4.65%)System CPU usage of NUMA01 is worse but it's an adverse workload on this
machine so I'm reluctant to conclude that it's a problem that matters. On
the other workloads that are sensible on this machine, system CPU usage is
great. Overall time to complete the benchmark is comparable3.18.0-rc5 3.18.0-rc5
mmotm-20141119protnone-v3r3
User 59612.50 48586.44
System 460.22 1537.45
Elapsed 1442.20 1304.29NUMA alloc hit 5075182 5743353
NUMA alloc miss 0 0
NUMA interleave hit 0 0
NUMA alloc local 5075174 5743339
NUMA base PTE updates 637061448 443106883
NUMA huge PMD updates 1243434 864747
NUMA page range updates 1273699656 885857347
NUMA hint faults 1658116 1214277
NUMA hint local faults 959487 754113
NUMA hint local percent 57 62
NUMA pages migrated 5467056 61676398The NUMA pages migrated look terrible but when I looked at a graph of the
activity over time I see that the massive spike in migration activity was
during NUMA01. This correlates with high system CPU usage and could be
simply down to bad luck but any modifications that affect that workload
would be related to scan rates and migrations, not the protection
mechanism. For all other workloads, migration activity was comparable.Overall, headline performance figures are comparable but the overhead is
higher, mostly in interrupts. To some extent, higher overhead from this
approach was anticipated but not to this degree. It's going to be
necessary to reduce this again with a separate series in the future. It's
still worth going ahead with this series though as it's likely to avoid
constant headaches with Xen and is probably easier to maintain.This patch (of 10):
A transhuge NUMA hinting fault may find the page is migrating and should
wait until migration completes. The check is race-prone because the pmd
is deferenced outside of the page lock and while the race is tiny, it'll
be larger if the PMD is cleared while marking PMDs for hinting fault.
This patch closes the race.Signed-off-by: Mel Gorman
Cc: Aneesh Kumar K.V
Cc: Benjamin Herrenschmidt
Cc: Dave Jones
Cc: Hugh Dickins
Cc: Ingo Molnar
Cc: Kirill Shutemov
Cc: Linus Torvalds
Cc: Paul Mackerras
Cc: Rik van Riel
Cc: Sasha Levin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
12 Feb, 2015
1 commit
-
We have a race condition between move_pages() and freeing hugepages, where
move_pages() calls follow_page(FOLL_GET) for hugepages internally and
tries to get its refcount without preventing concurrent freeing. This
race crashes the kernel, so this patch fixes it by moving FOLL_GET code
for hugepages into follow_huge_pmd() with taking the page table lock.This patch intentionally removes page==NULL check after pte_page.
This is justified because pte_page() never returns NULL for any
architectures or configurations.This patch changes the behavior of follow_huge_pmd() for tail pages and
then tail pages can be pinned/returned. So the caller must be changed to
properly handle the returned tail pages.We could have a choice to add the similar locking to
follow_huge_(addr|pud) for consistency, but it's not necessary because
currently these functions don't support FOLL_GET flag, so let's leave it
for future development.Here is the reproducer:
$ cat movepages.c
#include
#include
#include#define ADDR_INPUT 0x700000000000UL
#define HPS 0x200000
#define PS 0x1000int main(int argc, char *argv[]) {
int i;
int nr_hp = strtol(argv[1], NULL, 0);
int nr_p = nr_hp * HPS / PS;
int ret;
void **addrs;
int *status;
int *nodes;
pid_t pid;pid = strtol(argv[2], NULL, 0);
addrs = malloc(sizeof(char *) * nr_p + 1);
status = malloc(sizeof(char *) * nr_p + 1);
nodes = malloc(sizeof(char *) * nr_p + 1);while (1) {
for (i = 0; i < nr_p; i++) {
addrs[i] = (void *)ADDR_INPUT + i * PS;
nodes[i] = 1;
status[i] = 0;
}
ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
MPOL_MF_MOVE_ALL);
if (ret == -1)
err("move_pages");for (i = 0; i < nr_p; i++) {
addrs[i] = (void *)ADDR_INPUT + i * PS;
nodes[i] = 0;
status[i] = 0;
}
ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
MPOL_MF_MOVE_ALL);
if (ret == -1)
err("move_pages");
}
return 0;
}$ cat hugepage.c
#include
#include
#include#define ADDR_INPUT 0x700000000000UL
#define HPS 0x200000int main(int argc, char *argv[]) {
int nr_hp = strtol(argv[1], NULL, 0);
char *p;while (1) {
p = mmap((void *)ADDR_INPUT, nr_hp * HPS, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
if (p != (void *)ADDR_INPUT) {
perror("mmap");
break;
}
memset(p, 0, nr_hp * HPS);
munmap(p, nr_hp * HPS);
}
}$ sysctl vm.nr_hugepages=40
$ ./hugepage 10 &
$ ./movepages 10 $(pgrep -f hugepage)Fixes: e632a938d914 ("mm: migrate: add hugepage migration code to move_pages()")
Signed-off-by: Naoya Horiguchi
Reported-by: Hugh Dickins
Cc: James Hogan
Cc: David Rientjes
Cc: Mel Gorman
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Rik van Riel
Cc: Andrea Arcangeli
Cc: Luiz Capitulino
Cc: Nishanth Aravamudan
Cc: Lee Schermerhorn
Cc: Steve Capper
Cc: [3.12+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
11 Feb, 2015
1 commit
-
We don't create non-linear mappings anymore. Let's drop code which
handles them in rmap.Signed-off-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
17 Dec, 2014
1 commit
-
the only instance this method has ever grown was one in kernfs -
one that call ->migrate() of another vm_ops if it exists.Signed-off-by: Al Viro
16 Dec, 2014
1 commit
-
Pull drm updates from Dave Airlie:
"Highlights:- AMD KFD driver merge
This is the AMD HSA interface for exposing a lowlevel interface for
GPGPU use. They have an open source userspace built on top of this
interface, and the code looks as good as it was going to get out of
tree.- Initial atomic modesetting work
The need for an atomic modesetting interface to allow userspace to
try and send a complete set of modesetting state to the driver has
arisen, and been suffering from neglect this past year. No more,
the start of the common code and changes for msm driver to use it
are in this tree. Ongoing work to get the userspace ioctl finished
and the code clean will probably wait until next kernel.- DisplayID 1.3 and tiled monitor exposed to userspace.
Tiled monitor property is now exposed for userspace to make use of.
- Rockchip drm driver merged.
- imx gpu driver moved out of staging
Other stuff:
- core:
panel - MIPI DSI + new panels.
expose suggested x/y properties for virtual GPUs- i915:
Initial Skylake (SKL) support
gen3/4 reset work
start of dri1/ums removal
infoframe tracking
fixes for lots of things.- nouveau:
tegra k1 voltage support
GM204 modesetting support
GT21x memory reclocking work- radeon:
CI dpm fixes
GPUVM improvements
Initial DPM fan control- rcar-du:
HDMI support added
removed some support for old boards
slave encoder driver for Analog Devices adv7511- exynos:
Exynos4415 SoC support- msm:
a4xx gpu support
atomic helper conversion- tegra:
iommu support
universal plane support
ganged-mode DSI support- sti:
HDMI i2c improvements- vmwgfx:
some late fixes.- qxl:
use suggested x/y properties"* 'drm-next' of git://people.freedesktop.org/~airlied/linux: (969 commits)
drm: sti: fix module compilation issue
drm/i915: save/restore GMBUS freq across suspend/resume on gen4
drm: sti: correctly cleanup CRTC and planes
drm: sti: add HQVDP plane
drm: sti: add cursor plane
drm: sti: enable auxiliary CRTC
drm: sti: fix delay in VTG programming
drm: sti: prepare sti_tvout to support auxiliary crtc
drm: sti: use drm_crtc_vblank_{on/off} instead of drm_vblank_{on/off}
drm: sti: fix hdmi avi infoframe
drm: sti: remove event lock while disabling vblank
drm: sti: simplify gdp code
drm: sti: clear all mixer control
drm: sti: remove gpio for HDMI hot plug detection
drm: sti: allow to change hdmi ddc i2c adapter
drm/doc: Document drm_add_modes_noedid() usage
drm/i915: Remove '& 0xffff' from the mask given to WA_REG()
drm/i915: Invert the mask and val arguments in wa_add() and WA_REG()
drm: Zero out DRM object memory upon cleanup
drm/i915/bdw: Fix the write setting up the WIZ hashing mode
...
14 Dec, 2014
1 commit
-
Page migration's __unmap_and_move(), and rmap's try_to_unmap(), were
created for use on pages almost certainly mapped into userspace. But
nowadays compaction often applies them to unmapped page cache pages: which
may exacerbate contention on i_mmap_rwsem quite unnecessarily, since
try_to_unmap_file() makes no preliminary page_mapped() check.Now check page_mapped() in __unmap_and_move(); and avoid repeating the
same overhead in rmap_walk_file() - don't remove_migration_ptes() when we
never inserted any.(The PageAnon(page) comment blocks now look even sillier than before, but
clean that up on some other occasion. And note in passing that
try_to_unmap_one() does not use a migration entry when PageSwapCache, so
remove_migration_ptes() will then not update that swap entry to newpage
pte: not a big deal, but something else to clean up later.)Davidlohr remarked in "mm,fs: introduce helpers around the i_mmap_mutex"
conversion to i_mmap_rwsem, that "The biggest winner of these changes is
migration": a part of the reason might be all of that unnecessary taking
of i_mmap_mutex in page migration; and it's rather a shame that I didn't
get around to sending this patch in before his - this one is much less
useful after Davidlohr's conversion to rwsem, but still good.Signed-off-by: Hugh Dickins
Cc: Davidlohr Bueso
Cc: Rik van Riel
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
13 Nov, 2014
1 commit
-
Add calls to the new mmu_notifier_invalidate_range() function to all
places in the VMM that need it.Signed-off-by: Joerg Roedel
Reviewed-by: Andrea Arcangeli
Reviewed-by: Jérôme Glisse
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: Hugh Dickins
Cc: Mel Gorman
Cc: Johannes Weiner
Cc: Jay Cornwall
Cc: Oded Gabbay
Cc: Suravee Suthikulpanit
Cc: Jesse Barnes
Cc: David Woodhouse
Signed-off-by: Andrew Morton
Signed-off-by: Oded Gabbay
10 Oct, 2014
1 commit
-
Sasha Levin reported KASAN splash inside isolate_migratepages_range().
Problem is in the function __is_movable_balloon_page() which tests
AS_BALLOON_MAP in page->mapping->flags. This function has no protection
against anonymous pages. As result it tried to check address space flags
inside struct anon_vma.Further investigation shows more problems in current implementation:
* Special branch in __unmap_and_move() never works:
balloon_page_movable() checks page flags and page_count. In
__unmap_and_move() page is locked, reference counter is elevated, thus
balloon_page_movable() always fails. As a result execution goes to the
normal migration path. virtballoon_migratepage() returns
MIGRATEPAGE_BALLOON_SUCCESS instead of MIGRATEPAGE_SUCCESS,
move_to_new_page() thinks this is an error code and assigns
newpage->mapping to NULL. Newly migrated page lose connectivity with
balloon an all ability for further migration.* lru_lock erroneously required in isolate_migratepages_range() for
isolation ballooned page. This function releases lru_lock periodically,
this makes migration mostly impossible for some pages.* balloon_page_dequeue have a tight race with balloon_page_isolate:
balloon_page_isolate could be executed in parallel with dequeue between
picking page from list and locking page_lock. Race is rare because they
use trylock_page() for locking.This patch fixes all of them.
Instead of fake mapping with special flag this patch uses special state of
page->_mapcount: PAGE_BALLOON_MAPCOUNT_VALUE = -256. Buddy allocator uses
PAGE_BUDDY_MAPCOUNT_VALUE = -128 for similar purpose. Storing mark
directly in struct page makes everything safer and easier.PagePrivate is used to mark pages present in page list (i.e. not
isolated, like PageLRU for normal pages). It replaces special rules for
reference counter and makes balloon migration similar to migration of
normal pages. This flag is protected by page_lock together with link to
the balloon device.Signed-off-by: Konstantin Khlebnikov
Reported-by: Sasha Levin
Link: http://lkml.kernel.org/p/53E6CEAA.9020105@oracle.com
Cc: Rafael Aquini
Cc: Andrey Ryabinin
Cc: [3.8+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
03 Oct, 2014
1 commit
-
A migration entry is marked as write if pte_write was true at the time the
entry was created. The VMA protections are not double checked when migration
entries are being removed as mprotect marks write-migration-entries as
read. It means that potentially we take a spurious fault to mark PTEs write
again but it's straight-forward. However, there is a race between write
migrations being marked read and migrations finishing. This potentially
allows a PTE to be write that should have been read. Close this race by
double checking the VMA permissions using maybe_mkwrite when migration
completes.[torvalds@linux-foundation.org: use maybe_mkwrite]
Cc: stable@vger.kernel.org
Signed-off-by: Mel Gorman
Acked-by: Rik van Riel
Signed-off-by: Linus Torvalds
09 Aug, 2014
1 commit
-
The memcg uncharging code that is involved towards the end of a page's
lifetime - truncation, reclaim, swapout, migration - is impressively
complicated and fragile.Because anonymous and file pages were always charged before they had their
page->mapping established, uncharges had to happen when the page type
could still be known from the context; as in unmap for anonymous, page
cache removal for file and shmem pages, and swap cache truncation for swap
pages. However, these operations happen well before the page is actually
freed, and so a lot of synchronization is necessary:- Charging, uncharging, page migration, and charge migration all need
to take a per-page bit spinlock as they could race with uncharging.- Swap cache truncation happens during both swap-in and swap-out, and
possibly repeatedly before the page is actually freed. This means
that the memcg swapout code is called from many contexts that make
no sense and it has to figure out the direction from page state to
make sure memory and memory+swap are always correctly charged.- On page migration, the old page might be unmapped but then reused,
so memcg code has to prevent untimely uncharging in that case.
Because this code - which should be a simple charge transfer - is so
special-cased, it is not reusable for replace_page_cache().But now that charged pages always have a page->mapping, introduce
mem_cgroup_uncharge(), which is called after the final put_page(), when we
know for sure that nobody is looking at the page anymore.For page migration, introduce mem_cgroup_migrate(), which is called after
the migration is successful and the new page is fully rmapped. Because
the old page is no longer uncharged after migration, prevent double
charges by decoupling the page's memcg association (PCG_USED and
pc->mem_cgroup) from the page holding an actual charge. The new bits
PCG_MEM and PCG_MEMSW represent the respective charges and are transferred
to the new page during migration.mem_cgroup_migrate() is suitable for replace_page_cache() as well,
which gets rid of mem_cgroup_replace_page_cache(). However, care
needs to be taken because both the source and the target page can
already be charged and on the LRU when fuse is splicing: grab the page
lock on the charge moving side to prevent changing pc->mem_cgroup of a
page under migration. Also, the lruvecs of both pages change as we
uncharge the old and charge the new during migration, and putback may
race with us, so grab the lru lock and isolate the pages iff on LRU to
prevent races and ensure the pages are on the right lruvec afterward.Swap accounting is massively simplified: because the page is no longer
uncharged as early as swap cache deletion, a new mem_cgroup_swapout() can
transfer the page's memory+swap charge (PCG_MEMSW) to the swap entry
before the final put_page() in page reclaim.Finally, page_cgroup changes are now protected by whatever protection the
page itself offers: anonymous pages are charged under the page table lock,
whereas page cache insertions, swapin, and migration hold the page lock.
Uncharging happens under full exclusion with no outstanding references.
Charging and uncharging also ensure that the page is off-LRU, which
serializes against charge migration. Remove the very costly page_cgroup
lock and set pc->flags non-atomically.[mhocko@suse.cz: mem_cgroup_charge_statistics needs preempt_disable]
[vdavydov@parallels.com: fix flags definition]
Signed-off-by: Johannes Weiner
Cc: Hugh Dickins
Cc: Tejun Heo
Cc: Vladimir Davydov
Tested-by: Jet Chen
Acked-by: Michal Hocko
Tested-by: Felipe Balbi
Signed-off-by: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
27 Jul, 2014
1 commit
-
Shortly before 3.16-rc1, Dave Jones reported:
WARNING: CPU: 3 PID: 19721 at fs/xfs/xfs_aops.c:971
xfs_vm_writepage+0x5ce/0x630 [xfs]()
CPU: 3 PID: 19721 Comm: trinity-c61 Not tainted 3.15.0+ #3
Call Trace:
xfs_vm_writepage+0x5ce/0x630 [xfs]
shrink_page_list+0x8f9/0xb90
shrink_inactive_list+0x253/0x510
shrink_lruvec+0x563/0x6c0
shrink_zone+0x3b/0x100
shrink_zones+0x1f1/0x3c0
try_to_free_pages+0x164/0x380
__alloc_pages_nodemask+0x822/0xc90
alloc_pages_vma+0xaf/0x1c0
handle_mm_fault+0xa31/0xc50
etc.970 if (WARN_ON_ONCE((current->flags & (PF_MEMALLOC|PF_KSWAPD)) ==
971 PF_MEMALLOC))I did not respond at the time, because a glance at the PageDirty block
in shrink_page_list() quickly shows that this is impossible: we don't do
writeback on file pages (other than tmpfs) from direct reclaim nowadays.
Dave was hallucinating, but it would have been disrespectful to say so.However, my own /var/log/messages now shows similar complaints
WARNING: CPU: 1 PID: 28814 at fs/ext4/inode.c:1881 ext4_writepage+0xa7/0x38b()
WARNING: CPU: 0 PID: 27347 at fs/ext4/inode.c:1764 ext4_writepage+0xa7/0x38b()from stressing some mmotm trees during July.
Could a dirty xfs or ext4 file page somehow get marked PageSwapBacked,
so fail shrink_page_list()'s page_is_file_cache() test, and so proceed
to mapping->a_ops->writepage()?Yes, 3.16-rc1's commit 68711a746345 ("mm, migration: add destination
page freeing callback") has provided such a way to compaction: if
migrating a SwapBacked page fails, its newpage may be put back on the
list for later use with PageSwapBacked still set, and nothing will clear
it.Whether that can do anything worse than issue WARN_ON_ONCEs, and get
some statistics wrong, is unclear: easier to fix than to think through
the consequences.Fixing it here, before the put_new_page(), addresses the bug directly,
but is probably the worst place to fix it. Page migration is doing too
many parts of the job on too many levels: fixing it in
move_to_new_page() to complement its SetPageSwapBacked would be
preferable, except why is it (and newpage->mapping and newpage->index)
done there, rather than down in migrate_page_move_mapping(), once we are
sure of success? Not a cleanup to get into right now, especially not
with memcg cleanups coming in 3.17.Reported-by: Dave Jones
Signed-off-by: Hugh Dickins
Signed-off-by: Linus Torvalds
24 Jun, 2014
1 commit
-
Trinity has reported:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
IP: __lock_acquire (kernel/locking/lockdep.c:3070 (discriminator 1))
CPU: 6 PID: 16173 Comm: trinity-c364 Tainted: G W
3.15.0-rc1-next-20140415-sasha-00020-gaa90d09 #398
lock_acquire (arch/x86/include/asm/current.h:14
kernel/locking/lockdep.c:3602)
_raw_spin_lock (include/linux/spinlock_api_smp.h:143
kernel/locking/spinlock.c:151)
remove_migration_pte (mm/migrate.c:137)
rmap_walk (mm/rmap.c:1628 mm/rmap.c:1699)
remove_migration_ptes (mm/migrate.c:224)
migrate_pages (mm/migrate.c:922 mm/migrate.c:960 mm/migrate.c:1126)
migrate_misplaced_page (mm/migrate.c:1733)
__handle_mm_fault (mm/memory.c:3762 mm/memory.c:3812 mm/memory.c:3925)
handle_mm_fault (mm/memory.c:3948)
__get_user_pages (mm/memory.c:1851)
__mlock_vma_pages_range (mm/mlock.c:255)
__mm_populate (mm/mlock.c:711)
SyS_mlockall (include/linux/mm.h:1799 mm/mlock.c:817 mm/mlock.c:791)I believe this comes about because, whereas collapsing and splitting THP
functions take anon_vma lock in write mode (which excludes concurrent
rmap walks), faulting THP functions (write protection and misplaced
NUMA) do not - and mostly they do not need to.But they do use a pmdp_clear_flush(), set_pmd_at() sequence which, for
an instant (indeed, for a long instant, given the inter-CPU TLB flush in
there), leaves *pmd neither present not trans_huge.Which can confuse a concurrent rmap walk, as when removing migration
ptes, seen in the dumped trace. Although that rmap walk has a 4k page
to insert, anon_vmas containing THPs are in no way segregated from
4k-page anon_vmas, so the 4k-intent mm_find_pmd() does need to cope with
that instant when a trans_huge pmd is temporarily absent.I don't think we need strengthen the locking at the THP end: it's easily
handled with an ACCESS_ONCE() before testing both conditions.And since mm_find_pmd() had only one caller who wanted a THP rather than
a pmd, let's slightly repurpose it to fail when it hits a THP or
non-present pmd, and open code split_huge_page_address() again.Signed-off-by: Hugh Dickins
Reported-by: Sasha Levin
Acked-by: Kirill A. Shutemov
Cc: Konstantin Khlebnikov
Cc: Mel Gorman
Cc: Bob Liu
Cc: Christoph Lameter
Cc: Dave Jones
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
05 Jun, 2014
3 commits
-
We already have a function named hugepages_supported(), and the similar
name hugepage_migration_support() is a bit unconfortable, so let's rename
it hugepage_migration_supported().Signed-off-by: Naoya Horiguchi
Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Memory migration uses a callback defined by the caller to determine how to
allocate destination pages. When migration fails for a source page,
however, it frees the destination page back to the system.This patch adds a memory migration callback defined by the caller to
determine how to free destination pages. If a caller, such as memory
compaction, builds its own freelist for migration targets, this can reuse
already freed memory instead of scanning additional memory.If the caller provides a function to handle freeing of destination pages,
it is called when page migration fails. If the caller passes NULL then
freeing back to the system will be handled as usual. This patch
introduces no functional change.Signed-off-by: David Rientjes
Reviewed-by: Naoya Horiguchi
Acked-by: Mel Gorman
Acked-by: Vlastimil Babka
Cc: Greg Thelen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Migration of misplaced transhuge pages uses page_add_new_anon_rmap() when
putting the page back as it avoided an atomic operations and added the new
page to the correct LRU. A side-effect is that the page gets marked
activated as part of the migration meaning that transhuge and base pages
are treated differently from an aging perspective than base page
migration.This patch uses page_add_anon_rmap() and putback_lru_page() on completion
of a transhuge migration similar to base page migration. It would require
fewer atomic operations to use lru_cache_add without taking an additional
reference to the page. The downside would be that it's still different to
base page migration and unevictable pages may be added to the wrong LRU
for cleaning up later. Testing of the usual workloads did not show any
adverse impact to the change.Signed-off-by: Mel Gorman
Cc: Rik van Riel
Cc: Sasha Levin
Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
21 Mar, 2014
1 commit
-
Add remove_linear_migration_ptes_from_nonlinear(), to fix an interesting
little include/linux/swapops.h:131 BUG_ON(!PageLocked) found by trinity:
indicating that remove_migration_ptes() failed to find one of the
migration entries that was temporarily inserted.The problem comes from remap_file_pages()'s switch from vma_interval_tree
(good for inserting the migration entry) to i_mmap_nonlinear list (no good
for locating it again); but can only be a problem if the remap_file_pages()
range does not cover the whole of the vma (zap_pte() clears the range).remove_migration_ptes() needs a file_nonlinear method to go down the
i_mmap_nonlinear list, applying linear location to look for migration
entries in those vmas too, just in case there was this race.The file_nonlinear method does need rmap_walk_control.arg to do this;
but it never needed vma passed in - vma comes from its own iteration.Reported-and-tested-by: Dave Jones
Reported-and-tested-by: Sasha Levin
Signed-off-by: Hugh Dickins
Signed-off-by: Linus Torvalds
11 Mar, 2014
1 commit
-
GFP_THISNODE is for callers that implement their own clever fallback to
remote nodes. It restricts the allocation to the specified node and
does not invoke reclaim, assuming that the caller will take care of it
when the fallback fails, e.g. through a subsequent allocation request
without GFP_THISNODE set.However, many current GFP_THISNODE users only want the node exclusive
aspect of the flag, without actually implementing their own fallback or
triggering reclaim if necessary. This results in things like page
migration failing prematurely even when there is easily reclaimable
memory available, unless kswapd happens to be running already or a
concurrent allocation attempt triggers the necessary reclaim.Convert all callsites that don't implement their own fallback strategy
to __GFP_THISNODE. This restricts the allocation a single node too, but
at the same time allows the allocator to enter the slowpath, wake
kswapd, and invoke direct reclaim if necessary, to make the allocation
happen when memory is full.Signed-off-by: Johannes Weiner
Acked-by: Rik van Riel
Cc: Jan Stancek
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
28 Jan, 2014
1 commit
-
Commit 7851a45cd3f6 ("mm: numa: Copy cpupid on page migration") copies
over the cpupid at page migration time. It is unnecessary to set it
again in alloc_misplaced_dst_page().Signed-off-by: Wanpeng Li
Reviewed-by: Naoya Horiguchi
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
24 Jan, 2014
2 commits
-
Commit 7851a45cd3f6 ("mm: numa: Copy cpupid on page migration") copiess
over the cpupid at page migration time. It is unnecessary to set it
again in migrate_misplaced_transhuge_page().Signed-off-by: Wanpeng Li
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Most of the VM_BUG_ON assertions are performed on a page. Usually, when
one of these assertions fails we'll get a BUG_ON with a call stack and
the registers.I've recently noticed based on the requests to add a small piece of code
that dumps the page to various VM_BUG_ON sites that the page dump is
quite useful to people debugging issues in mm.This patch adds a VM_BUG_ON_PAGE(cond, page) which beyond doing what
VM_BUG_ON() does, also dumps the page before executing the actual
BUG_ON.[akpm@linux-foundation.org: fix up includes]
Signed-off-by: Sasha Levin
Cc: "Kirill A. Shutemov"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
22 Jan, 2014
8 commits
-
fail_migrate_page() isn't used anywhere, so remove it.
Signed-off-by: Joonsoo Kim
Acked-by: Christoph Lameter
Reviewed-by: Naoya Horiguchi
Reviewed-by: Wanpeng Li
Cc: Rafael Aquini
Cc: Vlastimil Babka
Cc: Wanpeng Li
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Zhang Yanfei
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Some part of putback_lru_pages() and putback_movable_pages() is
duplicated, so it could confuse us what we should use. We can remove
putback_lru_pages() since it is not really needed now. This makes us
undestand and maintain the code more easily.And comment on putback_movable_pages() is stale now, so fix it.
Signed-off-by: Joonsoo Kim
Reviewed-by: Wanpeng Li
Cc: Christoph Lameter
Cc: Naoya Horiguchi
Cc: Rafael Aquini
Cc: Vlastimil Babka
Cc: Wanpeng Li
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Vlastimil Babka
Cc: Zhang Yanfei
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
We should remove the page from the list if we fail with ENOSYS, since
migrate_pages() consider error cases except -ENOMEM and -EAGAIN as
permanent failure and it assumes that the page would be removed from the
list. Without this patch, we could overcount number of failure.In addition, we should put back the new hugepage if
!hugepage_migration_support(). If not, we would leak hugepage memory.Signed-off-by: Joonsoo Kim
Acked-by: Christoph Lameter
Reviewed-by: Wanpeng Li
Reviewed-by: Naoya Horiguchi
Cc: Rafael Aquini
Cc: Vlastimil Babka
Cc: Wanpeng Li
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Vlastimil Babka
Cc: Zhang Yanfei
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Let's add a comment about where the failed page goes to, which makes
code more readable.Signed-off-by: Naoya Horiguchi
Signed-off-by: Joonsoo Kim
Acked-by: Christoph Lameter
Reviewed-by: Wanpeng Li
Acked-by: Rafael Aquini
Cc: Vlastimil Babka
Cc: Wanpeng Li
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Vlastimil Babka
Cc: Zhang Yanfei
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
A low local/remote numa hinting fault ratio is potentially explained by
failed migrations. This patch adds a tracepoint that fires when
migration fails due to migration rate limitation.Signed-off-by: Mel Gorman
Reviewed-by: Rik van Riel
Cc: Alex Thorlton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
NUMA migrate rate limiting protects a migration counter and window using
a lock but in some cases this can be a contended lock. It is not
critical that the number of pages be perfect, lost updates are
acceptable. Reduce the importance of this lock.Signed-off-by: Mel Gorman
Reviewed-by: Rik van Riel
Cc: Alex Thorlton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
numamigrate_update_ratelimit and numamigrate_isolate_page only have
callers in mm/migrate.c. This patch makes them static.Signed-off-by: Mel Gorman
Reviewed-by: Rik van Riel
Cc: Alex Thorlton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
In each rmap traverse case, there is some difference so that we need
function pointers and arguments to them in order to handle theseFor this purpose, struct rmap_walk_control is introduced in this patch,
and will be extended in following patch. Introducing and extending are
separate, because it clarify changes.Signed-off-by: Joonsoo Kim
Reviewed-by: Naoya Horiguchi
Cc: Mel Gorman
Cc: Hugh Dickins
Cc: Rik van Riel
Cc: Ingo Molnar
Cc: Hillf Danton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
22 Dec, 2013
1 commit
-
The arbitrary restriction on page counts offered by the core
migrate_page_move_mapping() code results in rather suspicious looking
fiddling with page reference counts in the aio_migratepage() operation.
To fix this, make migrate_page_move_mapping() take an extra_count parameter
that allows aio to tell the code about its own reference count on the page
being migrated.While cleaning up aio_migratepage(), make it validate that the old page
being passed in is actually what aio_migratepage() expects to prevent
misbehaviour in the case of races.Signed-off-by: Benjamin LaHaise
19 Dec, 2013
5 commits
-
THP migration can fail for a variety of reasons. Avoid flushing the TLB
to deal with THP migration races until the copy is ready to start.Signed-off-by: Mel Gorman
Reviewed-by: Rik van Riel
Cc: Alex Thorlton
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
do_huge_pmd_numa_page() handles the case where there is parallel THP
migration. However, by the time it is checked the NUMA hinting
information has already been disrupted. This patch adds an earlier
check with some helpers.Signed-off-by: Mel Gorman
Reviewed-by: Rik van Riel
Cc: Alex Thorlton
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
If a PMD changes during a THP migration then migration aborts but the
failure path is doing more work than is necessary.Signed-off-by: Mel Gorman
Reviewed-by: Rik van Riel
Cc: Alex Thorlton
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
MMU notifiers must be called on THP page migration or secondary MMUs
will get very confused.Signed-off-by: Mel Gorman
Reviewed-by: Rik van Riel
Cc: Alex Thorlton
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Base pages are unmapped and flushed from cache and TLB during normal
page migration and replaced with a migration entry that causes any
parallel NUMA hinting fault or gup to block until migration completes.THP does not unmap pages due to a lack of support for migration entries
at a PMD level. This allows races with get_user_pages and
get_user_pages_fast which commit 3f926ab945b6 ("mm: Close races between
THP migration and PMD numa clearing") made worse by introducing a
pmd_clear_flush().This patch forces get_user_page (fast and normal) on a pmd_numa page to
go through the slow get_user_page path where it will serialise against
THP migration and properly account for the NUMA hinting fault. On the
migration side the page table lock is taken for each PTE update.Signed-off-by: Mel Gorman
Reviewed-by: Rik van Riel
Cc: Alex Thorlton
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
22 Nov, 2013
1 commit
-
Right now, the migration code in migrate_page_copy() uses copy_huge_page()
for hugetlbfs and thp pages:if (PageHuge(page) || PageTransHuge(page))
copy_huge_page(newpage, page);So, yay for code reuse. But:
void copy_huge_page(struct page *dst, struct page *src)
{
struct hstate *h = page_hstate(src);and a non-hugetlbfs page has no page_hstate(). This works 99% of the
time because page_hstate() determines the hstate from the page order
alone. Since the page order of a THP page matches the default hugetlbfs
page order, it works.But, if you change the default huge page size on the boot command-line
(say default_hugepagesz=1G), then we might not even *have* a 2MB hstate
so page_hstate() returns null and copy_huge_page() oopses pretty fast
since copy_huge_page() dereferences the hstate:void copy_huge_page(struct page *dst, struct page *src)
{
struct hstate *h = page_hstate(src);
if (unlikely(pages_per_huge_page(h) > MAX_ORDER_NR_PAGES)) {
...Mel noticed that the migration code is really the only user of these
functions. This moves all the copy code over to migrate.c and makes
copy_huge_page() work for THP by checking for it explicitly.I believe the bug was introduced in commit b32967ff101a ("mm: numa: Add
THP migration for the NUMA working set scanning fault case")[akpm@linux-foundation.org: fix coding-style and comment text, per Naoya Horiguchi]
Signed-off-by: Dave Hansen
Acked-by: Mel Gorman
Reviewed-by: Naoya Horiguchi
Cc: Hillf Danton
Cc: Andrea Arcangeli
Tested-by: Dave Jiang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds