07 Sep, 2017
40 commits
-
Patch series "Ranged pagevec lookup", v2.
In this series I make pagevec_lookup() update the index (to be
consistent with pagevec_lookup_tag() and also as a preparation for
ranged lookups), provide ranged variant of pagevec_lookup() and use it
in places where it makes sense. This not only removes some common code
but is also a measurable performance win for some use cases (see patch
4/10) where radix tree is sparse and searching & grabing of a page after
the end of the range has measurable overhead.This patch (of 10):
The callback doesn't ever get called. Remove it.
Link: http://lkml.kernel.org/r/20170726114704.7626-2-jack@suse.cz
Signed-off-by: Jan Kara
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Tetsuo Handa has reported[1][2][3] that direct reclaimers might get
stuck in too_many_isolated loop basically for ever because the last few
pages on the LRU lists are isolated by the kswapd which is stuck on fs
locks when doing the pageout or slab reclaim. This in turn means that
there is nobody to actually trigger the oom killer and the system is
basically unusable.too_many_isolated has been introduced by commit 35cd78156c49 ("vmscan:
throttle direct reclaim when too many pages are isolated already") to
prevent from pre-mature oom killer invocations because back then no
reclaim progress could indeed trigger the OOM killer too early.But since the oom detection rework in commit 0a0337e0d1d1 ("mm, oom:
rework oom detection") the allocation/reclaim retry loop considers all
the reclaimable pages and throttles the allocation at that layer so we
can loosen the direct reclaim throttling.Make shrink_inactive_list loop over too_many_isolated bounded and
returns immediately when the situation hasn't resolved after the first
sleep.Replace congestion_wait by a simple schedule_timeout_interruptible
because we are not really waiting on the IO congestion in this path.Please note that this patch can theoretically cause the OOM killer to
trigger earlier while there are many pages isolated for the reclaim
which makes progress only very slowly. This would be obvious from the
oom report as the number of isolated pages are printed there. If we
ever hit this should_reclaim_retry should consider those numbers in the
evaluation in one way or another.[1] http://lkml.kernel.org/r/201602092349.ACG81273.OSVtMJQHLOFOFF@I-love.SAKURA.ne.jp
[2] http://lkml.kernel.org/r/201702212335.DJB30777.JOFMHSFtVLQOOF@I-love.SAKURA.ne.jp
[3] http://lkml.kernel.org/r/201706300914.CEH95859.FMQOLVFHJFtOOS@I-love.SAKURA.ne.jp[mhocko@suse.com: switch to uninterruptible sleep]
Link: http://lkml.kernel.org/r/20170724065048.GB25221@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/20170710074842.23175-1-mhocko@kernel.org
Signed-off-by: Michal Hocko
Reported-by: Tetsuo Handa
Tested-by: Tetsuo Handa
Acked-by: Mel Gorman
Acked-by: Vlastimil Babka
Acked-by: Rik van Riel
Acked-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Getting -EBUSY from zs_page_migrate will make migration slow (retry) or
fail (zs_page_putback will schedule_work free_work, but it cannot ensure
the success).I noticed this issue because my Kernel patched
(https://lkml.org/lkml/2014/5/28/113) that will remove retry in
__alloc_contig_migrate_range.This retry will handle the -EBUSY because it will re-isolate the page
and re-call migrate_pages. Without it will make cma_alloc fail at once
with -EBUSY.According to the review from Minchan Kim in
https://lkml.org/lkml/2014/5/28/113, I update the patch to skip
unnecessary loops but not return -EBUSY if zspage is not inuse.Following is what I got with highalloc-performance in a vbox with 2 cpu
1G memory 512 zram as swap. And the swappiness is set to 100.ori ne
orig new
Minor Faults 50805113 50830235
Major Faults 43918 56530
Swap Ins 42087 55680
Swap Outs 89718 104700
Allocation stalls 0 0
DMA allocs 57787 52364
DMA32 allocs 47964599 48043563
Normal allocs 0 0
Movable allocs 0 0
Direct pages scanned 45493 23167
Kswapd pages scanned 1565222 1725078
Kswapd pages reclaimed 1342222 1503037
Direct pages reclaimed 45615 25186
Kswapd efficiency 85% 87%
Kswapd velocity 1897.101 1949.042
Direct efficiency 100% 108%
Direct velocity 55.139 26.175
Percentage direct scans 2% 1%
Zone normal velocity 1952.240 1975.217
Zone dma32 velocity 0.000 0.000
Zone dma velocity 0.000 0.000
Page writes by reclaim 89764.000 105233.000
Page writes file 46 533
Page writes anon 89718 104700
Page reclaim immediate 21457 3699
Sector Reads 3259688 3441368
Sector Writes 3667252 3754836
Page rescued immediate 0 0
Slabs scanned 1042872 1160855
Direct inode steals 8042 10089
Kswapd inode steals 54295 29170
Kswapd skipped wait 0 0
THP fault alloc 175 154
THP collapse alloc 226 289
THP splits 0 0
THP fault fallback 11 14
THP collapse fail 3 2
Compaction stalls 536 646
Compaction success 322 358
Compaction failures 214 288
Page migrate success 119608 111063
Page migrate failure 2723 2593
Compaction pages isolated 250179 232652
Compaction migrate scanned 9131832 9942306
Compaction free scanned 2093272 2613998
Compaction cost 192 189
NUMA alloc hit 47124555 47193990
NUMA alloc miss 0 0
NUMA interleave hit 0 0
NUMA alloc local 47124555 47193990
NUMA base PTE updates 0 0
NUMA huge PMD updates 0 0
NUMA page range updates 0 0
NUMA hint faults 0 0
NUMA hint local faults 0 0
NUMA hint local percent 100 100
NUMA pages migrated 0 0
AutoNUMA cost 0% 0%[akpm@linux-foundation.org: remove newline, per Minchan]
Link: http://lkml.kernel.org/r/1500889535-19648-1-git-send-email-zhuhui@xiaomi.com
Signed-off-by: Hui Zhu
Acked-by: Minchan Kim
Reviewed-by: Sergey Senozhatsky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Nadav Amit report zap_page_range only specifies that the caller protect
the VMA list but does not specify whether it is held for read or write
with callers using either. madvise holds mmap_sem for read meaning that
a parallel zap operation can unmap PTEs which are then potentially
skipped by madvise which potentially returns with stale TLB entries
present. While the API could be extended, it would be a difficult API
to use. This patch causes zap_page_range() to always consider flushing
the full affected range. For small ranges or sparsely populated
mappings, this may result in one additional spurious TLB flush. For
larger ranges, it is possible that the TLB has already been flushed and
the overhead is negligible. Either way, this approach is safer overall
and avoids stale entries being present when madvise returns.This can be illustrated with the following program provided by Nadav
Amit and slightly modified. With the patch applied, it has an exit code
of 0 indicating a stale TLB entry did not leak to userspace.---8<< 32);
}static inline void wait_rdtsc(unsigned long cycles)
{
unsigned long tsc = rdtsc();while (rdtsc() - tsc < cycles);
}void *big_madvise_thread(void *ign)
{
sync_step = 1;
while (sync_step != 2);
madvise((void*)p, PAGE_SIZE * N_PAGES, MADV_DONTNEED);
}int main(void)
{
pthread_t aux_thread;p = mmap(0, PAGE_SIZE * N_PAGES, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);memset((void*)p, 8, PAGE_SIZE * N_PAGES);
pthread_create(&aux_thread, NULL, big_madvise_thread, NULL);
while (sync_step != 1);*p = 8; // Cache in TLB
sync_step = 2;
wait_rdtsc(100000);
madvise((void*)p, PAGE_SIZE, MADV_DONTNEED);
printf("data: %d (%s)\n", *p, (*p == 8 ? "stale, broken" : "cleared, fine"));
return *p == 8 ? -1 : 0;
}
---8
Reported-by: Nadav Amit
Cc: Andy Lutomirski
Signed-off-by: Andrew MortonSigned-off-by: Linus Torvalds
-
When walking the page tables to resolve an address that points to
!p*d_present() entry, huge_pte_offset() returns inconsistent values
depending on the level of page table (PUD or PMD).It returns NULL in the case of a PUD entry while in the case of a PMD
entry, it returns a pointer to the page table entry.A similar inconsitency exists when handling swap entries - returns NULL
for a PUD entry while a pointer to the pte_t is retured for the PMD
entry.Update huge_pte_offset() to make the behaviour consistent - return a
pointer to the pte_t for hugepage or swap entries. Only return NULL in
instances where we have a p*d_none() entry and the size parameter
doesn't match the hugepage size at this level of the page table.Document the behaviour to clarify the expected behaviour of this
function. This is to set clear semantics for architecture specific
implementations of huge_pte_offset().Discussions on the arm64 implementation of huge_pte_offset()
(http://www.spinics.net/lists/linux-mm/msg133699.html) showed that there
is benefit from returning a pte_t* in the case of p*d_none().The fault handling code in hugetlb_fault() can handle p*d_none() entries
and saves an extra round trip to huge_pte_alloc(). Other callers of
huge_pte_offset() should be ok as well.[punit.agrawal@arm.com: v2]
Link: http://lkml.kernel.org/r/20170725154114.24131-2-punit.agrawal@arm.com
Signed-off-by: Punit Agrawal
Reviewed-by: Catalin Marinas
Reviewed-by: Mike Kravetz
Reviewed-by: Catalin Marinas
Acked-by: Michal Hocko
Cc: Naoya Horiguchi
Cc: Steve Capper
Cc: Will Deacon
Cc: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
These functions are the only bits of generic code that use
{pud,pmd}_pfn() without checking for CONFIG_TRANSPARENT_HUGEPAGE. This
works fine on x86, the only arch with devmap support, since the *_pfn()
functions are always defined there, but this isn't true for every
architecture.Link: http://lkml.kernel.org/r/20170626063833.11094-1-oohall@gmail.com
Signed-off-by: Oliver O'Halloran
Cc: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
mremap will attempt to create a 'duplicate' mapping if old_size == 0 is
specified. In the case of private mappings, mremap will actually create
a fresh separate private mapping unrelated to the original. This does
not fit with the design semantics of mremap as the intention is to
create a new mapping based on the original.Therefore, return EINVAL in the case where an attempt is made to
duplicate a private mapping. Also, print a warning message (once) if
such an attempt is made.Link: http://lkml.kernel.org/r/cb9d9f6a-7095-582f-15a5-62643d65c736@oracle.com
Signed-off-by: Mike Kravetz
Acked-by: Michal Hocko
Cc: Andrea Arcangeli
Cc: Aaron Lu
Cc: "Kirill A . Shutemov"
Cc: Vlastimil Babka
Cc: Anshuman Khandual
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
init_pages_in_zone() is run under zone->lock, which means a long lock
time and disabled interrupts on large machines. This is currently not
an issue since it runs early in boot, but a later patch will change
that.However, like other pfn scanners, we don't actually need zone->lock even
when other cpus are running. The only potentially dangerous operation
here is reading bogus buddy page owner due to race, and we already know
how to handle that. The worst that can happen is that we skip some
early allocated pages, which should not affect the debugging power of
page_owner noticeably.Link: http://lkml.kernel.org/r/20170720134029.25268-4-vbabka@suse.cz
Signed-off-by: Vlastimil Babka
Acked-by: Michal Hocko
Cc: Joonsoo Kim
Cc: Mel Gorman
Cc: Yang Shi
Cc: Laura Abbott
Cc: Vinayak Menon
Cc: zhong jiang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
page_ext_init() can take long on large machines, so add a cond_resched()
point after each section is processed. This will allow moving the init
to a later point at boot without triggering lockup reports.Link: http://lkml.kernel.org/r/20170720134029.25268-3-vbabka@suse.cz
Signed-off-by: Vlastimil Babka
Acked-by: Michal Hocko
Cc: Joonsoo Kim
Cc: Mel Gorman
Cc: Yang Shi
Cc: Laura Abbott
Cc: Vinayak Menon
Cc: zhong jiang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
In init_pages_in_zone() we currently use the generic set_page_owner()
function to initialize page_owner info for early allocated pages. This
means we needlessly do lookup_page_ext() twice for each page, and more
importantly save_stack(), which has to unwind the stack and find the
corresponding stack depot handle. Because the stack is always the same
for the initialization, unwind it once in init_pages_in_zone() and reuse
the handle. Also avoid the repeated lookup_page_ext().This can significantly reduce boot times with page_owner=on on large
machines, especially for kernels built without frame pointer, where the
stack unwinding is noticeably slower.[vbabka@suse.cz: don't duplicate code of __set_page_owner(), per Michal Hocko]
[akpm@linux-foundation.org: coding-style fixes]
[vbabka@suse.cz: create statically allocated fake stack trace for early allocated pages, per Michal]
Link: http://lkml.kernel.org/r/45813564-2342-fc8d-d31a-f4b68a724325@suse.cz
Link: http://lkml.kernel.org/r/20170720134029.25268-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka
Acked-by: Michal Hocko
Cc: Joonsoo Kim
Cc: Mel Gorman
Cc: Yang Shi
Cc: Laura Abbott
Cc: Vinayak Menon
Cc: zhong jiang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Commit f52407ce2dea ("memory hotplug: alloc page from other node in
memory online") has introduced N_HIGH_MEMORY checks to only use NUMA
aware allocations when there is some memory present because the
respective node might not have any memory yet at the time and so it
could fail or even OOM.Things have changed since then though. Zonelists are now always
initialized before we do any allocations even for hotplug (see
959ecc48fc75 ("mm/memory_hotplug.c: fix building of node hotplug
zonelist")).Therefore these checks are not really needed. In fact caller of the
allocator should never care about whether the node is populated because
that might change at any time.Link: http://lkml.kernel.org/r/20170721143915.14161-10-mhocko@kernel.org
Signed-off-by: Michal Hocko
Acked-by: Vlastimil Babka
Cc: Shaohua Li
Cc: Joonsoo Kim
Cc: Johannes Weiner
Cc: Mel Gorman
Cc: Toshi Kani
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
zonelists_mutex was introduced by commit 4eaf3f64397c ("mem-hotplug: fix
potential race while building zonelist for new populated zone") to
protect zonelist building from races. This is no longer needed though
because both memory online and offline are fully serialized. New users
have grown since then.Notably setup_per_zone_wmarks wants to prevent from races between memory
hotplug, khugepaged setup and manual min_free_kbytes update via sysctl
(see cfd3da1e49bb ("mm: Serialize access to min_free_kbytes"). Let's
add a private lock for that purpose. This will not prevent from seeing
halfway through memory hotplug operation but that shouldn't be a big
deal becuse memory hotplug will update watermarks explicitly so we will
eventually get a full picture. The lock just makes sure we won't race
when updating watermarks leading to weird results.Also __build_all_zonelists manipulates global data so add a private lock
for it as well. This doesn't seem to be necessary today but it is more
robust to have a lock there.While we are at it make sure we document that memory online/offline
depends on a full serialization either via mem_hotplug_begin() or
device_lock.Link: http://lkml.kernel.org/r/20170721143915.14161-9-mhocko@kernel.org
Signed-off-by: Michal Hocko
Cc: Johannes Weiner
Cc: Joonsoo Kim
Cc: Mel Gorman
Cc: Shaohua Li
Cc: Toshi Kani
Cc: Vlastimil Babka
Cc: Haicheng Li
Cc: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
build_all_zonelists has been (ab)using stop_machine to make sure that
zonelists do not change while somebody is looking at them. This is is
just a gross hack because a) it complicates the context from which we
can call build_all_zonelists (see 3f906ba23689 ("mm/memory-hotplug:
switch locking to a percpu rwsem")) and b) is is not really necessary
especially after "mm, page_alloc: simplify zonelist initialization" and
c) it doesn't really provide the protection it claims (see below).Updates of the zonelists happen very seldom, basically only when a zone
becomes populated during memory online or when it loses all the memory
during offline. A racing iteration over zonelists could either miss a
zone or try to work on one zone twice. Both of these are something we
can live with occasionally because there will always be at least one
zone visible so we are not likely to fail allocation too easily for
example.Please note that the original stop_machine approach doesn't really
provide a better exclusion because the iteration might be interrupted
half way (unless the whole iteration is preempt disabled which is not
the case in most cases) so the some zones could still be seen twice or a
zone missed.I have run the pathological online/offline of the single memblock in the
movable zone while stressing the same small node with some memory
pressure.Node 1, zone DMA
pages free 0
min 0
low 0
high 0
spanned 0
present 0
managed 0
protection: (0, 943, 943, 943)
Node 1, zone DMA32
pages free 227310
min 8294
low 10367
high 12440
spanned 262112
present 262112
managed 241436
protection: (0, 0, 0, 0)
Node 1, zone Normal
pages free 0
min 0
low 0
high 0
spanned 0
present 0
managed 0
protection: (0, 0, 0, 1024)
Node 1, zone Movable
pages free 32722
min 85
low 117
high 149
spanned 32768
present 32768
managed 32768
protection: (0, 0, 0, 0)root@test1:/sys/devices/system/node/node1# while true
do
echo offline > memory34/state
echo online_movable > memory34/state
doneroot@test1:/mnt/data/test/linux-3.7-rc5# numactl --preferred=1 make -j4
and it survived without any unexpected behavior. While this is not
really a great testing coverage it should exercise the allocation path
quite a lot.Link: http://lkml.kernel.org/r/20170721143915.14161-8-mhocko@kernel.org
Signed-off-by: Michal Hocko
Acked-by: Vlastimil Babka
Cc: Johannes Weiner
Cc: Joonsoo Kim
Cc: Mel Gorman
Cc: Shaohua Li
Cc: Toshi Kani
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
build_zonelists gradually builds zonelists from the nearest to the most
distant node. As we do not know how many populated zones we will have
in each node we rely on the _zoneref to terminate initialized part of
the zonelist by a NULL zone. While this is functionally correct it is
quite suboptimal because we cannot allow updaters to race with zonelists
users because they could see an empty zonelist and fail the allocation
or hit the OOM killer in the worst case.We can do much better, though. We can store the node ordering into an
already existing node_order array and then give this array to
build_zonelists_in_node_order and do the whole initialization at once.
zonelists consumers still might see halfway initialized state but that
should be much more tolerateable because the list will not be empty and
they would either see some zone twice or skip over some zone(s) in the
worst case which shouldn't lead to immediate failures.While at it let's simplify build_zonelists_node which is rather
confusing now. It gets an index into the zoneref array and returns the
updated index for the next iteration. Let's rename the function to
build_zonerefs_node to better reflect its purpose and give it zoneref
array to update. The function doesn't the index anymore. It just
returns the number of added zones so that the caller can advance the
zonered array start for the next update.This patch alone doesn't introduce any functional change yet, though, it
is merely a preparatory work for later changes.Link: http://lkml.kernel.org/r/20170721143915.14161-7-mhocko@kernel.org
Signed-off-by: Michal Hocko
Acked-by: Vlastimil Babka
Cc: Johannes Weiner
Cc: Joonsoo Kim
Cc: Mel Gorman
Cc: Shaohua Li
Cc: Toshi Kani
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
try_online_node calls hotadd_new_pgdat which already calls
build_all_zonelists. So the additional call is redundant. Even though
hotadd_new_pgdat will only initialize zonelists of the new node this is
the right thing to do because such a node doesn't have any memory so
other zonelists would ignore all the zones from this node anyway.Link: http://lkml.kernel.org/r/20170721143915.14161-6-mhocko@kernel.org
Signed-off-by: Michal Hocko
Acked-by: Vlastimil Babka
Cc: Toshi Kani
Cc: Johannes Weiner
Cc: Joonsoo Kim
Cc: Mel Gorman
Cc: Shaohua Li
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
build_all_zonelists gets a zone parameter to initialize zone's pagesets.
There is only a single user which gives a non-NULL zone parameter and
that one doesn't really need the rest of the build_all_zonelists (see
commit 6dcd73d7011b ("memory-hotplug: allocate zone's pcp before
onlining pages")).Therefore remove setup_zone_pageset from build_all_zonelists and call it
from its only user directly. This will also remove a pointless zonlists
rebuilding which is always good.Link: http://lkml.kernel.org/r/20170721143915.14161-5-mhocko@kernel.org
Signed-off-by: Michal Hocko
Acked-by: Vlastimil Babka
Cc: Johannes Weiner
Cc: Joonsoo Kim
Cc: Mel Gorman
Cc: Shaohua Li
Cc: Toshi Kani
Cc: Wen Congyang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
__build_all_zonelists reinitializes each online cpu local node for
CONFIG_HAVE_MEMORYLESS_NODES. This makes sense because previously
memory less nodes could gain some memory during memory hotplug and so
the local node should be changed for CPUs close to such a node. It
makes less sense to do that unconditionally for a newly creaded NUMA
node which is still offline and without any memory.Let's also simplify the cpu loop and use for_each_online_cpu instead of
an explicit cpu_online check for all possible cpus.Link: http://lkml.kernel.org/r/20170721143915.14161-4-mhocko@kernel.org
Signed-off-by: Michal Hocko
Acked-by: Mel Gorman
Acked-by: Vlastimil Babka
Cc: Johannes Weiner
Cc: Joonsoo Kim
Cc: Shaohua Li
Cc: Toshi Kani
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
boot_pageset is a boot time hack which gets superseded by normal
pagesets later in the boot process. It makes zero sense to reinitialize
it again and again during memory hotplug.Link: http://lkml.kernel.org/r/20170721143915.14161-3-mhocko@kernel.org
Signed-off-by: Michal Hocko
Acked-by: Mel Gorman
Acked-by: Vlastimil Babka
Cc: Johannes Weiner
Cc: Joonsoo Kim
Cc: Shaohua Li
Cc: Toshi Kani
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Patch series "cleanup zonelists initialization", v1.
This is aimed at cleaning up the zonelists initialization code we have
but the primary motivation was bug report [2] which got resolved but the
usage of stop_machine is just too ugly to live. Most patches are
straightforward but 3 of them need a special consideration.Patch 1 removes zone ordered zonelists completely. I am CCing linux-api
because this is a user visible change. As I argue in the patch
description I do not think we have a strong usecase for it these days.
I have kept sysctl in place and warn into the log if somebody tries to
configure zone lists ordering. If somebody has a real usecase for it we
can revert this patch but I do not expect anybody will actually notice
runtime differences. This patch is not strictly needed for the rest but
it made patch 6 easier to implement.Patch 7 removes stop_machine from build_all_zonelists without adding any
special synchronization between iterators and updater which I _believe_
is acceptable as explained in the changelog. I hope I am not missing
anything.Patch 8 then removes zonelists_mutex which is kind of ugly as well and
not really needed AFAICS but a care should be taken when double checking
my thinking.This patch (of 9):
Supporting zone ordered zonelists costs us just a lot of code while the
usefulness is arguable if existent at all. Mel has already made node
ordering default on 64b systems. 32b systems are still using
ZONELIST_ORDER_ZONE because it is considered better to fallback to a
different NUMA node rather than consume precious lowmem zones.This argument is, however, weaken by the fact that the memory reclaim
has been reworked to be node rather than zone oriented. This means that
lowmem requests have to skip over all highmem pages on LRUs already and
so zone ordering doesn't save the reclaim time much. So the only
advantage of the zone ordering is under a light memory pressure when
highmem requests do not ever hit into lowmem zones and the lowmem
pressure doesn't need to reclaim.Considering that 32b NUMA systems are rather suboptimal already and it
is generally advisable to use 64b kernel on such a HW I believe we
should rather care about the code maintainability and just get rid of
ZONELIST_ORDER_ZONE altogether. Keep systcl in place and warn if
somebody tries to set zone ordering either from kernel command line or
the sysctl.[mhocko@suse.com: reading vm.numa_zonelist_order will never terminate]
Link: http://lkml.kernel.org/r/20170721143915.14161-2-mhocko@kernel.org
Signed-off-by: Michal Hocko
Acked-by: Mel Gorman
Acked-by: Vlastimil Babka
Cc: Johannes Weiner
Cc: Joonsoo Kim
Cc: Shaohua Li
Cc: Toshi Kani
Cc: Abdul Haleem
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This patch adds document and kconfig for using of writeback feature.
Link: http://lkml.kernel.org/r/1498459987-24562-10-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim
Cc: Juneho Choi
Cc: Sergey Senozhatsky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This patch enables read IO from backing device. For the feature, it
implements two IO read functions to transfer data from backing storage.One is asynchronous IO function and other is synchronous one.
A reason I need synchrnous IO is due to partial write which need to
complete read IO before the overwriting partial data.We can make the partial IO's case asynchronous, too but at the moment, I
don't feel adding more complexity to support such rare use cases so want
to go with simple.[xieyisheng1@huawei.com: read_from_bdev_async(): return 1 to avoid call page_endio() in zram_rw_page()]
Link: http://lkml.kernel.org/r/1502707447-6944-1-git-send-email-xieyisheng1@huawei.com
Link: http://lkml.kernel.org/r/1498459987-24562-9-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim
Signed-off-by: Yisheng Xie
Cc: Juneho Choi
Cc: Sergey Senozhatsky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This patch enables write IO to transfer data to backing device. For
that, it implements write_to_bdev function which creates new bio and
chaining with parent bio to make the parent bio asynchrnous.For rw_page which don't have parent bio, it submit owned bio and handle
IO completion by zram_page_end_io.Also, this patch defines new flag ZRAM_WB to mark written page for later
read IO.[xieyisheng1@huawei.com: fix typo in comment]
Link: http://lkml.kernel.org/r/1502707447-6944-2-git-send-email-xieyisheng1@huawei.com
Link: http://lkml.kernel.org/r/1498459987-24562-8-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim
Signed-off-by: Yisheng Xie
Cc: Juneho Choi
Cc: Sergey Senozhatsky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
For upcoming asynchronous IO like writeback, zram_rw_page should be
aware of that whether requested IO was completed or submitted
successfully, otherwise error.For the goal, zram_bvec_rw has three return values.
-errno: returns error number
0: IO request is done synchronously
1: IO request is issued successfully.Link: http://lkml.kernel.org/r/1498459987-24562-7-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim
Cc: Juneho Choi
Cc: Sergey Senozhatsky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
With backing device, zram needs management of free space of backing
device.This patch adds bitmap logic to manage free space which is very naive.
However, it would be simple enough as considering uncompressible pages's
frequenty in zram.Link: http://lkml.kernel.org/r/1498459987-24562-6-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim
Cc: Juneho Choi
Cc: Sergey Senozhatsky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
For writeback feature, user should set up backing device before the zram
working.This patch enables the interface via /sys/block/zramX/backing_dev.
Currently, it supports block device only but it could be enhanced for
file as well.Link: http://lkml.kernel.org/r/1498459987-24562-5-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim
Cc: Juneho Choi
Cc: Sergey Senozhatsky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
zram_decompress_page naming is not proper because it doesn't decompress
if page was dedup hit or stored with compression.Use more abstract term and consistent with write path function
__zram_bvec_write.Link: http://lkml.kernel.org/r/1498459987-24562-4-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim
Cc: Juneho Choi
Cc: Sergey Senozhatsky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
zram_compress does several things, compress, entry alloc and check
limitation. I did for just readbility but it hurts modulization.:(So this patch removes zram_compress functions and inline it in
__zram_bvec_write for upcoming patches.Link: http://lkml.kernel.org/r/1498459987-24562-3-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim
Cc: Juneho Choi
Cc: Sergey Senozhatsky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Patch series "writeback incompressible pages to storage", v1.
zRam is useful for memory saving with compressible pages but sometime,
workload can be changed and system has lots of incompressible pages
which is very harmful for zram.This patch supports writeback feature of zram so admin can set up a
block device and with it, zram can save the memory via writing out the
incompressile pages once it found it's incompressible pages (1/4 comp
ratio) instead of keeping the page in memory.[1-3] is just clean up and [4-8] is step by step feature enablement.
[4-8] is logically not bisectable(ie, logical unit separation)
although I tried to compiled out without breaking but I think it would
be better to review.This patch (of 9):
__zram_bvec_write has some of duplicated logic for zram meta data
handling of same_page|compressed_page. This patch aims to clean it up
without behavior change.[xieyisheng1@huawei.com: fix compr_data_size stat]
Link: http://lkml.kernel.org/r/1502707447-6944-1-git-send-email-xieyisheng1@huawei.com
Link: http://lkml.kernel.org/r/1496019048-27016-1-git-send-email-minchan@kernel.org
Link: http://lkml.kernel.org/r/1498459987-24562-2-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim
Signed-off-by: Yisheng Xie
Reviewed-by: Sergey Senozhatsky
Cc: Juneho Choi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Historically we have enforced that any kernel zone (e.g ZONE_NORMAL) has
to precede the Movable zone in the physical memory range. The purpose
of the movable zone is, however, not bound to any physical memory
restriction. It merely defines a class of migrateable and reclaimable
memory.There are users (e.g. CMA) who might want to reserve specific physical
memory ranges for their own purpose. Moreover our pfn walkers have to
be prepared for zones overlapping in the physical range already because
we do support interleaving NUMA nodes and therefore zones can interleave
as well. This means we can allow each memory block to be associated
with a different zone.Loosen the current onlining semantic and allow explicit onlining type on
any memblock. That means that online_{kernel,movable} will be allowed
regardless of the physical address of the memblock as long as it is
offline of course. This might result in moveble zone overlapping with
other kernel zones. Default onlining then becomes a bit tricky but
still sensible. echo online > memoryXY/state will online the given
block to1) the default zone if the given range is outside of any zone
2) the enclosing zone if such a zone doesn't interleave with
any other zone
3) the default zone if more zones interleave for this rangewhere default zone is movable zone only if movable_node is enabled
otherwise it is a kernel zone.Here is an example of the semantic with (movable_node is not present but
it work in an analogous way). We start with following memblocks, all of
them offline:memory34/valid_zones:Normal Movable
memory35/valid_zones:Normal Movable
memory36/valid_zones:Normal Movable
memory37/valid_zones:Normal Movable
memory38/valid_zones:Normal Movable
memory39/valid_zones:Normal Movable
memory40/valid_zones:Normal Movable
memory41/valid_zones:Normal MovableNow, we online block 34 in default mode and block 37 as movable
root@test1:/sys/devices/system/node/node1# echo online > memory34/state
root@test1:/sys/devices/system/node/node1# echo online_movable > memory37/state
memory34/valid_zones:Normal
memory35/valid_zones:Normal Movable
memory36/valid_zones:Normal Movable
memory37/valid_zones:Movable
memory38/valid_zones:Normal Movable
memory39/valid_zones:Normal Movable
memory40/valid_zones:Normal Movable
memory41/valid_zones:Normal MovableAs we can see all other blocks can still be onlined both into Normal and
Movable zones and the Normal is default because the Movable zone spans
only block37 now.root@test1:/sys/devices/system/node/node1# echo online_movable > memory41/state
memory34/valid_zones:Normal
memory35/valid_zones:Normal Movable
memory36/valid_zones:Normal Movable
memory37/valid_zones:Movable
memory38/valid_zones:Movable Normal
memory39/valid_zones:Movable Normal
memory40/valid_zones:Movable Normal
memory41/valid_zones:MovableNow the default zone for blocks 37-41 has changed because movable zone
spans that range.root@test1:/sys/devices/system/node/node1# echo online_kernel > memory39/state
memory34/valid_zones:Normal
memory35/valid_zones:Normal Movable
memory36/valid_zones:Normal Movable
memory37/valid_zones:Movable
memory38/valid_zones:Normal Movable
memory39/valid_zones:Normal
memory40/valid_zones:Movable Normal
memory41/valid_zones:MovableNote that the block 39 now belongs to the zone Normal and so block38
falls into Normal by default as well.For completness
root@test1:/sys/devices/system/node/node1# for i in memory[34]?
do
echo online > $i/state 2>/dev/null
donememory34/valid_zones:Normal
memory35/valid_zones:Normal
memory36/valid_zones:Normal
memory37/valid_zones:Movable
memory38/valid_zones:Normal
memory39/valid_zones:Normal
memory40/valid_zones:Movable
memory41/valid_zones:MovableImplementation wise the change is quite straightforward. We can get rid
of allow_online_pfn_range altogether. online_pages allows only offline
nodes already. The original default_zone_for_pfn will become
default_kernel_zone_for_pfn. New default_zone_for_pfn implements the
above semantic. zone_for_pfn_range is slightly reorganized to implement
kernel and movable online type explicitly and MMOP_ONLINE_KEEP becomes a
catch all default behavior.Link: http://lkml.kernel.org/r/20170714121233.16861-3-mhocko@kernel.org
Signed-off-by: Michal Hocko
Acked-by: Joonsoo Kim
Acked-by: Vlastimil Babka
Acked-by: Reza Arbab
Cc: Mel Gorman
Cc: Andrea Arcangeli
Cc: Yasuaki Ishimatsu
Cc: Xishi Qiu
Cc: Kani Toshimitsu
Cc:
Cc: Daniel Kiper
Cc: Igor Mammedov
Cc: Vitaly Kuznetsov
Cc: Wei Yang
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Prior to commit f1dd2cd13c4b ("mm, memory_hotplug: do not associate
hotadded memory to zones until online") we used to allow to change the
valid zone types of a memory block if it is adjacent to a different zone
type.This fact was reflected in memoryNN/valid_zones by the ordering of
printed zones. The first one was default (echo online > memoryNN/state)
and the other one could be onlined explicitly by online_{movable,kernel}.This behavior was removed by the said patch and as such the ordering was
not all that important. In most cases a kernel zone would be default
anyway. The only exception is movable_node handled by "mm,
memory_hotplug: support movable_node for hotpluggable nodes".Let's reintroduce this behavior again because later patch will remove
the zone overlap restriction and so user will be allowed to online
kernel resp. movable block regardless of its placement. Original
behavior will then become significant again because it would be
non-trivial for users to see what is the default zone to online into.Implementation is really simple. Pull out zone selection out of
move_pfn_range into zone_for_pfn_range helper and use it in
show_valid_zones to display the zone for default onlining and then both
kernel and movable if they are allowed. Default online zone is not
duplicated.Link: http://lkml.kernel.org/r/20170714121233.16861-2-mhocko@kernel.org
Signed-off-by: Michal Hocko
Acked-by: Joonsoo Kim
Acked-by: Vlastimil Babka
Cc: Mel Gorman
Cc: Andrea Arcangeli
Cc: Reza Arbab
Cc: Yasuaki Ishimatsu
Cc: Xishi Qiu
Cc: Kani Toshimitsu
Cc:
Cc: Daniel Kiper
Cc: Igor Mammedov
Cc: Vitaly Kuznetsov
Cc: Wei Yang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Commit 9adb62a5df9c ("mm/hotplug: correctly setup fallback zonelists
when creating new pgdat") tries to build the correct zonelist for a
newly added node, while it is not necessary to rebuild it for already
exist nodes.In build_zonelists(), it will iterate on nodes with memory. For a newly
added node, it will have memory until node_states_set_node() is called
in online_pages().This patch avoids rebuilding the zonelists for already existing nodes.
build_zonelists_node() uses managed_zone(zone) checks, so it should not
include empty zones anyway. So effectively we avoid some pointless work
under stop_machine().[akpm@linux-foundation.org: tweak comment text]
[akpm@linux-foundation.org: coding-style tweak, per Vlastimil]
Link: http://lkml.kernel.org/r/20170626035822.50155-1-richard.weiyang@gmail.com
Signed-off-by: Wei Yang
Acked-by: Michal Hocko
Acked-by: Vlastimil Babka
Cc: Jiang Liu
Cc: Xishi Qiu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
shrink_slab() allows us to report back the number of objects we
successfully scanned (out of the target shrinkctl->nr_to_scan). As
report the number of pages owned by each GEM object as a separate item
to the shrinker, we cannot precisely control the number of shrinker
objects we scan on each pass; and indeed may free more than requested.
If we fail to tell the shrinker about the number of objects we process,
it will continue to hold a grudge against us as any objects left
unscanned are added to the next reclaim -- and so we will keep on
"unfairly" shrinking our own slab in comparison to other slabs.Link: http://lkml.kernel.org/r/20170822135325.9191-2-chris@chris-wilson.co.uk
Signed-off-by: Chris Wilson
Cc: Joonas Lahtinen
Cc: Michal Hocko
Cc: Johannes Weiner
Cc: Hillf Danton
Cc: Minchan Kim
Cc: Vlastimil Babka
Cc: Mel Gorman
Cc: Shaohua Li
Cc: Christoph Lameter
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Some shrinkers may only be able to free a bunch of objects at a time,
and so free more than the requested nr_to_scan in one pass.Whilst other shrinkers may find themselves even unable to scan as many
objects as they counted, and so underreport. Account for the extra
freed/scanned objects against the total number of objects we intend to
scan, otherwise we may end up penalising the slab far more than
intended. Similarly, we want to add the underperforming scan to the
deferred pass so that we try harder and harder in future passes.Link: http://lkml.kernel.org/r/20170822135325.9191-1-chris@chris-wilson.co.uk
Signed-off-by: Chris Wilson
Cc: Michal Hocko
Cc: Johannes Weiner
Cc: Hillf Danton
Cc: Minchan Kim
Cc: Vlastimil Babka
Cc: Mel Gorman
Cc: Shaohua Li
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Joonas Lahtinen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Add an assertion similar to "fasttop" check in GNU C Library allocator
as a part of SLAB_FREELIST_HARDENED feature. An object added to a
singly linked freelist should not point to itself. That helps to detect
some double free errors (e.g. CVE-2017-2636) without slub_debug and
KASAN.Link: http://lkml.kernel.org/r/1502468246-1262-1-git-send-email-alex.popov@linux.com
Signed-off-by: Alexander Popov
Acked-by: Christoph Lameter
Cc: Kees Cook
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Paul E McKenney
Cc: Ingo Molnar
Cc: Tejun Heo
Cc: Andy Lutomirski
Cc: Nicolas Pitre
Cc: Rik van Riel
Cc: Tycho Andersen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This SLUB free list pointer obfuscation code is modified from Brad
Spengler/PaX Team's code in the last public patch of grsecurity/PaX
based on my understanding of the code. Changes or omissions from the
original code are mine and don't reflect the original grsecurity/PaX
code.This adds a per-cache random value to SLUB caches that is XORed with
their freelist pointer address and value. This adds nearly zero
overhead and frustrates the very common heap overflow exploitation
method of overwriting freelist pointers.A recent example of the attack is written up here:
http://cyseclabs.com/blog/cve-2016-6187-heap-off-by-one-exploit
and there is a section dedicated to the technique the book "A Guide to
Kernel Exploitation: Attacking the Core".This is based on patches by Daniel Micay, and refactored to minimize the
use of #ifdef.With 200-count cycles of "hackbench -g 20 -l 1000" I saw the following
run times:before:
mean 10.11882499999999999995
variance .03320378329145728642
stdev .18221905304181911048after:
mean 10.12654000000000000014
variance .04700556623115577889
stdev .21680767106160192064The difference gets lost in the noise, but if the above is to be taken
literally, using CONFIG_FREELIST_HARDENED is 0.07% slower.Link: http://lkml.kernel.org/r/20170802180609.GA66807@beast
Signed-off-by: Kees Cook
Suggested-by: Daniel Micay
Cc: Rik van Riel
Cc: Tycho Andersen
Cc: Alexander Popov
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
- free_kmem_cache_nodes() frees the cache node before nulling out a
reference to it- init_kmem_cache_nodes() publishes the cache node before initializing
itNeither of these matter at runtime because the cache nodes cannot be
looked up by any other thread. But it's neater and more consistent to
reorder these.Link: http://lkml.kernel.org/r/20170707083408.40410-1-glider@google.com
Signed-off-by: Alexander Potapenko
Acked-by: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
clean up some unused functions and parameters.
Link: http://lkml.kernel.org/r/598A5E21.2080807@huawei.com
Signed-off-by: Jun Piao
Reviewed-by: Alex Chen
Cc: Mark Fasheh
Cc: Joel Becker
Cc: Junxiao Bi
Cc: Joseph Qi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The function is never called outside of fs/ocfs2/acl.c.
Link: http://lkml.kernel.org/r/20170801141252.19675-2-jack@suse.cz
Signed-off-by: Jan Kara
Cc: Mark Fasheh
Cc: Joel Becker
Cc: Junxiao Bi
Cc: Joseph Qi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
There is code duplication between sec_name() and sech_name(). Simplify
sec_name() by re-using sech_name(). Also, move them up to remove the
forward declaration of sec_name().Link: http://lkml.kernel.org/r/1502248721-22009-1-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada
Reviewed-by: Kees Cook
Cc: Nicholas Piggin
Cc: Jessica Yu
Cc: Chris Metcalf
Cc: Heinrich Schuchardt
Cc: Ingo Molnar
Cc: Ard Biesheuvel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
dax_pmd_insert_mapping() contains the following code:
pfn_t pfn;
if (bdev_dax_pgoff(bdev, sector, size, &pgoff) != 0)
goto fallback;
/* ... */
fallback:
trace_dax_pmd_insert_mapping_fallback(inode, vmf, length, pfn, ret);When the condition in the if statement fails, the function calls
trace_dax_pmd_insert_mapping_fallback() with an uninitialized pfn value.This issue has been found while building the kernel with clang. The
compiler reported:fs/dax.c:1280:6: error: variable 'pfn' is used uninitialized
whenever 'if' condition is true [-Werror,-Wsometimes-uninitialized]
if (bdev_dax_pgoff(bdev, sector, size, &pgoff) != 0)
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
fs/dax.c:1310:60: note: uninitialized use occurs here
trace_dax_pmd_insert_mapping_fallback(inode, vmf, length, pfn, ret);
^~~Link: http://lkml.kernel.org/r/20170903083000.587-1-nicolas.iooss_linux@m4x.org
Signed-off-by: Nicolas Iooss
Reviewed-by: Ross Zwisler
Cc: Matthew Wilcox
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds