20 Jan, 2017
1 commit
-
commit f05714293a591038304ddae7cb0dd747bb3786cc upstream.
During developemnt for zram-swap asynchronous writeback, I found strange
corruption of compressed page, resulting in:Modules linked in: zram(E)
CPU: 3 PID: 1520 Comm: zramd-1 Tainted: G E 4.8.0-mm1-00320-ge0d4894c9c38-dirty #3274
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
task: ffff88007620b840 task.stack: ffff880078090000
RIP: set_freeobj.part.43+0x1c/0x1f
RSP: 0018:ffff880078093ca8 EFLAGS: 00010246
RAX: 0000000000000018 RBX: ffff880076798d88 RCX: ffffffff81c408c8
RDX: 0000000000000018 RSI: 0000000000000000 RDI: 0000000000000246
RBP: ffff880078093cb0 R08: 0000000000000000 R09: 0000000000000000
R10: ffff88005bc43030 R11: 0000000000001df3 R12: ffff880076798d88
R13: 000000000005bc43 R14: ffff88007819d1b8 R15: 0000000000000001
FS: 0000000000000000(0000) GS:ffff88007e380000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fc934048f20 CR3: 0000000077b01000 CR4: 00000000000406e0
Call Trace:
obj_malloc+0x22b/0x260
zs_malloc+0x1e4/0x580
zram_bvec_rw+0x4cd/0x830 [zram]
page_requests_rw+0x9c/0x130 [zram]
zram_thread+0xe6/0x173 [zram]
kthread+0xca/0xe0
ret_from_fork+0x25/0x30With investigation, it reveals currently stable page doesn't support
anonymous page. IOW, reuse_swap_page can reuse the page without waiting
writeback completion so it can overwrite page zram is compressing.Unfortunately, zram has used per-cpu stream feature from v4.7.
It aims for increasing cache hit ratio of scratch buffer for
compressing. Downside of that approach is that zram should ask
memory space for compressed page in per-cpu context which requires
stricted gfp flag which could be failed. If so, it retries to
allocate memory space out of per-cpu context so it could get memory
this time and compress the data again, copies it to the memory space.In this scenario, zram assumes the data should never be changed
but it is not true unless stable page supports. So, If the data is
changed under us, zram can make buffer overrun because second
compression size could be bigger than one we got in previous trial
and blindly, copy bigger size object to smaller buffer which is
buffer overrun. The overrun breaks zsmalloc free object chaining
so system goes crash like above.I think below is same problem.
https://bugzilla.suse.com/show_bug.cgi?id=997574Unfortunately, reuse_swap_page should be atomic so that we cannot wait on
writeback in there so the approach in this patch is simply return false if
we found it needs stable page. Although it increases memory footprint
temporarily, it happens rarely and it should be reclaimed easily althoug
it happened. Also, It would be better than waiting of IO completion,
which is critial path for application latency.Fixes: da9556a2367c ("zram: user per-cpu compression streams")
Link: http://lkml.kernel.org/r/20161120233015.GA14113@bbox
Link: http://lkml.kernel.org/r/1482366980-3782-2-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim
Acked-by: Hugh Dickins
Cc: Sergey Senozhatsky
Cc: Darrick J. Wong
Cc: Takashi Iwai
Cc: Hyeoncheol Lee
Cc:
Cc: Sangseok Lee
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman
12 Nov, 2016
1 commit
-
When root activates a swap partition whose header has the wrong
endianness, nr_badpages elements of badpages are swabbed before
nr_badpages has been checked, leading to a buffer overrun of up to 8GB.This normally is not a security issue because it can only be exploited
by root (more specifically, a process with CAP_SYS_ADMIN or the ability
to modify a swap file/partition), and such a process can already e.g.
modify swapped-out memory of any other userspace process on the system.Link: http://lkml.kernel.org/r/1477949533-2509-1-git-send-email-jann@thejh.net
Signed-off-by: Jann Horn
Acked-by: Kees Cook
Acked-by: Jerome Marchand
Acked-by: Johannes Weiner
Cc: "Kirill A. Shutemov"
Cc: Vlastimil Babka
Cc: Hugh Dickins
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
08 Oct, 2016
2 commits
-
This patch is to improve the performance of swap cache operations when
the type of the swap device is not 0. Originally, the whole swap entry
value is used as the key of the swap cache, even though there is one
radix tree for each swap device. If the type of the swap device is not
0, the height of the radix tree of the swap cache will be increased
unnecessary, especially on 64bit architecture. For example, for a 1GB
swap device on the x86_64 architecture, the height of the radix tree of
the swap cache is 11. But if the offset of the swap entry is used as
the key of the swap cache, the height of the radix tree of the swap
cache is 4. The increased height causes unnecessary radix tree
descending and increased cache footprint.This patch reduces the height of the radix tree of the swap cache via
using the offset of the swap entry instead of the whole swap entry value
as the key of the swap cache. In 32 processes sequential swap out test
case on a Xeon E5 v3 system with RAM disk as swap, the lock contention
for the spinlock of the swap cache is reduced from 20.15% to 12.19%,
when the type of the swap device is 1.Use the whole swap entry as key,
perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list: 10.37,
perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_node_memcg: 9.78,Use the swap offset as key,
perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list: 6.25,
perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_node_memcg: 5.94,Link: http://lkml.kernel.org/r/1473270649-27229-1-git-send-email-ying.huang@intel.com
Signed-off-by: "Huang, Ying"
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Vladimir Davydov
Cc: "Kirill A. Shutemov"
Cc: Dave Hansen
Cc: Dan Williams
Cc: Joonsoo Kim
Cc: Hugh Dickins
Cc: Mel Gorman
Cc: Minchan Kim
Cc: Aaron Lu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This is a code clean up patch without functionality changes. The
swap_cluster_list data structure and its operations are introduced to
provide some better encapsulation for the free cluster and discard
cluster list operations. This avoid some code duplication, improved the
code readability, and reduced the total line number.[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/1472067356-16004-1-git-send-email-ying.huang@intel.com
Signed-off-by: "Huang, Ying"
Acked-by: Minchan Kim
Acked-by: Rik van Riel
Cc: Tim Chen
Cc: Hugh Dickins
Cc: Shaohua Li
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
20 Sep, 2016
1 commit
-
Commit 62c230bc1790 ("mm: add support for a filesystem to activate
swap files and use direct_IO for writing swap pages") replaced the
swap_aops dirty hook from __set_page_dirty_no_writeback() with
swap_set_page_dirty().For normal cases without these special SWP flags code path falls back to
__set_page_dirty_no_writeback() so the behaviour is expected to be the
same as before.But swap_set_page_dirty() makes use of the page_swap_info() helper to
get the swap_info_struct to check for the flags like SWP_FILE,
SWP_BLKDEV etc as desired for those features. This helper has
BUG_ON(!PageSwapCache(page)) which is racy and safe only for the
set_page_dirty_lock() path.For the set_page_dirty() path which is often needed for cases to be
called from irq context, kswapd() can toggle the flag behind the back
while the call is getting executed when system is low on memory and
heavy swapping is ongoing.This ends up with undesired kernel panic.
This patch just moves the check outside the helper to its users
appropriately to fix kernel panic for the described path. Couple of
users of helpers already take care of SwapCache condition so I skipped
them.Link: http://lkml.kernel.org/r/1473460718-31013-1-git-send-email-santosh.shilimkar@oracle.com
Signed-off-by: Santosh Shilimkar
Cc: Mel Gorman
Cc: Joe Perches
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: David S. Miller
Cc: Jens Axboe
Cc: Michal Hocko
Cc: Hugh Dickins
Cc: Al Viro
Cc: [4.7.x]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
27 Jul, 2016
1 commit
-
I have noticed that frontswap.h first declares "frontswap_enabled" as
extern bool variable, and then overrides it with "#define
frontswap_enabled (1)" for CONFIG_FRONTSWAP=Y or (0) when disabled. The
bool variable isn't actually instantiated anywhere.This all looks like an unfinished attempt to make frontswap_enabled
reflect whether a backend is instantiated. But in the current state,
all frontswap hooks call unconditionally into frontswap.c just to check
if frontswap_ops is non-NULL. This should at least be checked inline,
but we can further eliminate the overhead when CONFIG_FRONTSWAP is
enabled and no backend registered, using a static key that is initially
disabled, and gets enabled only upon first backend registration.Thus, checks for "frontswap_enabled" are replaced with
"frontswap_enabled()" wrapping the static key check. There are two
exceptions:- xen's selfballoon_process() was testing frontswap_enabled in code guarded
by #ifdef CONFIG_FRONTSWAP, which was effectively always true when reachable.
The patch just removes this check. Using frontswap_enabled() does not sound
correct here, as this can be true even without xen's own backend being
registered.- in SYSCALL_DEFINE2(swapon), change the check to IS_ENABLED(CONFIG_FRONTSWAP)
as it seems the bitmap allocation cannot currently be postponed until a
backend is registered. This means that frontswap will still have some
memory overhead by being configured, but without a backend.After the patch, we can expect that some functions in frontswap.c are
called only when frontswap_ops is non-NULL. Change the checks there to
VM_BUG_ONs. While at it, convert other BUG_ONs to VM_BUG_ONs as
frontswap has been stable for some time.[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/1463152235-9717-1-git-send-email-vbabka@suse.cz
Signed-off-by: Vlastimil Babka
Cc: Konrad Rzeszutek Wilk
Cc: Boris Ostrovsky
Cc: David Vrabel
Cc: Juergen Gross
Cc: "Kirill A. Shutemov"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
13 May, 2016
1 commit
-
This will provide fully accuracy to the mapcount calculation in the
write protect faults, so page pinning will not get broken by false
positive copy-on-writes.total_mapcount() isn't the right calculation needed in
reuse_swap_page(), so this introduces a page_trans_huge_mapcount()
that is effectively the full accurate return value for page_mapcount()
if dealing with Transparent Hugepages, however we only use the
page_trans_huge_mapcount() during COW faults where it strictly needed,
due to its higher runtime cost.This also provide at practical zero cost the total_mapcount
information which is needed to know if we can still relocate the page
anon_vma to the local vma. If page_trans_huge_mapcount() returns 1 we
can reuse the page no matter if it's a pte or a pmd_trans_huge
triggering the fault, but we can only relocate the page anon_vma to
the local vma->anon_vma if we're sure it's only this "vma" mapping the
whole THP physical range.Kirill A. Shutemov discovered the problem with moving the page
anon_vma to the local vma->anon_vma in a previous version of this
patch and another problem in the way page_move_anon_rmap() was called.Andrew Morton discovered that CONFIG_SWAP=n wouldn't build in a
previous version, because reuse_swap_page must be a macro to call
page_trans_huge_mapcount from swap.h, so this uses a macro again
instead of an inline function. With this change at least it's a less
dangerous usage than it was before, because "page" is used only once
now, while with the previous code reuse_swap_page(page++) would have
called page_mapcount on page+1 and it would have increased page twice
instead of just once.Dean Luick noticed an uninitialized variable that could result in a
rmap inefficiency for the non-THP case in a previous version.Mike Marciniszyn said:
: Our RDMA tests are seeing an issue with memory locking that bisects to
: commit 61f5d698cc97 ("mm: re-enable THP")
:
: The test program registers two rather large MRs (512M) and RDMA
: writes data to a passive peer using the first and RDMA reads it back
: into the second MR and compares that data. The sizes are chosen randomly
: between 0 and 1024 bytes.
:
: The test will get through a few (
Reviewed-by: "Kirill A. Shutemov"
Reviewed-by: Dean Luick
Tested-by: Alex Williamson
Tested-by: Mike Marciniszyn
Tested-by: Josh Collier
Cc: Marc Haber
Cc: [4.5]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
05 Apr, 2016
1 commit
-
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.Let's stop pretending that pages in page cache are special. They are
not.The changes are pretty straight-forward:
- << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;
- >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)Signed-off-by: Kirill A. Shutemov
Acked-by: Michal Hocko
Signed-off-by: Linus Torvalds
22 Mar, 2016
1 commit
-
Pull drm updates from Dave Airlie:
"This is the main drm pull request for 4.6 kernel.Overall the coolest thing here for me is the nouveau maxwell signed
firmware support from NVidia, it's taken a long while to extract this
from them.I also wish the ARM vendors just designed one set of display IP, ARM
display block proliferation is definitely increasing.Core:
- drm_event cleanups
- Internal API cleanup making mode_fixup optional.
- Apple GMUX vga switcheroo support.
- DP AUX testing interfacePanel:
- Refactoring of DSI core for use over more transports.New driver:
- ARM hdlcd driveri915:
- FBC/PSR (framebuffer compression, panel self refresh) enabled by default.
- Ongoing atomic display support work
- Ongoing runtime PM work
- Pixel clock limit checks
- VBT DSI description support
- GEM fixes
- GuC firmware scheduler enhancementsamdkfd:
- Deferred probing fixes to avoid make file or link ordering.amdgpu/radeon:
- ACP support for i2s audio support.
- Command Submission/GPU scheduler/GPUVM optimisations
- Initial GPU reset support for amdgpuvmwgfx:
- Support for DX10 gen mipmaps
- Pageflipping and other fixes.exynos:
- Exynos5420 SoC support for FIMD
- Exynos5422 SoC support for MIPI-DSInouveau:
- GM20x secure boot support - adds acceleration for Maxwell GPUs.
- GM200 support
- GM20B clock driver support
- Power sensors worketnaviv:
- Correctness fixes for GPU cache flushing
- Better support for i.MX6 systems.imx-drm:
- VBlank IRQ support
- Fence support
- OF endpoint supportmsm:
- HDMI support for 8996 (snapdragon 820)
- Adreno 430 support
- Timestamp queries supportvirtio-gpu:
- Fixes for Android support.rockchip:
- Add support for Innosilicion HDMIrcar-du:
- Support for 4 crtcs
- R8A7795 support
- RCar Gen 3 supportomapdrm:
- HDMI interlace output support
- dma-buf import support
- Refactoring to remove a lot of legacy code.tilcdc:
- Rewrite of pageflipping code
- dma-buf support
- pinctrl supportvc4:
- HDMI modesetting bug fixes
- Significant 3D performance improvement.fsl-dcu (FreeScale):
- Lots of fixestegra:
- Two small fixessti:
- Atomic support for planes
- Improved HDMI support"* 'drm-next' of git://people.freedesktop.org/~airlied/linux: (1063 commits)
drm/amdgpu: release_pages requires linux/pagemap.h
drm/sti: restore mode_fixup callback
drm/amdgpu/gfx7: add MTYPE definition
drm/amdgpu: removing BO_VAs shouldn't be interruptible
drm/amd/powerplay: show uvd/vce power gate enablement for tonga.
drm/amd/powerplay: show uvd/vce power gate info for fiji
drm/amdgpu: use sched fence if possible
drm/amdgpu: move ib.fence to job.fence
drm/amdgpu: give a fence param to ib_free
drm/amdgpu: include the right version of gmc header files for iceland
drm/radeon: fix indentation.
drm/amd/powerplay: add uvd/vce dpm enabling flag to fix the performance issue for CZ
drm/amdgpu: switch back to 32bit hw fences v2
drm/amdgpu: remove amdgpu_fence_is_signaled
drm/amdgpu: drop the extra fence range check v2
drm/amdgpu: signal fences directly in amdgpu_fence_process
drm/amdgpu: cleanup amdgpu_fence_wait_empty v2
drm/amdgpu: keep all fences in an RCU protected array v2
drm/amdgpu: add number of hardware submissions to amdgpu_fence_driver_init_ring
drm/amdgpu: RCU protected amd_sched_fence_release
...
18 Mar, 2016
1 commit
-
Kernel style prefers a single string over split strings when the string is
'user-visible'.Miscellanea:
- Add a missing newline
- Realign argumentsSigned-off-by: Joe Perches
Acked-by: Tejun Heo [percpu]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
09 Feb, 2016
1 commit
-
- support for v3 vbt dsi blocks (Jani)
- improve mmio debug checks (Mika Kuoppala)
- reorg the ddi port translation table entries and related code (Ville)
- reorg gen8 interrupt handling for future platforms (Tvrtko)
- refactor tile width/height computations for framebuffers (Ville)
- kerneldoc integration for intel_pm.c (Jani)
- move default context from engines to device-global dev_priv (Dave Gordon)
- make seqno/irq ordering coherent with execlist (Chris)
- decouple internal engine number from UABI (Chris&Tvrtko)
- tons of small fixes all over, as usual* tag 'drm-intel-next-2016-01-24' of git://anongit.freedesktop.org/drm-intel: (148 commits)
drm/i915: Update DRIVER_DATE to 20160124
drm/i915: Seal busy-ioctl uABI and prevent leaking of internal ids
drm/i915: Decouple execbuf uAPI from internal implementation
drm/i915: Use ordered seqno write interrupt generation on gen8+ execlists
drm/i915: Limit the auto arming of mmio debugs on vlv/chv
drm/i915: Tune down "GT register while GT waking disabled" message
drm/i915: tidy up a few leftovers
drm/i915: abolish separate per-ring default_context pointers
drm/i915: simplify allocation of driver-internal requests
drm/i915: Fix NULL plane->fb oops on SKL
drm/i915: Do not put big intel_crtc_state on the stack
Revert "drm/i915: Add two-stage ILK-style watermark programming (v10)"
drm/i915: add DOC: headline to RC6 kernel-doc
drm/i915: turn some bogus kernel-doc comments to normal comments
drm/i915/sdvo: revert bogus kernel-doc comments to normal comments
drm/i915/gen9: Correct max save/restore register count during gpu reset with GuC
drm/i915: Demote user facing DMC firmware load failure message
drm/i915: use hlist_for_each_entry
drm/i915: skl_update_scaler() wants a rotation bitmask instead of bit number
drm/i915: Don't reject primary plane windowing with color keying enabled on SKL+
...
23 Jan, 2016
1 commit
-
parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
inode_foo(inode) being mutex_foo(&inode->i_mutex).Please, use those for access to ->i_mutex; over the coming cycle
->i_mutex will become rwsem, with ->lookup() done with it held
only shared.Signed-off-by: Al Viro
21 Jan, 2016
2 commits
-
Swap cache pages are freed aggressively if swap is nearly full (>50%
currently), because otherwise we are likely to stop scanning anonymous
when we near the swap limit even if there is plenty of freeable swap cache
pages. We should follow the same trend in case of memory cgroup, which
has its own swap limit.Signed-off-by: Vladimir Davydov
Acked-by: Johannes Weiner
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This patchset introduces swap accounting to cgroup2.
This patch (of 7):
In the legacy hierarchy we charge memsw, which is dubious, because:
- memsw.limit must be >= memory.limit, so it is impossible to limit
swap usage less than memory usage. Taking into account the fact that
the primary limiting mechanism in the unified hierarchy is
memory.high while memory.limit is either left unset or set to a very
large value, moving memsw.limit knob to the unified hierarchy would
effectively make it impossible to limit swap usage according to the
user preference.- memsw.usage != memory.usage + swap.usage, because a page occupying
both swap entry and a swap cache page is charged only once to memsw
counter. As a result, it is possible to effectively eat up to
memory.limit of memory pages *and* memsw.limit of swap entries, which
looks unexpected.That said, we should provide a different swap limiting mechanism for
cgroup2.This patch adds mem_cgroup->swap counter, which charges the actual number
of swap entries used by a cgroup. It is only charged in the unified
hierarchy, while the legacy hierarchy memsw logic is left intact.The swap usage can be monitored using new memory.swap.current file and
limited using memory.swap.max.Note, to charge swap resource properly in the unified hierarchy, we have
to make swap_entry_free uncharge swap only when ->usage reaches zero, not
just ->count, i.e. when all references to a swap entry, including the one
taken by swap cache, are gone. This is necessary, because otherwise
swap-in could result in uncharging swap even if the page is still in swap
cache and hence still occupies a swap entry. At the same time, this
shouldn't break memsw counter logic, where a page is never charged twice
for using both memory and swap, because in case of legacy hierarchy we
uncharge swap on commit (see mem_cgroup_commit_charge).Signed-off-by: Vladimir Davydov
Acked-by: Johannes Weiner
Cc: Michal Hocko
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
16 Jan, 2016
4 commits
-
Both s390 and powerpc have hit the issue of swapoff hanging, when
CONFIG_HAVE_ARCH_SOFT_DIRTY and CONFIG_MEM_SOFT_DIRTY ifdefs were not
quite as x86_64 had them. I think it would be much clearer if
HAVE_ARCH_SOFT_DIRTY was just a Kconfig option set by architectures to
determine whether the MEM_SOFT_DIRTY option should be offered, and the
actual code depend upon CONFIG_MEM_SOFT_DIRTY alone.But won't embark on that change myself: instead make swapoff more
robust, by using pte_swp_clear_soft_dirty() on each pte it encounters,
without an explicit #ifdef CONFIG_MEM_SOFT_DIRTY. That being a no-op,
whether the bit in question is defined as 0 or the asm-generic fallback
is used, unless soft dirty is fully turned on.Why "maybe" in maybe_same_pte()? Rename it pte_same_as_swp().
Signed-off-by: Hugh Dickins
Reviewed-by: Aneesh Kumar K.V
Acked-by: Cyrill Gorcunov
Cc: Laurent Dufour
Cc: Michael Ellerman
Cc: Martin Schwidefsky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
With new refcounting we will be able map the same compound page with
PTEs and PMDs. It requires adjustment to conditions when we can reuse
the page on write-protection fault.For PTE fault we can't reuse the page if it's part of huge page.
For PMD we can only reuse the page if nobody else maps the huge page or
it's part. We can do it by checking page_mapcount() on each sub-page,
but it's expensive.The cheaper way is to check page_count() to be equal 1: every mapcount
takes page reference, so this way we can guarantee, that the PMD is the
only mapping.This approach can give false negative if somebody pinned the page, but
that doesn't affect correctness.Signed-off-by: Kirill A. Shutemov
Tested-by: Sasha Levin
Tested-by: Aneesh Kumar K.V
Acked-by: Jerome Marchand
Acked-by: Vlastimil Babka
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Naoya Horiguchi
Cc: Steve Capper
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Christoph Lameter
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
As with rmap, with new refcounting we cannot rely on PageTransHuge() to
check if we need to charge size of huge page form the cgroup. We need
to get information from caller to know whether it was mapped with PMD or
PTE.We do uncharge when last reference on the page gone. At that point if
we see PageTransHuge() it means we need to unchange whole huge page.The tricky part is partial unmap -- when we try to unmap part of huge
page. We don't do a special handing of this situation, meaning we don't
uncharge the part of huge page unless last user is gone or
split_huge_page() is triggered. In case of cgroup memory pressure
happens the partial unmapped page will be split through shrinker. This
should be good enough.Signed-off-by: Kirill A. Shutemov
Tested-by: Sasha Levin
Tested-by: Aneesh Kumar K.V
Acked-by: Vlastimil Babka
Acked-by: Jerome Marchand
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Naoya Horiguchi
Cc: Steve Capper
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Christoph Lameter
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
We're going to allow mapping of individual 4k pages of THP compound
page. It means we cannot rely on PageTransHuge() check to decide if
map/unmap small page or THP.The patch adds new argument to rmap functions to indicate whether we
want to operate on whole compound page or only the small page.[n-horiguchi@ah.jp.nec.com: fix mapcount mismatch in hugepage migration]
Signed-off-by: Kirill A. Shutemov
Tested-by: Sasha Levin
Tested-by: Aneesh Kumar K.V
Acked-by: Vlastimil Babka
Acked-by: Jerome Marchand
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Steve Capper
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Christoph Lameter
Cc: David Rientjes
Signed-off-by: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
15 Jan, 2016
2 commits
-
Use list_for_each_entry_safe() instead of list_for_each_safe() to
simplify the code.Signed-off-by: Geliang Tang
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
To make the intention clearer, use list_{next,first}_entry instead of
list_entry().Signed-off-by: Geliang Tang
Cc: "Kirill A. Shutemov"
Cc: Jerome Marchand
Cc: Vlastimil Babka
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
05 Jan, 2016
1 commit
-
Some modules, like i915.ko, use swappable objects and may try to swap
them out under memory pressure (via the shrinker). Before doing so, they
want to check using get_nr_swap_pages() to see if any swap space is
available as otherwise they will waste time purging the object from the
device without recovering any memory for the system. This requires the
nr_swap_pages counter to be exported to the modules.Signed-off-by: Chris Wilson
Cc: "Goel, Akash"
Cc: Johannes Weiner
Cc: linux-mm@kvack.org
Link: http://patchwork.freedesktop.org/patch/msgid/1449244734-25733-1-git-send-email-chris@chris-wilson.co.uk
Acked-by: Andrew Morton
Acked-by: Johannes Weiner
Signed-off-by: Daniel Vetter
09 Sep, 2015
1 commit
-
We want to know per-process workingset size for smart memory management
on userland and we use swap(ex, zram) heavily to maximize memory
efficiency so workingset includes swap as well as RSS.On such system, if there are lots of shared anonymous pages, it's really
hard to figure out exactly how many each process consumes memory(ie, rss
+ wap) if the system has lots of shared anonymous memory(e.g, android).This patch introduces SwapPss field on /proc//smaps so we can get
more exact workingset size per process.Bongkyu tested it. Result is below.
1. 50M used swap
SwapTotal: 461976 kB
SwapFree: 411192 kB$ adb shell cat /proc/*/smaps | grep "SwapPss:" | awk '{sum += $2} END {print sum}';
48236
$ adb shell cat /proc/*/smaps | grep "Swap:" | awk '{sum += $2} END {print sum}';
1411842. 240M used swap
SwapTotal: 461976 kB
SwapFree: 216808 kB$ adb shell cat /proc/*/smaps | grep "SwapPss:" | awk '{sum += $2} END {print sum}';
230315
$ adb shell cat /proc/*/smaps | grep "Swap:" | awk '{sum += $2} END {print sum}';
1387744[akpm@linux-foundation.org: simplify kunmap_atomic() call]
Signed-off-by: Minchan Kim
Reported-by: Bongkyu Kim
Tested-by: Bongkyu Kim
Cc: Hugh Dickins
Cc: Sergey Senozhatsky
Cc: Jonathan Corbet
Cc: Jerome Marchand
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
21 Aug, 2015
1 commit
-
While running KernelThreadSanitizer (ktsan) on upstream kernel with
trinity, we got a few reports from SyS_swapon, here is one of them:Read of size 8 by thread T307 (K7621):
[< inlined >] SyS_swapon+0x3c0/0x1850 SYSC_swapon mm/swapfile.c:2395
[] SyS_swapon+0x3c0/0x1850 mm/swapfile.c:2345
[] ia32_do_call+0x1b/0x25Looks like the swap_lock should be taken when iterating through the
swap_info array on lines 2392 - 2401: q->swap_file may be reset to
NULL by another thread before it is dereferenced for f_mapping.But why is that iteration needed at all? Doesn't the claim_swapfile()
which follows do all that is needed to check for a duplicate entry -
FMODE_EXCL on a bdev, testing IS_SWAPFILE under i_mutex on a regfile?Well, not quite: bd_may_claim() allows the same "holder" to claim the
bdev again, so we do need to use a different holder than "sys_swapon";
and we should not replace appropriate -EBUSY by inappropriate -EINVAL.Index i was reused in a cpu loop further down: renamed cpu there.
Reported-by: Andrey Konovalov
Signed-off-by: Hugh Dickins
Signed-off-by: Al Viro
24 Jun, 2015
1 commit
-
Turn
seq_path(..., &file->f_path, ...);
into
seq_file_path(..., file, ...);Signed-off-by: Miklos Szeredi
Signed-off-by: Al Viro
16 Apr, 2015
1 commit
-
We converted some of the usages of ACCESS_ONCE to READ_ONCE in the mm/
tree since it doesn't work reliably on non-scalar types.This patch removes the rest of the usages of ACCESS_ONCE, and use the new
READ_ONCE API for the read accesses. This makes things cleaner, instead
of using separate/multiple sets of APIs.Signed-off-by: Jason Low
Acked-by: Michal Hocko
Acked-by: Davidlohr Bueso
Acked-by: Rik van Riel
Reviewed-by: Christian Borntraeger
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
11 Dec, 2014
1 commit
-
Now that the external page_cgroup data structure and its lookup is gone,
the only code remaining in there is swap slot accounting.Rename it and move the conditional compilation into mm/Makefile.
Signed-off-by: Johannes Weiner
Acked-by: Michal Hocko
Acked-by: Vladimir Davydov
Acked-by: David S. Miller
Acked-by: KAMEZAWA Hiroyuki
Cc: "Kirill A. Shutemov"
Cc: Tejun Heo
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
09 Aug, 2014
2 commits
-
The memcg uncharging code that is involved towards the end of a page's
lifetime - truncation, reclaim, swapout, migration - is impressively
complicated and fragile.Because anonymous and file pages were always charged before they had their
page->mapping established, uncharges had to happen when the page type
could still be known from the context; as in unmap for anonymous, page
cache removal for file and shmem pages, and swap cache truncation for swap
pages. However, these operations happen well before the page is actually
freed, and so a lot of synchronization is necessary:- Charging, uncharging, page migration, and charge migration all need
to take a per-page bit spinlock as they could race with uncharging.- Swap cache truncation happens during both swap-in and swap-out, and
possibly repeatedly before the page is actually freed. This means
that the memcg swapout code is called from many contexts that make
no sense and it has to figure out the direction from page state to
make sure memory and memory+swap are always correctly charged.- On page migration, the old page might be unmapped but then reused,
so memcg code has to prevent untimely uncharging in that case.
Because this code - which should be a simple charge transfer - is so
special-cased, it is not reusable for replace_page_cache().But now that charged pages always have a page->mapping, introduce
mem_cgroup_uncharge(), which is called after the final put_page(), when we
know for sure that nobody is looking at the page anymore.For page migration, introduce mem_cgroup_migrate(), which is called after
the migration is successful and the new page is fully rmapped. Because
the old page is no longer uncharged after migration, prevent double
charges by decoupling the page's memcg association (PCG_USED and
pc->mem_cgroup) from the page holding an actual charge. The new bits
PCG_MEM and PCG_MEMSW represent the respective charges and are transferred
to the new page during migration.mem_cgroup_migrate() is suitable for replace_page_cache() as well,
which gets rid of mem_cgroup_replace_page_cache(). However, care
needs to be taken because both the source and the target page can
already be charged and on the LRU when fuse is splicing: grab the page
lock on the charge moving side to prevent changing pc->mem_cgroup of a
page under migration. Also, the lruvecs of both pages change as we
uncharge the old and charge the new during migration, and putback may
race with us, so grab the lru lock and isolate the pages iff on LRU to
prevent races and ensure the pages are on the right lruvec afterward.Swap accounting is massively simplified: because the page is no longer
uncharged as early as swap cache deletion, a new mem_cgroup_swapout() can
transfer the page's memory+swap charge (PCG_MEMSW) to the swap entry
before the final put_page() in page reclaim.Finally, page_cgroup changes are now protected by whatever protection the
page itself offers: anonymous pages are charged under the page table lock,
whereas page cache insertions, swapin, and migration hold the page lock.
Uncharging happens under full exclusion with no outstanding references.
Charging and uncharging also ensure that the page is off-LRU, which
serializes against charge migration. Remove the very costly page_cgroup
lock and set pc->flags non-atomically.[mhocko@suse.cz: mem_cgroup_charge_statistics needs preempt_disable]
[vdavydov@parallels.com: fix flags definition]
Signed-off-by: Johannes Weiner
Cc: Hugh Dickins
Cc: Tejun Heo
Cc: Vladimir Davydov
Tested-by: Jet Chen
Acked-by: Michal Hocko
Tested-by: Felipe Balbi
Signed-off-by: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
These patches rework memcg charge lifetime to integrate more naturally
with the lifetime of user pages. This drastically simplifies the code and
reduces charging and uncharging overhead. The most expensive part of
charging and uncharging is the page_cgroup bit spinlock, which is removed
entirely after this series.Here are the top-10 profile entries of a stress test that reads a 128G
sparse file on a freshly booted box, without even a dedicated cgroup (i.e.
executing in the root memcg). Before:15.36% cat [kernel.kallsyms] [k] copy_user_generic_string
13.31% cat [kernel.kallsyms] [k] memset
11.48% cat [kernel.kallsyms] [k] do_mpage_readpage
4.23% cat [kernel.kallsyms] [k] get_page_from_freelist
2.38% cat [kernel.kallsyms] [k] put_page
2.32% cat [kernel.kallsyms] [k] __mem_cgroup_commit_charge
2.18% kswapd0 [kernel.kallsyms] [k] __mem_cgroup_uncharge_common
1.92% kswapd0 [kernel.kallsyms] [k] shrink_page_list
1.86% cat [kernel.kallsyms] [k] __radix_tree_lookup
1.62% cat [kernel.kallsyms] [k] __pagevec_lru_add_fnAfter:
15.67% cat [kernel.kallsyms] [k] copy_user_generic_string
13.48% cat [kernel.kallsyms] [k] memset
11.42% cat [kernel.kallsyms] [k] do_mpage_readpage
3.98% cat [kernel.kallsyms] [k] get_page_from_freelist
2.46% cat [kernel.kallsyms] [k] put_page
2.13% kswapd0 [kernel.kallsyms] [k] shrink_page_list
1.88% cat [kernel.kallsyms] [k] __radix_tree_lookup
1.67% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn
1.39% kswapd0 [kernel.kallsyms] [k] free_pcppages_bulk
1.30% cat [kernel.kallsyms] [k] kfreeAs you can see, the memcg footprint has shrunk quite a bit.
text data bss dec hex filename
37970 9892 400 48262 bc86 mm/memcontrol.o.old
35239 9892 400 45531 b1db mm/memcontrol.oThis patch (of 4):
The memcg charge API charges pages before they are rmapped - i.e. have an
actual "type" - and so every callsite needs its own set of charge and
uncharge functions to know what type is being operated on. Worse,
uncharge has to happen from a context that is still type-specific, rather
than at the end of the page's lifetime with exclusive access, and so
requires a lot of synchronization.Rewrite the charge API to provide a generic set of try_charge(),
commit_charge() and cancel_charge() transaction operations, much like
what's currently done for swap-in:mem_cgroup_try_charge() attempts to reserve a charge, reclaiming
pages from the memcg if necessary.mem_cgroup_commit_charge() commits the page to the charge once it
has a valid page->mapping and PageAnon() reliably tells the type.mem_cgroup_cancel_charge() aborts the transaction.
This reduces the charge API and enables subsequent patches to
drastically simplify uncharging.As pages need to be committed after rmap is established but before they
are added to the LRU, page_add_new_anon_rmap() must stop doing LRU
additions again. Revive lru_cache_add_active_or_unevictable().[hughd@google.com: fix shmem_unuse]
[hughd@google.com: Add comments on the private use of -EAGAIN]
Signed-off-by: Johannes Weiner
Acked-by: Michal Hocko
Cc: Tejun Heo
Cc: Vladimir Davydov
Signed-off-by: Hugh Dickins
Cc: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
05 Jun, 2014
3 commits
-
Via commit ebc2a1a69111 ("swap: make cluster allocation per-cpu"), we
can find that all SWP_SOLIDSTATE "seek is cheap"(SSD case) has already
gone to si->cluster_info scan_swap_map_try_ssd_cluster() route. So that
the "last_in_cluster < scan_base" loop in the body of scan_swap_map()
has already become a dead code snippet, and it should have been deleted.This patch is to delete the redundant loop as Hugh and Shaohua
suggested.[hughd@google.com: fix comment, simplify code]
Signed-off-by: Chen Yucong
Cc: Shaohua Li
Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Originally get_swap_page() started iterating through the singly-linked
list of swap_info_structs using swap_list.next or highest_priority_index,
which both were intended to point to the highest priority active swap
target that was not full. The first patch in this series changed the
singly-linked list to a doubly-linked list, and removed the logic to start
at the highest priority non-full entry; it starts scanning at the highest
priority entry each time, even if the entry is full.Replace the manually ordered swap_list_head with a plist, swap_active_head.
Add a new plist, swap_avail_head. The original swap_active_head plist
contains all active swap_info_structs, as before, while the new
swap_avail_head plist contains only swap_info_structs that are active and
available, i.e. not full. Add a new spinlock, swap_avail_lock, to protect
the swap_avail_head list.Mel Gorman suggested using plists since they internally handle ordering
the list entries based on priority, which is exactly what swap was doing
manually. All the ordering code is now removed, and swap_info_struct
entries and simply added to their corresponding plist and automatically
ordered correctly.Using a new plist for available swap_info_structs simplifies and
optimizes get_swap_page(), which no longer has to iterate over full
swap_info_structs. Using a new spinlock for swap_avail_head plist
allows each swap_info_struct to add or remove themselves from the
plist when they become full or not-full; previously they could not
do so because the swap_info_struct->lock is held when they change
from fullnot-full, and the swap_lock protecting the main
swap_active_head must be ordered before any swap_info_struct->lock.Signed-off-by: Dan Streetman
Acked-by: Mel Gorman
Cc: Shaohua Li
Cc: Steven Rostedt
Cc: Peter Zijlstra
Cc: Hugh Dickins
Cc: Dan Streetman
Cc: Michal Hocko
Cc: Christian Ehrhardt
Cc: Weijie Yang
Cc: Rik van Riel
Cc: Johannes Weiner
Cc: Bob Liu
Cc: Paul Gortmaker
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The logic controlling the singly-linked list of swap_info_struct entries
for all active, i.e. swapon'ed, swap targets is rather complex, because:- it stores the entries in priority order
- there is a pointer to the highest priority entry
- there is a pointer to the highest priority not-full entry
- there is a highest_priority_index variable set outside the swap_lock
- swap entries of equal priority should be used equallythis complexity leads to bugs such as: https://lkml.org/lkml/2014/2/13/181
where different priority swap targets are incorrectly used equally.That bug probably could be solved with the existing singly-linked lists,
but I think it would only add more complexity to the already difficult to
understand get_swap_page() swap_list iteration logic.The first patch changes from a singly-linked list to a doubly-linked list
using list_heads; the highest_priority_index and related code are removed
and get_swap_page() starts each iteration at the highest priority
swap_info entry, even if it's full. While this does introduce unnecessary
list iteration (i.e. Schlemiel the painter's algorithm) in the case where
one or more of the highest priority entries are full, the iteration and
manipulation code is much simpler and behaves correctly re: the above bug;
and the fourth patch removes the unnecessary iteration.The second patch adds some minor plist helper functions; nothing new
really, just functions to match existing regular list functions. These
are used by the next two patches.The third patch adds plist_requeue(), which is used by get_swap_page() in
the next patch - it performs the requeueing of same-priority entries
(which moves the entry to the end of its priority in the plist), so that
all equal-priority swap_info_structs get used equally.The fourth patch converts the main list into a plist, and adds a new plist
that contains only swap_info entries that are both active and not full.
As Mel suggested using plists allows removing all the ordering code from
swap - plists handle ordering automatically. The list naming is also
clarified now that there are two lists, with the original list changed
from swap_list_head to swap_active_head and the new list named
swap_avail_head. A new spinlock is also added for the new list, so
swap_info entries can be added or removed from the new list immediately as
they become full or not full.This patch (of 4):
Replace the singly-linked list tracking active, i.e. swapon'ed,
swap_info_struct entries with a doubly-linked list using struct
list_heads. Simplify the logic iterating and manipulating the list of
entries, especially get_swap_page(), by using standard list_head
functions, and removing the highest priority iteration logic.The change fixes the bug:
https://lkml.org/lkml/2014/2/13/181
in which different priority swap entries after the highest priority entry
are incorrectly used equally in pairs. The swap behavior is now as
advertised, i.e. different priority swap entries are used in order, and
equal priority swap targets are used concurrently.Signed-off-by: Dan Streetman
Acked-by: Mel Gorman
Cc: Shaohua Li
Cc: Hugh Dickins
Cc: Dan Streetman
Cc: Michal Hocko
Cc: Christian Ehrhardt
Cc: Weijie Yang
Cc: Rik van Riel
Cc: Johannes Weiner
Cc: Bob Liu
Cc: Steven Rostedt
Cc: Peter Zijlstra
Cc: Paul Gortmaker
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
07 Feb, 2014
1 commit
-
swapoff clear swap_info's SWP_USED flag prematurely and free its
resources after that. A concurrent swapon will reuse this swap_info
while its previous resources are not cleared completely.These late freed resources are:
- p->percpu_cluster
- swap_cgroup_ctrl[type]
- block_device setting
- inode->i_flags &= ~S_SWAPFILEThis patch clears the SWP_USED flag after all its resources are freed,
so that swapon can reuse this swap_info by alloc_swap_info() safely.[akpm@linux-foundation.org: tidy up code comment]
Signed-off-by: Weijie Yang
Acked-by: Hugh Dickins
Cc: Krzysztof Kozlowski
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
24 Jan, 2014
2 commits
-
In the second half of scan_swap_map()'s scan loop, offset is set to
si->lowest_bit and then incremented before entering the loop for the
first time, causing si->swap_map[si->lowest_bit] to be skipped.Signed-off-by: Jamie Liu
Cc: Shaohua Li
Acked-by: Hugh Dickins
Cc: Minchan Kim
Cc: Akinobu Mita
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Most of the VM_BUG_ON assertions are performed on a page. Usually, when
one of these assertions fails we'll get a BUG_ON with a call stack and
the registers.I've recently noticed based on the requests to add a small piece of code
that dumps the page to various VM_BUG_ON sites that the page dump is
quite useful to people debugging issues in mm.This patch adds a VM_BUG_ON_PAGE(cond, page) which beyond doing what
VM_BUG_ON() does, also dumps the page before executing the actual
BUG_ON.[akpm@linux-foundation.org: fix up includes]
Signed-off-by: Sasha Levin
Cc: "Kirill A. Shutemov"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
13 Nov, 2013
2 commits
-
During swapoff the frontswap_map was NULL-ified before calling
frontswap_invalidate_area(). However the frontswap_invalidate_area()
exits early if frontswap_map is NULL. Invalidate was never called
during swapoff.This patch moves frontswap_map_set() in swapoff just after calling
frontswap_invalidate_area() so outside of locks (swap_lock and
swap_info_struct->lock). This shouldn't be a problem as during swapon
the frontswap_map_set() is called also outside of any locks.Signed-off-by: Krzysztof Kozlowski
Reviewed-by: Seth Jennings
Cc: Konrad Rzeszutek Wilk
Cc: Shaohua Li
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Signed-off-by: Seth Jennings
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
17 Oct, 2013
1 commit
-
Fix race between swapoff and swapon. Swapoff used old_block_size from
swap_info outside of swapon_mutex so it could be overwritten by
concurrent swapon.The race has visible effect only if more than one swap block device
exists with different block sizes (e.g. /dev/sda1 with block size 4096
and /dev/sdb1 with 512). In such case it leads to setting the blocksize
of swapped off device with wrong blocksize.The bug can be triggered with multiple concurrent swapoff and swapon:
0. Swap for some device is on.
1. swapoff:
First the swapoff is called on this device and "struct swap_info_struct
*p" is assigned. This is done under swap_lock however this lock is
released for the call try_to_unuse().2. swapon:
After the assignment above (and before acquiring swapon_mutex &
swap_lock by swapoff) the swapon is called on the same device.
The p->old_block_size is assigned to the value of block_size the device.
This block size should be the same as previous but sometimes it is not.
The swapon ends successfully.3. swapoff:
Swapoff resumes, grabs the locks and mutex and continues to disable this
swap device. Now it sets the block size to value taken from swap_info
which was overwritten by swapon in 2.Signed-off-by: Krzysztof Kozlowski
Reported-by: Weijie Yang
Cc: Bob Liu
Cc: Konrad Rzeszutek Wilk
Cc: Shaohua Li
Cc: Minchan Kim
Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
12 Sep, 2013
3 commits
-
swap cluster allocation is to get better request merge to improve
performance. But the cluster is shared globally, if multiple tasks are
doing swap, this will cause interleave disk access. While multiple tasks
swap is quite common, for example, each numa node has a kswapd thread
doing swap and multiple threads/processes doing direct page reclaim.ioscheduler can't help too much here, because tasks don't send swapout IO
down to block layer in the meantime. Block layer does merge some IOs, but
a lot not, depending on how many tasks are doing swapout concurrently. In
practice, I've seen a lot of small size IO in swapout workloads.We makes the cluster allocation per-cpu here. The interleave disk access
issue goes away. All tasks swapout to their own cluster, so swapout will
become sequential, which can be easily merged to big size IO. If one CPU
can't get its per-cpu cluster (for example, there is no free cluster
anymore in the swap), it will fallback to scan swap_map. The CPU can
still continue swap. We don't need recycle free swap entries of other
CPUs.In my test (swap to a 2-disk raid0 partition), this improves around 10%
swapout throughput, and request size is increased significantly.How does this impact swap readahead is uncertain though. On one side,
page reclaim always isolates and swaps several adjancent pages, this will
make page reclaim write the pages sequentially and benefit readahead. On
the other side, several CPU write pages interleave means the pages don't
live _sequentially_ but relatively _near_. In the per-cpu allocation
case, if adjancent pages are written by different cpus, they will live
relatively _far_. So how this impacts swap readahead depends on how many
pages page reclaim isolates and swaps one time. If the number is big,
this patch will benefit swap readahead. Of course, this is about
sequential access pattern. The patch has no impact for random access
pattern, because the new cluster allocation algorithm is just for SSD.Alternative solution is organizing swap layout to be per-mm instead of
this per-cpu approach. In the per-mm layout, we allocate a disk range for
each mm, so pages of one mm live in swap disk adjacently. per-mm layout
has potential issues of lock contention if multiple reclaimers are swap
pages from one mm. For a sequential workload, per-mm layout is better to
implement swap readahead, because pages from the mm are adjacent in disk.
But per-cpu layout isn't very bad in this workload, as page reclaim always
isolates and swaps several pages one time, such pages will still live in
disk sequentially and readahead can utilize this. For a random workload,
per-mm layout isn't beneficial of request merge, because it's quite
possible pages from different mm are swapout in the meantime and IO can't
be merged in per-mm layout. while with per-cpu layout we can merge
requests from any mm. Considering random workload is more popular in
workloads with swap (and per-cpu approach isn't too bad for sequential
workload too), I'm choosing per-cpu layout.[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Shaohua Li
Cc: Rik van Riel
Cc: Minchan Kim
Cc: Kyungmin Park
Cc: Hugh Dickins
Cc: Rafael Aquini
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The previous patch can expose races, according to Hugh:
swapoff was sometimes failing with "Cannot allocate memory", coming from
try_to_unuse()'s -ENOMEM: it needs to allow for swap_duplicate() failing
on a free entry temporarily SWAP_MAP_BAD while being discarded.We should use ACCESS_ONCE() there, and whenever accessing swap_map
locklessly; but rather than peppering it throughout try_to_unuse(), just
declare *swap_map with volatile.try_to_unuse() is accustomed to *swap_map going down racily, but not
necessarily to it jumping up from 0 to SWAP_MAP_BAD: we'll be safer to
prevent that transition once SWP_WRITEOK is switched off, when it's a
waste of time to issue discards anyway (swapon can do a whole discard).Another issue is:
In swapin_readahead(), read_swap_cache_async() can read a bad swap entry,
because we don't check if readahead swap entry is bad. This doesn't break
anything but such swapin page is wasteful and can only be freed at page
reclaim. We should avoid read such swap entry. And in discard, we mark
swap entry SWAP_MAP_BAD and then switch it to normal when discard is
finished. If readahead reads such swap entry, we have the same issue, so
we much check if swap entry is bad too.Thanks Hugh to inspire swapin_readahead could use bad swap entry.
[include Hugh's patch 'swap: fix swapoff ENOMEMs from discard']
Signed-off-by: Shaohua Li
Signed-off-by: Hugh Dickins
Cc: Rik van Riel
Cc: Minchan Kim
Cc: Kyungmin Park
Cc: Rafael Aquini
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
swap can do cluster discard for SSD, which is good, but there are some
problems here:1. swap do the discard just before page reclaim gets a swap entry and
writes the disk sectors. This is useless for high end SSD, because an
overwrite to a sector implies a discard to original sector too. A
discard + overwrite == overwrite.2. the purpose of doing discard is to improve SSD firmware garbage
collection. Idealy we should send discard as early as possible, so
firmware can do something smart. Sending discard just after swap entry
is freed is considered early compared to sending discard before write.
Of course, if workload is already bound to gc speed, sending discard
earlier or later doesn't make3. block discard is a sync API, which will delay scan_swap_map()
significantly.4. Write and discard command can be executed parallel in PCIe SSD.
Making swap discard async can make execution more efficiently.This patch makes swap discard async and moves discard to where swap entry
is freed. Discard and write have no dependence now, so above issues can
be avoided. Idealy we should do discard for any freed sectors, but some
SSD discard is very slow. This patch still does discard for a whole
cluster.My test does a several round of 'mmap, write, unmap', which will trigger a
lot of swap discard. In a fusionio card, with this patch, the test
runtime is reduced to 18% of the time without it, so around 5.5x faster.[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Shaohua Li
Cc: Rik van Riel
Cc: Minchan Kim
Cc: Kyungmin Park
Cc: Hugh Dickins
Cc: Rafael Aquini
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds