Eric Lee / smarc-fsl-linux-kernel

24 May, 2016

1 commit

dc0ef0df7 mm: make mmap_sem for write waits killable for mm syscalls ... Browse Code »

This is a follow up work for oom_reaper [1]. As the async OOM killing
depends on oom_sem for read we would really appreciate if a holder for
write didn't stood in the way. This patchset is changing many of
down_write calls to be killable to help those cases when the writer is
blocked and waiting for readers to release the lock and so help
__oom_reap_task to process the oom victim.

Most of the patches are really trivial because the lock is help from a
shallow syscall paths where we can return EINTR trivially and allow the
current task to die (note that EINTR will never get to the userspace as
the task has fatal signal pending). Others seem to be easy as well as
the callers are already handling fatal errors and bail and return to
userspace which should be sufficient to handle the failure gracefully.
I am not familiar with all those code paths so a deeper review is really
appreciated.

As this work is touching more areas which are not directly connected I
have tried to keep the CC list as small as possible and people who I
believed would be familiar are CCed only to the specific patches (all
should have received the cover though).

This patchset is based on linux-next and it depends on
down_write_killable for rw_semaphores which got merged into tip
locking/rwsem branch and it is merged into this next tree. I guess it
would be easiest to route these patches via mmotm because of the
dependency on the tip tree but if respective maintainers prefer other
way I have no objections.

I haven't covered all the mmap_write(mm->mmap_sem) instances here

$ git grep "down_write(.*\)" next/master | wc -l
98
$ git grep "down_write(.*\)" | wc -l
62

I have tried to cover those which should be relatively easy to review in
this series because this alone should be a nice improvement. Other
places can be changed on top.

[0] http://lkml.kernel.org/r/1456752417-9626-1-git-send-email-mhocko@kernel.org
[1] http://lkml.kernel.org/r/1452094975-551-1-git-send-email-mhocko@kernel.org
[2] http://lkml.kernel.org/r/1456750705-7141-1-git-send-email-mhocko@kernel.org

This patch (of 18):

This is the first step in making mmap_sem write waiters killable. It
focuses on the trivial ones which are taking the lock early after
entering the syscall and they are not changing state before.

Therefore it is very easy to change them to use down_write_killable and
immediately return with -EINTR. This will allow the waiter to pass away
without blocking the mmap_sem which might be required to make a forward
progress. E.g. the oom reaper will need the lock for reading to
dismantle the OOM victim address space.

The only tricky function in this patch is vm_mmap_pgoff which has many
call sites via vm_mmap. To reduce the risk keep vm_mmap with the
original non-killable semantic for now.

vm_munmap callers do not bother checking the return value so open code
it into the munmap syscall path for now for simplicity.

Signed-off-by: Michal Hocko
Acked-by: Vlastimil Babka
Cc: Mel Gorman
Cc: "Kirill A. Shutemov"
Cc: Konstantin Khlebnikov
Cc: Hugh Dickins
Cc: Andrea Arcangeli
Cc: David Rientjes
Cc: Dave Hansen
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2016-05-24 08:04:14 +0800

05 Apr, 2016

1 commit

09cbfeaf1 mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros ... Browse Code »

PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized. And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special. They are
not.

The changes are pretty straight-forward:

- << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

- >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

- page_cache_get() -> get_page();

- page_cache_release() -> put_page();

This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.

virtual patch

@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT

@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE

@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK

@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)

@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)

@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)

Signed-off-by: Kirill A. Shutemov
Acked-by: Michal Hocko
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-04-05 01:41:08 +0800

16 Mar, 2016

2 commits

d7206a70a mm/madvise: update comment on sys_madvise() ... Browse Code »

Some new MADV_* advices are not documented in sys_madvise() comment. So
let's update it.

[akpm@linux-foundation.org: modifications suggested by Michal]
Signed-off-by: Naoya Horiguchi
Acked-by: Michal Hocko
Cc: Minchan Kim
Cc: "Kirill A. Shutemov"
Cc: Jason Baron
Cc: Chen Gong
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2016-03-16 07:55:16 +0800
23a003bfd mm/madvise: pass return code of memory_failure() to userspace ... Browse Code »

Currently the return value of memory_failure() is not passed to
userspace when madvise(MADV_HWPOISON) is used. This is inconvenient for
test programs that want to know the result of error handling. So let's
return it to the caller as we already do in the MADV_SOFT_OFFLINE case.

Signed-off-by: Naoya Horiguchi
Cc: Chen Gong
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2016-03-16 07:55:16 +0800

16 Jan, 2016

4 commits

b8d3c4c30 mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called ... Browse Code »

We don't need to split THP page when MADV_FREE syscall is called if
[start, len] is aligned with THP size. The split could be done when VM
decide to free it in reclaim path if memory pressure is heavy. With
that, we could avoid unnecessary THP split.

For the feature, this patch changes pte dirtness marking logic of THP.
Now, it marks every ptes of pages dirty unconditionally in splitting,
which makes MADV_FREE void. So, instead, this patch propagates pmd
dirtiness to all pages via PG_dirty and restores pte dirtiness from
PG_dirty. With this, if pmd is clean(ie, MADV_FREEed) when split
happens(e,g, shrink_page_list), all of pages are clean too so we could
discard them.

Signed-off-by: Minchan Kim
Cc: Kirill A. Shutemov
Cc: Hugh Dickins
Cc: Andrea Arcangeli
Cc: "James E.J. Bottomley"
Cc: "Kirill A. Shutemov"
Cc: Shaohua Li
Cc:
Cc: Andy Lutomirski
Cc: Arnd Bergmann
Cc: Benjamin Herrenschmidt
Cc: Catalin Marinas
Cc: Chen Gang
Cc: Chris Zankel
Cc: Daniel Micay
Cc: Darrick J. Wong
Cc: David S. Miller
Cc: Helge Deller
Cc: Ivan Kokshaysky
Cc: Jason Evans
Cc: Johannes Weiner
Cc: KOSAKI Motohiro
Cc: Matt Turner
Cc: Max Filippov
Cc: Mel Gorman
Cc: Michael Kerrisk
Cc: Michal Hocko
Cc: Mika Penttil
Cc: Ralf Baechle
Cc: Richard Henderson
Cc: Rik van Riel
Cc: Roland Dreier
Cc: Russell King
Cc: Shaohua Li
Cc: Will Deacon
Cc: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2016-01-16 09:56:32 +0800
10853a039 mm: move lazily freed pages to inactive list ... Browse Code »

MADV_FREE is a hint that it's okay to discard pages if there is memory
pressure and we use reclaimers(ie, kswapd and direct reclaim) to free
them so there is no value keeping them in the active anonymous LRU so
this patch moves them to inactive LRU list's head.

This means that MADV_FREE-ed pages which were living on the inactive
list are reclaimed first because they are more likely to be cold rather
than recently active pages.

An arguable issue for the approach would be whether we should put the
page to the head or tail of the inactive list. I chose head because the
kernel cannot make sure it's really cold or warm for every MADV_FREE
usecase but at least we know it's not *hot*, so landing of inactive head
would be a comprimise for various usecases.

This fixes suboptimal behavior of MADV_FREE when pages living on the
active list will sit there for a long time even under memory pressure
while the inactive list is reclaimed heavily. This basically breaks the
whole purpose of using MADV_FREE to help the system to free memory which
is might not be used.

Signed-off-by: Minchan Kim
Acked-by: Hugh Dickins
Acked-by: Michal Hocko
Cc: Johannes Weiner
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Shaohua Li
Cc: "James E.J. Bottomley"
Cc: "Kirill A. Shutemov"
Cc:
Cc: Andrea Arcangeli
Cc: Andy Lutomirski
Cc: Arnd Bergmann
Cc: Benjamin Herrenschmidt
Cc: Catalin Marinas
Cc: Chen Gang
Cc: Chris Zankel
Cc: Daniel Micay
Cc: Darrick J. Wong
Cc: David S. Miller
Cc: Helge Deller
Cc: Ivan Kokshaysky
Cc: Jason Evans
Cc: KOSAKI Motohiro
Cc: Kirill A. Shutemov
Cc: Matt Turner
Cc: Max Filippov
Cc: Michael Kerrisk
Cc: Mika Penttil
Cc: Ralf Baechle
Cc: Richard Henderson
Cc: Roland Dreier
Cc: Russell King
Cc: Will Deacon
Cc: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2016-01-16 09:56:32 +0800
64b42bc1c mm/madvise.c: free swp_entry in madvise_free ... Browse Code »

When I test below piece of code with 12 processes(ie, 512M * 12 = 6G
consume) on my (3G ram + 12 cpu + 8G swap, the madvise_free is
siginficat slower (ie, 2x times) than madvise_dontneed.

loop = 5;
mmap(512M);
while (loop--) {
memset(512M);
madvise(MADV_FREE or MADV_DONTNEED);
}

The reason is lots of swapin.

1) dontneed: 1,612 swapin
2) madvfree: 879,585 swapin

If we find hinted pages were already swapped out when syscall is called,
it's pointless to keep the swapped-out pages in pte. Instead, let's
free the cold page because swapin is more expensive than (alloc page +
zeroing).

With this patch, it reduced swapin from 879,585 to 1,878 so elapsed time

1) dontneed: 6.10user 233.50system 0:50.44elapsed
2) madvfree: 6.03user 401.17system 1:30.67elapsed
2) madvfree + below patch: 6.70user 339.14system 1:04.45elapsed

Signed-off-by: Minchan Kim
Acked-by: Michal Hocko
Acked-by: Hugh Dickins
Cc: "James E.J. Bottomley"
Cc: "Kirill A. Shutemov"
Cc: Shaohua Li
Cc:
Cc: Andrea Arcangeli
Cc: Andy Lutomirski
Cc: Arnd Bergmann
Cc: Benjamin Herrenschmidt
Cc: Catalin Marinas
Cc: Chen Gang
Cc: Chris Zankel
Cc: Daniel Micay
Cc: Darrick J. Wong
Cc: David S. Miller
Cc: Helge Deller
Cc: Ivan Kokshaysky
Cc: Jason Evans
Cc: Johannes Weiner
Cc: KOSAKI Motohiro
Cc: Kirill A. Shutemov
Cc: Matt Turner
Cc: Max Filippov
Cc: Mel Gorman
Cc: Michael Kerrisk
Cc: Mika Penttil
Cc: Ralf Baechle
Cc: Richard Henderson
Cc: Rik van Riel
Cc: Roland Dreier
Cc: Russell King
Cc: Shaohua Li
Cc: Will Deacon
Cc: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2016-01-16 09:56:32 +0800
854e9ed09 mm: support madvise(MADV_FREE) ... Browse Code »

Linux doesn't have an ability to free pages lazy while other OS already
have been supported that named by madvise(MADV_FREE).

The gain is clear that kernel can discard freed pages rather than
swapping out or OOM if memory pressure happens.

Without memory pressure, freed pages would be reused by userspace
without another additional overhead(ex, page fault + allocation +
zeroing).

Jason Evans said:

: Facebook has been using MAP_UNINITIALIZED
: (https://lkml.org/lkml/2012/1/18/308) in some of its applications for
: several years, but there are operational costs to maintaining this
: out-of-tree in our kernel and in jemalloc, and we are anxious to retire it
: in favor of MADV_FREE. When we first enabled MAP_UNINITIALIZED it
: increased throughput for much of our workload by ~5%, and although the
: benefit has decreased using newer hardware and kernels, there is still
: enough benefit that we cannot reasonably retire it without a replacement.
:
: Aside from Facebook operations, there are numerous broadly used
: applications that would benefit from MADV_FREE. The ones that immediately
: come to mind are redis, varnish, and MariaDB. I don't have much insight
: into Android internals and development process, but I would hope to see
: MADV_FREE support eventually end up there as well to benefit applications
: linked with the integrated jemalloc.
:
: jemalloc will use MADV_FREE once it becomes available in the Linux kernel.
: In fact, jemalloc already uses MADV_FREE or equivalent everywhere it's
: available: *BSD, OS X, Windows, and Solaris -- every platform except Linux
: (and AIX, but I'm not sure it even compiles on AIX). The lack of
: MADV_FREE on Linux forced me down a long series of increasingly
: sophisticated heuristics for madvise() volume reduction, and even so this
: remains a common performance issue for people using jemalloc on Linux.
: Please integrate MADV_FREE; many people will benefit substantially.

How it works:

When madvise syscall is called, VM clears dirty bit of ptes of the
range. If memory pressure happens, VM checks dirty bit of page table
and if it found still "clean", it means it's a "lazyfree pages" so VM
could discard the page instead of swapping out. Once there was store
operation for the page before VM peek a page to reclaim, dirty bit is
set so VM can swap out the page instead of discarding.

One thing we should notice is that basically, MADV_FREE relies on dirty
bit in page table entry to decide whether VM allows to discard the page
or not. IOW, if page table entry includes marked dirty bit, VM
shouldn't discard the page.

However, as a example, if swap-in by read fault happens, page table
entry doesn't have dirty bit so MADV_FREE could discard the page
wrongly.

For avoiding the problem, MADV_FREE did more checks with PageDirty and
PageSwapCache. It worked out because swapped-in page lives on swap
cache and since it is evicted from the swap cache, the page has PG_dirty
flag. So both page flags check effectively prevent wrong discarding by
MADV_FREE.

However, a problem in above logic is that swapped-in page has PG_dirty
still after they are removed from swap cache so VM cannot consider the
page as freeable any more even if madvise_free is called in future.

Look at below example for detail.

ptr = malloc();
memset(ptr);
..
..
.. heavy memory pressure so all of pages are swapped out
..
..
var = *ptr; -> a page swapped-in and could be removed from
swapcache. Then, page table doesn't mark
dirty bit and page descriptor includes PG_dirty
..
..
madvise_free(ptr); -> It doesn't clear PG_dirty of the page.
..
..
..
.. heavy memory pressure again.
.. In this time, VM cannot discard the page because the page
.. has *PG_dirty*

To solve the problem, this patch clears PG_dirty if only the page is
owned exclusively by current process when madvise is called because
PG_dirty represents ptes's dirtiness in several processes so we could
clear it only if we own it exclusively.

Firstly, heavy users would be general allocators(ex, jemalloc, tcmalloc
and hope glibc supports it) and jemalloc/tcmalloc already have supported
the feature for other OS(ex, FreeBSD)

barrios@blaptop:~/benchmark/ebizzy$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 12
On-line CPU(s) list: 0-11
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 12
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 2
Stepping: 3
CPU MHz: 3200.185
BogoMIPS: 6400.53
Virtualization: VT-x
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 4096K
NUMA node0 CPU(s): 0-11
ebizzy benchmark(./ebizzy -S 10 -n 512)

Higher avg is better.

vanilla-jemalloc MADV_free-jemalloc

1 thread
records: 10 records: 10
avg: 2961.90 avg: 12069.70
std: 71.96(2.43%) std: 186.68(1.55%)
max: 3070.00 max: 12385.00
min: 2796.00 min: 11746.00

2 thread
records: 10 records: 10
avg: 5020.00 avg: 17827.00
std: 264.87(5.28%) std: 358.52(2.01%)
max: 5244.00 max: 18760.00
min: 4251.00 min: 17382.00

4 thread
records: 10 records: 10
avg: 8988.80 avg: 27930.80
std: 1175.33(13.08%) std: 3317.33(11.88%)
max: 9508.00 max: 30879.00
min: 5477.00 min: 21024.00

8 thread
records: 10 records: 10
avg: 13036.50 avg: 33739.40
std: 170.67(1.31%) std: 5146.22(15.25%)
max: 13371.00 max: 40572.00
min: 12785.00 min: 24088.00

16 thread
records: 10 records: 10
avg: 11092.40 avg: 31424.20
std: 710.60(6.41%) std: 3763.89(11.98%)
max: 12446.00 max: 36635.00
min: 9949.00 min: 25669.00

32 thread
records: 10 records: 10
avg: 11067.00 avg: 34495.80
std: 971.06(8.77%) std: 2721.36(7.89%)
max: 12010.00 max: 38598.00
min: 9002.00 min: 30636.00

In summary, MADV_FREE is about much faster than MADV_DONTNEED.

This patch (of 12):

Add core MADV_FREE implementation.

[akpm@linux-foundation.org: small cleanups]
Signed-off-by: Minchan Kim
Acked-by: Michal Hocko
Acked-by: Hugh Dickins
Cc: Mika Penttil
Cc: Michael Kerrisk
Cc: Johannes Weiner
Cc: Rik van Riel
Cc: Mel Gorman
Cc: KOSAKI Motohiro
Cc: Jason Evans
Cc: Daniel Micay
Cc: "Kirill A. Shutemov"
Cc: Shaohua Li
Cc:
Cc: Andy Lutomirski
Cc: "James E.J. Bottomley"
Cc: "Kirill A. Shutemov"
Cc: "Shaohua Li"
Cc: Andrea Arcangeli
Cc: Arnd Bergmann
Cc: Benjamin Herrenschmidt
Cc: Catalin Marinas
Cc: Chen Gang
Cc: Chris Zankel
Cc: Darrick J. Wong
Cc: David S. Miller
Cc: Helge Deller
Cc: Ivan Kokshaysky
Cc: Matt Turner
Cc: Max Filippov
Cc: Ralf Baechle
Cc: Richard Henderson
Cc: Roland Dreier
Cc: Russell King
Cc: Shaohua Li
Cc: Will Deacon
Cc: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2016-01-16 09:56:32 +0800

09 Sep, 2015

1 commit

72079ba0d mm: madvise allow remove operation for hugetlbfs ... Browse Code »

Now that we have hole punching support for hugetlbfs, we can also
support the MADV_REMOVE interface to it.

Signed-off-by: Dave Hansen
Signed-off-by: Mike Kravetz
Reviewed-by: Naoya Horiguchi
Acked-by: Hillf Danton
Cc: David Rientjes
Cc: Hugh Dickins
Cc: Davidlohr Bueso
Cc: Aneesh Kumar
Cc: Christoph Hellwig
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Kravetz
2015-09-09 06:35:28 +0800

05 Sep, 2015

2 commits

1ecef9ed0 mm/madvise.c: make madvise_behaviour_valid() return bool ... Browse Code »

This makes the madvise_bahaviour_valid() function return bool due to
this particular function always returning the value of either one or
zero as its return value.

Signed-off-by: Nicholas Krause
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nicholas Krause
2015-09-05 07:54:41 +0800
19a809afe userfaultfd: teach vma_merge to merge across vma->vm_userfaultfd_ctx ... Browse Code »

vma->vm_userfaultfd_ctx is yet another vma parameter that vma_merge
must be aware about so that we can merge vmas back like they were
originally before arming the userfaultfd on some memory range.

Signed-off-by: Andrea Arcangeli
Acked-by: Pavel Emelyanov
Cc: Sanidhya Kashyap
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov"
Cc: Andres Lagar-Cavilla
Cc: Dave Hansen
Cc: Paolo Bonzini
Cc: Rik van Riel
Cc: Mel Gorman
Cc: Andy Lutomirski
Cc: Hugh Dickins
Cc: Peter Feiner
Cc: "Dr. David Alan Gilbert"
Cc: Johannes Weiner
Cc: "Huangpeng (Peter)"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2015-09-05 07:54:41 +0800

02 Jun, 2015

1 commit

66114cad6 writeback: separate out include/linux/backing-dev-defs.h ... Browse Code »

With the planned cgroup writeback support, backing-dev related
declarations will be more widely used across block and cgroup;
unfortunately, including backing-dev.h from include/linux/blkdev.h
makes cyclic include dependency quite likely.

This patch separates out backing-dev-defs.h which only has the
essential definitions and updates blkdev.h to include it. c files
which need access to more backing-dev details now include
backing-dev.h directly. This takes backing-dev.h off the common
include dependency chain making it a lot easier to use it across block
and cgroup.

v2: fs/fat build failure fixed.

Signed-off-by: Tejun Heo
Reviewed-by: Jan Kara
Cc: Jens Axboe
Signed-off-by: Jens Axboe

Tejun Heo
2015-06-02 22:33:34 +0800

17 Feb, 2015

1 commit

e748dcd09 vfs: remove get_xip_mem ... Browse Code »

All callers of get_xip_mem() are now gone. Remove checks for it,
initialisers of it, documentation of it and the only implementation of it.
Also remove mm/filemap_xip.c as it is now empty. Also remove
documentation of the long-gone get_xip_page().

Signed-off-by: Matthew Wilcox
Cc: Andreas Dilger
Cc: Boaz Harrosh
Cc: Christoph Hellwig
Cc: Dave Chinner
Cc: Jan Kara
Cc: Jens Axboe
Cc: Kirill A. Shutemov
Cc: Mathieu Desnoyers
Cc: Randy Dunlap
Cc: Ross Zwisler
Cc: Theodore Ts'o
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matthew Wilcox
2015-02-17 09:56:03 +0800

13 Feb, 2015

1 commit

6bec00352 Merge branch 'for-3.20/bdi' of git://git.kernel.dk/linux-block ... Browse Code »

Pull backing device changes from Jens Axboe:
"This contains a cleanup of how the backing device is handled, in
preparation for a rework of the life time rules. In this part, the
most important change is to split the unrelated nommu mmap flags from
it, but also removing a backing_dev_info pointer from the
address_space (and inode), and a cleanup of other various minor bits.

Christoph did all the work here, I just fixed an oops with pages that
have a swap backing. Arnd fixed a missing export, and Oleg killed the
lustre backing_dev_info from staging. Last patch was from Al,
unexporting parts that are now no longer needed outside"

* 'for-3.20/bdi' of git://git.kernel.dk/linux-block:
Make super_blocks and sb_lock static
mtd: export new mtd_mmap_capabilities
fs: make inode_to_bdi() handle NULL inode
staging/lustre/llite: get rid of backing_dev_info
fs: remove default_backing_dev_info
fs: don't reassign dirty inodes to default_backing_dev_info
nfs: don't call bdi_unregister
ceph: remove call to bdi_unregister
fs: remove mapping->backing_dev_info
fs: export inode_to_bdi and use it in favor of mapping->backing_dev_info
nilfs2: set up s_bdi like the generic mount_bdev code
block_dev: get bdev inode bdi directly from the block device
block_dev: only write bdev inode on close
fs: introduce f_op->mmap_capabilities for nommu mmap support
fs: kill BDI_CAP_SWAP_BACKED
fs: deduplicate noop_backing_dev_info

Linus Torvalds
2015-02-13 05:50:21 +0800

11 Feb, 2015

2 commits

0661a3361 mm: remove rest usage of VM_NONLINEAR and pte_file() ... Browse Code »

One bit in ->vm_flags is unused now!

Signed-off-by: Kirill A. Shutemov
Cc: Dan Carpenter
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2015-02-11 06:30:31 +0800
8a5f14a23 mm: drop support of non-linear mapping from unmap/zap codepath ... Browse Code »

We have remap_file_pages(2) emulation in -mm tree for few release cycles
and we plan to have it mainline in v3.20. This patchset removes rest of
VM_NONLINEAR infrastructure.

Patches 1-8 take care about generic code. They are pretty
straight-forward and can be applied without other of patches.

Rest patches removes pte_file()-related stuff from architecture-specific
code. It usually frees up one bit in non-present pte. I've tried to reuse
that bit for swap offset, where I was able to figure out how to do that.

For obvious reason I cannot test all that arch-specific code and would
like to see acks from maintainers.

In total, remap_file_pages(2) required about 1.4K lines of not-so-trivial
kernel code. That's too much for functionality nobody uses.

Tested-by: Felipe Balbi

This patch (of 38):

We don't create non-linear mappings anymore. Let's drop code which
handles them on unmap/zap.

Signed-off-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2015-02-11 06:30:30 +0800

21 Jan, 2015

1 commit

97b713ba3 fs: kill BDI_CAP_SWAP_BACKED ... Browse Code »

This bdi flag isn't too useful - we can determine that a vma is backed by
either swap or shmem trivially in the caller.

This also allows removing the backing_dev_info instaces for swap and shmem
in favor of noop_backing_dev_info.

Signed-off-by: Christoph Hellwig
Reviewed-by: Tejun Heo
Reviewed-by: Jan Kara
Signed-off-by: Jens Axboe

Christoph Hellwig
2015-01-21 05:02:56 +0800

08 Nov, 2014

1 commit

72c72bdf7 VFS: Rename do_fallocate() to vfs_fallocate() ... Browse Code »

This function needs to be exported so it can be used by the NFSD module
when responding to the new ALLOCATE and DEALLOCATE operations in NFS
v4.2. Christoph Hellwig suggested renaming the function to stay
consistent with how other vfs functions are named.

Signed-off-by: Anna Schumaker
Signed-off-by: J. Bruce Fields

Anna Schumaker
2014-11-08 05:17:44 +0800

07 Aug, 2014

1 commit

54980b93c mm: update the description for madvise_remove ... Browse Code »

Currently, we have more filesystems supporting fallocate, e.g
ext4/btrfs. Remove the outdated comment for madvise_remove.

Signed-off-by: Wang Sheng-Hui
Reviewed-by: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wang Sheng-Hui
2014-08-07 09:01:18 +0800

24 May, 2014

1 commit

55231e5c8 mm: madvise: fix MADV_WILLNEED on shmem swapouts ... Browse Code »

MADV_WILLNEED currently does not read swapped out shmem pages back in.

Commit 0cd6144aadd2 ("mm + fs: prepare for non-page entries in page
cache radix trees") made find_get_page() filter exceptional radix tree
entries but failed to convert all find_get_page() callers that WANT
exceptional entries over to find_get_entry(). One of them is shmem swap
readahead in madvise, which now skips over any swap-out records.

Convert it to find_get_entry().

Fixes: 0cd6144aadd2 ("mm + fs: prepare for non-page entries in page cache radix trees")
Signed-off-by: Johannes Weiner
Reported-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-05-24 00:37:29 +0800

01 Oct, 2013

1 commit

20cb6cab5 mm/hwpoison: fix traversal of hugetlbfs pages to avoid printk flood ... Browse Code »

madvise_hwpoison won't check if the page is small page or huge page and
traverses in small page granularity against the range unconditionally,
which result in a printk flood "MCE xxx: already hardware poisoned" if
the page is a huge page.

This patch fixes it by using compound_order(compound_head(page)) for
huge page iterator.

Testcase:

#define _GNU_SOURCE
#include
#include
#include
#include
#include
#include
#include

#define PAGES_TO_TEST 3
#define PAGE_SIZE 4096 * 512

int main(void)
{
char *mem;
int i;

mem = mmap(NULL, PAGES_TO_TEST * PAGE_SIZE,
PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, 0, 0);

if (madvise(mem, PAGES_TO_TEST * PAGE_SIZE, MADV_HWPOISON) == -1)
return -1;

munmap(mem, PAGES_TO_TEST * PAGE_SIZE);

return 0;
}

Signed-off-by: Wanpeng Li
Reviewed-by: Naoya Horiguchi
Acked-by: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wanpeng Li
2013-10-01 05:31:02 +0800

12 Sep, 2013

5 commits

325c4ef5c mm/madvise.c:madvise_hwpoison(): remove local `ret' ... Browse Code »

madvise_hwpoison() has two locals called "ret". Fix it all up.

Cc: Wanpeng Li
Cc: Naoya Horiguchi
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2013-09-12 06:58:13 +0800
8302423b8 mm/madvise.c: fix return value of madvise_hwpoison() ... Browse Code »

The return value outside for loop is always zero which means
madvise_hwpoison return success, however, this is not truth for
soft_offline_page w/ failure return value.

Signed-off-by: Wanpeng Li
Reviewed-by: Naoya Horiguchi
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wanpeng Li
2013-09-12 06:58:12 +0800
29b4eedee mm/hwpoison.c: fix held reference count after unpoisoning empty zero page ... Browse Code »

madvise hwpoison inject will poison the read-only empty zero page if there
is no write access before poison. Empty zero page reference count will be
increased for hwpoison, subsequent poison zero page will return directly
since page has already been set PG_hwpoison, however, page reference count
is still increased by get_user_pages_fast. The unpoison process will
unpoison the empty zero page and decrease the reference count successfully
for the fist time, however, subsequent unpoison empty zero page will
return directly since page has already been unpoisoned and without
decrease the page reference count of empty zero page.

This patch fixes it by make madvise_hwpoison() put a page and return
immediately (without calling memory_failure() or soft_offline_page()) when
the page is already hwpoisoned.

Testcase:

#define _GNU_SOURCE
#include
#include
#include
#include
#include
#include
#include

#define PAGES_TO_TEST 3
#define PAGE_SIZE 4096

int main(void)
{
char *mem;
int i;

mem = mmap(NULL, PAGES_TO_TEST * PAGE_SIZE,
PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);

if (madvise(mem, PAGES_TO_TEST * PAGE_SIZE, MADV_HWPOISON) == -1)
return -1;

munmap(mem, PAGES_TO_TEST * PAGE_SIZE);

return 0;
}

Add printk to dump page reference count:

[ 93.075959] Injecting memory failure for page 0x19d0 at 0xb77d8000
[ 93.076207] MCE 0x19d0: non LRU page recovery: Ignored
[ 93.076209] pfn 0x19d0, page count = 1 after memory failure
[ 93.076220] Injecting memory failure for page 0x19d0 at 0xb77d9000
[ 93.076221] MCE 0x19d0: already hardware poisoned
[ 93.076222] pfn 0x19d0, page count = 2 after memory failure
[ 93.076224] Injecting memory failure for page 0x19d0 at 0xb77da000
[ 93.076224] MCE 0x19d0: already hardware poisoned
[ 93.076225] pfn 0x19d0, page count = 3 after memory failure

Signed-off-by: Wanpeng Li
Suggested-by: Naoya Horiguchi
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wanpeng Li
2013-09-12 06:58:11 +0800
b194b8cdb mm/hwpoison: add '#' to madvise_hwpoison ... Browse Code »

Add '#' to madvise_hwpoison.

Before patch:

[ 95.892866] Injecting memory failure for page 19d0 at b7786000
[ 95.893151] MCE 0x19d0: non LRU page recovery: Ignored

After patch:

[ 95.892866] Injecting memory failure for page 0x19d0 at 0xb7786000
[ 95.893151] MCE 0x19d0: non LRU page recovery: Ignored

Signed-off-by: Wanpeng Li
Reviewed-by: Naoya Horiguchi
Cc: Andi Kleen
Cc: Tony Luck
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wanpeng Li
2013-09-12 06:58:11 +0800
ec9bed9d3 mm/madvise.c: fix coding-style errors ... Browse Code »

This fixes following errors:
- ERROR: "(foo*)" should be "(foo *)"
- ERROR: "foo ** bar" should be "foo **bar"

Signed-off-by: Vladimir Cernov
Reviewed-by: Pekka Enberg
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Cernov
2013-09-12 06:57:00 +0800

30 Apr, 2013

1 commit

84d96d897 mm: madvise: complete input validation before taking lock ... Browse Code »

In madvise(), there doesn't seem to be any reason for taking the
¤t->mm->mmap_sem before start and len_in have been validated.
Incidentally, this removes the need for the out: label.

[akpm@linux-foundation.org: s/out_plug/out/, per David]
Signed-off-by: Rasmus Villemoes
Acked-by: KOSAKI Motohiro
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rasmus Villemoes
2013-04-30 06:54:37 +0800

24 Feb, 2013

1 commit

1998cc048 mm: make madvise(MADV_WILLNEED) support swap file prefetch ... Browse Code »

Make madvise(MADV_WILLNEED) support swap file prefetch. If memory is
swapout, this syscall can do swapin prefetch. It has no impact if the
memory isn't swapout.

[akpm@linux-foundation.org: fix CONFIG_SWAP=n build]
[sasha.levin@oracle.com: fix BUG on madvise early failure]
Signed-off-by: Shaohua Li
Cc: Hugh Dickins
Cc: Rik van Riel
Signed-off-by: Sasha Levin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shaohua Li
2013-02-24 09:50:10 +0800

09 Oct, 2012

1 commit

0103bd16f mm: prepare VM_DONTDUMP for using in drivers ... Browse Code »

Rename VM_NODUMP into VM_DONTDUMP: this name matches other negative flags:
VM_DONTEXPAND, VM_DONTCOPY. Currently this flag used only for
sys_madvise. The next patch will use it for replacing the outdated flag
VM_RESERVED.

Also forbid madvise(MADV_DODUMP) for special kernel mappings VM_SPECIAL
(VM_IO | VM_DONTEXPAND | VM_RESERVED | VM_PFNMAP)

Signed-off-by: Konstantin Khlebnikov
Cc: Alexander Viro
Cc: Carsten Otte
Cc: Chris Metcalf
Cc: Cyrill Gorcunov
Cc: Eric Paris
Cc: H. Peter Anvin
Cc: Hugh Dickins
Cc: Ingo Molnar
Cc: James Morris
Cc: Jason Baron
Cc: Kentaro Takeda
Cc: Matt Helsley
Cc: Nick Piggin
Cc: Oleg Nesterov
Cc: Peter Zijlstra
Cc: Robert Richter
Cc: Suresh Siddha
Cc: Tetsuo Handa
Cc: Venkatesh Pallipadi
Acked-by: Linus Torvalds
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2012-10-09 15:22:18 +0800

07 Jul, 2012

1 commit

9ab4233dd mm: Hold a file reference in madvise_remove ... Browse Code »
1

Otherwise the code races with munmap (causing a use-after-free
of the vma) or with close (causing a use-after-free of the struct
file).

The bug was introduced by commit 90ed52ebe481 ("[PATCH] holepunch: fix
mmap_sem i_mutex deadlock")

Cc: Hugh Dickins
Cc: Miklos Szeredi
Cc: Badari Pulavarty
Cc: Nick Piggin
Cc: stable@vger.kernel.org
Signed-off-by: Andy Lutomirski
Signed-off-by: Linus Torvalds

Andy Lutomirski
2012-07-07 01:34:38 +0800

30 May, 2012

1 commit

3f31d0757 mm/fs: route MADV_REMOVE to FALLOC_FL_PUNCH_HOLE ... Browse Code »

Now tmpfs supports hole-punching via fallocate(), switch madvise_remove()
to use do_fallocate() instead of vmtruncate_range(): which extends
madvise(,,MADV_REMOVE) support from tmpfs to ext4, ocfs2 and xfs.

There is one more user of vmtruncate_range() in our tree,
staging/android's ashmem_shrink(): convert it to use do_fallocate() too
(but if its unpinned areas are already unmapped - I don't know - then it
would do better to use shmem_truncate_range() directly).

Based-on-patch-by: Cong Wang
Signed-off-by: Hugh Dickins
Cc: Christoph Hellwig
Cc: Al Viro
Cc: Colin Cross
Cc: John Stultz
Cc: Greg Kroah-Hartman
Cc: "Theodore Ts'o"
Cc: Andreas Dilger
Cc: Mark Fasheh
Cc: Joel Becker
Cc: Dave Chinner
Cc: Ben Myers
Cc: Michael Kerrisk
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2012-05-30 07:22:22 +0800

24 Mar, 2012

1 commit

accb61fe7 coredump: add VM_NODUMP, MADV_NODUMP, MADV_CLEAR_NODUMP ... Browse Code »

Since we no longer need the VM_ALWAYSDUMP flag, let's use the freed bit
for 'VM_NODUMP' flag. The idea is is to add a new madvise() flag:
MADV_DONTDUMP, which can be set by applications to specifically request
memory regions which should not dump core.

The specific application I have in mind is qemu: we can add a flag there
that wouldn't dump all of guest memory when qemu dumps core. This flag
might also be useful for security sensitive apps that want to absolutely
make sure that parts of memory are not dumped. To clear the flag use:
MADV_DODUMP.

[akpm@linux-foundation.org: s/MADV_NODUMP/MADV_DONTDUMP/, s/MADV_CLEAR_NODUMP/MADV_DODUMP/, per Roland]
[akpm@linux-foundation.org: fix up the architectures which broke]
Signed-off-by: Jason Baron
Acked-by: Roland McGrath
Cc: Chris Metcalf
Cc: Avi Kivity
Cc: Ralf Baechle
Cc: Richard Henderson
Cc: Ivan Kokshaysky
Cc: Matt Turner
Cc: "James E.J. Bottomley"
Cc: Helge Deller
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jason Baron
2012-03-24 07:58:42 +0800

04 Jan, 2012

1 commit

cd42f4a3b HWPOISON: Clean up memory_failure() vs. __memory_failure() ... Browse Code »

There is only one caller of memory_failure(), all other users call
__memory_failure() and pass in the flags argument explicitly. The
lone user of memory_failure() will soon need to pass flags too.

Add flags argument to the callsite in mce.c. Delete the old memory_failure()
function, and then rename __memory_failure() without the leading "__".

Provide clearer message when action optional memory errors are ignored.

Acked-by: Borislav Petkov
Signed-off-by: Tony Luck

Tony Luck
2012-01-04 04:06:32 +0800

21 Jul, 2011

1 commit

bd5fe6c5e fs: kill i_alloc_sem ... Browse Code »

i_alloc_sem is a rather special rw_semaphore. It's the last one that may
be released by a non-owner, and it's write side is always mirrored by
real exclusion. It's intended use it to wait for all pending direct I/O
requests to finish before starting a truncate.

Replace it with a hand-grown construct:

- exclusion for truncates is already guaranteed by i_mutex, so it can
simply fall way
- the reader side is replaced by an i_dio_count member in struct inode
that counts the number of pending direct I/O requests. Truncate can't
proceed as long as it's non-zero
- when i_dio_count reaches non-zero we wake up a pending truncate using
wake_up_bit on a new bit in i_flags
- new references to i_dio_count can't appear while we are waiting for
it to read zero because the direct I/O count always needs i_mutex
(or an equivalent like XFS's i_iolock) for starting a new operation.

This scheme is much simpler, and saves the space of a spinlock_t and a
struct list_head in struct inode (typically 160 bits on a non-debug 64-bit
system).

Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Christoph Hellwig
2011-07-21 08:47:46 +0800

14 Jan, 2011

3 commits

60ab3244e thp: khugepaged: make khugepaged aware about madvise ... Browse Code »

MADV_HUGEPAGE and MADV_NOHUGEPAGE were fully effective only if run after
mmap and before touching the memory. While this is enough for most
usages, it's little effort to make madvise more dynamic at runtime on an
existing mapping by making khugepaged aware about madvise.

MADV_HUGEPAGE: register in khugepaged immediately without waiting a page
fault (that may not ever happen if all pages are already mapped and the
"enabled" knob was set to madvise during the initial page faults).

MADV_NOHUGEPAGE: skip vmas marked VM_NOHUGEPAGE in khugepaged to stop
collapsing pages where not needed.

[akpm@linux-foundation.org: tweak comment]
Signed-off-by: Andrea Arcangeli
Cc: Michael Kerrisk
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-01-14 09:32:47 +0800
a664b2d85 thp: madvise(MADV_NOHUGEPAGE) ... Browse Code »

Add madvise MADV_NOHUGEPAGE to mark regions that are not important to be
hugepage backed. Return -EINVAL if the vma is not of an anonymous type,
or the feature isn't built into the kernel. Never silently return
success.

Signed-off-by: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-01-14 09:32:47 +0800
0af4e98b6 thp: madvise(MADV_HUGEPAGE) ... Browse Code »

Add madvise MADV_HUGEPAGE to mark regions that are important to be
hugepage backed. Return -EINVAL if the vma is not of an anonymous type,
or the feature isn't built into the kernel. Never silently return
success.

Signed-off-by: Andrea Arcangeli
Acked-by: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-01-14 09:32:42 +0800

16 Dec, 2009

3 commits

afcf938ee HWPOISON: Add a madvise() injector for soft page offlining ... Browse Code »

Process based injection is much easier to handle for test programs,
who can first bring a page into a specific state and then test.
So add a new MADV_SOFT_OFFLINE to soft offline a page, similar
to the existing hard offline injector.

Signed-off-by: Andi Kleen

Andi Kleen
2009-12-16 19:20:00 +0800
d15f107d9 HWPOISON: Use get_user_page_fast in hwpoison madvise ... Browse Code »

The previous version didn't take the mmap_sem before calling gup(),
which is racy.

Use get_user_pages_fast() instead which doesn't need any locks.
This is also faster of course, but then it doesn't really matter
because this is just a testing path.

Based on report from Nick Piggin.
Cc: npiggin@suse.de

Signed-off-by: Andi Kleen

Andi Kleen
2009-12-16 19:20:00 +0800
82ba011b9 HWPOISON: Turn ref argument into flags argument ... Browse Code »

Now that "ref" is just a boolean turn it into
a flags argument. First step is only a single flag
that makes the code's intention more clear, but more
may follow.

Signed-off-by: Andi Kleen

Andi Kleen
2009-12-16 19:19:57 +0800