Eric Lee / smarc-fsl-linux-kernel

23 Oct, 2016

1 commit

86c5bf710 Merge branch 'mm-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull vmap stack fixes from Ingo Molnar:
"This is fallout from CONFIG_HAVE_ARCH_VMAP_STACK=y on x86: stack
accesses that used to be just somewhat questionable are now totally
buggy.

These changes try to do it without breaking the ABI: the fields are
left there, they are just reporting zero, or reporting narrower
information (the maps file change)"

* 'mm-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
mm: Change vm_is_stack_for_task() to vm_is_stack_for_current()
fs/proc: Stop trying to report thread stacks
fs/proc: Stop reporting eip and esp in /proc/PID/stat
mm/numa: Remove duplicated include from mprotect.c

Linus Torvalds
2016-10-23 00:39:10 +0800

20 Oct, 2016

1 commit

d17af5056 mm: Change vm_is_stack_for_task() to vm_is_stack_for_current() ... Browse Code »

Asking for a non-current task's stack can't be done without races
unless the task is frozen in kernel mode. As far as I know,
vm_is_stack_for_task() never had a safe non-current use case.

The __unused annotation is because some KSTK_ESP implementations
ignore their parameter, which IMO is further justification for this
patch.

Signed-off-by: Andy Lutomirski
Acked-by: Thomas Gleixner
Cc: Al Viro
Cc: Andrew Morton
Cc: Borislav Petkov
Cc: Brian Gerst
Cc: Jann Horn
Cc: Kees Cook
Cc: Linus Torvalds
Cc: Linux API
Cc: Peter Zijlstra
Cc: Tycho Andersen
Link: http://lkml.kernel.org/r/4c3f68f426e6c061ca98b4fc7ef85ffbb0a25b0c.1475257877.git.luto@kernel.org
Signed-off-by: Ingo Molnar

Andy Lutomirski
2016-10-20 15:21:41 +0800

19 Oct, 2016

2 commits

f307ab6dc mm: replace access_process_vm() write parameter with gup_flags ... Browse Code »

This removes the 'write' argument from access_process_vm() and replaces
it with 'gup_flags' as use of this function previously silently implied
FOLL_FORCE, whereas after this patch callers explicitly pass this flag.

We make this explicit as use of FOLL_FORCE can result in surprising
behaviour (and hence bugs) within the mm subsystem.

Signed-off-by: Lorenzo Stoakes
Acked-by: Jesper Nilsson
Acked-by: Michal Hocko
Acked-by: Michael Ellerman
Signed-off-by: Linus Torvalds

Lorenzo Stoakes
2016-10-19 23:31:25 +0800
c164154f6 mm: replace get_user_pages_unlocked() write/force parameters with gup_flags ... Browse Code »

This removes the 'write' and 'force' use from get_user_pages_unlocked()
and replaces them with 'gup_flags' to make the use of FOLL_FORCE
explicit in callers as use of this flag can result in surprising
behaviour (and hence bugs) within the mm subsystem.

Signed-off-by: Lorenzo Stoakes
Reviewed-by: Jan Kara
Acked-by: Michal Hocko
Signed-off-by: Linus Torvalds

Lorenzo Stoakes
2016-10-19 05:13:37 +0800

29 Jul, 2016

1 commit

11fb99898 mm: move most file-based accounting to the node ... Browse Code »

There are now a number of accounting oddities such as mapped file pages
being accounted for on the node while the total number of file pages are
accounted on the zone. This can be coped with to some extent but it's
confusing so this patch moves the relevant file-based accounted. Due to
throttling logic in the page allocator for reliable OOM detection, it is
still necessary to track dirty and writeback pages on a per-zone basis.

[mgorman@techsingularity.net: fix NR_ZONE_WRITE_PENDING accounting]
Link: http://lkml.kernel.org/r/1468404004-5085-5-git-send-email-mgorman@techsingularity.net
Link: http://lkml.kernel.org/r/1467970510-21195-20-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman
Acked-by: Vlastimil Babka
Acked-by: Michal Hocko
Cc: Hillf Danton
Acked-by: Johannes Weiner
Cc: Joonsoo Kim
Cc: Minchan Kim
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2016-07-29 07:07:41 +0800

27 Jul, 2016

2 commits

dd78fedde rmap: support file thp ... Browse Code »

Naive approach: on mapping/unmapping the page as compound we update
->_mapcount on each 4k page. That's not efficient, but it's not obvious
how we can optimize this. We can look into optimization later.

PG_double_map optimization doesn't work for file pages since lifecycle
of file pages is different comparing to anon pages: file page can be
mapped again at any time.

Link: http://lkml.kernel.org/r/1466021202-61880-11-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-07-27 07:19:19 +0800
bda807d44 mm: migrate: support non-lru movable page migration ... Browse Code »

We have allowed migration for only LRU pages until now and it was enough
to make high-order pages. But recently, embedded system(e.g., webOS,
android) uses lots of non-movable pages(e.g., zram, GPU memory) so we
have seen several reports about troubles of small high-order allocation.
For fixing the problem, there were several efforts (e,g,. enhance
compaction algorithm, SLUB fallback to 0-order page, reserved memory,
vmalloc and so on) but if there are lots of non-movable pages in system,
their solutions are void in the long run.

So, this patch is to support facility to change non-movable pages with
movable. For the feature, this patch introduces functions related to
migration to address_space_operations as well as some page flags.

If a driver want to make own pages movable, it should define three
functions which are function pointers of struct
address_space_operations.

1. bool (*isolate_page) (struct page *page, isolate_mode_t mode);

What VM expects on isolate_page function of driver is to return *true*
if driver isolates page successfully. On returing true, VM marks the
page as PG_isolated so concurrent isolation in several CPUs skip the
page for isolation. If a driver cannot isolate the page, it should
return *false*.

Once page is successfully isolated, VM uses page.lru fields so driver
shouldn't expect to preserve values in that fields.

2. int (*migratepage) (struct address_space *mapping,
struct page *newpage, struct page *oldpage, enum migrate_mode);

After isolation, VM calls migratepage of driver with isolated page. The
function of migratepage is to move content of the old page to new page
and set up fields of struct page newpage. Keep in mind that you should
indicate to the VM the oldpage is no longer movable via
__ClearPageMovable() under page_lock if you migrated the oldpage
successfully and returns 0. If driver cannot migrate the page at the
moment, driver can return -EAGAIN. On -EAGAIN, VM will retry page
migration in a short time because VM interprets -EAGAIN as "temporal
migration failure". On returning any error except -EAGAIN, VM will give
up the page migration without retrying in this time.

Driver shouldn't touch page.lru field VM using in the functions.

3. void (*putback_page)(struct page *);

If migration fails on isolated page, VM should return the isolated page
to the driver so VM calls driver's putback_page with migration failed
page. In this function, driver should put the isolated page back to the
own data structure.

4. non-lru movable page flags

There are two page flags for supporting non-lru movable page.

* PG_movable

Driver should use the below function to make page movable under
page_lock.

void __SetPageMovable(struct page *page, struct address_space *mapping)

It needs argument of address_space for registering migration family
functions which will be called by VM. Exactly speaking, PG_movable is
not a real flag of struct page. Rather than, VM reuses page->mapping's
lower bits to represent it.

#define PAGE_MAPPING_MOVABLE 0x2
page->mapping = page->mapping | PAGE_MAPPING_MOVABLE;

so driver shouldn't access page->mapping directly. Instead, driver
should use page_mapping which mask off the low two bits of page->mapping
so it can get right struct address_space.

For testing of non-lru movable page, VM supports __PageMovable function.
However, it doesn't guarantee to identify non-lru movable page because
page->mapping field is unified with other variables in struct page. As
well, if driver releases the page after isolation by VM, page->mapping
doesn't have stable value although it has PAGE_MAPPING_MOVABLE (Look at
__ClearPageMovable). But __PageMovable is cheap to catch whether page
is LRU or non-lru movable once the page has been isolated. Because LRU
pages never can have PAGE_MAPPING_MOVABLE in page->mapping. It is also
good for just peeking to test non-lru movable pages before more
expensive checking with lock_page in pfn scanning to select victim.

For guaranteeing non-lru movable page, VM provides PageMovable function.
Unlike __PageMovable, PageMovable functions validates page->mapping and
mapping->a_ops->isolate_page under lock_page. The lock_page prevents
sudden destroying of page->mapping.

Driver using __SetPageMovable should clear the flag via
__ClearMovablePage under page_lock before the releasing the page.

* PG_isolated

To prevent concurrent isolation among several CPUs, VM marks isolated
page as PG_isolated under lock_page. So if a CPU encounters PG_isolated
non-lru movable page, it can skip it. Driver doesn't need to manipulate
the flag because VM will set/clear it automatically. Keep in mind that
if driver sees PG_isolated page, it means the page have been isolated by
VM so it shouldn't touch page.lru field. PG_isolated is alias with
PG_reclaim flag so driver shouldn't use the flag for own purpose.

[opensource.ganesh@gmail.com: mm/compaction: remove local variable is_lru]
Link: http://lkml.kernel.org/r/20160618014841.GA7422@leo-test
Link: http://lkml.kernel.org/r/1464736881-24886-3-git-send-email-minchan@kernel.org
Signed-off-by: Gioh Kim
Signed-off-by: Minchan Kim
Signed-off-by: Ganesh Mahendran
Acked-by: Vlastimil Babka
Cc: Sergey Senozhatsky
Cc: Rik van Riel
Cc: Joonsoo Kim
Cc: Mel Gorman
Cc: Hugh Dickins
Cc: Rafael Aquini
Cc: Jonathan Corbet
Cc: John Einar Reitan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2016-07-27 07:19:19 +0800

24 May, 2016

2 commits

9fbeb5ab5 mm: make vm_mmap killable ... Browse Code »

All the callers of vm_mmap seem to check for the failure already and
bail out in one way or another on the error which means that we can
change it to use killable version of vm_mmap_pgoff and return -EINTR if
the current task gets killed while waiting for mmap_sem. This also
means that vm_mmap_pgoff can be killable by default and drop the
additional parameter.

This will help in the OOM conditions when the oom victim might be stuck
waiting for the mmap_sem for write which in turn can block oom_reaper
which relies on the mmap_sem for read to make a forward progress and
reclaim the address space of the victim.

Please note that load_elf_binary is ignoring vm_mmap error for
current->personality & MMAP_PAGE_ZERO case but that shouldn't be a
problem because the address is not used anywhere and we never return to
the userspace if we got killed.

Signed-off-by: Michal Hocko
Acked-by: Vlastimil Babka
Cc: "Kirill A. Shutemov"
Cc: Mel Gorman
Cc: Oleg Nesterov
Cc: Andrea Arcangeli
Cc: Al Viro
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2016-05-24 08:04:14 +0800
dc0ef0df7 mm: make mmap_sem for write waits killable for mm syscalls ... Browse Code »

This is a follow up work for oom_reaper [1]. As the async OOM killing
depends on oom_sem for read we would really appreciate if a holder for
write didn't stood in the way. This patchset is changing many of
down_write calls to be killable to help those cases when the writer is
blocked and waiting for readers to release the lock and so help
__oom_reap_task to process the oom victim.

Most of the patches are really trivial because the lock is help from a
shallow syscall paths where we can return EINTR trivially and allow the
current task to die (note that EINTR will never get to the userspace as
the task has fatal signal pending). Others seem to be easy as well as
the callers are already handling fatal errors and bail and return to
userspace which should be sufficient to handle the failure gracefully.
I am not familiar with all those code paths so a deeper review is really
appreciated.

As this work is touching more areas which are not directly connected I
have tried to keep the CC list as small as possible and people who I
believed would be familiar are CCed only to the specific patches (all
should have received the cover though).

This patchset is based on linux-next and it depends on
down_write_killable for rw_semaphores which got merged into tip
locking/rwsem branch and it is merged into this next tree. I guess it
would be easiest to route these patches via mmotm because of the
dependency on the tip tree but if respective maintainers prefer other
way I have no objections.

I haven't covered all the mmap_write(mm->mmap_sem) instances here

$ git grep "down_write(.*\)" next/master | wc -l
98
$ git grep "down_write(.*\)" | wc -l
62

I have tried to cover those which should be relatively easy to review in
this series because this alone should be a nice improvement. Other
places can be changed on top.

[0] http://lkml.kernel.org/r/1456752417-9626-1-git-send-email-mhocko@kernel.org
[1] http://lkml.kernel.org/r/1452094975-551-1-git-send-email-mhocko@kernel.org
[2] http://lkml.kernel.org/r/1456750705-7141-1-git-send-email-mhocko@kernel.org

This patch (of 18):

This is the first step in making mmap_sem write waiters killable. It
focuses on the trivial ones which are taking the lock early after
entering the syscall and they are not changing state before.

Therefore it is very easy to change them to use down_write_killable and
immediately return with -EINTR. This will allow the waiter to pass away
without blocking the mmap_sem which might be required to make a forward
progress. E.g. the oom reaper will need the lock for reading to
dismantle the OOM victim address space.

The only tricky function in this patch is vm_mmap_pgoff which has many
call sites via vm_mmap. To reduce the risk keep vm_mmap with the
original non-killable semantic for now.

vm_munmap callers do not bother checking the return value so open code
it into the munmap syscall path for now for simplicity.

Signed-off-by: Michal Hocko
Acked-by: Vlastimil Babka
Cc: Mel Gorman
Cc: "Kirill A. Shutemov"
Cc: Konstantin Khlebnikov
Cc: Hugh Dickins
Cc: Andrea Arcangeli
Cc: David Rientjes
Cc: Dave Hansen
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2016-05-24 08:04:14 +0800

20 May, 2016

1 commit

1aa8aea53 mm: uninline page_mapped() ... Browse Code »

It's huge. Uninlining it saves 206 bytes per callsite. Shaves 4924
bytes from the x86_64 allmodconfig vmlinux.

[akpm@linux-foundation.org: coding-style fixes]
Cc: Steve Capper
Cc: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2016-05-20 10:12:14 +0800

21 Mar, 2016

1 commit

643ad15d4 Merge branch 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull x86 protection key support from Ingo Molnar:
"This tree adds support for a new memory protection hardware feature
that is available in upcoming Intel CPUs: 'protection keys' (pkeys).

There's a background article at LWN.net:

https://lwn.net/Articles/643797/

The gist is that protection keys allow the encoding of
user-controllable permission masks in the pte. So instead of having a
fixed protection mask in the pte (which needs a system call to change
and works on a per page basis), the user can map a (handful of)
protection mask variants and can change the masks runtime relatively
cheaply, without having to change every single page in the affected
virtual memory range.

This allows the dynamic switching of the protection bits of large
amounts of virtual memory, via user-space instructions. It also
allows more precise control of MMU permission bits: for example the
executable bit is separate from the read bit (see more about that
below).

This tree adds the MM infrastructure and low level x86 glue needed for
that, plus it adds a high level API to make use of protection keys -
if a user-space application calls:

mmap(..., PROT_EXEC);

or

mprotect(ptr, sz, PROT_EXEC);

(note PROT_EXEC-only, without PROT_READ/WRITE), the kernel will notice
this special case, and will set a special protection key on this
memory range. It also sets the appropriate bits in the Protection
Keys User Rights (PKRU) register so that the memory becomes unreadable
and unwritable.

So using protection keys the kernel is able to implement 'true'
PROT_EXEC on x86 CPUs: without protection keys PROT_EXEC implies
PROT_READ as well. Unreadable executable mappings have security
advantages: they cannot be read via information leaks to figure out
ASLR details, nor can they be scanned for ROP gadgets - and they
cannot be used by exploits for data purposes either.

We know about no user-space code that relies on pure PROT_EXEC
mappings today, but binary loaders could start making use of this new
feature to map binaries and libraries in a more secure fashion.

There is other pending pkeys work that offers more high level system
call APIs to manage protection keys - but those are not part of this
pull request.

Right now there's a Kconfig that controls this feature
(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) that is default enabled
(like most x86 CPU feature enablement code that has no runtime
overhead), but it's not user-configurable at the moment. If there's
any serious problem with this then we can make it configurable and/or
flip the default"

* 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
x86/mm/pkeys: Fix mismerge of protection keys CPUID bits
mm/pkeys: Fix siginfo ABI breakage caused by new u64 field
x86/mm/pkeys: Fix access_error() denial of writes to write-only VMA
mm/core, x86/mm/pkeys: Add execute-only protection keys support
x86/mm/pkeys: Create an x86 arch_calc_vm_prot_bits() for VMA flags
x86/mm/pkeys: Allow kernel to modify user pkey rights register
x86/fpu: Allow setting of XSAVE state
x86/mm: Factor out LDT init from context init
mm/core, x86/mm/pkeys: Add arch_validate_pkey()
mm/core, arch, powerpc: Pass a protection key in to calc_vm_flag_bits()
x86/mm/pkeys: Actually enable Memory Protection Keys in the CPU
x86/mm/pkeys: Add Kconfig prompt to existing config option
x86/mm/pkeys: Dump pkey from VMA in /proc/pid/smaps
x86/mm/pkeys: Dump PKRU with other kernel registers
mm/core, x86/mm/pkeys: Differentiate instruction fetches
x86/mm/pkeys: Optimize fault handling in access_error()
mm/core: Do not enforce PKEY permissions on remote mm access
um, pkeys: Add UML arch_*_access_permitted() methods
mm/gup, x86/mm/pkeys: Check VMAs and PTEs for protection keys
x86/mm/gup: Simplify get_user_pages() PTE bit handling
...

Linus Torvalds
2016-03-21 10:08:56 +0800

18 Mar, 2016

1 commit

39a1aa8e1 mm: deduplicate memory overcommitment code ... Browse Code »

Currently we have two copies of the same code which implements memory
overcommitment logic. Let's move it into mm/util.c and hence avoid
duplication. No functional changes here.

Signed-off-by: Andrey Ryabinin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrey Ryabinin
2016-03-18 06:09:34 +0800

16 Feb, 2016

1 commit

cde70140f mm/gup: Overload get_user_pages() functions ... Browse Code »

The concept here was a suggestion from Ingo. The implementation
horrors are all mine.

This allows get_user_pages(), get_user_pages_unlocked(), and
get_user_pages_locked() to be called with or without the
leading tsk/mm arguments. We will give a compile-time warning
about the old style being __deprecated and we will also
WARN_ON() if the non-remote version is used for a remote-style
access.

Doing this, folks will get nice warnings and will not break the
build. This should be nice for -next and will hopefully let
developers fix up their own code instead of maintainers needing
to do it at merge time.

The way we do this is hideous. It uses the __VA_ARGS__ macro
functionality to call different functions based on the number
of arguments passed to the macro.

There's an additional hack to ensure that our EXPORT_SYMBOL()
of the deprecated symbols doesn't trigger a warning.

We should be able to remove this mess as soon as -rc1 hits in
the release after this is merged.

Signed-off-by: Dave Hansen
Cc: Al Viro
Cc: Alexander Kuleshov
Cc: Andrea Arcangeli
Cc: Andrew Morton
Cc: Dan Williams
Cc: Dave Hansen
Cc: Dominik Dingel
Cc: Geliang Tang
Cc: Jan Kara
Cc: Johannes Weiner
Cc: Kirill A. Shutemov
Cc: Konstantin Khlebnikov
Cc: Leon Romanovsky
Cc: Linus Torvalds
Cc: Masahiro Yamada
Cc: Mateusz Guzik
Cc: Maxime Coquelin
Cc: Michal Hocko
Cc: Naoya Horiguchi
Cc: Oleg Nesterov
Cc: Paul Gortmaker
Cc: Peter Zijlstra
Cc: Srikar Dronamraju
Cc: Thomas Gleixner
Cc: Vladimir Davydov
Cc: Vlastimil Babka
Cc: Xie XiuQi
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20160212210155.73222EE1@viggo.jf.intel.com
Signed-off-by: Ingo Molnar

Dave Hansen
2016-02-16 17:11:12 +0800

04 Feb, 2016

1 commit

65376df58 proc: revert /proc/<pid>/maps [stack:TID] annotation ... Browse Code »

Commit b76437579d13 ("procfs: mark thread stack correctly in
proc//maps") added [stack:TID] annotation to /proc//maps.

Finding the task of a stack VMA requires walking the entire thread list,
turning this into quadratic behavior: a thousand threads means a
thousand stacks, so the rendering of /proc//maps needs to look at a
million combinations.

The cost is not in proportion to the usefulness as described in the
patch.

Drop the [stack:TID] annotation to make /proc//maps (and
/proc//numa_maps) usable again for higher thread counts.

The [stack] annotation inside /proc//task//maps is retained, as
identifying the stack VMA there is an O(1) operation.

Siddesh said:
"The end users needed a way to identify thread stacks programmatically and
there wasn't a way to do that. I'm afraid I no longer remember (or have
access to the resources that would aid my memory since I changed
employers) the details of their requirement. However, I did do this on my
own time because I thought it was an interesting project for me and nobody
really gave any feedback then as to its utility, so as far as I am
concerned you could roll back the main thread maps information since the
information is available in the thread-specific files"

Signed-off-by: Johannes Weiner
Cc: "Kirill A. Shutemov"
Cc: Siddhesh Poyarekar
Cc: Shaohua Li
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-02-04 00:28:43 +0800

21 Jan, 2016

1 commit

a3b609ef9 proc read mm's {arg,env}_{start,end} with mmap semaphore taken. ... Browse Code »

Only functions doing more than one read are modified. Consumeres
happened to deal with possibly changing data, but it does not seem like
a good thing to rely on.

Signed-off-by: Mateusz Guzik
Acked-by: Cyrill Gorcunov
Cc: Alexey Dobriyan
Cc: Jarod Wilson
Cc: Jan Stancek
Cc: Al Viro
Cc: Anshuman Khandual
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mateusz Guzik
2016-01-21 09:09:18 +0800

16 Jan, 2016

2 commits

b20ce5e03 mm: prepare page_referenced() and page_idle to new THP refcounting ... Browse Code »

Both page_referenced() and page_idle_clear_pte_refs_one() assume that
THP can only be mapped with PMD, so there's no reason to look on PTEs
for PageTransHuge() pages. That's no true anymore: THP can be mapped
with PTEs too.

The patch removes PageTransHuge() test from the functions and opencode
page table check.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Kirill A. Shutemov
Cc: Vladimir Davydov
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Naoya Horiguchi
Cc: Sasha Levin
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-01-16 09:56:32 +0800
1c290f642 mm: sanitize page->mapping for tail pages ... Browse Code »

We don't define meaning of page->mapping for tail pages. Currently it's
always NULL, which can be inconsistent with head page and potentially
lead to problems.

Let's poison the pointer to catch all illigal uses.

page_rmapping(), page_mapping() and page_anon_vma() are changed to look
on head page.

The only illegal use I've caught so far is __GPF_COMP pages from sound
subsystem, mapped with PTEs. do_shared_fault() is changed to use
page_rmapping() instead of direct access to fault_page->mapping.

Signed-off-by: Kirill A. Shutemov
Reviewed-by: Jérôme Glisse
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Vlastimil Babka
Cc: Christoph Lameter
Cc: Naoya Horiguchi
Cc: Steve Capper
Cc: "Aneesh Kumar K.V"
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Jerome Marchand
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-01-16 09:56:32 +0800

04 Jan, 2016

1 commit

e9d408e10 new helper: memdup_user_nul() ... Browse Code »

Similar to memdup_user(), except that allocated buffer is one byte
longer and '\0' is stored after the copied data.

Signed-off-by: Al Viro

Al Viro
2016-01-04 23:20:19 +0800

06 Nov, 2015

1 commit

ea53cde08 mm/util: use offset_in_page macro ... Browse Code »

linux/mm.h provides offset_in_page() macro. Let's use already predefined
macro instead of (addr & ~PAGE_MASK).

Signed-off-by: Alexander Kuleshov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexander Kuleshov
2015-11-06 11:34:48 +0800

16 Apr, 2015

1 commit

e39155ea1 mm: uninline and cleanup page-mapping related helpers ... Browse Code »

Most-used page->mapping helper -- page_mapping() -- has already uninlined.
Let's uninline also page_rmapping() and page_anon_vma(). It saves us
depending on configuration around 400 bytes in text:

text data bss dec hex filename
660318 99254 410000 1169572 11d8a4 mm/built-in.o-before
659854 99254 410000 1169108 11d6d4 mm/built-in.o

I also tried to make code a bit more clean.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Kirill A. Shutemov
Cc: Christoph Lameter
Cc: Konstantin Khlebnikov
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2015-04-16 07:35:19 +0800

14 Feb, 2015

1 commit

a4bb1e43e mm/util: add kstrdup_const ... Browse Code »

kstrdup() is often used to duplicate strings where neither source neither
destination will be ever modified. In such case we can just reuse the
source instead of duplicating it. The problem is that we must be sure
that the source is non-modifiable and its life-time is long enough.

I suspect the good candidates for such strings are strings located in
kernel .rodata section, they cannot be modifed because the section is
read-only and their life-time is equal to kernel life-time.

This small patchset proposes alternative version of kstrdup -
kstrdup_const, which returns source string if it is located in .rodata
otherwise it fallbacks to kstrdup. To verify if the source is in
.rodata function checks if the address is between sentinels
__start_rodata, __end_rodata. I guess it should work with all
architectures.

The main patch is accompanied by four patches constifying kstrdup for
cases where situtation described above happens frequently.

I have tested the patchset on mobile platform (exynos4210-trats) and it
saves 3272 string allocations. Since minimal allocation is 32 or 64
bytes depending on Kconfig options the patchset saves respectively about
100KB or 200KB of memory.

Stats from tested platform show that the main offender is sysfs:

By caller:
2260 __kernfs_new_node
631 clk_register+0xc8/0x1b8
318 clk_register+0x34/0x1b8
51 kmem_cache_create
12 alloc_vfsmnt

By string (with count >= 5):
883 power
876 subsystem
135 parameters
132 device
61 iommu_group
...

This patch (of 5):

Add an alternative version of kstrdup which returns pointer to constant
char array. The function checks if input string is in persistent and
read-only memory section, if yes it returns the input string, otherwise it
fallbacks to kstrdup.

kstrdup_const is accompanied by kfree_const performing conditional memory
deallocation of the string.

Signed-off-by: Andrzej Hajda
Cc: Marek Szyprowski
Cc: Kyungmin Park
Cc: Mike Turquette
Cc: Alexander Viro
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Tejun Heo
Cc: Greg KH
Cc: Geert Uytterhoeven
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrzej Hajda
2015-02-14 13:21:35 +0800

12 Feb, 2015

1 commit

a7b780750 mm: gup: use get_user_pages_unlocked within get_user_pages_fast ... Browse Code »

This allows the get_user_pages_fast slow path to release the mmap_sem
before blocking.

Signed-off-by: Andrea Arcangeli
Reviewed-by: Kirill A. Shutemov
Cc: Andres Lagar-Cavilla
Cc: Peter Feiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2015-02-12 09:06:05 +0800

10 Oct, 2014

1 commit

58cb65487 proc/maps: make vm_is_stack() logic namespace-friendly ... Browse Code »

- Rename vm_is_stack() to task_of_stack() and change it to return
"struct task_struct *" rather than the global (and thus wrong in
general) pid_t.

- Add the new pid_of_stack() helper which calls task_of_stack() and
uses the right namespace to report the correct pid_t.

Unfortunately we need to define this helper twice, in task_mmu.c
and in task_nommu.c. perhaps it makes sense to add fs/proc/util.c
and move at least pid_of_stack/task_of_stack there to avoid the
code duplication.

- Change show_map_vma() and show_numa_map() to use the new helper.

Signed-off-by: Oleg Nesterov
Cc: Alexander Viro
Cc: Cyrill Gorcunov
Cc: "Eric W. Biederman"
Cc: Greg Ungerer
Cc: "Kirill A. Shutemov"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-10-10 10:25:50 +0800

09 Aug, 2014

1 commit

4449a51a7 vm_is_stack: use for_each_thread() rather then buggy while_each_thread() ... Browse Code »

Aleksei hit the soft lockup during reading /proc/PID/smaps. David
investigated the problem and suggested the right fix.

while_each_thread() is racy and should die, this patch updates
vm_is_stack().

Signed-off-by: Oleg Nesterov
Reported-by: Aleksei Besogonov
Tested-by: Aleksei Besogonov
Suggested-by: David Rientjes
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-08-09 06:57:17 +0800

07 Aug, 2014

1 commit

928cec9cd mm: move slab related stuff from util.c to slab_common.c ... Browse Code »

Functions krealloc(), __krealloc(), kzfree() belongs to slab API, so
should be placed in slab_common.c

Also move slab allocator's tracepoints defenitions to slab_common.c No
functional changes here.

Signed-off-by: Andrey Ryabinin
Acked-by: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrey Ryabinin
2014-08-07 09:01:15 +0800

07 May, 2014

1 commit

39f1f78d5 nick kvfree() from apparmor ... Browse Code »

too many places open-code it

Signed-off-by: Al Viro

Al Viro
2014-05-07 02:02:53 +0800

13 Apr, 2014

1 commit

0b747172d Merge git://git.infradead.org/users/eparis/audit ... Browse Code »

Pull audit updates from Eric Paris.

* git://git.infradead.org/users/eparis/audit: (28 commits)
AUDIT: make audit_is_compat depend on CONFIG_AUDIT_COMPAT_GENERIC
audit: renumber AUDIT_FEATURE_CHANGE into the 1300 range
audit: do not cast audit_rule_data pointers pointlesly
AUDIT: Allow login in non-init namespaces
audit: define audit_is_compat in kernel internal header
kernel: Use RCU_INIT_POINTER(x, NULL) in audit.c
sched: declare pid_alive as inline
audit: use uapi/linux/audit.h for AUDIT_ARCH declarations
syscall_get_arch: remove useless function arguments
audit: remove stray newline from audit_log_execve_info() audit_panic() call
audit: remove stray newlines from audit_log_lost messages
audit: include subject in login records
audit: remove superfluous new- prefix in AUDIT_LOGIN messages
audit: allow user processes to log from another PID namespace
audit: anchor all pid references in the initial pid namespace
audit: convert PPIDs to the inital PID namespace.
pid: get pid_t ppid of task in init_pid_ns
audit: rename the misleading audit_get_context() to audit_take_context()
audit: Add generic compat syscall support
audit: Add CONFIG_HAVE_ARCH_AUDITSYSCALL
...

Linus Torvalds
2014-04-13 03:38:53 +0800

08 Apr, 2014

1 commit

3b32123d7 mm: use macros from compiler.h instead of __attribute__((...)) ... Browse Code »

To increase compiler portability there is which
provides convenience macros for various gcc constructs. Eg: __weak for
__attribute__((weak)). I've replaced all instances of gcc attributes with
the right macro in the memory management (/mm) subsystem.

[akpm@linux-foundation.org: while-we're-there consistency tweaks]
Signed-off-by: Gideon Israel Dsouza
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Gideon Israel Dsouza
2014-04-08 07:35:54 +0800

08 Mar, 2014

1 commit

a90902531 mm: Create utility function for accessing a tasks commandline value ... Browse Code »

introduce get_cmdline() for retreiving the value of a processes
proc/self/cmdline value.

Acked-by: David Rientjes
Acked-by: Stephen Smalley
Acked-by: Richard Guy Briggs

Signed-off-by: William Roberts
Signed-off-by: Eric Paris

William Roberts
2014-03-08 00:52:45 +0800

22 Jan, 2014

1 commit

49f0ce5f9 mm: add overcommit_kbytes sysctl variable ... Browse Code »

Some applications that run on HPC clusters are designed around the
availability of RAM and the overcommit ratio is fine tuned to get the
maximum usage of memory without swapping. With growing memory, the
1%-of-all-RAM grain provided by overcommit_ratio has become too coarse
for these workload (on a 2TB machine it represents no less than 20GB).

This patch adds the new overcommit_kbytes sysctl variable that allow a
much finer grain.

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix nommu build]
Signed-off-by: Jerome Marchand
Cc: Dave Hansen
Cc: Alan Cox
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jerome Marchand
2014-01-22 08:19:44 +0800

15 Jan, 2014

1 commit

03e5ac2fc mm: fix crash when using XFS on loopback ... Browse Code »

Commit 8456a648cf44 ("slab: use struct page for slab management") causes
a crash in the LVM2 testsuite on PA-RISC (the crashing test is
fsadm.sh). The testsuite doesn't crash on 3.12, crashes on 3.13-rc1 and
later.

Bad Address (null pointer deref?): Code=15 regs=000000413edd89a0 (Addr=000006202224647d)
CPU: 3 PID: 24008 Comm: loop0 Not tainted 3.13.0-rc6 #5
task: 00000001bf3c0048 ti: 000000413edd8000 task.ti: 000000413edd8000

YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
PSW: 00001000000001101111100100001110 Not tainted
r00-03 000000ff0806f90e 00000000405c8de0 000000004013e6c0 000000413edd83f0
r04-07 00000000405a95e0 0000000000000200 00000001414735f0 00000001bf349e40
r08-11 0000000010fe3d10 0000000000000001 00000040829c7778 000000413efd9000
r12-15 0000000000000000 000000004060d800 0000000010fe3000 0000000010fe3000
r16-19 000000413edd82a0 00000041078ddbc0 0000000000000010 0000000000000001
r20-23 0008f3d0d83a8000 0000000000000000 00000040829c7778 0000000000000080
r24-27 00000001bf349e40 00000001bf349e40 202d66202224640d 00000000405a95e0
r28-31 202d662022246465 000000413edd88f0 000000413edd89a0 0000000000000001
sr00-03 000000000532c000 0000000000000000 0000000000000000 000000000532c000
sr04-07 0000000000000000 0000000000000000 0000000000000000 0000000000000000

IASQ: 0000000000000000 0000000000000000 IAOQ: 00000000401fe42c 00000000401fe430
IIR: 539c0030 ISR: 00000000202d6000 IOR: 000006202224647d
CPU: 3 CR30: 000000413edd8000 CR31: 0000000000000000
ORIG_R28: 00000000405a95e0
IAOQ[0]: vma_interval_tree_iter_first+0x14/0x48
IAOQ[1]: vma_interval_tree_iter_first+0x18/0x48
RP(r2): flush_dcache_page+0x128/0x388
Backtrace:
flush_dcache_page+0x128/0x388
lo_splice_actor+0x90/0x148 [loop]
splice_from_pipe_feed+0xc0/0x1d0
__splice_from_pipe+0xac/0xc0
lo_direct_splice_actor+0x1c/0x70 [loop]
splice_direct_to_actor+0xec/0x228
lo_receive+0xe4/0x298 [loop]
loop_thread+0x478/0x640 [loop]
kthread+0x134/0x168
end_fault_vector+0x20/0x28
xfs_setsize_buftarg+0x0/0x90 [xfs]

Kernel panic - not syncing: Bad Address (null pointer deref?)

Commit 8456a648cf44 changes the page structure so that the slab
subsystem reuses the page->mapping field.

The crash happens in the following way:
* XFS allocates some memory from slab and issues a bio to read data
into it.
* the bio is sent to the loopback device.
* lo_receive creates an actor and calls splice_direct_to_actor.
* lo_splice_actor copies data to the target page.
* lo_splice_actor calls flush_dcache_page because the page may be
mapped by userspace. In that case we need to flush the kernel cache.
* flush_dcache_page asks for the list of userspace mappings, however
that page->mapping field is reused by the slab subsystem for a
different purpose. This causes the crash.

Note that other architectures without coherent caches (sparc, arm, mips)
also call page_mapping from flush_dcache_page, so they may crash in the
same way.

This patch fixes this bug by testing if the page is a slab page in
page_mapping and returning NULL if it is.

The patch also fixes VM_BUG_ON(PageSlab(page)) that could happen in
earlier kernels in the same scenario on architectures without cache
coherence when CONFIG_DEBUG_VM is enabled - so it should be backported
to stable kernels.

In the old kernels, the function page_mapping is placed in
include/linux/mm.h, so you should modify the patch accordingly when
backporting it.

Signed-off-by: Mikulas Patocka
Cc: John David Anglin ]
Cc: Andi Kleen
Cc: Christoph Lameter
Acked-by: Pekka Enberg
Reviewed-by: Joonsoo Kim
Cc: Helge Deller
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mikulas Patocka
2014-01-15 15:19:42 +0800

13 Nov, 2013

1 commit

00619bcc4 mm: factor commit limit calculation ... Browse Code »

The same calculation is currently done in three differents places.
Factor that code so future changes has to be made at only one place.

[akpm@linux-foundation.org: uninline vm_commit_limit()]
Signed-off-by: Jerome Marchand
Cc: Dave Hansen
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jerome Marchand
2013-11-13 11:09:11 +0800

12 Sep, 2013

1 commit

d2cf5ad63 swap: clean-up #ifdef in page_mapping() ... Browse Code »

PageSwapCache() is always false when !CONFIG_SWAP, so compiler
properly discard related code. Therefore, we don't need #ifdef explicitly.

Signed-off-by: Joonsoo Kim
Acked-by: Johannes Weiner
Cc: Minchan Kim
Cc: Mel Gorman
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2013-09-12 06:57:31 +0800

11 Jul, 2013

1 commit

98d1e64f9 mm: remove free_area_cache ... Browse Code »

Since all architectures have been converted to use vm_unmapped_area(),
there is no remaining use for the free_area_cache.

Signed-off-by: Michel Lespinasse
Acked-by: Rik van Riel
Cc: "James E.J. Bottomley"
Cc: "Luck, Tony"
Cc: Benjamin Herrenschmidt
Cc: David Howells
Cc: Helge Deller
Cc: Ivan Kokshaysky
Cc: Matt Turner
Cc: Paul Mackerras
Cc: Richard Henderson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michel Lespinasse
2013-07-11 09:11:34 +0800

24 Feb, 2013

4 commits

33806f06d swap: make each swap partition have one address_space ... Browse Code »

When I use several fast SSD to do swap, swapper_space.tree_lock is
heavily contended. This makes each swap partition have one
address_space to reduce the lock contention. There is an array of
address_space for swap. The swap entry type is the index to the array.

In my test with 3 SSD, this increases the swapout throughput 20%.

[akpm@linux-foundation.org: revert unneeded change to __add_to_swap_cache]
Signed-off-by: Shaohua Li
Cc: Hugh Dickins
Acked-by: Rik van Riel
Acked-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shaohua Li
2013-02-24 09:50:17 +0800
9800339b5 mm: don't inline page_mapping() ... Browse Code »

According to akpm, this saves 1/2k text and makes things simple for the
next patch.

Numbers from Minchan:

add/remove: 1/0 grow/shrink: 6/22 up/down: 92/-516 (-424)
function old new delta
page_mapping - 48 +48
do_task_stat 2292 2308 +16
page_remove_rmap 240 248 +8
load_elf_binary 4500 4508 +8
update_queue 532 536 +4
scsi_probe_and_add_lun 2892 2896 +4
lookup_fast 644 648 +4
vcs_read 1040 1036 -4
__ip_route_output_key 1904 1900 -4
ip_route_input_noref 2508 2500 -8
shmem_file_aio_read 784 772 -12
__isolate_lru_page 272 256 -16
shmem_replace_page 708 688 -20
mark_buffer_dirty 228 208 -20
__set_page_dirty_buffers 240 220 -20
__remove_mapping 276 256 -20
update_mmu_cache 500 476 -24
set_page_dirty_balance 92 68 -24
set_page_dirty 172 148 -24
page_evictable 88 64 -24
page_cache_pipe_buf_steal 248 224 -24
clear_page_dirty_for_io 340 316 -24
test_set_page_writeback 400 372 -28
test_clear_page_writeback 516 488 -28
invalidate_inode_page 156 128 -28
page_mkclean 432 400 -32
flush_dcache_page 360 328 -32
__set_page_dirty_nobuffers 324 280 -44
shrink_page_list 2412 2356 -56

Signed-off-by: Shaohua Li
Suggested-by: Andrew Morton
Cc: Hugh Dickins
Acked-by: Rik van Riel
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shaohua Li
2013-02-24 09:50:17 +0800
41badc15c mm: make do_mmap_pgoff return populate as a size in bytes, not as a bool ... Browse Code »

do_mmap_pgoff() rounds up the desired size to the next PAGE_SIZE
multiple, however there was no equivalent code in mm_populate(), which
caused issues.

This could be fixed by introduced the same rounding in mm_populate(),
however I think it's preferable to make do_mmap_pgoff() return populate
as a size rather than as a boolean, so we don't have to duplicate the
size rounding logic in mm_populate().

Signed-off-by: Michel Lespinasse
Acked-by: Rik van Riel
Tested-by: Andy Lutomirski
Cc: Greg Ungerer
Cc: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michel Lespinasse
2013-02-24 09:50:11 +0800
bebeb3d68 mm: introduce mm_populate() for populating new vmas ... Browse Code »

When creating new mappings using the MAP_POPULATE / MAP_LOCKED flags (or
with MCL_FUTURE in effect), we want to populate the pages within the
newly created vmas. This may take a while as we may have to read pages
from disk, so ideally we want to do this outside of the write-locked
mmap_sem region.

This change introduces mm_populate(), which is used to defer populating
such mappings until after the mmap_sem write lock has been released.
This is implemented as a generalization of the former do_mlock_pages(),
which accomplished the same task but was using during mlock() /
mlockall().

Signed-off-by: Michel Lespinasse
Reported-by: Andy Lutomirski
Acked-by: Rik van Riel
Tested-by: Andy Lutomirski
Cc: Greg Ungerer
Cc: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michel Lespinasse
2013-02-24 09:50:10 +0800

29 Oct, 2012

1 commit

3bd7bf1f0 Merge branch 'master' into for-next ... Browse Code »

Sync up with Linus' tree to be able to apply Cesar's patch
against newer version of the code.

Signed-off-by: Jiri Kosina

Jiri Kosina
2012-10-29 02:29:19 +0800

16 Oct, 2012

1 commit

0db10c8e8 krealloc: Fix kernel-doc comment ... Browse Code »

It should say "@new_size" and not "@size". Correct that.

Signed-off-by: Borislav Petkov
Cc: trivial@kernel.org
Signed-off-by: Jiri Kosina

Borislav Petkov
2012-10-16 04:43:02 +0800