Eric Lee / smarc-fsl-linux-kernel

15 Aug, 2014

2 commits

f2937e454 Merge branch 'akpm' (fixes from Andrew Morton) ... Browse Code »

Merge leftovers from Andrew Morton:
"A few leftovers.

I have a bunch of OCFS2 patches which are still out for review and
which I might sneak along after -rc1. Partly my fault - I should send
my review pokes out earlier"

* emailed patches from Andrew Morton :
mm: fix CROSS_MEMORY_ATTACH help text grammar
drivers/mfd/rtsx_usb.c: export device table
mm, hugetlb_cgroup: align hugetlb cgroup limit to hugepage size

Linus Torvalds
2014-08-15 00:56:25 +0800
24d7cd207 mm, hugetlb_cgroup: align hugetlb cgroup limit to hugepage size ... Browse Code »

Memcg aligns memory.limit_in_bytes to PAGE_SIZE as part of the resource
counter since it makes no sense to allow a partial page to be charged.

As a result of the hugetlb cgroup using the resource counter, it is also
aligned to PAGE_SIZE but makes no sense unless aligned to the size of
the hugepage being limited.

Align hugetlb cgroup limit to hugepage size.

Signed-off-by: David Rientjes
Acked-by: Michal Hocko
Cc: "Aneesh Kumar K.V"
Cc: Tejun Heo
Cc: Li Zefan
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2014-08-15 00:56:15 +0800

12 Aug, 2014

2 commits

f6f993328 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull vfs updates from Al Viro:
"Stuff in here:

- acct.c fixes and general rework of mnt_pin mechanism. That allows
to go for delayed-mntput stuff, which will permit mntput() on deep
stack without worrying about stack overflows - fs shutdown will
happen on shallow stack. IOW, we can do Eric's umount-on-rmdir
series without introducing tons of stack overflows on new mntput()
call chains it introduces.
- Bruce's d_splice_alias() patches
- more Miklos' rename() stuff.
- a couple of regression fixes (stable fodder, in the end of branch)
and a fix for API idiocy in iov_iter.c.

There definitely will be another pile, maybe even two. I'd like to
get Eric's series in this time, but even if we miss it, it'll go right
in the beginning of for-next in the next cycle - the tricky part of
prereqs is in this pile"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (40 commits)
fix copy_tree() regression
__generic_file_write_iter(): fix handling of sync error after DIO
switch iov_iter_get_pages() to passing maximal number of pages
fs: mark __d_obtain_alias static
dcache: d_splice_alias should detect loops
exportfs: update Exporting documentation
dcache: d_find_alias needn't recheck IS_ROOT && DCACHE_DISCONNECTED
dcache: remove unused d_find_alias parameter
dcache: d_obtain_alias callers don't all want DISCONNECTED
dcache: d_splice_alias should ignore DCACHE_DISCONNECTED
dcache: d_splice_alias mustn't create directory aliases
dcache: close d_move race in d_splice_alias
dcache: move d_splice_alias
namei: trivial fix to vfs_rename_dir comment
VFS: allow ->d_manage() to declare -EISDIR in rcu_walk mode.
cifs: support RENAME_NOREPLACE
hostfs: support rename flags
shmem: support RENAME_EXCHANGE
shmem: support RENAME_NOREPLACE
btrfs: add RENAME_NOREPLACE
...

Linus Torvalds
2014-08-12 02:44:11 +0800
60bb45297 __generic_file_write_iter(): fix handling of sync error after DIO ... Browse Code »

If DIO results in short write and sync write fails, we want to bugger off
whether the DIO part has written anything or not; the logics on the return
will take care of the right return value.

Cc: stable@vger.kernel.org [3.16]
Reported-by: Anton Altaparmakov
Signed-off-by: Al Viro

Al Viro
2014-08-12 00:27:40 +0800

09 Aug, 2014

12 commits

05f65b5c7 shm: wait for pins to be released when sealing ... Browse Code »

If we set SEAL_WRITE on a file, we must make sure there cannot be any
ongoing write-operations on the file. For write() calls, we simply lock
the inode mutex, for mmap() we simply verify there're no writable
mappings. However, there might be pages pinned by AIO, Direct-IO and
similar operations via GUP. We must make sure those do not write to the
memfd file after we set SEAL_WRITE.

As there is no way to notify GUP users to drop pages or to wait for them
to be done, we implement the wait ourself: When setting SEAL_WRITE, we
check all pages for their ref-count. If it's bigger than 1, we know
there's some user of the page. We then mark the page and wait for up to
150ms for those ref-counts to be dropped. If the ref-counts are not
dropped in time, we refuse the seal operation.

Signed-off-by: David Herrmann
Acked-by: Hugh Dickins
Cc: Michael Kerrisk
Cc: Ryan Lortie
Cc: Lennart Poettering
Cc: Daniel Mack
Cc: Andy Lutomirski
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Herrmann
2014-08-09 06:57:31 +0800
9183df25f shm: add memfd_create() syscall ... Browse Code »

memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor
that you can pass to mmap(). It can support sealing and avoids any
connection to user-visible mount-points. Thus, it's not subject to quotas
on mounted file-systems, but can be used like malloc()'ed memory, but with
a file-descriptor to it.

memfd_create() returns the raw shmem file, so calls like ftruncate() can
be used to modify the underlying inode. Also calls like fstat() will
return proper information and mark the file as regular file. If you want
sealing, you can specify MFD_ALLOW_SEALING. Otherwise, sealing is not
supported (like on all other regular files).

Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not
subject to a filesystem size limit. It is still properly accounted to
memcg limits, though, and to the same overcommit or no-overcommit
accounting as all user memory.

Signed-off-by: David Herrmann
Acked-by: Hugh Dickins
Cc: Michael Kerrisk
Cc: Ryan Lortie
Cc: Lennart Poettering
Cc: Daniel Mack
Cc: Andy Lutomirski
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Herrmann
2014-08-09 06:57:31 +0800
40e041a2c shm: add sealing API ... Browse Code »

If two processes share a common memory region, they usually want some
guarantees to allow safe access. This often includes:
- one side cannot overwrite data while the other reads it
- one side cannot shrink the buffer while the other accesses it
- one side cannot grow the buffer beyond previously set boundaries

If there is a trust-relationship between both parties, there is no need
for policy enforcement. However, if there's no trust relationship (eg.,
for general-purpose IPC) sharing memory-regions is highly fragile and
often not possible without local copies. Look at the following two
use-cases:

1) A graphics client wants to share its rendering-buffer with a
graphics-server. The memory-region is allocated by the client for
read/write access and a second FD is passed to the server. While
scanning out from the memory region, the server has no guarantee that
the client doesn't shrink the buffer at any time, requiring rather
cumbersome SIGBUS handling.
2) A process wants to perform an RPC on another process. To avoid huge
bandwidth consumption, zero-copy is preferred. After a message is
assembled in-memory and a FD is passed to the remote side, both sides
want to be sure that neither modifies this shared copy, anymore. The
source may have put sensible data into the message without a separate
copy and the target may want to parse the message inline, to avoid a
local copy.

While SIGBUS handling, POSIX mandatory locking and MAP_DENYWRITE provide
ways to achieve most of this, the first one is unproportionally ugly to
use in libraries and the latter two are broken/racy or even disabled due
to denial of service attacks.

This patch introduces the concept of SEALING. If you seal a file, a
specific set of operations is blocked on that file forever. Unlike locks,
seals can only be set, never removed. Hence, once you verified a specific
set of seals is set, you're guaranteed that no-one can perform the blocked
operations on this file, anymore.

An initial set of SEALS is introduced by this patch:
- SHRINK: If SEAL_SHRINK is set, the file in question cannot be reduced
in size. This affects ftruncate() and open(O_TRUNC).
- GROW: If SEAL_GROW is set, the file in question cannot be increased
in size. This affects ftruncate(), fallocate() and write().
- WRITE: If SEAL_WRITE is set, no write operations (besides resizing)
are possible. This affects fallocate(PUNCH_HOLE), mmap() and
write().
- SEAL: If SEAL_SEAL is set, no further seals can be added to a file.
This basically prevents the F_ADD_SEAL operation on a file and
can be set to prevent others from adding further seals that you
don't want.

The described use-cases can easily use these seals to provide safe use
without any trust-relationship:

1) The graphics server can verify that a passed file-descriptor has
SEAL_SHRINK set. This allows safe scanout, while the client is
allowed to increase buffer size for window-resizing on-the-fly.
Concurrent writes are explicitly allowed.
2) For general-purpose IPC, both processes can verify that SEAL_SHRINK,
SEAL_GROW and SEAL_WRITE are set. This guarantees that neither
process can modify the data while the other side parses it.
Furthermore, it guarantees that even with writable FDs passed to the
peer, it cannot increase the size to hit memory-limits of the source
process (in case the file-storage is accounted to the source).

The new API is an extension to fcntl(), adding two new commands:
F_GET_SEALS: Return a bitset describing the seals on the file. This
can be called on any FD if the underlying file supports
sealing.
F_ADD_SEALS: Change the seals of a given file. This requires WRITE
access to the file and F_SEAL_SEAL may not already be set.
Furthermore, the underlying file must support sealing and
there may not be any existing shared mapping of that file.
Otherwise, EBADF/EPERM is returned.
The given seals are _added_ to the existing set of seals
on the file. You cannot remove seals again.

The fcntl() handler is currently specific to shmem and disabled on all
files. A file needs to explicitly support sealing for this interface to
work. A separate syscall is added in a follow-up, which creates files that
support sealing. There is no intention to support this on other
file-systems. Semantics are unclear for non-volatile files and we lack any
use-case right now. Therefore, the implementation is specific to shmem.

Signed-off-by: David Herrmann
Acked-by: Hugh Dickins
Cc: Michael Kerrisk
Cc: Ryan Lortie
Cc: Lennart Poettering
Cc: Daniel Mack
Cc: Andy Lutomirski
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Herrmann
2014-08-09 06:57:31 +0800
4bb5f5d93 mm: allow drivers to prevent new writable mappings ... Browse Code »

This patch (of 6):

The i_mmap_writable field counts existing writable mappings of an
address_space. To allow drivers to prevent new writable mappings, make
this counter signed and prevent new writable mappings if it is negative.
This is modelled after i_writecount and DENYWRITE.

This will be required by the shmem-sealing infrastructure to prevent any
new writable mappings after the WRITE seal has been set. In case there
exists a writable mapping, this operation will fail with EBUSY.

Note that we rely on the fact that iff you already own a writable mapping,
you can increase the counter without using the helpers. This is the same
that we do for i_writecount.

Signed-off-by: David Herrmann
Acked-by: Hugh Dickins
Cc: Michael Kerrisk
Cc: Ryan Lortie
Cc: Lennart Poettering
Cc: Daniel Mack
Cc: Andy Lutomirski
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Herrmann
2014-08-09 06:57:31 +0800
a6c19dfe3 arm64,ia64,ppc,s390,sh,tile,um,x86,mm: remove default gate area ... Browse Code »

The core mm code will provide a default gate area based on
FIXADDR_USER_START and FIXADDR_USER_END if
!defined(__HAVE_ARCH_GATE_AREA) && defined(AT_SYSINFO_EHDR).

This default is only useful for ia64. arm64, ppc, s390, sh, tile, 64-bit
UML, and x86_32 have their own code just to disable it. arm, 32-bit UML,
and x86_64 have gate areas, but they have their own implementations.

This gets rid of the default and moves the code into ia64.

This should save some code on architectures without a gate area: it's now
possible to inline the gate_area functions in the default case.

Signed-off-by: Andy Lutomirski
Acked-by: Nathan Lynch
Acked-by: H. Peter Anvin
Acked-by: Benjamin Herrenschmidt [in principle]
Acked-by: Richard Weinberger [for um]
Acked-by: Will Deacon [for arm64]
Cc: Catalin Marinas
Cc: Will Deacon
Cc: Tony Luck
Cc: Fenghua Yu
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: Martin Schwidefsky
Cc: Heiko Carstens
Cc: Chris Metcalf
Cc: Jeff Dike
Cc: Richard Weinberger
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Nathan Lynch
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andy Lutomirski
2014-08-09 06:57:27 +0800
c119239b1 mm/zswap.c: add __init to zswap_entry_cache_destroy() ... Browse Code »

zswap_entry_cache_destroy() is only called by __init init_zswap().

This patch also fixes function name zswap_entry_cache_ s/destory/destroy

Signed-off-by: Fabian Frederick
Acked-by: Seth Jennings
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Fabian Frederick
2014-08-09 06:57:18 +0800
6abb5a867 mm: memcontrol: avoid charge statistics churn during page migration ... Browse Code »

Charge migration currently disables IRQs twice to update the charge
statistics for the old page and then again for the new page.

But migration is a seamless transition of a charge from one physical
page to another one of the same size, so this should be a non-event from
an accounting point of view. Leave the statistics alone.

Signed-off-by: Johannes Weiner
Acked-by: Michal Hocko
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-08-09 06:57:18 +0800
747db954c mm: memcontrol: use page lists for uncharge batching ... Browse Code »

Pages are now uncharged at release time, and all sources of batched
uncharges operate on lists of pages. Directly use those lists, and
get rid of the per-task batching state.

This also batches statistics accounting, in addition to the res
counter charges, to reduce IRQ-disabling and re-enabling.

Signed-off-by: Johannes Weiner
Acked-by: Michal Hocko
Cc: Hugh Dickins
Cc: Tejun Heo
Cc: Vladimir Davydov
Cc: Naoya Horiguchi
Cc: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-08-09 06:57:18 +0800
0a31bc97c mm: memcontrol: rewrite uncharge API ... Browse Code »

The memcg uncharging code that is involved towards the end of a page's
lifetime - truncation, reclaim, swapout, migration - is impressively
complicated and fragile.

Because anonymous and file pages were always charged before they had their
page->mapping established, uncharges had to happen when the page type
could still be known from the context; as in unmap for anonymous, page
cache removal for file and shmem pages, and swap cache truncation for swap
pages. However, these operations happen well before the page is actually
freed, and so a lot of synchronization is necessary:

- Charging, uncharging, page migration, and charge migration all need
to take a per-page bit spinlock as they could race with uncharging.

- Swap cache truncation happens during both swap-in and swap-out, and
possibly repeatedly before the page is actually freed. This means
that the memcg swapout code is called from many contexts that make
no sense and it has to figure out the direction from page state to
make sure memory and memory+swap are always correctly charged.

- On page migration, the old page might be unmapped but then reused,
so memcg code has to prevent untimely uncharging in that case.
Because this code - which should be a simple charge transfer - is so
special-cased, it is not reusable for replace_page_cache().

But now that charged pages always have a page->mapping, introduce
mem_cgroup_uncharge(), which is called after the final put_page(), when we
know for sure that nobody is looking at the page anymore.

For page migration, introduce mem_cgroup_migrate(), which is called after
the migration is successful and the new page is fully rmapped. Because
the old page is no longer uncharged after migration, prevent double
charges by decoupling the page's memcg association (PCG_USED and
pc->mem_cgroup) from the page holding an actual charge. The new bits
PCG_MEM and PCG_MEMSW represent the respective charges and are transferred
to the new page during migration.

mem_cgroup_migrate() is suitable for replace_page_cache() as well,
which gets rid of mem_cgroup_replace_page_cache(). However, care
needs to be taken because both the source and the target page can
already be charged and on the LRU when fuse is splicing: grab the page
lock on the charge moving side to prevent changing pc->mem_cgroup of a
page under migration. Also, the lruvecs of both pages change as we
uncharge the old and charge the new during migration, and putback may
race with us, so grab the lru lock and isolate the pages iff on LRU to
prevent races and ensure the pages are on the right lruvec afterward.

Swap accounting is massively simplified: because the page is no longer
uncharged as early as swap cache deletion, a new mem_cgroup_swapout() can
transfer the page's memory+swap charge (PCG_MEMSW) to the swap entry
before the final put_page() in page reclaim.

Finally, page_cgroup changes are now protected by whatever protection the
page itself offers: anonymous pages are charged under the page table lock,
whereas page cache insertions, swapin, and migration hold the page lock.
Uncharging happens under full exclusion with no outstanding references.
Charging and uncharging also ensure that the page is off-LRU, which
serializes against charge migration. Remove the very costly page_cgroup
lock and set pc->flags non-atomically.

[mhocko@suse.cz: mem_cgroup_charge_statistics needs preempt_disable]
[vdavydov@parallels.com: fix flags definition]
Signed-off-by: Johannes Weiner
Cc: Hugh Dickins
Cc: Tejun Heo
Cc: Vladimir Davydov
Tested-by: Jet Chen
Acked-by: Michal Hocko
Tested-by: Felipe Balbi
Signed-off-by: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-08-09 06:57:17 +0800
00501b531 mm: memcontrol: rewrite charge API ... Browse Code »

These patches rework memcg charge lifetime to integrate more naturally
with the lifetime of user pages. This drastically simplifies the code and
reduces charging and uncharging overhead. The most expensive part of
charging and uncharging is the page_cgroup bit spinlock, which is removed
entirely after this series.

Here are the top-10 profile entries of a stress test that reads a 128G
sparse file on a freshly booted box, without even a dedicated cgroup (i.e.
executing in the root memcg). Before:

15.36% cat [kernel.kallsyms] [k] copy_user_generic_string
13.31% cat [kernel.kallsyms] [k] memset
11.48% cat [kernel.kallsyms] [k] do_mpage_readpage
4.23% cat [kernel.kallsyms] [k] get_page_from_freelist
2.38% cat [kernel.kallsyms] [k] put_page
2.32% cat [kernel.kallsyms] [k] __mem_cgroup_commit_charge
2.18% kswapd0 [kernel.kallsyms] [k] __mem_cgroup_uncharge_common
1.92% kswapd0 [kernel.kallsyms] [k] shrink_page_list
1.86% cat [kernel.kallsyms] [k] __radix_tree_lookup
1.62% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn

After:

15.67% cat [kernel.kallsyms] [k] copy_user_generic_string
13.48% cat [kernel.kallsyms] [k] memset
11.42% cat [kernel.kallsyms] [k] do_mpage_readpage
3.98% cat [kernel.kallsyms] [k] get_page_from_freelist
2.46% cat [kernel.kallsyms] [k] put_page
2.13% kswapd0 [kernel.kallsyms] [k] shrink_page_list
1.88% cat [kernel.kallsyms] [k] __radix_tree_lookup
1.67% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn
1.39% kswapd0 [kernel.kallsyms] [k] free_pcppages_bulk
1.30% cat [kernel.kallsyms] [k] kfree

As you can see, the memcg footprint has shrunk quite a bit.

text data bss dec hex filename
37970 9892 400 48262 bc86 mm/memcontrol.o.old
35239 9892 400 45531 b1db mm/memcontrol.o

This patch (of 4):

The memcg charge API charges pages before they are rmapped - i.e. have an
actual "type" - and so every callsite needs its own set of charge and
uncharge functions to know what type is being operated on. Worse,
uncharge has to happen from a context that is still type-specific, rather
than at the end of the page's lifetime with exclusive access, and so
requires a lot of synchronization.

Rewrite the charge API to provide a generic set of try_charge(),
commit_charge() and cancel_charge() transaction operations, much like
what's currently done for swap-in:

mem_cgroup_try_charge() attempts to reserve a charge, reclaiming
pages from the memcg if necessary.

mem_cgroup_commit_charge() commits the page to the charge once it
has a valid page->mapping and PageAnon() reliably tells the type.

mem_cgroup_cancel_charge() aborts the transaction.

This reduces the charge API and enables subsequent patches to
drastically simplify uncharging.

As pages need to be committed after rmap is established but before they
are added to the LRU, page_add_new_anon_rmap() must stop doing LRU
additions again. Revive lru_cache_add_active_or_unevictable().

[hughd@google.com: fix shmem_unuse]
[hughd@google.com: Add comments on the private use of -EAGAIN]
Signed-off-by: Johannes Weiner
Acked-by: Michal Hocko
Cc: Tejun Heo
Cc: Vladimir Davydov
Signed-off-by: Hugh Dickins
Cc: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-08-09 06:57:17 +0800
4449a51a7 vm_is_stack: use for_each_thread() rather then buggy while_each_thread() ... Browse Code »

Aleksei hit the soft lockup during reading /proc/PID/smaps. David
investigated the problem and suggested the right fix.

while_each_thread() is racy and should die, this patch updates
vm_is_stack().

Signed-off-by: Oleg Nesterov
Reported-by: Aleksei Besogonov
Tested-by: Aleksei Besogonov
Suggested-by: David Rientjes
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-08-09 06:57:17 +0800
edcad2509 Revert "slab: remove BAD_ALIEN_MAGIC" ... Browse Code »

This reverts commit a640616822b2 ("slab: remove BAD_ALIEN_MAGIC").

commit a640616822b2 ("slab: remove BAD_ALIEN_MAGIC") assumes that the
system with !CONFIG_NUMA has only one memory node. But, it turns out to
be false by the report from Geert. His system, m68k, has many memory
nodes and is configured in !CONFIG_NUMA. So it couldn't boot with above
change.

Here goes his failure report.

With latest mainline, I'm getting a crash during bootup on m68k/ARAnyM:

enable_cpucache failed for radix_tree_node, error 12.
kernel BUG at /scratch/geert/linux/linux-m68k/mm/slab.c:1522!
*** TRAP #7 *** FORMAT=0
Current process id is 0
BAD KERNEL TRAP: 00000000
Modules linked in:
PC: [] kmem_cache_init_late+0x70/0x8c
SR: 2200 SP: 00345f90 a2: 0034c2e8
d0: 0000003d d1: 00000000 d2: 00000000 d3: 003ac942
d4: 00000000 d5: 00000000 a0: 0034f686 a1: 0034f682
Process swapper (pid: 0, task=0034c2e8)
Frame format=0
Stack from 00345fc4:
002f69ef 002ff7e5 000005f2 000360fa 0017d806 003921d4 00000000
00000000 00000000 00000000 00000000 00000000 003ac942 00000000
003912d6
Call Trace: [] parse_args+0x0/0x2ca
[] strlen+0x0/0x1a
[] start_kernel+0x23c/0x428
[] _sinittext+0x2d6/0x95e

Code: f7e5 4879 002f 69ef 61ff ffca 462a 4e47 0035 4b1c 61ff
fff0 0cc4 7005 23c0 0037 fd20 588f 265f 285f 4e75 48e7 301c
Disabling lock debugging due to kernel taint
Kernel panic - not syncing: Attempted to kill the idle task!

Although there is a alternative way to fix this issue such as disabling
use of alien cache on !CONFIG_NUMA, but, reverting issued commit is better
to me in this time.

Signed-off-by: Joonsoo Kim
Reported-by: Geert Uytterhoeven
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2014-08-09 06:57:17 +0800

08 Aug, 2014

3 commits

c7f3888ad switch iov_iter_get_pages() to passing maximal number of pages ... Browse Code »

... instead of maximal size.

Signed-off-by: Al Viro

Al Viro
2014-08-08 02:40:11 +0800
37456771c shmem: support RENAME_EXCHANGE ... Browse Code »

This is really simple in tmpfs since the VFS already takes care of
shuffling the dentries. Just adjust nlink on parent directories and touch
c & mtimes.

Signed-off-by: Miklos Szeredi
Acked-by: Hugh Dickins
Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Miklos Szeredi
2014-08-08 02:40:09 +0800
3b69ff51d shmem: support RENAME_NOREPLACE ... Browse Code »

Implement ->rename2 instead of ->rename.

Signed-off-by: Miklos Szeredi
Acked-by: Hugh Dickins
Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Miklos Szeredi
2014-08-08 02:40:09 +0800

07 Aug, 2014

21 commits

12d79d64b mm/zpool: update zswap to use zpool ... Browse Code »

Change zswap to use the zpool api instead of directly using zbud. Add a
boot-time param to allow selecting which zpool implementation to use,
with zbud as the default.

Signed-off-by: Dan Streetman
Tested-by: Seth Jennings
Cc: Weijie Yang
Cc: Minchan Kim
Cc: Nitin Gupta
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Streetman
2014-08-07 09:01:23 +0800
c795779df mm/zpool: zbud/zsmalloc implement zpool ... Browse Code »

Update zbud and zsmalloc to implement the zpool api.

[fengguang.wu@intel.com: make functions static]
Signed-off-by: Dan Streetman
Tested-by: Seth Jennings
Cc: Minchan Kim
Cc: Nitin Gupta
Cc: Weijie Yang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Streetman
2014-08-07 09:01:23 +0800
af8d417a0 mm/zpool: implement common zpool api to zbud/zsmalloc ... Browse Code »

Add zpool api.

zpool provides an interface for memory storage, typically of compressed
memory. Users can select what backend to use; currently the only
implementations are zbud, a low density implementation with up to two
compressed pages per storage page, and zsmalloc, a higher density
implementation with multiple compressed pages per storage page.

Signed-off-by: Dan Streetman
Tested-by: Seth Jennings
Cc: Minchan Kim
Cc: Nitin Gupta
Cc: Weijie Yang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Streetman
2014-08-07 09:01:23 +0800
99eef8e93 mm/zbud: change zbud_alloc size type to size_t ... Browse Code »

Change the type of the zbud_alloc() size param from unsigned int to
size_t.

Technically, this should not make any difference, as the zbud
implementation already restricts the size to well within either type's
limits; but as zsmalloc (and kmalloc) use size_t, and zpool will use
size_t, this brings the size parameter type in line with zsmalloc/zpool.

Signed-off-by: Dan Streetman
Acked-by: Seth Jennings
Tested-by: Seth Jennings
Cc: Weijie Yang
Cc: Minchan Kim
Cc: Nitin Gupta
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Streetman
2014-08-07 09:01:23 +0800
15de36a4c mm/highmem: make kmap cache coloring aware ... Browse Code »

User-visible effect:
Architectures that choose this method of maintaining cache coherency
(MIPS and xtensa currently) are able to use high memory on cores with
aliasing data cache. Without this fix such architectures can not use
high memory (in case of xtensa it means that at most 128 MBytes of
physical memory is available).

The problem:
VIPT cache with way size larger than MMU page size may suffer from
aliasing problem: a single physical address accessed via different
virtual addresses may end up in multiple locations in the cache.
Virtual mappings of a physical address that always get cached in
different cache locations are said to have different colors. L1 caching
hardware usually doesn't handle this situation leaving it up to
software. Software must avoid this situation as it leads to data
corruption.

What can be done:
One way to handle this is to flush and invalidate data cache every time
page mapping changes color. The other way is to always map physical
page at a virtual address with the same color. Low memory pages already
have this property. Giving architecture a way to control color of high
memory page mapping allows reusing of existing low memory cache alias
handling code.

How this is done with this patch:
Provide hooks that allow architectures with aliasing cache to align
mapping address of high pages according to their color. Such
architectures may enforce similar coloring of low- and high-memory page
mappings and reuse existing cache management functions to support
highmem.

This code is based on the implementation of similar feature for MIPS by
Leonid Yegoshin.

Signed-off-by: Max Filippov
Cc: Leonid Yegoshin
Cc: Chris Zankel
Cc: Marc Gauthier
Cc: David Rientjes
Cc: Steven Hill
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Max Filippov
2014-08-07 09:01:22 +0800
b972216e2 mmu_notifier: add call_srcu and sync function for listener to delay call and sync ... Browse Code »

When kernel device drivers or subsystems want to bind their lifespan to
t= he lifespan of the mm_struct, they usually use one of the following
methods:

1. Manually calling a function in the interested kernel module. The
funct= ion call needs to be placed in mmput. This method was rejected
by several ker= nel maintainers.

2. Registering to the mmu notifier release mechanism.

The problem with the latter approach is that the mmu_notifier_release
cal= lback is called from__mmu_notifier_release (called from exit_mmap).
That functi= on iterates over the list of mmu notifiers and don't expect
the release call= back function to remove itself from the list.
Therefore, the callback function= in the kernel module can't release the
mmu_notifier_object, which is actuall= y the kernel module's object
itself. As a result, the destruction of the kernel module's object must
to be done in a delayed fashion.

This patch adds support for this delayed callback, by adding a new
mmu_notifier_call_srcu function that receives a function ptr and calls
th= at function with call_srcu. In that function, the kernel module
releases its object. To use mmu_notifier_call_srcu, the calling module
needs to call b= efore that a new function called
mmu_notifier_unregister_no_release that as its= name implies,
unregisters a notifier without calling its notifier release call= back.

This patch also adds a function that will call barrier_srcu so those
kern= el modules can sync with mmu_notifier.

Signed-off-by: Peter Zijlstra
Signed-off-by: Jérôme Glisse
Signed-off-by: Oded Gabbay
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Zijlstra
2014-08-07 09:01:22 +0800
61e02c745 mm: memcontrol: clean up reclaim size variable use in try_charge() ... Browse Code »

Charge reclaim and OOM currently use the charge batch variable, but
batching is already disabled at that point. To simplify the charge
logic, the batch variable is reset to the original request size when
reclaim is entered, so it's functionally equal, but it's misleading.

Switch reclaim/OOM to nr_pages, which is the original request size.

Signed-off-by: Johannes Weiner
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-08-07 09:01:22 +0800
dbffcd03d mm: change confusing #ifdef use in __access_remote_vm ... Browse Code »

This patch changes confusing #ifdef use in __access_remote_vm into
merely ugly #ifdef use.

Addresses bug https://bugzilla.kernel.org/show_bug.cgi?id=81651

Signed-off-by: Rik van Riel
Reported-by: David Binderman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2014-08-07 09:01:22 +0800
3a91053ae mm: mark fault_around_bytes __read_mostly ... Browse Code »

fault_around_bytes can only be changed via debugfs. Let's mark it
read-mostly.

Signed-off-by: Kirill A. Shutemov
Suggested-by: David Rientjes
Acked-by: David Rientjes
Cc: Dave Hansen
Cc: Andrey Ryabinin
Cc: Sasha Levin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2014-08-07 09:01:22 +0800
aecd6f442 mm: close race between do_fault_around() and fault_around_bytes_set() ... Browse Code »

Things can go wrong if fault_around_bytes will be changed under
do_fault_around(): between fault_around_mask() and fault_around_pages().

Let's read fault_around_bytes only once during do_fault_around() and
calculate mask based on the reading.

Note: fault_around_bytes can only be updated via debug interface. Also
I've tried but was not able to trigger a bad behaviour without the
patch. So I would not consider this patch as urgent.

Signed-off-by: Kirill A. Shutemov
Cc: Dave Hansen
Cc: Andrey Ryabinin
Cc: Sasha Levin
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2014-08-07 09:01:22 +0800
2ab051e11 memcg, vmscan: Fix forced scan of anonymous pages ... Browse Code »

When memory cgoups are enabled, the code that decides to force to scan
anonymous pages in get_scan_count() compares global values (free,
high_watermark) to a value that is restricted to a memory cgroup (file).
It make the code over-eager to force anon scan.

For instance, it will force anon scan when scanning a memcg that is
mainly populated by anonymous page, even when there is plenty of file
pages to get rid of in others memcgs, even when swappiness == 0. It
breaks user's expectation about swappiness and hurts performance.

This patch makes sure that forced anon scan only happens when there not
enough file pages for the all zone, not just in one random memcg.

[hannes@cmpxchg.org: cleanups]
Signed-off-by: Jerome Marchand
Acked-by: Michal Hocko
Acked-by: Johannes Weiner
Reviewed-by: Rik van Riel
Cc: Mel Gorman
Signed-off-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jerome Marchand
2014-08-07 09:01:22 +0800
7c0db9e91 mm, vmscan: fix an outdated comment still mentioning get_scan_ratio ... Browse Code »

Quite a while ago, get_scan_ratio() has been renamed get_scan_count(),
however a comment in shrink_active_list() still mention it. This patch
fixes the outdated comment.

Signed-off-by: Jerome Marchand
Cc: Michal Hocko
Cc: Johannes Weiner
Cc: Rik van Riel
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jerome Marchand
2014-08-07 09:01:22 +0800
fb794bcbb mm, oom: remove unnecessary exit_state check ... Browse Code »

The oom killer scans each process and determines whether it is eligible
for oom kill or whether the oom killer should abort because of
concurrent memory freeing. It will abort when an eligible process is
found to have TIF_MEMDIE set, meaning it has already been oom killed and
we're waiting for it to exit.

Processes with task->mm == NULL should not be considered because they
are either kthreads or have already detached their memory and killing
them would not lead to memory freeing. That memory is only freed after
exit_mm() has returned, however, and not when task->mm is first set to
NULL.

Clear TIF_MEMDIE after exit_mm()'s mmput() so that an oom killed process
is no longer considered for oom kill, but only until exit_mm() has
returned. This was fragile in the past because it relied on
exit_notify() to be reached before no longer considering TIF_MEMDIE
processes.

Signed-off-by: David Rientjes
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2014-08-07 09:01:21 +0800
d01776393 mm: fix potential infinite loop in dissolve_free_huge_pages() ... Browse Code »

It is possible for some platforms, such as powerpc to set HPAGE_SHIFT to
0 to indicate huge pages not supported.

When this is the case, hugetlbfs could be disabled during boot time:
hugetlbfs: disabling because there are no supported hugepage sizes

Then in dissolve_free_huge_pages(), order is kept maximum (64 for
64bits), and the for loop below won't end: for (pfn = start_pfn; pfn <
end_pfn; pfn += 1 << order)

As suggested by Naoya, below fix checks hugepages_supported() before
calling dissolve_free_huge_pages().

[rientjes@google.com: no legitimate reason to call dissolve_free_huge_pages() when !hugepages_supported()]
Signed-off-by: Li Zhong
Acked-by: Naoya Horiguchi
Acked-by: David Rientjes
Signed-off-by: David Rientjes
Cc: [3.12+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Li Zhong
2014-08-07 09:01:21 +0800
8fe780484 mm, thp: restructure thp avoidance of light synchronous migration ... Browse Code »

__GFP_NO_KSWAPD, once the way to determine if an allocation was for thp
or not, has gained more users. Their use is not necessarily wrong, they
are trying to do a memory allocation that can easily fail without
disturbing kswapd, so the bit has gained additional usecases.

This restructures the check to determine whether MIGRATE_SYNC_LIGHT
should be used for memory compaction in the page allocator. Rather than
testing solely for __GFP_NO_KSWAPD, test for all bits that must be set
for thp allocations.

This also moves the check to be done only after the page allocator is
aborted for deferred or contended memory compaction since setting
migration_mode for this case is pointless.

Signed-off-by: David Rientjes
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2014-08-07 09:01:21 +0800
e972a070e mm, oom: rename zonelist locking functions ... Browse Code »

try_set_zonelist_oom() and clear_zonelist_oom() are not named properly
to imply that they require locking semantics to avoid out_of_memory()
being reordered.

zone_scan_lock is required for both functions to ensure that there is
proper locking synchronization.

Rename try_set_zonelist_oom() to oom_zonelist_trylock() and rename
clear_zonelist_oom() to oom_zonelist_unlock() to imply there is proper
locking semantics.

At the same time, convert oom_zonelist_trylock() to return bool instead
of int since only success and failure are tested.

Signed-off-by: David Rientjes
Cc: "Kirill A. Shutemov"
Cc: Johannes Weiner
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2014-08-07 09:01:21 +0800
8d060bf49 mm, oom: ensure memoryless node zonelist always includes zones ... Browse Code »

With memoryless node support being worked on, it's possible that for
optimizations that a node may not have a non-NULL zonelist. When
CONFIG_NUMA is enabled and node 0 is memoryless, this means the zonelist
for first_online_node may become NULL.

The oom killer requires a zonelist that includes all memory zones for
the sysrq trigger and pagefault out of memory handler.

Ensure that a non-NULL zonelist is always passed to the oom killer.

[akpm@linux-foundation.org: fix non-numa build]
Signed-off-by: David Rientjes
Cc: "Kirill A. Shutemov"
Cc: Johannes Weiner
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2014-08-07 09:01:21 +0800
632644007 memory-hotplug: add zone_for_memory() for selecting zone for new memory ... Browse Code »

This series of patches fixes a problem when adding memory in bad manner.
For example: for a x86_64 machine booted with "mem=400M" and with 2GiB
memory installed, following commands cause problem:

# echo 0x40000000 > /sys/devices/system/memory/probe
[ 28.613895] init_memory_mapping: [mem 0x40000000-0x47ffffff]
# echo 0x48000000 > /sys/devices/system/memory/probe
[ 28.693675] init_memory_mapping: [mem 0x48000000-0x4fffffff]
# echo online_movable > /sys/devices/system/memory/memory9/state
# echo 0x50000000 > /sys/devices/system/memory/probe
[ 29.084090] init_memory_mapping: [mem 0x50000000-0x57ffffff]
# echo 0x58000000 > /sys/devices/system/memory/probe
[ 29.151880] init_memory_mapping: [mem 0x58000000-0x5fffffff]
# echo online_movable > /sys/devices/system/memory/memory11/state
# echo online> /sys/devices/system/memory/memory8/state
# echo online> /sys/devices/system/memory/memory10/state
# echo offline> /sys/devices/system/memory/memory9/state
[ 30.558819] Offlined Pages 32768
# free
total used free shared buffers cached
Mem: 780588 18014398509432020 830552 0 0 51180
-/+ buffers/cache: 18014398509380840 881732
Swap: 0 0 0

This is because the above commands probe higher memory after online a
section with online_movable, which causes ZONE_HIGHMEM (or ZONE_NORMAL
for systems without ZONE_HIGHMEM) overlaps ZONE_MOVABLE.

After the second online_movable, the problem can be observed from
zoneinfo:

# cat /proc/zoneinfo
...
Node 0, zone Movable
pages free 65491
min 250
low 312
high 375
scanned 0
spanned 18446744073709518848
present 65536
managed 65536
...

This series of patches solve the problem by checking ZONE_MOVABLE when
choosing zone for new memory. If new memory is inside or higher than
ZONE_MOVABLE, makes it go there instead.

After applying this series of patches, following are free and zoneinfo
result (after offlining memory9):

bash-4.2# free
total used free shared buffers cached
Mem: 780956 80112 700844 0 0 51180
-/+ buffers/cache: 28932 752024
Swap: 0 0 0

bash-4.2# cat /proc/zoneinfo

Node 0, zone DMA
pages free 3389
min 14
low 17
high 21
scanned 0
spanned 4095
present 3998
managed 3977
nr_free_pages 3389
...
start_pfn: 1
inactive_ratio: 1
Node 0, zone DMA32
pages free 73724
min 341
low 426
high 511
scanned 0
spanned 98304
present 98304
managed 92958
nr_free_pages 73724
...
start_pfn: 4096
inactive_ratio: 1
Node 0, zone Normal
pages free 32630
min 120
low 150
high 180
scanned 0
spanned 32768
present 32768
managed 32768
nr_free_pages 32630
...
start_pfn: 262144
inactive_ratio: 1
Node 0, zone Movable
pages free 65476
min 241
low 301
high 361
scanned 0
spanned 98304
present 65536
managed 65536
nr_free_pages 65476
...
start_pfn: 294912
inactive_ratio: 1

This patch (of 7):

Introduce zone_for_memory() in arch independent code for
arch_add_memory() use.

Many arch_add_memory() function simply selects ZONE_HIGHMEM or
ZONE_NORMAL and add new memory into it. However, with the existance of
ZONE_MOVABLE, the selection method should be carefully considered: if
new, higher memory is added after ZONE_MOVABLE is setup, the default
zone and ZONE_MOVABLE may overlap each other.

should_add_memory_movable() checks the status of ZONE_MOVABLE. If it
has already contain memory, compare the address of new memory and
movable memory. If new memory is higher than movable, it should be
added into ZONE_MOVABLE instead of default zone.

Signed-off-by: Wang Nan
Cc: Zhang Yanfei
Cc: Dave Hansen
Cc: Ingo Molnar
Cc: Yinghai Lu
Cc: "Mel Gorman"
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Cc: "Luck, Tony"
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: Chris Metcalf
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wang Nan
2014-08-07 09:01:21 +0800
aee52cae0 slub: remove kmemcg id from create_unique_id ... Browse Code »

This function is never called for memcg caches, because they are
unmergeable, so remove the dead code.

Signed-off-by: Vladimir Davydov
Cc: Michal Hocko
Cc: Johannes Weiner
Cc: Christoph Lameter
Reviewed-by: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2014-08-07 09:01:21 +0800
9ef0a0ffa mm, writeback: prevent race when calculating dirty limits ... Browse Code »

Setting vm_dirty_bytes and dirty_background_bytes is not protected by
any serialization.

Therefore, it's possible for either variable to change value after the
test in global_dirty_limits() to determine whether available_memory
needs to be initialized or not.

Always ensure that available_memory is properly initialized.

Signed-off-by: David Rientjes
Acked-by: Johannes Weiner
Acked-by: Michal Hocko
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2014-08-07 09:01:21 +0800
14a4e2141 mm, thp: only collapse hugepages to nodes with affinity for zone_reclaim_mode ... Browse Code »

Commit 9f1b868a13ac ("mm: thp: khugepaged: add policy for finding target
node") improved the previous khugepaged logic which allocated a
transparent hugepages from the node of the first page being collapsed.

However, it is still possible to collapse pages to remote memory which
may suffer from additional access latency. With the current policy, it
is possible that 255 pages (with PAGE_SHIFT == 12) will be collapsed
remotely if the majority are allocated from that node.

When zone_reclaim_mode is enabled, it means the VM should make every
attempt to allocate locally to prevent NUMA performance degradation. In
this case, we do not want to collapse hugepages to remote nodes that
would suffer from increased access latency. Thus, when
zone_reclaim_mode is enabled, only allow collapsing to nodes with
RECLAIM_DISTANCE or less.

There is no functional change for systems that disable
zone_reclaim_mode.

Signed-off-by: David Rientjes
Cc: Dave Hansen
Cc: Andrea Arcangeli
Acked-by: Vlastimil Babka
Acked-by: Mel Gorman
Cc: Rik van Riel
Cc: "Kirill A. Shutemov"
Cc: Bob Liu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2014-08-07 09:01:20 +0800