Eric Lee / smarc-fsl-linux-kernel

13 Jun, 2013

6 commits

89dc991f0 mm: memcontrol: fix lockless reclaim hierarchy iterator ... Browse Code »

The lockless reclaim hierarchy iterator currently has a misplaced
barrier that can lead to use-after-free crashes.

The reclaim hierarchy iterator consist of a sequence count and a
position pointer that are read and written locklessly, with memory
barriers enforcing ordering.

The write side sets the position pointer first, then updates the
sequence count to "publish" the new position. Likewise, the read side
must read the sequence count first, then the position. If the sequence
count is up to date, it's guaranteed that the position is up to date as
well:

writer: reader:
iter->position = position if iter->sequence == expected:
smp_wmb() smp_rmb()
iter->sequence = sequence position = iter->position

However, the read side barrier is currently misplaced, which can lead to
dereferencing stale position pointers that no longer point to valid
memory. Fix this.

Signed-off-by: Johannes Weiner
Reported-by: Tejun Heo
Reviewed-by: Tejun Heo
Acked-by: Michal Hocko
Cc: KAMEZAWA Hiroyuki
Cc: Glauber Costa
Cc: [3.10+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2013-06-13 07:29:46 +0800
7b57976da frontswap: fix incorrect zeroing and allocation size for frontswap_map ... Browse Code »

The bitmap accessed by bitops must have enough size to hold the required
numbers of bits rounded up to a multiple of BITS_PER_LONG. And the
bitmap must not be zeroed by memset() if the number of bits cleared is
not a multiple of BITS_PER_LONG.

This fixes incorrect zeroing and allocation size for frontswap_map. The
incorrect zeroing part doesn't cause any problem because frontswap_map
is freed just after zeroing. But the wrongly calculated allocation size
may cause the problem.

For 32bit systems, the allocation size of frontswap_map is about twice
as large as required size. For 64bit systems, the allocation size is
smaller than requeired if the number of bits is not a multiple of
BITS_PER_LONG.

Signed-off-by: Akinobu Mita
Cc: Konrad Rzeszutek Wilk
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Akinobu Mita
2013-06-13 07:29:46 +0800
30dad3092 mm: migration: add migrate_entry_wait_huge() ... Browse Code »

When we have a page fault for the address which is backed by a hugepage
under migration, the kernel can't wait correctly and do busy looping on
hugepage fault until the migration finishes. As a result, users who try
to kick hugepage migration (via soft offlining, for example) occasionally
experience long delay or soft lockup.

This is because pte_offset_map_lock() can't get a correct migration entry
or a correct page table lock for hugepage. This patch introduces
migration_entry_wait_huge() to solve this.

Signed-off-by: Naoya Horiguchi
Reviewed-by: Rik van Riel
Reviewed-by: Wanpeng Li
Reviewed-by: Michal Hocko
Cc: Mel Gorman
Cc: Andi Kleen
Cc: KOSAKI Motohiro
Cc: [2.6.35+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2013-06-13 07:29:46 +0800
026b08147 mm/page_alloc.c: fix watermark check in __zone_watermark_ok() ... Browse Code »

The watermark check consists of two sub-checks. The first one is:

if (free_pages < order; o++) {
free_pages -= z->free_area[o].nr_free << o;
min >>= 1;
if (free_pages free_area[o].nr_free is equal to the number of free pages
including free CMA pages. Therefore the CMA pages are subtracted twice.
This may cause a false positive fail of __zone_watermark_ok() if the CMA
area gets strongly fragmented. In such a case there are many 0-order
free pages located in CMA. Those pages are subtracted twice therefore
they will quickly drain free_pages during the check against
fragmentation. The test fails even though there are many free non-cma
pages in the zone.

This patch fixes this issue by subtracting CMA pages only for a purpose of
(free_pages
Signed-off-by: Kyungmin Park
Tested-by: Laura Abbott
Cc: Bartlomiej Zolnierkiewicz
Acked-by: Minchan Kim
Cc: Mel Gorman
Tested-by: Marek Szyprowski
Cc: [3.7+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tomasz Stanislawski
2013-06-13 07:29:46 +0800
cbab0e4ee swap: avoid read_swap_cache_async() race to deadlock while waiting on discard I/O completion ... Browse Code »

read_swap_cache_async() can race against get_swap_page(), and stumble
across a SWAP_HAS_CACHE entry in the swap map whose page wasn't brought
into the swapcache yet.

This transient swap_map state is expected to be transitory, but the
actual placement of discard at scan_swap_map() inserts a wait for I/O
completion thus making the thread at read_swap_cache_async() to loop
around its -EEXIST case, while the other end at get_swap_page() is
scheduled away at scan_swap_map(). This can leave the system deadlocked
if the I/O completion happens to be waiting on the CPU waitqueue where
read_swap_cache_async() is busy looping and !CONFIG_PREEMPT.

This patch introduces a cond_resched() call to make the aforementioned
read_swap_cache_async() busy loop condition to bail out when necessary,
thus avoiding the subtle race window.

Signed-off-by: Rafael Aquini
Acked-by: Johannes Weiner
Acked-by: KOSAKI Motohiro
Acked-by: Hugh Dickins
Cc: Shaohua Li
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rafael Aquini
2013-06-13 07:29:45 +0800
f101a9464 memcg: don't initialize kmem-cache destroying work for root caches ... Browse Code »

struct memcg_cache_params has a union. Different parts of this union
are used for root and non-root caches. A part with destroying work is
used only for non-root caches.

BUG: unable to handle kernel paging request at 0000000fffffffe0
IP: kmem_cache_alloc+0x41/0x1f0
Modules linked in: netlink_diag af_packet_diag udp_diag tcp_diag inet_diag unix_diag ip6table_filter ip6_tables i2c_piix4 virtio_net virtio_balloon microcode i2c_core pcspkr floppy
CPU: 0 PID: 1929 Comm: lt-vzctl Tainted: G D 3.10.0-rc1+ #2
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
RIP: kmem_cache_alloc+0x41/0x1f0
Call Trace:
getname_flags.part.34+0x30/0x140
getname+0x38/0x60
do_sys_open+0xc5/0x1e0
SyS_open+0x22/0x30
system_call_fastpath+0x16/0x1b
Code: f4 53 48 83 ec 18 8b 05 8e 53 b7 00 4c 8b 4d 08 21 f0 a8 10 74 0d 4c 89 4d c0 e8 1b 76 4a 00 4c 8b 4d c0 e9 92 00 00 00 4d 89 f5 8b 45 00 65 4c 03 04 25 48 cd 00 00 49 8b 50 08 4d 8b 38 49
RIP [] kmem_cache_alloc+0x41/0x1f0

Signed-off-by: Andrey Vagin
Cc: Konstantin Khlebnikov
Cc: Glauber Costa
Cc: Johannes Weiner
Cc: Balbir Singh
Cc: KAMEZAWA Hiroyuki
Reviewed-by: Michal Hocko
Cc: Li Zefan
Cc: [3.9.x]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrey Vagin
2013-06-13 07:29:45 +0800

06 Jun, 2013

1 commit

29eb77825 arch, mm: Remove tlb_fast_mode() ... Browse Code »

Since the introduction of preemptible mmu_gather TLB fast mode has been
broken. TLB fast mode relies on there being absolutely no concurrency;
it frees pages first and invalidates TLBs later.

However now we can get concurrency and stuff goes *bang*.

This patch removes all tlb_fast_mode() code; it was found the better
option vs trying to patch the hole by entangling tlb invalidation with
the scheduler.

Cc: Thomas Gleixner
Cc: Russell King
Cc: Tony Luck
Reported-by: Max Filippov
Signed-off-by: Peter Zijlstra
Signed-off-by: Linus Torvalds

Peter Zijlstra
2013-06-06 09:07:26 +0800

25 May, 2013

6 commits

a9ff785e4 mm/pagewalk.c: walk_page_range should avoid VM_PFNMAP areas ... Browse Code »

A panic can be caused by simply cat'ing /proc//smaps while an
application has a VM_PFNMAP range. It happened in-house when a
benchmarker was trying to decipher the memory layout of his program.

/proc//smaps and similar walks through a user page table should not
be looking at VM_PFNMAP areas.

Certain tests in walk_page_range() (specifically split_huge_page_pmd())
assume that all the mapped PFN's are backed with page structures. And
this is not usually true for VM_PFNMAP areas. This can result in panics
on kernel page faults when attempting to address those page structures.

There are a half dozen callers of walk_page_range() that walk through a
task's entire page table (as N. Horiguchi pointed out). So rather than
change all of them, this patch changes just walk_page_range() to ignore
VM_PFNMAP areas.

The logic of hugetlb_vma() is moved back into walk_page_range(), as we
want to test any vma in the range.

VM_PFNMAP areas are used by:
- graphics memory manager gpu/drm/drm_gem.c
- global reference unit sgi-gru/grufile.c
- sgi special memory char/mspec.c
- and probably several out-of-tree modules

[akpm@linux-foundation.org: remove now-unused hugetlb_vma() stub]
Signed-off-by: Cliff Wickman
Reviewed-by: Naoya Horiguchi
Cc: Mel Gorman
Cc: Andrea Arcangeli
Cc: Dave Hansen
Cc: David Sterba
Cc: Johannes Weiner
Cc: KOSAKI Motohiro
Cc: "Kirill A. Shutemov"
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cliff Wickman
2013-05-25 07:22:53 +0800
348f9f05e mm/memory_hotplug.c: fix printk format warnings ... Browse Code »

Fix printk format warnings in mm/memory_hotplug.c by using "%pa":

mm/memory_hotplug.c: warning: format '%llx' expects argument of type 'long long unsigned int', but argument 2 has type 'resource_size_t' [-Wformat]
mm/memory_hotplug.c: warning: format '%llx' expects argument of type 'long long unsigned int', but argument 3 has type 'resource_size_t' [-Wformat]

Signed-off-by: Randy Dunlap
Reported-by: Geert Uytterhoeven
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Randy Dunlap
2013-05-25 07:22:52 +0800
7c3425123 mm/THP: use pmd_populate() to update the pmd with pgtable_t pointer ... Browse Code »

We should not use set_pmd_at to update pmd_t with pgtable_t pointer.
set_pmd_at is used to set pmd with huge pte entries and architectures
like ppc64, clear few flags from the pte when saving a new entry.
Without this change we observe bad pte errors like below on ppc64 with
THP enabled.

BUG: Bad page map in process ld mm=0xc000001ee39f4780 pte:7fc3f37848000001 pmd:c000001ec0000000

Signed-off-by: Aneesh Kumar K.V
Cc: Hugh Dickins
Cc: Benjamin Herrenschmidt
Reviewed-by: Andrea Arcangeli
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Aneesh Kumar K.V
2013-05-25 07:22:51 +0800
c2cc499c5 mm compaction: fix of improper cache flush in migration code ... Browse Code »

Page 'new' during MIGRATION can't be flushed with flush_cache_page().
Using flush_cache_page(vma, addr, pfn) is justified only if the page is
already placed in process page table, and that is done right after
flush_cache_page(). But without it the arch function has no knowledge
of process PTE and does nothing.

Besides that, flush_cache_page() flushes an application cache page, but
the kernel has a different page virtual address and dirtied it.

Replace it with flush_dcache_page(new) which is the proper usage.

The old page is flushed in try_to_unmap_one() before migration.

This bug takes place in Sead3 board with M14Kc MIPS CPU without cache
aliasing (but Harvard arch - separate I and D cache) in tight memory
environment (128MB) each 1-3days on SOAK test. It fails in cc1 during
kernel build (SIGILL, SIGBUS, SIGSEG) if CONFIG_COMPACTION is switched
ON.

Signed-off-by: Leonid Yegoshin
Cc: Leonid Yegoshin
Acked-by: Rik van Riel
Cc: Michal Hocko
Acked-by: Mel Gorman
Cc: Ralf Baechle
Cc: Russell King
Cc: David Miller
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Leonid Yegoshin
2013-05-25 07:22:51 +0800
28ccddf79 mm: memcg: remove incorrect VM_BUG_ON for swap cache pages in uncharge ... Browse Code »

Commit 0c59b89c81ea ("mm: memcg: push down PageSwapCache check into
uncharge entry functions") added a VM_BUG_ON() on PageSwapCache in the
uncharge path after checking that page flag once, assuming that the
state is stable in all paths, but this is not the case and the condition
triggers in user environments. An uncharge after the last page table
reference to the page goes away can race with reclaim adding the page to
swap cache.

Swap cache pages are usually uncharged when they are freed after
swapout, from a path that also handles swap usage accounting and memcg
lifetime management. However, since the last page table reference is
gone and thus no references to the swap slot left, the swap slot will be
freed shortly when reclaim attempts to write the page to disk. The
whole swap accounting is not even necessary.

So while the race condition for which this VM_BUG_ON was added is real
and actually existed all along, there are no negative effects. Remove
the VM_BUG_ON again.

Reported-by: Heiko Carstens
Reported-by: Lingzhu Xiang
Signed-off-by: Johannes Weiner
Acked-by: Hugh Dickins
Acked-by: Michal Hocko
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2013-05-25 07:22:51 +0800
d34883d4e mm: mmu_notifier: re-fix freed page still mapped in secondary MMU ... Browse Code »

Commit 751efd8610d3 ("mmu_notifier_unregister NULL Pointer deref and
multiple ->release()") breaks the fix 3ad3d901bbcf ("mm: mmu_notifier:
fix freed page still mapped in secondary MMU").

Since hlist_for_each_entry_rcu() is changed now, we can not revert that
patch directly, so this patch reverts the commit and simply fix the bug
spotted by that patch

This bug spotted by commit 751efd8610d3 is:

There is a race condition between mmu_notifier_unregister() and
__mmu_notifier_release().

Assume two tasks, one calling mmu_notifier_unregister() as a result
of a filp_close() ->flush() callout (task A), and the other calling
mmu_notifier_release() from an mmput() (task B).

A B
t1 srcu_read_lock()
t2 if (!hlist_unhashed())
t3 srcu_read_unlock()
t4 srcu_read_lock()
t5 hlist_del_init_rcu()
t6 synchronize_srcu()
t7 srcu_read_unlock()
t8 hlist_del_rcu() release should be fast
since all the pages have already been released by the first call.
Anyway, this issue should be fixed in a separate patch.

-stable suggestions: Any version that has commit 751efd8610d3 need to be
backported. I find the oldest version has this commit is 3.0-stable.

[akpm@linux-foundation.org: tweak comments]
Signed-off-by: Xiao Guangrong
Tested-by: Robin Holt
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Xiao Guangrong
2013-05-25 07:22:51 +0800

22 May, 2013

1 commit

bb3ec6b08 mm: Fix virt_to_page() warning ... Browse Code »

virt_to_page() is typically implemented as a macro containing a cast so
that it will accept both pointers and unsigned long without causing a
warning.

But MIPS virt_to_page() uses virt_to_phys which is a function so passing
an unsigned long will cause a warning:

CC mm/page_alloc.o
mm/page_alloc.c: In function ‘free_reserved_area’:
mm/page_alloc.c:5161:3: warning: passing argument 1 of ‘virt_to_phys’ makes pointer from integer without a cast [enabled by default]
arch/mips/include/asm/io.h:119:100: note: expected ‘const volatile void *’ but argument is of type ‘long unsigned int’

All others users of virt_to_page() in mm/ are passing a void *.

Signed-off-by: Ralf Baechle
Reported-by: Eunbong Song
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-mips@linux-mips.org
Signed-off-by: Linus Torvalds

Ralf Baechle
2013-05-22 23:05:16 +0800

10 May, 2013

1 commit

091d0d55b shm: fix null pointer deref when userspace specifies invalid hugepage size ... Browse Code »

Dave reported an oops triggered by trinity:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
IP: newseg+0x10d/0x390
PGD cf8c1067 PUD cf8c2067 PMD 0
Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
CPU: 2 PID: 7636 Comm: trinity-child2 Not tainted 3.9.0+#67
...
Call Trace:
ipcget+0x182/0x380
SyS_shmget+0x5a/0x60
tracesys+0xdd/0xe2

This bug was introduced by commit af73e4d9506d ("hugetlbfs: fix mmap
failure in unaligned size request").

Reported-by: Dave Jones
Cc:
Signed-off-by: Li Zefan
Reviewed-by: Naoya Horiguchi
Acked-by: Rik van Riel
Signed-off-by: Linus Torvalds

Li Zefan
2013-05-10 05:22:47 +0800

09 May, 2013

2 commits

956e46efb mm/slab: Fix crash during slab init ... Browse Code »

Commit 8a965b3baa89 ("mm, slab_common: Fix bootstrap creation of kmalloc
caches") introduced a regression that caused us to crash early during
boot. The commit was introducing ordering of slab creation, making sure
two odd-sized slabs were created after specific powers of two sizes.

But, if any of the power of two slabs were created earlier during boot,
slabs at index 1 or 2 might not get created at all. This patch makes
sure none of the slabs get skipped.

Tony Lindgren bisected this down to the offending commit, which really
helped because bisect kept bringing me to almost but not quite this one.

Signed-off-by: Chris Mason
Acked-by: Christoph Lameter
Acked-by: Tony Lindgren
Acked-by: Soren Brinkmann
Tested-by: Tetsuo Handa
Tested-by: Konrad Rzeszutek Wilk
Signed-off-by: Linus Torvalds

Chris Mason
2013-05-09 06:02:33 +0800
4de13d7aa Merge branch 'for-3.10/core' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block core updates from Jens Axboe:

- Major bit is Kents prep work for immutable bio vecs.

- Stable candidate fix for a scheduling-while-atomic in the queue
bypass operation.

- Fix for the hang on exceeded rq->datalen 32-bit unsigned when merging
discard bios.

- Tejuns changes to convert the writeback thread pool to the generic
workqueue mechanism.

- Runtime PM framework, SCSI patches exists on top of these in James'
tree.

- A few random fixes.

* 'for-3.10/core' of git://git.kernel.dk/linux-block: (40 commits)
relay: move remove_buf_file inside relay_close_buf
partitions/efi.c: replace useless kzalloc's by kmalloc's
fs/block_dev.c: fix iov_shorten() criteria in blkdev_aio_read()
block: fix max discard sectors limit
blkcg: fix "scheduling while atomic" in blk_queue_bypass_start
Documentation: cfq-iosched: update documentation help for cfq tunables
writeback: expose the bdi_wq workqueue
writeback: replace custom worker pool implementation with unbound workqueue
writeback: remove unused bdi_pending_list
aoe: Fix unitialized var usage
bio-integrity: Add explicit field for owner of bip_buf
block: Add an explicit bio flag for bios that own their bvec
block: Add bio_alloc_pages()
block: Convert some code to bio_for_each_segment_all()
block: Add bio_for_each_segment_all()
bounce: Refactor __blk_queue_bounce to not use bi_io_vec
raid1: use bio_copy_data()
pktcdvd: Use bio_reset() in disabled code to kill bi_idx usage
pktcdvd: use bio_copy_data()
block: Add bio_copy_data()
...

Linus Torvalds
2013-05-09 01:13:35 +0800

08 May, 2013

5 commits

a27bb332c aio: don't include aio.h in sched.h ... Browse Code »

Faster kernel compiles by way of fewer unnecessary includes.

[akpm@linux-foundation.org: fix fallout]
[akpm@linux-foundation.org: fix build]
Signed-off-by: Kent Overstreet
Cc: Zach Brown
Cc: Felipe Balbi
Cc: Greg Kroah-Hartman
Cc: Mark Fasheh
Cc: Joel Becker
Cc: Rusty Russell
Cc: Jens Axboe
Cc: Asai Thambi S P
Cc: Selvan Mani
Cc: Sam Bradshaw
Cc: Jeff Moyer
Cc: Al Viro
Cc: Benjamin LaHaise
Reviewed-by: "Theodore Ts'o"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kent Overstreet
2013-05-08 11:16:25 +0800
697f4d68c mm: remove old aio use_mm() comment ... Browse Code »

Bunch of performance improvements and cleanups Zach Brown and I have
been working on. The code should be pretty solid at this point, though
it could of course use more review and testing.

The results in my testing are pretty impressive, particularly when an
ioctx is being shared between multiple threads. In my crappy synthetic
benchmark, with 4 threads submitting and one thread reaping completions,
I saw overhead in the aio code go from ~50% (mostly ioctx lock
contention) to low single digits. Performance with ioctx per thread
improved too, but I'd have to rerun those benchmarks.

The reason I've been focused on performance when the ioctx is shared is
that for a fair number of real world completions, userspace needs the
completions aggregated somehow - in practice people just end up
implementing this aggregation in userspace today, but if it's done right
we can do it much more efficiently in the kernel.

Performance wise, the end result of this patch series is that submitting
a kiocb writes to _no_ shared cachelines - the penalty for sharing an
ioctx is gone there. There's still going to be some cacheline
contention when we deliver the completions to the aio ringbuffer (at
least if you have interrupts being delivered on multiple cores, which
for high end stuff you do) but I have a couple more patches not in this
series that implement coalescing for that (by taking advantage of
interrupt coalescing). With that, there's basically no bottlenecks or
performance issues to speak of in the aio code.

This patch:

use_mm() is used in more places than just aio. There's no need to mention
callers when describing the function.

Signed-off-by: Zach Brown
Signed-off-by: Kent Overstreet
Cc: Felipe Balbi
Cc: Greg Kroah-Hartman
Cc: Mark Fasheh
Cc: Joel Becker
Cc: Rusty Russell
Cc: Jens Axboe
Cc: Asai Thambi S P
Cc: Selvan Mani
Cc: Sam Bradshaw
Acked-by: Jeff Moyer
Cc: Al Viro
Cc: Benjamin LaHaise
Reviewed-by: "Theodore Ts'o"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Zach Brown
2013-05-08 09:38:27 +0800
c9fcee513 mm/vmalloc.c: add vfree comment ... Browse Code »

Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2013-05-08 09:38:27 +0800
af73e4d95 hugetlbfs: fix mmap failure in unaligned size request ... Browse Code »

The current kernel returns -EINVAL unless a given mmap length is
"almost" hugepage aligned. This is because in sys_mmap_pgoff() the
given length is passed to vm_mmap_pgoff() as it is without being aligned
with hugepage boundary.

This is a regression introduced in commit 40716e29243d ("hugetlbfs: fix
alignment of huge page requests"), where alignment code is pushed into
hugetlb_file_setup() and the variable len in caller side is not changed.

To fix this, this patch partially reverts that commit, and adds
alignment code in caller side. And it also introduces hstate_sizelog()
in order to get proper hstate to specified hugepage size.

Addresses https://bugzilla.kernel.org/show_bug.cgi?id=56881

[akpm@linux-foundation.org: fix warning when CONFIG_HUGETLB_PAGE=n]
Signed-off-by: Naoya Horiguchi
Signed-off-by: Johannes Weiner
Reported-by:
Cc: Steven Truelove
Cc: Jianguo Wu
Cc: Hugh Dickins
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2013-05-08 09:38:27 +0800
b070e65c0 mm, memcg: add rss_huge stat to memory.stat ... Browse Code »

This exports the amount of anonymous transparent hugepages for each
memcg via the new "rss_huge" stat in memory.stat. The units are in
bytes.

This is helpful to determine the hugepage utilization for individual
jobs on the system in comparison to rss and opportunities where
MADV_HUGEPAGE may be helpful.

The amount of anonymous transparent hugepages is also included in "rss"
for backwards compatibility.

Signed-off-by: David Rientjes
Acked-by: Michal Hocko
Acked-by: Johannes Weiner
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2013-05-08 09:38:26 +0800

07 May, 2013

3 commits

0f47c9423 Merge branch 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux ... Browse Code »

Pull slab changes from Pekka Enberg:
"The bulk of the changes are more slab unification from Christoph.

There's also few fixes from Aaron, Glauber, and Joonsoo thrown into
the mix."

* 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux: (24 commits)
mm, slab_common: Fix bootstrap creation of kmalloc caches
slab: Return NULL for oversized allocations
mm: slab: Verify the nodeid passed to ____cache_alloc_node
slub: tid must be retrieved from the percpu area of the current processor
slub: Do not dereference NULL pointer in node_match
slub: add 'likely' macro to inc_slabs_node()
slub: correct to calculate num of acquired objects in get_partial_node()
slub: correctly bootstrap boot caches
mm/sl[au]b: correct allocation type check in kmalloc_slab()
slab: Fixup CONFIG_PAGE_ALLOC/DEBUG_SLAB_LEAK sections
slab: Handle ARCH_DMA_MINALIGN correctly
slab: Common definition for kmem_cache_node
slab: Rename list3/l3 to node
slab: Common Kmalloc cache determination
stat: Use size_t for sizes instead of unsigned
slab: Common function to create the kmalloc array
slab: Common definition for the array of kmalloc caches
slab: Common constants for kmalloc boundaries
slab: Rename nodelists to node
slab: Common name for the per node structures
...

Linus Torvalds
2013-05-07 23:42:20 +0800
69df2ac12 Merge branch 'slab/next' into slab/for-linus Browse Code »

Pekka Enberg
2013-05-07 14:19:47 +0800
8a965b3ba mm, slab_common: Fix bootstrap creation of kmalloc caches ... Browse Code »

For SLAB the kmalloc caches must be created in ascending sizes in order
for the OFF_SLAB sub-slab cache to work properly.

Create the non power of two caches immediately after the prior power of
two kmalloc cache. Do not create the non power of two caches before all
other caches.

Reported-and-tested-by: Tetsuo Handa
Signed-off-by: Christoph Lamete
Link: http://lkml.kernel.org/r/201305040348.CIF81716.OStQOHFJMFLOVF@I-love.SAKURA.ne.jp
Signed-off-by: Pekka Enberg

Christoph Lameter
2013-05-07 04:22:17 +0800

06 May, 2013

1 commit

6286ae97d slab: Return NULL for oversized allocations ... Browse Code »

The inline path seems to have changed the SLAB behavior for very large
kmalloc allocations with commit e3366016 ("slab: Use common
kmalloc_index/kmalloc_size functions"). This patch restores the old
behavior but also adds diagnostics so that we can figure where in the
code these large allocations occur.

Reported-and-tested-by: Tetsuo Handa
Signed-off-by: Christoph Lameter
Link: http://lkml.kernel.org/r/201305040348.CIF81716.OStQOHFJMFLOVF@I-love.SAKURA.ne.jp
[ penberg@kernel.org: use WARN_ON_ONCE ]
Signed-off-by: Pekka Enberg

Christoph Lameter
2013-05-06 14:24:16 +0800

02 May, 2013

2 commits

20b4fb485 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull VFS updates from Al Viro,

Misc cleanups all over the place, mainly wrt /proc interfaces (switch
create_proc_entry to proc_create(), get rid of the deprecated
create_proc_read_entry() in favor of using proc_create_data() and
seq_file etc).

7kloc removed.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits)
don't bother with deferred freeing of fdtables
proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h
proc: Make the PROC_I() and PDE() macros internal to procfs
proc: Supply a function to remove a proc entry by PDE
take cgroup_open() and cpuset_open() to fs/proc/base.c
ppc: Clean up scanlog
ppc: Clean up rtas_flash driver somewhat
hostap: proc: Use remove_proc_subtree()
drm: proc: Use remove_proc_subtree()
drm: proc: Use minor->index to label things, not PDE->name
drm: Constify drm_proc_list[]
zoran: Don't print proc_dir_entry data in debug
reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show()
proc: Supply an accessor for getting the data from a PDE's parent
airo: Use remove_proc_subtree()
rtl8192u: Don't need to save device proc dir PDE
rtl8187se: Use a dir under /proc/net/r8180/
proc: Add proc_mkdir_data()
proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h}
proc: Move PDE_NET() to fs/proc/proc_net.c
...

Linus Torvalds
2013-05-02 08:51:54 +0800
094dd33b1 Merge branch 'vfree' into for-next Browse Code »

Al Viro
2013-05-02 05:31:27 +0800

01 May, 2013

10 commits

08d767608 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal ... Browse Code »

Pull compat cleanup from Al Viro:
"Mostly about syscall wrappers this time; there will be another pile
with patches in the same general area from various people, but I'd
rather push those after both that and vfs.git pile are in."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
syscalls.h: slightly reduce the jungles of macros
get rid of union semop in sys_semctl(2) arguments
make do_mremap() static
sparc: no need to sign-extend in sync_file_range() wrapper
ppc compat wrappers for add_key(2) and request_key(2) are pointless
x86: trim sys_ia32.h
x86: sys32_kill and sys32_mprotect are pointless
get rid of compat_sys_semctl() and friends in case of ARCH_WANT_OLD_COMPAT_IPC
merge compat sys_ipc instances
consolidate compat lookup_dcookie()
convert vmsplice to COMPAT_SYSCALL_DEFINE
switch getrusage() to COMPAT_SYSCALL_DEFINE
switch epoll_pwait to COMPAT_SYSCALL_DEFINE
convert sendfile{,64} to COMPAT_SYSCALL_DEFINE
switch signalfd{,4}() to COMPAT_SYSCALL_DEFINE
make SYSCALL_DEFINE-generated wrappers do asmlinkage_protect
make HAVE_SYSCALL_WRAPPERS unconditional
consolidate cond_syscall and SYSCALL_ALIAS declarations
teach SYSCALL_DEFINE how to deal with long long/unsigned long long
get rid of duplicate logics in __SC_....[1-6] definitions

Linus Torvalds
2013-05-01 22:21:43 +0800
14e50c6a9 mm: slab: Verify the nodeid passed to ____cache_alloc_node ... Browse Code »

If the nodeid is > num_online_nodes() this can cause an Oops and a
panic(). The purpose of this patch is to assert if this condition is
true to aid debugging efforts rather than some random NULL pointer
dereference or page fault.

This patch is in response to BZ#42967 [1]. Using VM_BUG_ON so it's used
only when CONFIG_DEBUG_VM is set, given that ____cache_alloc_node() is a
hot code path.

[1]: https://bugzilla.kernel.org/show_bug.cgi?id=42967

Signed-off-by: Aaron Tomlin
Reviewed-by: Rik van Riel
Acked-by: Christoph Lameter
Acked-by: Rafael Aquini
Acked-by: David Rientjes
Signed-off-by: Pekka Enberg

Aaron Tomlin
2013-05-01 15:57:43 +0800
ff610a1d5 mm: cleancache: clean up cleancache_enabled ... Browse Code »

cleancache_ops is used to decide whether backend is registered.
So now cleancache_enabled is always true if defined CONFIG_CLEANCACHE.

Signed-off-by: Bob Liu
Cc: Wanpeng Li
Cc: Andor Daam
Cc: Dan Magenheimer
Cc: Florian Schmaus
Cc: Konrad Rzeszutek Wilk
Cc: Minchan Kim
Cc: Stefan Hengelein
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Bob Liu
2013-05-01 08:04:01 +0800
833f8662a cleancache: Make cleancache_init use a pointer for the ops ... Browse Code »

Instead of using a backend_registered to determine whether a backend is
enabled. This allows us to remove the backend_register check and just
do 'if (cleancache_ops)'

[v1: Rebase on top of b97c4b430b0a (ramster->zcache move]
Signed-off-by: Konrad Rzeszutek Wilk
Signed-off-by: Bob Liu
Cc: Wanpeng Li
Cc: Andor Daam
Cc: Dan Magenheimer
Cc: Florian Schmaus
Cc: Minchan Kim
Cc: Stefan Hengelein
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konrad Rzeszutek Wilk
2013-05-01 08:04:01 +0800
49a9ab815 mm: cleancache: lazy initialization to allow tmem backends to build/run as modules ... Browse Code »

With the goal of allowing tmem backends (zcache, ramster, Xen tmem) to
be built/loaded as modules rather than built-in and enabled by a boot
parameter, this patch provides "lazy initialization", allowing backends
to register to cleancache even after filesystems were mounted. Calls to
init_fs and init_shared_fs are remembered as fake poolids but no real
tmem_pools created. On backend registration the fake poolids are mapped
to real poolids and respective tmem_pools.

Signed-off-by: Stefan Hengelein
Signed-off-by: Florian Schmaus
Signed-off-by: Andor Daam
Signed-off-by: Dan Magenheimer
[v1: Minor fixes: used #define for some values and bools]
[v2: Removed CLEANCACHE_HAS_LAZY_INIT]
[v3: Added more comments, added a lock for [shared_|]fs_poolid_map]
Signed-off-by: Konrad Rzeszutek Wilk
Signed-off-by: Bob Liu
Cc: Wanpeng Li
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Magenheimer
2013-05-01 08:04:01 +0800
4f89849da frontswap: get rid of swap_lock dependency ... Browse Code »

Frontswap initialization routine depends on swap_lock, which want to be
atomic about frontswap's first appearance. IOW, frontswap is not present
and will fail all calls OR frontswap is fully functional but if new
swap_info_struct isn't registered by enable_swap_info, swap subsystem
doesn't start I/O so there is no race between init procedure and page I/O
working on frontswap.

So let's remove unnecessary swap_lock dependency.

Cc: Dan Magenheimer
Signed-off-by: Minchan Kim
[v1: Rebased on my branch, reworked to work with backends loading late]
[v2: Added a check for !map]
[v3: Made the invalidate path follow the init path]
[v4: Address comments by Wanpeng Li ]
Signed-off-by: Konrad Rzeszutek Wilk
Signed-off-by: Bob Liu
Cc: Wanpeng Li
Cc: Andor Daam
Cc: Florian Schmaus
Cc: Stefan Hengelein
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2013-05-01 08:04:00 +0800
f066ea230 mm: frontswap: cleanup code ... Browse Code »

After allowing tmem backends to build/run as modules, frontswap_enabled
always true if defined CONFIG_FRONTSWAP. But frontswap_test() depends on
whether backend is registered, mv it into frontswap.c using fronstswap_ops
to make the decision.

frontswap_set/clear are not used outside frontswap, so don't export them.

Signed-off-by: Bob Liu
Cc: Wanpeng Li
Cc: Andor Daam
Cc: Dan Magenheimer
Cc: Florian Schmaus
Cc: Konrad Rzeszutek Wilk
Cc: Minchan Kim
Cc: Stefan Hengelein
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Bob Liu
2013-05-01 08:04:00 +0800
1e01c968d frontswap: make frontswap_init use a pointer for the ops ... Browse Code »

This simplifies the code in the frontswap - we can get rid of the
'backend_registered' test and instead check against frontswap_ops.

[v1: Rebase on top of 703ba7fe5e0 (ramster->zcache move]
Signed-off-by: Konrad Rzeszutek Wilk
Signed-off-by: Bob Liu
Cc: Wanpeng Li
Cc: Andor Daam
Cc: Dan Magenheimer
Cc: Florian Schmaus
Cc: Minchan Kim
Cc: Stefan Hengelein
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konrad Rzeszutek Wilk
2013-05-01 08:04:00 +0800
905cd0e1b mm: frontswap: lazy initialization to allow tmem backends to build/run as modules ... Browse Code »

With the goal of allowing tmem backends (zcache, ramster, Xen tmem) to
be built/loaded as modules rather than built-in and enabled by a boot
parameter, this patch provides "lazy initialization", allowing backends
to register to frontswap even after swapon was run. Before a backend
registers all calls to init are recorded and the creation of tmem_pools
delayed until a backend registers or until a frontswap store is
attempted.

Signed-off-by: Stefan Hengelein
Signed-off-by: Florian Schmaus
Signed-off-by: Andor Daam
Signed-off-by: Dan Magenheimer
[v1: Fixes per Seth Jennings suggestions]
[v2: Removed FRONTSWAP_HAS_.. ]
[v3: Fix up per Bob Liu recommendations]
[v4: Fix up per Andrew's comments]
Signed-off-by: Konrad Rzeszutek Wilk
Signed-off-by: Bob Liu
Cc: Wanpeng Li
Cc: Dan Magenheimer
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Magenheimer
2013-05-01 08:04:00 +0800
5d434fcb2 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial ... Browse Code »

Pull trivial tree updates from Jiri Kosina:
"Usual stuff, mostly comment fixes, typo fixes, printk fixes and small
code cleanups"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (45 commits)
mm: Convert print_symbol to %pSR
gfs2: Convert print_symbol to %pSR
m32r: Convert print_symbol to %pSR
iostats.txt: add easy-to-find description for field 6
x86 cmpxchg.h: fix wrong comment
treewide: Fix typo in printk and comments
doc: devicetree: Fix various typos
docbook: fix 8250 naming in device-drivers
pata_pdc2027x: Fix compiler warning
treewide: Fix typo in printks
mei: Fix comments in drivers/misc/mei
treewide: Fix typos in kernel messages
pm44xx: Fix comment for "CONFIG_CPU_IDLE"
doc: Fix typo "CONFIG_CGROUP_CGROUP_MEMCG_SWAP"
mmzone: correct "pags" to "pages" in comment.
kernel-parameters: remove outdated 'noresidual' parameter
Remove spurious _H suffixes from ifdef comments
sound: Remove stray pluses from Kconfig file
radio-shark: Fix printk "CONFIG_LED_CLASS"
doc: put proper reference to CONFIG_MODULE_SIG_ENFORCE
...

Linus Torvalds
2013-05-01 00:36:50 +0800

30 Apr, 2013

2 commits

56847d857 Merge branch 'akpm' (incoming from Andrew) ... Browse Code »

Merge second batch of fixes from Andrew Morton:

- various misc bits

- some printk updates

- a new "SRAM" driver.

- MAINTAINERS updates

- the backlight driver queue

- checkpatch updates

- a few init/ changes

- a huge number of drivers/rtc changes

- fatfs updates

- some lib/idr.c work

- some renaming of the random driver interfaces

* emailed patches from Andrew Morton : (285 commits)
net: rename random32 to prandom
net/core: remove duplicate statements by do-while loop
net/core: rename random32() to prandom_u32()
net/netfilter: rename random32() to prandom_u32()
net/sched: rename random32() to prandom_u32()
net/sunrpc: rename random32() to prandom_u32()
scsi: rename random32() to prandom_u32()
lguest: rename random32() to prandom_u32()
uwb: rename random32() to prandom_u32()
video/uvesafb: rename random32() to prandom_u32()
mmc: rename random32() to prandom_u32()
drbd: rename random32() to prandom_u32()
kernel/: rename random32() to prandom_u32()
mm/: rename random32() to prandom_u32()
lib/: rename random32() to prandom_u32()
x86: rename random32() to prandom_u32()
x86: pageattr-test: remove srandom32 call
uuid: use prandom_bytes()
raid6test: use prandom_bytes()
sctp: convert sctp_assoc_set_id() to use idr_alloc_cyclic()
...

Linus Torvalds
2013-04-30 10:47:50 +0800
191a71209 Merge branch 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup updates from Tejun Heo:

- Fixes and a lot of cleanups. Locking cleanup is finally complete.
cgroup_mutex is no longer exposed to individual controlelrs which
used to cause nasty deadlock issues. Li fixed and cleaned up quite a
bit including long standing ones like racy cgroup_path().

- device cgroup now supports proper hierarchy thanks to Aristeu.

- perf_event cgroup now supports proper hierarchy.

- A new mount option "__DEVEL__sane_behavior" is added. As indicated
by the name, this option is to be used for development only at this
point and generates a warning message when used. Unfortunately,
cgroup interface currently has too many brekages and inconsistencies
to implement a consistent and unified hierarchy on top. The new flag
is used to collect the behavior changes which are necessary to
implement consistent unified hierarchy. It's likely that this flag
won't be used verbatim when it becomes ready but will be enabled
implicitly along with unified hierarchy.

The option currently disables some of broken behaviors in cgroup core
and also .use_hierarchy switch in memcg (will be routed through -mm),
which can be used to make very unusual hierarchy where nesting is
partially honored. It will also be used to implement hierarchy
support for blk-throttle which would be impossible otherwise without
introducing a full separate set of control knobs.

This is essentially versioning of interface which isn't very nice but
at this point I can't see any other options which would allow keeping
the interface the same while moving towards hierarchy behavior which
is at least somewhat sane. The planned unified hierarchy is likely
to require some level of adaptation from userland anyway, so I think
it'd be best to take the chance and update the interface such that
it's supportable in the long term.

Maintaining the existing interface does complicate cgroup core but
shouldn't put too much strain on individual controllers and I think
it'd be manageable for the foreseeable future. Maybe we'll be able
to drop it in a decade.

Fix up conflicts (including a semantic one adding a new #include to ppc
that was uncovered by header the file changes) as per Tejun.

* 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (45 commits)
cpuset: fix compile warning when CONFIG_SMP=n
cpuset: fix cpu hotplug vs rebuild_sched_domains() race
cpuset: use rebuild_sched_domains() in cpuset_hotplug_workfn()
cgroup: restore the call to eventfd->poll()
cgroup: fix use-after-free when umounting cgroupfs
cgroup: fix broken file xattrs
devcg: remove parent_cgroup.
memcg: force use_hierarchy if sane_behavior
cgroup: remove cgrp->top_cgroup
cgroup: introduce sane_behavior mount option
move cgroupfs_root to include/linux/cgroup.h
cgroup: convert cgroupfs_root flag bits to masks and add CGRP_ prefix
cgroup: make cgroup_path() not print double slashes
Revert "cgroup: remove bind() method from cgroup_subsys."
perf: make perf_event cgroup hierarchical
cgroup: implement cgroup_is_descendant()
cgroup: make sure parent won't be destroyed before its children
cgroup: remove bind() method from cgroup_subsys.
devcg: remove broken_hierarchy tag
cgroup: remove cgroup_lock_is_held()
...

Linus Torvalds
2013-04-30 10:14:20 +0800