Eric Lee / smarc-fsl-linux-kernel

23 Nov, 2013

1 commit

24f971abb Merge branch 'slab/next' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux ... Browse Code »

Pull SLAB changes from Pekka Enberg:
"The patches from Joonsoo Kim switch mm/slab.c to use 'struct page' for
slab internals similar to mm/slub.c. This reduces memory usage and
improves performance:

https://lkml.org/lkml/2013/10/16/155

Rest of the changes are bug fixes from various people"

* 'slab/next' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux: (21 commits)
mm, slub: fix the typo in mm/slub.c
mm, slub: fix the typo in include/linux/slub_def.h
slub: Handle NULL parameter in kmem_cache_flags
slab: replace non-existing 'struct freelist *' with 'void *'
slab: fix to calm down kmemleak warning
slub: proper kmemleak tracking if CONFIG_SLUB_DEBUG disabled
slab: rename slab_bufctl to slab_freelist
slab: remove useless statement for checking pfmemalloc
slab: use struct page for slab management
slab: replace free and inuse in struct slab with newly introduced active
slab: remove SLAB_LIMIT
slab: remove kmem_bufctl_t
slab: change the management method of free objects of the slab
slab: use __GFP_COMP flag for allocating slab pages
slab: use well-defined macro, virt_to_slab()
slab: overloading the RCU head over the LRU for RCU free
slab: remove cachep in struct slab_rcu
slab: remove nodeid in struct slab
slab: remove colouroff in struct slab
slab: change return type of kmem_getpages() to struct page
...

Linus Torvalds
2013-11-23 00:10:34 +0800

22 Nov, 2013

3 commits

b7a9f420e mm, mempolicy: silence gcc warning ... Browse Code »

Fengguang Wu reports that compiling mm/mempolicy.c results in a warning:

mm/mempolicy.c: In function 'mpol_to_str':
mm/mempolicy.c:2878:2: error: format not a string literal and no format arguments

Kees says this is because he is using -Wformat-security.

Silence the warning.

Signed-off-by: David Rientjes
Reported-by: Fengguang Wu
Suggested-by: Kees Cook
Acked-by: Kees Cook
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2013-11-22 08:42:27 +0800
27c73ae75 mm: hugetlbfs: fix hugetlbfs optimization ... Browse Code »

Commit 7cb2ef56e6a8 ("mm: fix aio performance regression for database
caused by THP") can cause dereference of a dangling pointer if
split_huge_page runs during PageHuge() if there are updates to the
tail_page->private field.

Also it is repeating compound_head twice for hugetlbfs and it is running
compound_head+compound_trans_head for THP when a single one is needed in
both cases.

The new code within the PageSlab() check doesn't need to verify that the
THP page size is never bigger than the smallest hugetlbfs page size, to
avoid memory corruption.

A longstanding theoretical race condition was found while fixing the
above (see the change right after the skip_unlock label, that is
relevant for the compound_lock path too).

By re-establishing the _mapcount tail refcounting for all compound
pages, this also fixes the below problem:

echo 0 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

BUG: Bad page state in process bash pfn:59a01
page:ffffea000139b038 count:0 mapcount:10 mapping: (null) index:0x0
page flags: 0x1c00000000008000(tail)
Modules linked in:
CPU: 6 PID: 2018 Comm: bash Not tainted 3.12.0+ #25
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
Call Trace:
dump_stack+0x55/0x76
bad_page+0xd5/0x130
free_pages_prepare+0x213/0x280
__free_pages+0x36/0x80
update_and_free_page+0xc1/0xd0
free_pool_huge_page+0xc2/0xe0
set_max_huge_pages.part.58+0x14c/0x220
nr_hugepages_store_common.isra.60+0xd0/0xf0
nr_hugepages_store+0x13/0x20
kobj_attr_store+0xf/0x20
sysfs_write_file+0x189/0x1e0
vfs_write+0xc5/0x1f0
SyS_write+0x55/0xb0
system_call_fastpath+0x16/0x1b

Signed-off-by: Khalid Aziz
Signed-off-by: Andrea Arcangeli
Tested-by: Khalid Aziz
Cc: Pravin Shelar
Cc: Greg Kroah-Hartman
Cc: Ben Hutchings
Cc: Christoph Lameter
Cc: Johannes Weiner
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Andi Kleen
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2013-11-22 08:42:27 +0800
30b0a105d mm: thp: give transparent hugepage code a separate copy_page ... Browse Code »

Right now, the migration code in migrate_page_copy() uses copy_huge_page()
for hugetlbfs and thp pages:

if (PageHuge(page) || PageTransHuge(page))
copy_huge_page(newpage, page);

So, yay for code reuse. But:

void copy_huge_page(struct page *dst, struct page *src)
{
struct hstate *h = page_hstate(src);

and a non-hugetlbfs page has no page_hstate(). This works 99% of the
time because page_hstate() determines the hstate from the page order
alone. Since the page order of a THP page matches the default hugetlbfs
page order, it works.

But, if you change the default huge page size on the boot command-line
(say default_hugepagesz=1G), then we might not even *have* a 2MB hstate
so page_hstate() returns null and copy_huge_page() oopses pretty fast
since copy_huge_page() dereferences the hstate:

void copy_huge_page(struct page *dst, struct page *src)
{
struct hstate *h = page_hstate(src);
if (unlikely(pages_per_huge_page(h) > MAX_ORDER_NR_PAGES)) {
...

Mel noticed that the migration code is really the only user of these
functions. This moves all the copy code over to migrate.c and makes
copy_huge_page() work for THP by checking for it explicitly.

I believe the bug was introduced in commit b32967ff101a ("mm: numa: Add
THP migration for the NUMA working set scanning fault case")

[akpm@linux-foundation.org: fix coding-style and comment text, per Naoya Horiguchi]
Signed-off-by: Dave Hansen
Acked-by: Mel Gorman
Reviewed-by: Naoya Horiguchi
Cc: Hillf Danton
Cc: Andrea Arcangeli
Tested-by: Dave Jiang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dave Hansen
2013-11-22 08:42:27 +0800

21 Nov, 2013

1 commit

8b2e9b712 Revert "mm: create a separate slab for page->ptl allocation" ... Browse Code »

This reverts commit ea1e7ed33708c7a760419ff9ded0a6cb90586a50.

Al points out that while the commit *does* actually create a separate
slab for the page->ptl allocation, that slab is never actually used, and
the code continues to use kmalloc/kfree.

Damien Wyart points out that the original patch did have the conversion
to use kmem_cache_alloc/free, so it got lost somewhere on its way to me.

Revert the half-arsed attempt that didn't do anything. If we really do
want the special slab (remember: this is all relevant just for debug
builds, so it's not necessarily all that critical) we might as well redo
the patch fully.

Reported-by: Al Viro
Acked-by: Andrew Morton
Cc: Kirill A Shutemov
Signed-off-by: Linus Torvalds

Linus Torvalds
2013-11-21 06:41:47 +0800

16 Nov, 2013

1 commit

9073e1a80 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial ... Browse Code »

Pull trivial tree updates from Jiri Kosina:
"Usual earth-shaking, news-breaking, rocket science pile from
trivial.git"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (23 commits)
doc: usb: Fix typo in Documentation/usb/gadget_configs.txt
doc: add missing files to timers/00-INDEX
timekeeping: Fix some trivial typos in comments
mm: Fix some trivial typos in comments
irq: Fix some trivial typos in comments
NUMA: fix typos in Kconfig help text
mm: update 00-INDEX
doc: Documentation/DMA-attributes.txt fix typo
DRM: comment: `halve' -> `half'
Docs: Kconfig: `devlopers' -> `developers'
doc: typo on word accounting in kprobes.c in mutliple architectures
treewide: fix "usefull" typo
treewide: fix "distingush" typo
mm/Kconfig: Grammar s/an/a/
kexec: Typo s/the/then/
Documentation/kvm: Update cpuid documentation for steal time and pv eoi
treewide: Fix common typo in "identify"
__page_to_pfn: Fix typo in comment
Correct some typos for word frequency
clk: fixed-factor: Fix a trivial typo
...

Linus Torvalds
2013-11-16 08:47:22 +0800

15 Nov, 2013

13 commits

498d319bb kfifo API type safety ... Browse Code »

This patch enhances the type safety for the kfifo API. It is now safe
to put const data into a non const FIFO and the API will now generate a
compiler warning when reading from the fifo where the destination
address is pointing to a const variable.

As a side effect the kfifo_put() does now expect the value of an element
instead a pointer to the element. This was suggested Russell King. It
make the handling of the kfifo_put easier since there is no need to
create a helper variable for getting the address of a pointer or to pass
integers of different sizes.

IMHO the API break is okay, since there are currently only six users of
kfifo_put().

The code is also cleaner by kicking out the "if (0)" expressions.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Stefani Seibold
Cc: Russell King
Cc: Hauke Mehrtens
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Stefani Seibold
2013-11-15 08:32:23 +0800
ea1e7ed33 mm: create a separate slab for page->ptl allocation ... Browse Code »

If DEBUG_SPINLOCK and DEBUG_LOCK_ALLOC are enabled spinlock_t on x86_64
is 72 bytes. For page->ptl they will be allocated from kmalloc-96 slab,
so we loose 24 on each. An average system can easily allocate few tens
thousands of page->ptl and overhead is significant.

Let's create a separate slab for page->ptl allocation to solve this.

Signed-off-by: Kirill A. Shutemov
Cc: Peter Zijlstra
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2013-11-15 08:32:20 +0800
539edb584 mm: properly separate the bloated ptl from the regular case ... Browse Code »

Use kernel/bounds.c to convert build-time spinlock_t size check into a
preprocessor symbol and apply that to properly separate the page::ptl
situation.

Signed-off-by: Peter Zijlstra
Signed-off-by: Kirill A. Shutemov
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Zijlstra
2013-11-15 08:32:20 +0800
49076ec2c mm: dynamically allocate page->ptl if it cannot be embedded to struct page ... Browse Code »

If split page table lock is in use, we embed the lock into struct page
of table's page. We have to disable split lock, if spinlock_t is too
big be to be embedded, like when DEBUG_SPINLOCK or DEBUG_LOCK_ALLOC
enabled.

This patch add support for dynamic allocation of split page table lock
if we can't embed it to struct page.

page->ptl is unsigned long now and we use it as spinlock_t if
sizeof(spinlock_t) ptl.

Signed-off-by: Kirill A. Shutemov
Reviewed-by: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2013-11-15 08:32:20 +0800
e009bb30c mm: implement split page table lock for PMD level ... Browse Code »

The basic idea is the same as with PTE level: the lock is embedded into
struct page of table's page.

We can't use mm->pmd_huge_pte to store pgtables for THP, since we don't
take mm->page_table_lock anymore. Let's reuse page->lru of table's page
for that.

pgtable_pmd_page_ctor() returns true, if initialization is successful
and false otherwise. Current implementation never fails, but assumption
that constructor can fail will help to port it to -rt where spinlock_t
is rather huge and cannot be embedded into struct page -- dynamic
allocation is required.

Signed-off-by: Naoya Horiguchi
Signed-off-by: Kirill A. Shutemov
Tested-by: Alex Thorlton
Cc: Ingo Molnar
Cc: "Eric W . Biederman"
Cc: "Paul E . McKenney"
Cc: Al Viro
Cc: Andi Kleen
Cc: Andrea Arcangeli
Cc: Dave Hansen
Cc: Dave Jones
Cc: David Howells
Cc: Frederic Weisbecker
Cc: Johannes Weiner
Cc: Kees Cook
Cc: Mel Gorman
Cc: Michael Kerrisk
Cc: Oleg Nesterov
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: Robin Holt
Cc: Sedat Dilek
Cc: Srikar Dronamraju
Cc: Thomas Gleixner
Cc: Hugh Dickins
Reviewed-by: Steven Rostedt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2013-11-15 08:32:15 +0800
c4088ebdc mm: convert the rest to new page table lock api ... Browse Code »

Only trivial cases left. Let's convert them altogether.

Signed-off-by: Naoya Horiguchi
Signed-off-by: Kirill A. Shutemov
Tested-by: Alex Thorlton
Cc: Ingo Molnar
Cc: "Eric W . Biederman"
Cc: "Paul E . McKenney"
Cc: Al Viro
Cc: Andi Kleen
Cc: Andrea Arcangeli
Cc: Dave Hansen
Cc: Dave Jones
Cc: David Howells
Cc: Frederic Weisbecker
Cc: Johannes Weiner
Cc: Kees Cook
Cc: Mel Gorman
Cc: Michael Kerrisk
Cc: Oleg Nesterov
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: Robin Holt
Cc: Sedat Dilek
Cc: Srikar Dronamraju
Cc: Thomas Gleixner
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2013-11-15 08:32:15 +0800
cb900f412 mm, hugetlb: convert hugetlbfs to use split pmd lock ... Browse Code »

Hugetlb supports multiple page sizes. We use split lock only for PMD
level, but not for PUD.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Naoya Horiguchi
Signed-off-by: Kirill A. Shutemov
Tested-by: Alex Thorlton
Cc: Ingo Molnar
Cc: "Eric W . Biederman"
Cc: "Paul E . McKenney"
Cc: Al Viro
Cc: Andi Kleen
Cc: Andrea Arcangeli
Cc: Dave Hansen
Cc: Dave Jones
Cc: David Howells
Cc: Frederic Weisbecker
Cc: Johannes Weiner
Cc: Kees Cook
Cc: Mel Gorman
Cc: Michael Kerrisk
Cc: Oleg Nesterov
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: Robin Holt
Cc: Sedat Dilek
Cc: Srikar Dronamraju
Cc: Thomas Gleixner
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2013-11-15 08:32:14 +0800
c389a250a mm, thp: do not access mm->pmd_huge_pte directly ... Browse Code »

Currently mm->pmd_huge_pte protected by page table lock. It will not
work with split lock. We have to have per-pmd pmd_huge_pte for proper
access serialization.

For now, let's just introduce wrapper to access mm->pmd_huge_pte.

Signed-off-by: Kirill A. Shutemov
Tested-by: Alex Thorlton
Cc: Alex Thorlton
Cc: Ingo Molnar
Cc: Naoya Horiguchi
Cc: "Eric W . Biederman"
Cc: "Paul E . McKenney"
Cc: Al Viro
Cc: Andi Kleen
Cc: Andrea Arcangeli
Cc: Dave Hansen
Cc: Dave Jones
Cc: David Howells
Cc: Frederic Weisbecker
Cc: Johannes Weiner
Cc: Kees Cook
Cc: Mel Gorman
Cc: Michael Kerrisk
Cc: Oleg Nesterov
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: Robin Holt
Cc: Sedat Dilek
Cc: Srikar Dronamraju
Cc: Thomas Gleixner
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2013-11-15 08:32:14 +0800
117b0791a mm, thp: move ptl taking inside page_check_address_pmd() ... Browse Code »

With split page table lock we can't know which lock we need to take
before we find the relevant pmd.

Let's move lock taking inside the function.

Signed-off-by: Naoya Horiguchi
Signed-off-by: Kirill A. Shutemov
Tested-by: Alex Thorlton
Cc: Ingo Molnar
Cc: "Eric W . Biederman"
Cc: "Paul E . McKenney"
Cc: Al Viro
Cc: Andi Kleen
Cc: Andrea Arcangeli
Cc: Dave Hansen
Cc: Dave Jones
Cc: David Howells
Cc: Frederic Weisbecker
Cc: Johannes Weiner
Cc: Kees Cook
Cc: Mel Gorman
Cc: Michael Kerrisk
Cc: Oleg Nesterov
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: Robin Holt
Cc: Sedat Dilek
Cc: Srikar Dronamraju
Cc: Thomas Gleixner
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2013-11-15 08:32:14 +0800
bf929152e mm, thp: change pmd_trans_huge_lock() to return taken lock ... Browse Code »

With split ptlock it's important to know which lock
pmd_trans_huge_lock() took. This patch adds one more parameter to the
function to return the lock.

In most places migration to new api is trivial. Exception is
move_huge_pmd(): we need to take two locks if pmd tables are different.

Signed-off-by: Naoya Horiguchi
Signed-off-by: Kirill A. Shutemov
Tested-by: Alex Thorlton
Cc: Ingo Molnar
Cc: "Eric W . Biederman"
Cc: "Paul E . McKenney"
Cc: Al Viro
Cc: Andi Kleen
Cc: Andrea Arcangeli
Cc: Dave Hansen
Cc: Dave Jones
Cc: David Howells
Cc: Frederic Weisbecker
Cc: Johannes Weiner
Cc: Kees Cook
Cc: Mel Gorman
Cc: Michael Kerrisk
Cc: Oleg Nesterov
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: Robin Holt
Cc: Sedat Dilek
Cc: Srikar Dronamraju
Cc: Thomas Gleixner
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2013-11-15 08:32:14 +0800
e1f56c89b mm: convert mm->nr_ptes to atomic_long_t ... Browse Code »

With split page table lock for PMD level we can't hold mm->page_table_lock
while updating nr_ptes.

Let's convert it to atomic_long_t to avoid races.

Signed-off-by: Kirill A. Shutemov
Tested-by: Alex Thorlton
Cc: Ingo Molnar
Cc: Naoya Horiguchi
Cc: "Eric W . Biederman"
Cc: "Paul E . McKenney"
Cc: Al Viro
Cc: Andi Kleen
Cc: Andrea Arcangeli
Cc: Dave Hansen
Cc: Dave Jones
Cc: David Howells
Cc: Frederic Weisbecker
Cc: Johannes Weiner
Cc: Kees Cook
Cc: Mel Gorman
Cc: Michael Kerrisk
Cc: Oleg Nesterov
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: Robin Holt
Cc: Sedat Dilek
Cc: Srikar Dronamraju
Cc: Thomas Gleixner
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2013-11-15 08:32:14 +0800
e9bb18c7b mm: avoid increase sizeof(struct page) due to split page table lock ... Browse Code »

Alex Thorlton noticed that some massively threaded workloads work poorly,
if THP enabled. This patchset fixes this by introducing split page table
lock for PMD tables. hugetlbfs is not covered yet.

This patchset is based on work by Naoya Horiguchi.

: akpm result summary:
:
: THP off, v3.12-rc2: 18.059261877 seconds time elapsed
: THP off, patched: 16.768027318 seconds time elapsed
:
: THP on, v3.12-rc2: 42.162306788 seconds time elapsed
: THP on, patched: 8.397885779 seconds time elapsed
:
: HUGETLB, v3.12-rc2: 47.574936948 seconds time elapsed
: HUGETLB, patched: 19.447481153 seconds time elapsed

THP off, v3.12-rc2:
-------------------

Performance counter stats for './thp_memscale -c 80 -b 512m' (5 runs):

1037072.835207 task-clock # 57.426 CPUs utilized ( +- 3.59% )
95,093 context-switches # 0.092 K/sec ( +- 3.93% )
140 cpu-migrations # 0.000 K/sec ( +- 5.28% )
10,000,550 page-faults # 0.010 M/sec ( +- 0.00% )
2,455,210,400,261 cycles # 2.367 GHz ( +- 3.62% ) [83.33%]
2,429,281,882,056 stalled-cycles-frontend # 98.94% frontend cycles idle ( +- 3.67% ) [83.33%]
1,975,960,019,659 stalled-cycles-backend # 80.48% backend cycles idle ( +- 3.88% ) [66.68%]
46,503,296,013 instructions # 0.02 insns per cycle
# 52.24 stalled cycles per insn ( +- 3.21% ) [83.34%]
9,278,997,542 branches # 8.947 M/sec ( +- 4.00% ) [83.34%]
89,881,640 branch-misses # 0.97% of all branches ( +- 1.17% ) [83.33%]

18.059261877 seconds time elapsed ( +- 2.65% )

THP on, v3.12-rc2:
------------------

Performance counter stats for './thp_memscale -c 80 -b 512m' (5 runs):

3114745.395974 task-clock # 73.875 CPUs utilized ( +- 1.84% )
267,356 context-switches # 0.086 K/sec ( +- 1.84% )
99 cpu-migrations # 0.000 K/sec ( +- 1.40% )
58,313 page-faults # 0.019 K/sec ( +- 0.28% )
7,416,635,817,510 cycles # 2.381 GHz ( +- 1.83% ) [83.33%]
7,342,619,196,993 stalled-cycles-frontend # 99.00% frontend cycles idle ( +- 1.88% ) [83.33%]
6,267,671,641,967 stalled-cycles-backend # 84.51% backend cycles idle ( +- 2.03% ) [66.67%]
117,819,935,165 instructions # 0.02 insns per cycle
# 62.32 stalled cycles per insn ( +- 4.39% ) [83.34%]
28,899,314,777 branches # 9.278 M/sec ( +- 4.48% ) [83.34%]
71,787,032 branch-misses # 0.25% of all branches ( +- 1.03% ) [83.33%]

42.162306788 seconds time elapsed ( +- 1.73% )

HUGETLB, v3.12-rc2:
-------------------

Performance counter stats for './thp_memscale_hugetlbfs -c 80 -b 512M' (5 runs):

2588052.787264 task-clock # 54.400 CPUs utilized ( +- 3.69% )
246,831 context-switches # 0.095 K/sec ( +- 4.15% )
138 cpu-migrations # 0.000 K/sec ( +- 5.30% )
21,027 page-faults # 0.008 K/sec ( +- 0.01% )
6,166,666,307,263 cycles # 2.383 GHz ( +- 3.68% ) [83.33%]
6,086,008,929,407 stalled-cycles-frontend # 98.69% frontend cycles idle ( +- 3.77% ) [83.33%]
5,087,874,435,481 stalled-cycles-backend # 82.51% backend cycles idle ( +- 4.41% ) [66.67%]
133,782,831,249 instructions # 0.02 insns per cycle
# 45.49 stalled cycles per insn ( +- 4.30% ) [83.34%]
34,026,870,541 branches # 13.148 M/sec ( +- 4.24% ) [83.34%]
68,670,942 branch-misses # 0.20% of all branches ( +- 3.26% ) [83.33%]

47.574936948 seconds time elapsed ( +- 2.09% )

THP off, patched:
-----------------

Performance counter stats for './thp_memscale -c 80 -b 512m' (5 runs):

943301.957892 task-clock # 56.256 CPUs utilized ( +- 3.01% )
86,218 context-switches # 0.091 K/sec ( +- 3.17% )
121 cpu-migrations # 0.000 K/sec ( +- 6.64% )
10,000,551 page-faults # 0.011 M/sec ( +- 0.00% )
2,230,462,457,654 cycles # 2.365 GHz ( +- 3.04% ) [83.32%]
2,204,616,385,805 stalled-cycles-frontend # 98.84% frontend cycles idle ( +- 3.09% ) [83.32%]
1,778,640,046,926 stalled-cycles-backend # 79.74% backend cycles idle ( +- 3.47% ) [66.69%]
45,995,472,617 instructions # 0.02 insns per cycle
# 47.93 stalled cycles per insn ( +- 2.51% ) [83.34%]
9,179,700,174 branches # 9.731 M/sec ( +- 3.04% ) [83.35%]
89,166,529 branch-misses # 0.97% of all branches ( +- 1.45% ) [83.33%]

16.768027318 seconds time elapsed ( +- 2.47% )

THP on, patched:
----------------

Performance counter stats for './thp_memscale -c 80 -b 512m' (5 runs):

458793.837905 task-clock # 54.632 CPUs utilized ( +- 0.79% )
41,831 context-switches # 0.091 K/sec ( +- 0.97% )
98 cpu-migrations # 0.000 K/sec ( +- 1.66% )
57,829 page-faults # 0.126 K/sec ( +- 0.62% )
1,077,543,336,716 cycles # 2.349 GHz ( +- 0.81% ) [83.33%]
1,067,403,802,964 stalled-cycles-frontend # 99.06% frontend cycles idle ( +- 0.87% ) [83.33%]
864,764,616,143 stalled-cycles-backend # 80.25% backend cycles idle ( +- 0.73% ) [66.68%]
16,129,177,440 instructions # 0.01 insns per cycle
# 66.18 stalled cycles per insn ( +- 7.94% ) [83.35%]
3,618,938,569 branches # 7.888 M/sec ( +- 8.46% ) [83.36%]
33,242,032 branch-misses # 0.92% of all branches ( +- 2.02% ) [83.32%]

8.397885779 seconds time elapsed ( +- 0.18% )

HUGETLB, patched:
-----------------

Performance counter stats for './thp_memscale_hugetlbfs -c 80 -b 512M' (5 runs):

395353.076837 task-clock # 20.329 CPUs utilized ( +- 8.16% )
55,730 context-switches # 0.141 K/sec ( +- 5.31% )
138 cpu-migrations # 0.000 K/sec ( +- 4.24% )
21,027 page-faults # 0.053 K/sec ( +- 0.00% )
930,219,717,244 cycles # 2.353 GHz ( +- 8.21% ) [83.32%]
914,295,694,103 stalled-cycles-frontend # 98.29% frontend cycles idle ( +- 8.35% ) [83.33%]
704,137,950,187 stalled-cycles-backend # 75.70% backend cycles idle ( +- 9.16% ) [66.69%]
30,541,538,385 instructions # 0.03 insns per cycle
# 29.94 stalled cycles per insn ( +- 3.98% ) [83.35%]
8,415,376,631 branches # 21.286 M/sec ( +- 3.61% ) [83.36%]
32,645,478 branch-misses # 0.39% of all branches ( +- 3.41% ) [83.32%]

19.447481153 seconds time elapsed ( +- 2.00% )

This patch (of 11):

CONFIG_GENERIC_LOCKBREAK increases sizeof(spinlock_t) to 8 bytes. It
leads to increase sizeof(struct page) by 4 bytes on 32-bit system if split
page table lock is in use, since page->ptl shares space in union with
longs and pointers.

Let's disable split page table lock on 32-bit systems with
GENERIC_LOCKBREAK enabled.

Signed-off-by: Kirill A. Shutemov
Cc: Alex Thorlton
Cc: Ingo Molnar
Cc: Naoya Horiguchi
Cc: "Eric W . Biederman"
Cc: "Paul E . McKenney"
Cc: Al Viro
Cc: Andi Kleen
Cc: Andrea Arcangeli
Cc: Dave Hansen
Cc: Dave Jones
Cc: David Howells
Cc: Frederic Weisbecker
Cc: Johannes Weiner
Cc: Kees Cook
Cc: Mel Gorman
Cc: Michael Kerrisk
Cc: Oleg Nesterov
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: Robin Holt
Cc: Sedat Dilek
Cc: Srikar Dronamraju
Cc: Thomas Gleixner
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2013-11-15 08:32:13 +0800
b77d88d49 mm: drop actor argument of do_generic_file_read() ... Browse Code »

There's only one caller of do_generic_file_read() and the only actor is
file_read_actor(). No reason to have a callback parameter.

Signed-off-by: Kirill A. Shutemov
Acked-by: Dave Hansen
Reviewed-by: Wanpeng Li
Cc: Matthew Wilcox
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2013-11-15 08:32:13 +0800

14 Nov, 2013

2 commits

5e30025a3 Merge branch 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull core locking changes from Ingo Molnar:
"The biggest changes:

- add lockdep support for seqcount/seqlocks structures, this
unearthed both bugs and required extra annotation.

- move the various kernel locking primitives to the new
kernel/locking/ directory"

* 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
block: Use u64_stats_init() to initialize seqcounts
locking/lockdep: Mark __lockdep_count_forward_deps() as static
lockdep/proc: Fix lock-time avg computation
locking/doc: Update references to kernel/mutex.c
ipv6: Fix possible ipv6 seqlock deadlock
cpuset: Fix potential deadlock w/ set_mems_allowed
seqcount: Add lockdep functionality to seqcount/seqlock structures
net: Explicitly initialize u64_stats_sync structures for lockdep
locking: Move the percpu-rwsem code to kernel/locking/
locking: Move the lglocks code to kernel/locking/
locking: Move the rwsem code to kernel/locking/
locking: Move the rtmutex code to kernel/locking/
locking: Move the semaphore core to kernel/locking/
locking: Move the spinlock code to kernel/locking/
locking: Move the lockdep code to kernel/locking/
locking: Move the mutex code to kernel/locking/
hung_task debugging: Add tracepoint to report the hang
x86/locking/kconfig: Update paravirt spinlock Kconfig description
lockstat: Report avg wait and hold times
lockdep, x86/alternatives: Drop ancient lockdep fixup message
...

Linus Torvalds
2013-11-14 15:30:30 +0800
0910c0bdf Merge branch 'for-3.13/core' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block IO core updates from Jens Axboe:
"This is the pull request for the core changes in the block layer for
3.13. It contains:

- The new blk-mq request interface.

This is a new and more scalable queueing model that marries the
best part of the request based interface we currently have (which
is fully featured, but scales poorly) and the bio based "interface"
which the new drivers for high IOPS devices end up using because
it's much faster than the request based one.

The bio interface has no block layer support, since it taps into
the stack much earlier. This means that drivers end up having to
implement a lot of functionality on their own, like tagging,
timeout handling, requeue, etc. The blk-mq interface provides all
these. Some drivers even provide a switch to select bio or rq and
has code to handle both, since things like merging only works in
the rq model and hence is faster for some workloads. This is a
huge mess. Conversion of these drivers nets us a substantial code
reduction. Initial results on converting SCSI to this model even
shows an 8x improvement on single queue devices. So while the
model was intended to work on the newer multiqueue devices, it has
substantial improvements for "classic" hardware as well. This code
has gone through extensive testing and development, it's now ready
to go. A pull request is coming to convert virtio-blk to this
model will be will be coming as well, with more drivers scheduled
for 3.14 conversion.

- Two blktrace fixes from Jan and Chen Gang.

- A plug merge fix from Alireza Haghdoost.

- Conversion of __get_cpu_var() from Christoph Lameter.

- Fix for sector_div() with 64-bit divider from Geert Uytterhoeven.

- A fix for a race between request completion and the timeout
handling from Jeff Moyer. This is what caused the merge conflict
with blk-mq/core, in case you are looking at that.

- A dm stacking fix from Mike Snitzer.

- A code consolidation fix and duplicated code removal from Kent
Overstreet.

- A handful of block bug fixes from Mikulas Patocka, fixing a loop
crash and memory corruption on blk cg.

- Elevator switch bug fix from Tomoki Sekiyama.

A heads-up that I had to rebase this branch. Initially the immutable
bio_vecs had been queued up for inclusion, but a week later, it became
clear that it wasn't fully cooked yet. So the decision was made to
pull this out and postpone it until 3.14. It was a straight forward
rebase, just pruning out the immutable series and the later fixes of
problems with it. The rest of the patches applied directly and no
further changes were made"

* 'for-3.13/core' of git://git.kernel.dk/linux-block: (31 commits)
block: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO
block: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO
block: Do not call sector_div() with a 64-bit divisor
kernel: trace: blktrace: remove redundent memcpy() in compat_blk_trace_setup()
block: Consolidate duplicated bio_trim() implementations
block: Use rw_copy_check_uvector()
block: Enable sysfs nomerge control for I/O requests in the plug list
block: properly stack underlying max_segment_size to DM device
elevator: acquire q->sysfs_lock in elevator_change()
elevator: Fix a race in elevator switching and md device initialization
block: Replace __get_cpu_var uses
bdi: test bdi_init failure
block: fix a probe argument to blk_register_region
loop: fix crash if blk_alloc_queue fails
blk-core: Fix memory corruption if blkcg_init_queue fails
block: fix race between request completion and timeout handling
blktrace: Send BLK_TN_PROCESS events to all running traces
blk-mq: don't disallow request merges for req->special being set
blk-mq: mq plug list breakage
blk-mq: fix for flush deadlock
...

Linus Torvalds
2013-11-14 11:08:14 +0800

13 Nov, 2013

19 commits

42a2d923c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next ... Browse Code »

Pull networking updates from David Miller:

1) The addition of nftables. No longer will we need protocol aware
firewall filtering modules, it can all live in userspace.

At the core of nftables is a, for lack of a better term, virtual
machine that executes byte codes to inspect packet or metadata
(arriving interface index, etc.) and make verdict decisions.

Besides support for loading packet contents and comparing them, the
interpreter supports lookups in various datastructures as
fundamental operations. For example sets are supports, and
therefore one could create a set of whitelist IP address entries
which have ACCEPT verdicts attached to them, and use the appropriate
byte codes to do such lookups.

Since the interpreted code is composed in userspace, userspace can
do things like optimize things before giving it to the kernel.

Another major improvement is the capability of atomically updating
portions of the ruleset. In the existing netfilter implementation,
one has to update the entire rule set in order to make a change and
this is very expensive.

Userspace tools exist to create nftables rules using existing
netfilter rule sets, but both kernel implementations will need to
co-exist for quite some time as we transition from the old to the
new stuff.

Kudos to Patrick McHardy, Pablo Neira Ayuso, and others who have
worked so hard on this.

2) Daniel Borkmann and Hannes Frederic Sowa made several improvements
to our pseudo-random number generator, mostly used for things like
UDP port randomization and netfitler, amongst other things.

In particular the taus88 generater is updated to taus113, and test
cases are added.

3) Support 64-bit rates in HTB and TBF schedulers, from Eric Dumazet
and Yang Yingliang.

4) Add support for new 577xx tigon3 chips to tg3 driver, from Nithin
Sujir.

5) Fix two fatal flaws in TCP dynamic right sizing, from Eric Dumazet,
Neal Cardwell, and Yuchung Cheng.

6) Allow IP_TOS and IP_TTL to be specified in sendmsg() ancillary
control message data, much like other socket option attributes.
From Francesco Fusco.

7) Allow applications to specify a cap on the rate computed
automatically by the kernel for pacing flows, via a new
SO_MAX_PACING_RATE socket option. From Eric Dumazet.

8) Make the initial autotuned send buffer sizing in TCP more closely
reflect actual needs, from Eric Dumazet.

9) Currently early socket demux only happens for TCP sockets, but we
can do it for connected UDP sockets too. Implementation from Shawn
Bohrer.

10) Refactor inet socket demux with the goal of improving hash demux
performance for listening sockets. With the main goals being able
to use RCU lookups on even request sockets, and eliminating the
listening lock contention. From Eric Dumazet.

11) The bonding layer has many demuxes in it's fast path, and an RCU
conversion was started back in 3.11, several changes here extend the
RCU usage to even more locations. From Ding Tianhong and Wang
Yufen, based upon suggestions by Nikolay Aleksandrov and Veaceslav
Falico.

12) Allow stackability of segmentation offloads to, in particular, allow
segmentation offloading over tunnels. From Eric Dumazet.

13) Significantly improve the handling of secret keys we input into the
various hash functions in the inet hashtables, TCP fast open, as
well as syncookies. From Hannes Frederic Sowa. The key fundamental
operation is "net_get_random_once()" which uses static keys.

Hannes even extended this to ipv4/ipv6 fragmentation handling and
our generic flow dissector.

14) The generic driver layer takes care now to set the driver data to
NULL on device removal, so it's no longer necessary for drivers to
explicitly set it to NULL any more. Many drivers have been cleaned
up in this way, from Jingoo Han.

15) Add a BPF based packet scheduler classifier, from Daniel Borkmann.

16) Improve CRC32 interfaces and generic SKB checksum iterators so that
SCTP's checksumming can more cleanly be handled. Also from Daniel
Borkmann.

17) Add a new PMTU discovery mode, IP_PMTUDISC_INTERFACE, which forces
using the interface MTU value. This helps avoid PMTU attacks,
particularly on DNS servers. From Hannes Frederic Sowa.

18) Use generic XPS for transmit queue steering rather than internal
(re-)implementation in virtio-net. From Jason Wang.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1622 commits)
random32: add test cases for taus113 implementation
random32: upgrade taus88 generator to taus113 from errata paper
random32: move rnd_state to linux/random.h
random32: add prandom_reseed_late() and call when nonblocking pool becomes initialized
random32: add periodic reseeding
random32: fix off-by-one in seeding requirement
PHY: Add RTL8201CP phy_driver to realtek
xtsonic: add missing platform_set_drvdata() in xtsonic_probe()
macmace: add missing platform_set_drvdata() in mace_probe()
ethernet/arc/arc_emac: add missing platform_set_drvdata() in arc_emac_probe()
ipv6: protect for_each_sk_fl_rcu in mem_check with rcu_read_lock_bh
vlan: Implement vlan_dev_get_egress_qos_mask as an inline.
ixgbe: add warning when max_vfs is out of range.
igb: Update link modes display in ethtool
netfilter: push reasm skb through instead of original frag skbs
ip6_output: fragment outgoing reassembled skb properly
MAINTAINERS: mv643xx_eth: take over maintainership from Lennart
net_sched: tbf: support of 64bit rates
ixgbe: deleting dfwd stations out of order can cause null ptr deref
ixgbe: fix build err, num_rx_queues is only available with CONFIG_RPS
...

Linus Torvalds
2013-11-13 16:40:34 +0800
5cbb3d216 Merge branch 'akpm' (patches from Andrew Morton) ... Browse Code »

Merge first patch-bomb from Andrew Morton:
"Quite a lot of other stuff is banked up awaiting further
next->mainline merging, but this batch contains:

- Lots of random misc patches
- OCFS2
- Most of MM
- backlight updates
- lib/ updates
- printk updates
- checkpatch updates
- epoll tweaking
- rtc updates
- hfs
- hfsplus
- documentation
- procfs
- update gcov to gcc-4.7 format
- IPC"

* emailed patches from Andrew Morton : (269 commits)
ipc, msg: fix message length check for negative values
ipc/util.c: remove unnecessary work pending test
devpts: plug the memory leak in kill_sb
./Makefile: export initial ramdisk compression config option
init/Kconfig: add option to disable kernel compression
drivers: w1: make w1_slave::flags long to avoid memory corruption
drivers/w1/masters/ds1wm.cuse dev_get_platdata()
drivers/memstick/core/ms_block.c: fix unreachable state in h_msb_read_page()
drivers/memstick/core/mspro_block.c: fix attributes array allocation
drivers/pps/clients/pps-gpio.c: remove redundant of_match_ptr
kernel/panic.c: reduce 1 byte usage for print tainted buffer
gcov: reuse kbasename helper
kernel/gcov/fs.c: use pr_warn()
kernel/module.c: use pr_foo()
gcov: compile specific gcov implementation based on gcc version
gcov: add support for gcc 4.7 gcov format
gcov: move gcov structs definitions to a gcc version specific file
kernel/taskstats.c: return -ENOMEM when alloc memory fails in add_del_listener()
kernel/taskstats.c: add nla_nest_cancel() for failure processing between nla_nest_start() and nla_nest_end()
kernel/sysctl_binary.c: use scnprintf() instead of snprintf()
...

Linus Torvalds
2013-11-13 14:45:43 +0800
9bc9ccd7d Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull vfs updates from Al Viro:
"All kinds of stuff this time around; some more notable parts:

- RCU'd vfsmounts handling
- new primitives for coredump handling
- files_lock is gone
- Bruce's delegations handling series
- exportfs fixes

plus misc stuff all over the place"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (101 commits)
ecryptfs: ->f_op is never NULL
locks: break delegations on any attribute modification
locks: break delegations on link
locks: break delegations on rename
locks: helper functions for delegation breaking
locks: break delegations on unlink
namei: minor vfs_unlink cleanup
locks: implement delegations
locks: introduce new FL_DELEG lock flag
vfs: take i_mutex on renamed file
vfs: rename I_MUTEX_QUOTA now that it's not used for quotas
vfs: don't use PARENT/CHILD lock classes for non-directories
vfs: pull ext4's double-i_mutex-locking into common code
exportfs: fix quadratic behavior in filehandle lookup
exportfs: better variable name
exportfs: move most of reconnect_path to helper function
exportfs: eliminate unused "noprogress" counter
exportfs: stop retrying once we race with rename/remove
exportfs: clear DISCONNECTED on all parents sooner
exportfs: more detailed comment for path_reconnect
...

Linus Torvalds
2013-11-13 14:34:18 +0800
a99864645 Merge branch 'for-3.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup changes from Tejun Heo:
"Not too much activity this time around. css_id is finally killed and
a minor update to device_cgroup"

* 'for-3.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
device_cgroup: remove can_attach
cgroup: kill css_id
memcg: stop using css id
memcg: fail to create cgroup if the cgroup id is too big
memcg: convert to use cgroup id
memcg: convert to use cgroup_is_descendant()

Linus Torvalds
2013-11-13 14:21:53 +0800
c08acff05 Merge branch 'for-3.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu ... Browse Code »

Pull percpu changes from Tejun Heo:
"Two smallish changes for percpu. Two patches to remove unused
this_cpu_xor() and one to fix a bug in percpu init failure path so
that it can reach the proper BUG() instead of oopsing earlier"

* 'for-3.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
x86: remove this_cpu_xor() implementation
percpu: remove this_cpu_xor() implementation
percpu: fix bootmem error handling in pcpu_page_first_chunk()

Linus Torvalds
2013-11-13 14:17:16 +0800
72403b4a0 mm: numa: return the number of base pages altered by protection changes ... Browse Code »

Commit 0255d4918480 ("mm: Account for a THP NUMA hinting update as one
PTE update") was added to account for the number of PTE updates when
marking pages prot_numa. task_numa_work was using the old return value
to track how much address space had been updated. Altering the return
value causes the scanner to do more work than it is configured or
documented to in a single unit of work.

This patch reverts that commit and accounts for the number of THP
updates separately in vmstat. It is up to the administrator to
interpret the pair of values correctly. This is a straight-forward
operation and likely to only be of interest when actively debugging NUMA
balancing problems.

The impact of this patch is that the NUMA PTE scanner will scan slower
when THP is enabled and workloads may converge slower as a result. On
the flip size system CPU usage should be lower than recent tests
reported. This is an illustrative example of a short single JVM specjbb
test

specjbb
3.12.0 3.12.0
vanilla acctupdates
TPut 1 26143.00 ( 0.00%) 25747.00 ( -1.51%)
TPut 7 185257.00 ( 0.00%) 183202.00 ( -1.11%)
TPut 13 329760.00 ( 0.00%) 346577.00 ( 5.10%)
TPut 19 442502.00 ( 0.00%) 460146.00 ( 3.99%)
TPut 25 540634.00 ( 0.00%) 549053.00 ( 1.56%)
TPut 31 512098.00 ( 0.00%) 519611.00 ( 1.47%)
TPut 37 461276.00 ( 0.00%) 474973.00 ( 2.97%)
TPut 43 403089.00 ( 0.00%) 414172.00 ( 2.75%)

3.12.0 3.12.0
vanillaacctupdates
User 5169.64 5184.14
System 100.45 80.02
Elapsed 252.75 251.85

Performance is similar but note the reduction in system CPU time. While
this showed a performance gain, it will not be universal but at least
it'll be behaving as documented. The vmstats are obviously different but
here is an obvious interpretation of them from mmtests.

3.12.0 3.12.0
vanillaacctupdates
NUMA page range updates 1408326 11043064
NUMA huge PMD updates 0 21040
NUMA PTE updates 1408326 291624

"NUMA page range updates" == nr_pte_updates and is the value returned to
the NUMA pte scanner. NUMA huge PMD updates were the number of THP
updates which in combination can be used to calculate how many ptes were
updated from userspace.

Signed-off-by: Mel Gorman
Reported-by: Alex Thorlton
Reviewed-by: Rik van Riel
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2013-11-13 11:09:11 +0800
00619bcc4 mm: factor commit limit calculation ... Browse Code »

The same calculation is currently done in three differents places.
Factor that code so future changes has to be made at only one place.

[akpm@linux-foundation.org: uninline vm_commit_limit()]
Signed-off-by: Jerome Marchand
Cc: Dave Hansen
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jerome Marchand
2013-11-13 11:09:11 +0800
a1aeb65a4 mm/page_alloc.c: fix comment in zlc_setup() ... Browse Code »

Signed-off-by: Zhi Yong Wu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Zhi Yong Wu
2013-11-13 11:09:11 +0800
0ab0abcf5 mm/zswap: refactor the get/put routines ... Browse Code »

The refcount routine was not fit the kernel get/put semantic exactly,
There were too many judgement statements on refcount and it could be
minus.

This patch does the following:

- move refcount judgement to zswap_entry_put() to hide resource free function.

- add a new function zswap_entry_find_get(), so that callers can use
easily in the following pattern:

zswap_entry_find_get
.../* do something */
zswap_entry_put

- to eliminate compile error, move some functions declaration

This patch is based on Minchan Kim 's idea and suggestion.

Signed-off-by: Weijie Yang
Cc: Seth Jennings
Acked-by: Minchan Kim
Cc: Bob Liu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Weijie Yang
2013-11-13 11:09:11 +0800
67d13fe84 mm/zswap: bugfix: memory leak when invalidate and reclaim occur concurrently ... Browse Code »

Consider the following scenario:

thread 0: reclaim entry x (get refcount, but not call zswap_get_swap_cache_page)
thread 1: call zswap_frontswap_invalidate_page to invalidate entry x.
finished, entry x and its zbud is not freed as its refcount != 0
now, the swap_map[x] = 0
thread 0: now call zswap_get_swap_cache_page
swapcache_prepare return -ENOENT because entry x is not used any more
zswap_get_swap_cache_page return ZSWAP_SWAPCACHE_NOMEM
zswap_writeback_entry do nothing except put refcount

Now, the memory of zswap_entry x and its zpage leak.

Modify:
- check the refcount in fail path, free memory if it is not referenced.

- use ZSWAP_SWAPCACHE_FAIL instead of ZSWAP_SWAPCACHE_NOMEM as the fail path
can be not only caused by nomem but also by invalidate.

Signed-off-by: Weijie Yang
Reviewed-by: Bob Liu
Reviewed-by: Minchan Kim
Acked-by: Seth Jennings
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Weijie Yang
2013-11-13 11:09:10 +0800
7a67d7abc memcg, kmem: use cache_from_memcg_idx instead of hard code ... Browse Code »

Signed-off-by: Qiang Huang
Reviewed-by: Pekka Enberg
Acked-by: David Rientjes
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Glauber Costa
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Qiang Huang
2013-11-13 11:09:10 +0800
2ade4de87 memcg, kmem: rename cache_from_memcg to cache_from_memcg_idx ... Browse Code »

We can't see the relationship with memcg from the parameters,
so the name with memcg_idx would be more reasonable.

Signed-off-by: Qiang Huang
Reviewed-by: Pekka Enberg
Acked-by: David Rientjes
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Glauber Costa
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Qiang Huang
2013-11-13 11:09:10 +0800
f35c3a8ee memcg, kmem: use is_root_cache instead of hard code ... Browse Code »

Signed-off-by: Qiang Huang
Reviewed-by: Pekka Enberg
Acked-by: David Rientjes
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Glauber Costa
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Qiang Huang
2013-11-13 11:09:10 +0800
2afc745f3 mm: ensure get_unmapped_area() returns higher address than mmap_min_addr ... Browse Code »

This patch fixes the problem that get_unmapped_area() can return illegal
address and result in failing mmap(2) etc.

In case that the address higher than PAGE_SIZE is set to
/proc/sys/vm/mmap_min_addr, the address lower than mmap_min_addr can be
returned by get_unmapped_area(), even if you do not pass any virtual
address hint (i.e. the second argument).

This is because the current get_unmapped_area() code does not take into
account mmap_min_addr.

This leads to two actual problems as follows:

1. mmap(2) can fail with EPERM on the process without CAP_SYS_RAWIO,
although any illegal parameter is not passed.

2. The bottom-up search path after the top-down search might not work in
arch_get_unmapped_area_topdown().

Note: The first and third chunk of my patch, which changes "len" check,
are for more precise check using mmap_min_addr, and not for solving the
above problem.

[How to reproduce]

--- test.c -------------------------------------------------
#include
#include
#include
#include

int main(int argc, char *argv[])
{
void *ret = NULL, *last_map;
size_t pagesize = sysconf(_SC_PAGESIZE);

do {
last_map = ret;
ret = mmap(0, pagesize, PROT_NONE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
// printf("ret=%p\n", ret);
} while (ret != MAP_FAILED);

if (errno != ENOMEM) {
printf("ERR: unexpected errno: %d (last map=%p)\n",
errno, last_map);
}

return 0;
}
---------------------------------------------------------------

$ gcc -m32 -o test test.c
$ sudo sysctl -w vm.mmap_min_addr=65536
vm.mmap_min_addr = 65536
$ ./test (run as non-priviledge user)
ERR: unexpected errno: 1 (last map=0x10000)

Signed-off-by: Akira Takeuchi
Signed-off-by: Kiyoshi Owada
Reviewed-by: Naoya Horiguchi
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Akira Takeuchi
2013-11-13 11:09:10 +0800
0cbef29a7 mm: __rmqueue_fallback() should respect pageblock type ... Browse Code »

When __rmqueue_fallback() doesn't find a free block with the required size
it splits a larger page and puts the rest of the page onto the free list.

But it has one serious mistake. When putting back, __rmqueue_fallback()
always use start_migratetype if type is not CMA. However,
__rmqueue_fallback() is only called when all of the start_migratetype
queue is empty. That said, __rmqueue_fallback always puts back memory to
the wrong queue except try_to_steal_freepages() changed pageblock type
(i.e. requested size is smaller than half of page block). The end result
is that the antifragmentation framework increases fragmenation instead of
decreasing it.

Mel's original anti fragmentation does the right thing. But commit
47118af076f6 ("mm: mmzone: MIGRATE_CMA migration type added") broke it.

This patch restores sane and old behavior. It also removes an incorrect
comment which was introduced by commit fef903efcf0c ("mm/page_alloc.c:
restructure free-page stealing code and fix a bug").

Signed-off-by: KOSAKI Motohiro
Cc: Mel Gorman
Cc: Michal Nazarewicz
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2013-11-13 11:09:10 +0800
52c8f6a5a mm: get rid of unnecessary overhead of trace_mm_page_alloc_extfrag() ... Browse Code »

In general, every tracepoint should be zero overhead if it is disabled.
However, trace_mm_page_alloc_extfrag() is one of exception. It evaluate
"new_type == start_migratetype" even if tracepoint is disabled.

However, the code can be moved into tracepoint's TP_fast_assign() and
TP_fast_assign exist exactly such purpose. This patch does it.

Signed-off-by: KOSAKI Motohiro
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2013-11-13 11:09:10 +0800
5d0f3f72e mm: fix page_group_by_mobility_disabled breakage ... Browse Code »

Currently, set_pageblock_migratetype() screws up MIGRATE_CMA and
MIGRATE_ISOLATE if page_group_by_mobility_disabled is true. It rewrites
the argument to MIGRATE_UNMOVABLE and we lost these attribute.

The problem was introduced by commit 49255c619fbd ("page allocator: move
check for disabled anti-fragmentation out of fastpath"). So a 4 year
old issue may mean that nobody uses page_group_by_mobility_disabled.

But anyway, this patch fixes the problem.

Signed-off-by: KOSAKI Motohiro
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2013-11-13 11:09:09 +0800
af248a0c6 readahead: fix sequential read cache miss detection ... Browse Code »

The kernel's readahead algorithm sometimes interprets random read
accesses as sequential and triggers unnecessary data prefecthing from
storage device (impacting random read average latency).

In order to identify sequential cache read misses, the readahead
algorithm intends to check whether offset - previous offset == 1
(trivial sequential reads) or offset - previous offset == 0 (sequential
reads not aligned on page boundary):

if (offset - (ra->prev_pos >> PAGE_CACHE_SHIFT) current offset (which happens on random pattern), the if
condition is true and access is wrongly interpeted as sequential. An
unnecessary data prefetching is triggered, impacting the average random
read latency.

Storing the previous offset value in a "pgoff_t" variable (unsigned
long) fixes the sequential read detection logic.

Signed-off-by: Damien Ramonda
Reviewed-by: Fengguang Wu
Acked-by: Pierre Tardy
Acked-by: David Cohen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Damien Ramonda
2013-11-13 11:09:09 +0800
4a099fb4b mm/bootmem.c: remove unused local `map' ... Browse Code »

Signed-off-by: Daeseok Youn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Daeseok Youn
2013-11-13 11:09:09 +0800