Eric Lee / smarc-fsl-linux-kernel

02 Apr, 2008

1 commit

00460dd5f Fix undefined count_partial if !CONFIG_SLABINFO ... Browse Code »

Small typo in the patch recently merged to avoid the unused symbol
message for count_partial(). Discussion thread with confirmation of fix at
http://marc.info/?t=120696854400001&r=1&w=2

Typo in the check if we need the count_partial function that was
introduced by 53625b4204753b904addd40ca96d9ba802e6977d

Signed-off-by: Christoph Lameter
Signed-off-by: Linus Torvalds

Christoph Lameter
2008-04-02 03:44:06 +0800

31 Mar, 2008

1 commit

9dce07f1a NULL noise: fs/*, mm/*, kernel/* ... Browse Code »

Signed-off-by: Al Viro
Signed-off-by: Linus Torvalds

Al Viro
2008-03-31 05:18:41 +0800

28 Mar, 2008

1 commit

e72e9c23e Revert "SLUB: remove useless masking of GFP_ZERO" ... Browse Code »

This reverts commit 3811dbf67162bd08412f1b0e02e554f353e93bdb.

The masking was not at all useless, and it was sensible. We handle
GFP_ZERO in the caller, and passing it down to any page allocator logic
is buggy and wrong.

Signed-off-by: Linus Torvalds

Linus Torvalds
2008-03-28 11:56:33 +0800

27 Mar, 2008

4 commits

11320d17c hugetlb: fix potential livelock in return_unused_surplus_hugepages() ... Browse Code »

Running the counters testcase from libhugetlbfs results in on 2.6.25-rc5
and 2.6.25-rc5-mm1:

BUG: soft lockup - CPU#3 stuck for 61s! [counters:10531]
NIP: c0000000000d1f3c LR: c0000000000d1f2c CTR: c0000000001b5088
REGS: c000005db12cb360 TRAP: 0901 Not tainted (2.6.25-rc5-autokern1)
MSR: 8000000000009032 CR: 48008448 XER: 20000000
TASK = c000005dbf3d6000[10531] 'counters' THREAD: c000005db12c8000 CPU: 3
GPR00: 0000000000000004 c000005db12cb5e0 c000000000879228 0000000000000004
GPR04: 0000000000000010 0000000000000000 0000000000200200 0000000000100100
GPR08: c0000000008aba10 000000000000ffff 0000000000000004 0000000000000000
GPR12: 0000000028000442 c000000000770080
NIP [c0000000000d1f3c] .return_unused_surplus_pages+0x84/0x18c
LR [c0000000000d1f2c] .return_unused_surplus_pages+0x74/0x18c
Call Trace:
[c000005db12cb5e0] [c000005db12cb670] 0xc000005db12cb670 (unreliable)
[c000005db12cb670] [c0000000000d24c4] .hugetlb_acct_memory+0x2e0/0x354
[c000005db12cb740] [c0000000001b5048] .truncate_hugepages+0x1d4/0x214
[c000005db12cb890] [c0000000001b50a4] .hugetlbfs_delete_inode+0x1c/0x3c
[c000005db12cb920] [c000000000103fd8] .generic_delete_inode+0xf8/0x1c0
[c000005db12cb9b0] [c0000000001b5100] .hugetlbfs_drop_inode+0x3c/0x24c
[c000005db12cba50] [c00000000010287c] .iput+0xdc/0xf8
[c000005db12cbad0] [c0000000000fee54] .dentry_iput+0x12c/0x194
[c000005db12cbb60] [c0000000000ff050] .d_kill+0x6c/0xa4
[c000005db12cbbf0] [c0000000000ffb74] .dput+0x18c/0x1b0
[c000005db12cbc70] [c0000000000e9e98] .__fput+0x1a4/0x1e8
[c000005db12cbd10] [c0000000000e61ec] .filp_close+0xb8/0xe0
[c000005db12cbda0] [c0000000000e62d0] .sys_close+0xbc/0x134
[c000005db12cbe30] [c00000000000872c] syscall_exit+0x0/0x40
Instruction dump:
ebbe8038 38800010 e8bf0002 3bbd0008 7fa3eb78 38a50001 7ca507b4 4818df25
60000000 38800010 38a00000 7c601b78 2f800010 409d0008 38000010

This was tracked down to a potential livelock in
return_unused_surplus_hugepages(). In the case where we have surplus
pages on some node, but no free pages on the same node, we may never
break out of the loop. To avoid this livelock, terminate the search if
we iterate a number of times equal to the number of online nodes without
freeing a page.

Thanks to Andy Whitcroft and Adam Litke for helping with debugging and
the patch.

Signed-off-by: Nishanth Aravamudan
Signed-off-by: Linus Torvalds

Nishanth Aravamudan
2008-03-27 06:01:33 +0800
a1de09195 hugetlb: indicate surplus huge page counts in per-node meminfo ... Browse Code »

Currently we show the surplus hugetlb pool state in /proc/meminfo, but
not in the per-node meminfo files, even though we track the information
on a per-node basis. Printing it there can help track down dynamic pool
bugs including the one in the follow-on patch.

Signed-off-by: Nishanth Aravamudan
Signed-off-by: Linus Torvalds

Nishanth Aravamudan
2008-03-27 06:01:33 +0800
ec1f5eeeb slab: fix cache_cache bootstrap in kmem_cache_init() ... Browse Code »

Commit 556a169dab38b5100df6f4a45b655dddd3db94c1 ("slab: fix bootstrap on
memoryless node") introduced bootstrap-time cache_cache list3s for all nodes
but forgot that initkmem_list3 needs to be accessed by [somevalue + node]. This
patch fixes list_add() corruption in mm/slab.c seen on the ES7000.

Cc: Mel Gorman
Cc: Olaf Hering
Cc: Christoph Lameter
Signed-off-by: Dan Yeisley
Signed-off-by: Pekka Enberg
Signed-off-by: Christoph Lameter

Daniel Yeisley
2008-03-27 01:44:17 +0800
53625b420 count_partial() is not used if !SLUB_DEBUG and !CONFIG_SLABINFO ... Browse Code »

Avoid warnings about unused functions if neither SLUB_DEBUG nor CONFIG_SLABINFO
is defined. This patch will be reversed when slab defrag is merged since slab
defrag requires count_partial() to determine the fragmentation status of
slab caches.

Signed-off-by: Christoph Lameter

Christoph Lameter
2008-03-27 01:42:28 +0800

25 Mar, 2008

3 commits

7ed7fe5e8 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
[PATCH] get stack footprint of pathname resolution back to relative sanity
[PATCH] double iput() on failure exit in hugetlb
[PATCH] double dput() on failure exit in tiny-shmem
[PATCH] fix up new filp allocators
[PATCH] check for null vfsmount in dentry_open()
[PATCH] reiserfs: eliminate private use of struct file in xattr
[PATCH] sanitize hppfs
hppfs pass vfsmount to dentry_open()
[PATCH] restore export of do_kern_mount()

Linus Torvalds
2008-03-25 23:57:47 +0800
4dd4b9202 revert "kswapd should only wait on IO if there is IO" ... Browse Code »

Revert commit f1a9ee758de7de1e040de849fdef46e6802ea117:

Author: Rik van Riel
Date: Thu Feb 7 00:14:08 2008 -0800

kswapd should only wait on IO if there is IO

The current kswapd (and try_to_free_pages) code has an oddity where the
code will wait on IO, even if there is no IO in flight. This problem is
notable especially when the system scans through many unfreeable pages,
causing unnecessary stalls in the VM.

Additionally, tasks without __GFP_FS or __GFP_IO in the direct reclaim path
will sleep if a significant number of pages are encountered that should be
written out. This gives kswapd a chance to write out those pages, while
the direct reclaim task sleeps.

Signed-off-by: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Because of large latencies and interactivity problems reported by Carlos,
here: http://lkml.org/lkml/2008/3/22/211

Cc: Rik van Riel
Cc: "Carlos R. Mafra"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2008-03-25 10:22:19 +0800
5a982cbc7 mm: fix boundary checking in free_bootmem_core ... Browse Code »

With numa enabled, some callers could have a range of memory on one node
but try to free that on other node. This can cause some pages to be
freed wrongly.

For example: when we try to allocate 128g boot ram early for
gart/swiotlb, and free that range later so gart/swiotlb can get some
range afterwards.

With this patch, we don't need to care which node holds the range, just
loop to call free_bootmem_node for all online nodes.

This patch makes free_bootmem_core() more robust by trimming the sidx
and eidx according the ram range that the node has.

And make the free_bootmem_core handle this out of range case. We could
use bdata_list to make sure the range can be freed for sure. So next
time, we don't need to loop online nodes and could use free_bootmem
directly.

Signed-off-by: Yinghai Lu
Cc: Andi Kleen
Cc: Yasunori Goto
Cc: KAMEZAWA Hiroyuki
Acked-by: Ingo Molnar
Tested-by: Ingo Molnar
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yinghai Lu
2008-03-25 10:22:19 +0800

20 Mar, 2008

7 commits

f7850d932 mm/readahead: fix kernel-doc notation ... Browse Code »

Fix kernel-doc notation in mm/readahead.c.

Change ":" to ";" so that it doesn't get treated as a doc section heading.
Move the comment block ending "*/" to a line by itself so that the text on
that last line is not lost (dropped).

Signed-off-by: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Randy Dunlap
2008-03-20 09:53:37 +0800
52ea27eb4 memcgroup: fix check for thread being a group leader in memcgroup ... Browse Code »

The check t->pid == t->pid is not the blessed way to check whether a task is a
group leader.

This is not about the code beautifulness only, but about pid namespaces fixes
- both the tgid and the pid fields on the task_struct are (slowly :( )
becoming deprecated.

Besides, the thread_group_leader() macro makes only one dereference :)

Signed-off-by: Pavel Emelyanov
Cc: Balbir Singh
Cc: KAMEZAWA Hiroyuki
Cc: Paul Menage
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Emelyanov
2008-03-20 09:53:35 +0800
43d8eac44 mm: rmap kernel-doc fixes ... Browse Code »

Correct kernel-doc function names and parameters in rmap.c.

Signed-off-by: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Randy Dunlap
2008-03-20 09:53:35 +0800
77f6078aa mm: highmem kernel-doc additions ... Browse Code »

Add kernel-doc comments to highmem.c.

Signed-off-by: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Randy Dunlap
2008-03-20 09:53:35 +0800
1b578df02 mm/oom_kill: fix kernel-doc ... Browse Code »

Fix kernel-doc notation in oom_kill.c.

Signed-off-by: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Randy Dunlap
2008-03-20 09:53:35 +0800
467118102 mm/shmem and tiny-shmem: fix some kernel-doc ... Browse Code »

Convert tiny-shmem.c function comments to kernel-doc. Add parameters and
convert/fix other kernel-doc in shmem.c.

Signed-off-by: Randy Dunlap
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Randy Dunlap
2008-03-20 09:53:35 +0800
7682486b3 mm: fix various kernel-doc comments ... Browse Code »

Fix various kernel-doc notation in mm/:

filemap.c: add function short description; convert 2 to kernel-doc
fremap.c: change parameter 'prot' to @prot
pagewalk.c: change "-" in function parameters to ":"
slab.c: fix short description of kmem_ptr_validate()
swap.c: fix description & parameters of put_pages_list()
swap_state.c: fix function parameters
vmalloc.c: change "@returns" to "Returns:" since that is not a parameter

Signed-off-by: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Randy Dunlap
2008-03-20 09:53:35 +0800

19 Mar, 2008

1 commit

8a03feab3 [PATCH] double dput() on failure exit in tiny-shmem ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2008-03-19 18:54:36 +0800

18 Mar, 2008

1 commit

caeab084d slub page alloc fallback: Enable interrupts for GFP_WAIT. ... Browse Code »

The fallback path needs to enable interrupts like done for
the other page allocator calls. This was not necessary with
the alternate fast path since we handled irq enable/disable in
the slow path. The regular fastpath handles irq enable/disable
around calls to the slow path so we need to restore the proper
status before calling the page allocator from the slowpath.

Signed-off-by: Christoph Lameter

Christoph Lameter
2008-03-18 02:14:17 +0800

11 Mar, 2008

3 commits

f7009264c iov_iter_advance() fix ... Browse Code »

iov_iter_advance() skips over zero-length iovecs, however it does not properly
terminate at the end of the iovec array. Fix this by checking against
i->count before we skip a zero-length iov.

The bug was reproduced with a test program that continually randomly creates
iovs to writev. The fix was also verified with the same program and also it
could verify that the correct data was contained in the file after each
writev.

Signed-off-by: Nick Piggin
Tested-by: "Kevin Coffman"
Cc: "Alexey Dobriyan"
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2008-03-11 09:01:20 +0800
2668db911 hugetlb: correct page count for surplus huge pages ... Browse Code »

Free pages in the hugetlb pool are free and as such have a reference count of
zero. Regular allocations into the pool from the buddy are "freed" into the
pool which results in their page_count dropping to zero. However, surplus
pages can be directly utilized by the caller without first being freed to the
pool. Therefore, a call to put_page_testzero() is in order so that such a
page will be handed to the caller with a correct count.

This has not affected end users because the bad page count is reset before the
page is handed off. However, under CONFIG_DEBUG_VM this triggers a BUG when
the page count is validated.

Thanks go to Mel for first spotting this issue and providing an initial fix.

Signed-off-by: Adam Litke
Cc: Mel Gorman
Cc: Dave Hansen
Cc: William Lee Irwin III
Cc: Andy Whitcroft
Cc: Mel Gorman
Cc: David Gibson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adam Litke
2008-03-11 09:01:19 +0800
69682d852 mempolicy: fix reference counting bugs ... Browse Code »

Address 3 known bugs in the current memory policy reference counting method.
I have a series of patches to rework the reference counting to reduce overhead
in the allocation path. However, that series will require testing in -mm once
I repost it.

1) alloc_page_vma() does not release the extra reference taken for
vma/shared mempolicy when the mode == MPOL_INTERLEAVE. This can result in
leaking mempolicy structures. This is probably occurring, but not being
noticed.

Fix: add the conditional release of the reference.

2) hugezonelist unconditionally releases a reference on the mempolicy when
mode == MPOL_INTERLEAVE. This can result in decrementing the reference
count for system default policy [should have no ill effect] or premature
freeing of task policy. If this occurred, the next allocation using task
mempolicy would use the freed structure and probably BUG out.

Fix: add the necessary check to the release.

3) The current reference counting method assumes that vma 'get_policy()'
methods automatically add an extra reference a non-NULL returned mempolicy.
This is true for shmem_get_policy() used by tmpfs mappings, including
regular page shm segments. However, SHM_HUGETLB shm's, backed by
hugetlbfs, just use the vma policy without the extra reference. This
results in freeing of the vma policy on the first allocation, with reuse of
the freed mempolicy structure on subsequent allocations.

Fix: Rather than add another condition to the conditional reference
release, which occur in the allocation path, just add a reference when
returning the vma policy in shm_get_policy() to match the assumptions.

Signed-off-by: Lee Schermerhorn
Cc: Greg KH
Cc: Andi Kleen
Cc: Christoph Lameter
Cc: Mel Gorman
Cc: David Rientjes
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lee Schermerhorn
2008-03-11 09:01:19 +0800

10 Mar, 2008

1 commit

3426fadfa Do not include linux/backing-dev.h twice ... Browse Code »

Don't include linux/backing-dev.h twice in mm/filemap.c, it's pointless.

Signed-off-by: Jesper Juhl
Signed-off-by: Linus Torvalds

Jesper Juhl
2008-03-10 13:21:52 +0800

07 Mar, 2008

5 commits

6d2144d35 slab: NUMA slab allocator migration bugfix ... Browse Code »

NUMA slab allocator cpu migration bugfix

The NUMA slab allocator (specifically, cache_alloc_refill)
is not refreshing its local copies of what cpu and what
numa node it is on, when it drops and reacquires the irq
block that it inherited from its caller. As a result
those values become invalid if an attempt to migrate the
process to another numa node occured while the irq block
had been dropped.

The solution is to make cache_alloc_refill reload these
variables whenever it drops and reacquires the irq block.

The error is very difficult to hit. When it does occur,
one gets the following oops + stack traceback bits in
check_spinlock_acquired:

kernel BUG at mm/slab.c:2417
cache_alloc_refill+0xe6
kmem_cache_alloc+0xd0
...

This patch was developed against 2.6.23, ported to and
compiled-tested only against 2.6.25-rc4.

Signed-off-by: Joe Korty
Signed-off-by: Christoph Lameter

Joe Korty
2008-03-07 08:21:50 +0800
b62103867 slub: Do not cross cacheline boundaries for very small objects ... Browse Code »

SLUB should pack even small objects nicely into cachelines if that is what
has been asked for. Use the same algorithm as SLAB for this.

The effect of this patch for a system with a cacheline size of 64
bytes is that the 24 byte sized slab caches will now put exactly
2 objects into a cacheline instead of 3 with some overlap into
the next cacheline. This reduces the object density in a 4k slab
from 170 to 128 objects (same as SLAB).

Signed-off-by: Nick Piggin
Signed-off-by: Christoph Lameter

Nick Piggin
2008-03-07 08:21:50 +0800
1c61fc40f slab - use angle brackets for include of kmalloc_sizes.h ... Browse Code »

Make them all use angle brackets and the directory name.

Acked-by: Pekka Enberg
Signed-off-by: Joe Perches
Signed-off-by: Christoph Lameter

Joe Perches
2008-03-07 08:21:49 +0800
9ac33b2b7 slab numa fallback logic: Do not pass unfiltered flags to page allocator ... Browse Code »

The NUMA fallback logic should be passing local_flags to kmem_get_pages() and not simply the
flags passed in.

Reviewed-by: Pekka Enberg
Signed-off-by: Christoph Lameter

Christoph Lameter
2008-03-07 08:21:49 +0800
b773ad736 slub statistics: Fix check for DEACTIVATE_REMOTE_FREES ... Browse Code »

The remote frees are in the freelist of the page and not in the
percpu freelist.

Reviewed-by: Pekka Enberg
Signed-off-by: Christoph Lameter

Christoph Lameter
2008-03-07 08:21:49 +0800

05 Mar, 2008

12 commits

348e1e04b hugetlb: fix pool shrinking while in restricted cpuset ... Browse Code »

Adam Litke noticed that currently we grow the hugepage pool independent of any
cpuset the running process may be in, but when shrinking the pool, the cpuset
is checked. This leads to inconsistency when shrinking the pool in a
restricted cpuset -- an administrator may have been able to grow the pool on a
node restricted by a containing cpuset, but they cannot shrink it there.

There are two options: either prevent growing of the pool outside of the
cpuset or allow shrinking outside of the cpuset. >From previous discussions
on linux-mm, /proc/sys/vm/nr_hugepages is an administrative interface that
should not be restricted by cpusets. So allow shrinking the pool by removing
pages from nodes outside of current's cpuset.

Signed-off-by: Nishanth Aravamudan
Acked-by: Adam Litke
Cc: William Irwin
Cc: Lee Schermerhorn
Cc: Christoph Lameter
Cc: Paul Jackson
Cc: David Gibson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nishanth Aravamudan
2008-03-05 08:35:18 +0800
ac09b3a15 hugetlb: close a difficult to trigger reservation race ... Browse Code »

A hugetlb reservation may be inadequately backed in the event of racing
allocations and frees when utilizing surplus huge pages. Consider the
following series of events in processes A and B:

A) Allocates some surplus pages to satisfy a reservation
B) Frees some huge pages
A) A notices the extra free pages and drops hugetlb_lock to free some of
its surplus pages back to the buddy allocator.
B) Allocates some huge pages
A) Reacquires hugetlb_lock and returns from gather_surplus_huge_pages()

Avoid this by commiting the reservation after pages have been allocated but
before dropping the lock to free excess pages. For parity, release the
reservation in return_unused_surplus_pages().

This patch also corrects the cpuset_mems_nr() error path in
hugetlb_acct_memory(). If the cpuset check fails, uncommit the
reservation, but also be sure to return any surplus huge pages that may
have been allocated to back the failed reservation.

Thanks to Andy Whitcroft for discovering this.

Signed-off-by: Adam Litke
Cc: Mel Gorman
Cc: Andy Whitcroft
Cc: Dave Hansen
Cc: William Lee Irwin III
Cc: Andy Whitcroft
Cc: Mel Gorman
Cc: David Gibson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adam Litke
2008-03-05 08:35:18 +0800
fb59e9f1e memcg: fix oops on NULL lru list ... Browse Code »

While testing force_empty, during an exit_mmap, __mem_cgroup_remove_list
called from mem_cgroup_uncharge_page oopsed on a NULL pointer in the lru list.
I couldn't see what racing tasks on other cpus were doing, but surmise that
another must have been in mem_cgroup_charge_common on the same page, between
its unlock_page_cgroup and spin_lock_irqsave near done (thanks to that kzalloc
which I'd almost changed to a kmalloc).

Normally such a race cannot happen, the ref_cnt prevents it, the final
uncharge cannot race with the initial charge. But force_empty buggers the
ref_cnt, that's what it's all about; and thereafter forced pages are
vulnerable to races such as this (just think of a shared page also mapped into
an mm of another mem_cgroup than that just emptied). And remain vulnerable
until they're freed indefinitely later.

This patch just fixes the oops by moving the unlock_page_cgroups down below
adding to and removing from the list (only possible given the previous patch);
and while we're at it, we might as well make it an invariant that
page->page_cgroup is always set while pc is on lru.

But this behaviour of force_empty seems highly unsatisfactory to me: why have
a ref_cnt if we always have to cope with it being violated (as in the earlier
page migration patch). We may prefer force_empty to move pages to an orphan
mem_cgroup (could be the root, but better not), from which other cgroups could
recover them; we might need to reverse the locking again; but no time now for
such concerns.

Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2008-03-05 08:35:15 +0800
9b3c0a07e memcg: simplify force_empty and move_lists ... Browse Code »

As for force_empty, though this may not be the main topic here,
mem_cgroup_force_empty_list() can be implemented simpler. It is possible to
make the function just call mem_cgroup_uncharge_page() instead of releasing
page_cgroups by itself. The tip is to call get_page() before invoking
mem_cgroup_uncharge_page(), so the page won't be released during this
function.

Kamezawa-san points out that by the time mem_cgroup_uncharge_page() uncharges,
the page might have been reassigned to an lru of a different mem_cgroup, and
now be emptied from that; but Hugh claims that's okay, the end state is the
same as when it hasn't gone to another list.

And once force_empty stops taking lock_page_cgroup within mz->lru_lock,
mem_cgroup_move_lists() can be simplified to take mz->lru_lock directly while
holding page_cgroup lock (but still has to use try_lock_page_cgroup).

Signed-off-by: Hirokazu Takahashi
Signed-off-by: Hugh Dickins
Cc: David Rientjes
Cc: Balbir Singh
Cc: KAMEZAWA Hiroyuki
Cc: YAMAMOTO Takashi
Cc: Paul Menage
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hirokazu Takahashi
2008-03-05 08:35:15 +0800
2680eed72 memcg: fix mem_cgroup_move_lists locking ... Browse Code »

Ever since the VM_BUG_ON(page_get_page_cgroup(page)) (now Bad page state) went
into page freeing, I've hit it from time to time in testing on some machines,
sometimes only after many days. Recently found a machine which could usually
produce it within a few hours, which got me there at last.

The culprit is mem_cgroup_move_lists, whose locking is inadequate; and the
arrangement of structures was such that you got page_cgroups from the lru list
neatly put on to SLUB's freelist. Kamezawa-san identified the same hole
independently.

The main problem was that it was missing the lock_page_cgroup it needs to
safely page_get_page_cgroup; but it's tricky to go beyond that too, and I
couldn't do it with SLAB_DESTROY_BY_RCU as I'd expected. See the code for
comments on the constraints.

This patch immediately gets replaced by a simpler one from Hirokazu-san; but
is it just foolish pride that tells me to put this one on record, in case we
need to come back to it later?

Signed-off-by: Hugh Dickins
Cc: David Rientjes
Cc: Balbir Singh
Acked-by: KAMEZAWA Hiroyuki
Cc: Hirokazu Takahashi
Cc: YAMAMOTO Takashi
Cc: Paul Menage
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2008-03-05 08:35:15 +0800
6d48ff8bc memcg: css_put after remove_list ... Browse Code »

mem_cgroup_uncharge_page does css_put on the mem_cgroup before uncharging from
it, and before removing page_cgroup from one of its lru lists: isn't there a
danger that struct mem_cgroup memory could be freed and reused before
completing that, so corrupting something? Never seen it, and for all I know
there may be other constraints which make it impossible; but let's be
defensive and reverse the ordering there.

mem_cgroup_force_empty_list is safe because there's an extra css_get around
all its works; but even so, change its ordering the same way round, to help
get in the habit of doing it like this.

Signed-off-by: Hugh Dickins
Cc: David Rientjes
Cc: Balbir Singh
Acked-by: KAMEZAWA Hiroyuki
Cc: Hirokazu Takahashi
Cc: YAMAMOTO Takashi
Cc: Paul Menage
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2008-03-05 08:35:15 +0800
b9c565d5a memcg: remove clear_page_cgroup and atomics ... Browse Code »

Remove clear_page_cgroup: it's an unhelpful helper, see for example how
mem_cgroup_uncharge_page had to unlock_page_cgroup just in order to call it
(serious races from that? I'm not sure).

Once that's gone, you can see it's pointless for page_cgroup's ref_cnt to be
atomic: it's always manipulated under lock_page_cgroup, except where
force_empty unilaterally reset it to 0 (and how does uncharge's
atomic_dec_and_test protect against that?).

Simplify this page_cgroup locking: if you've got the lock and the pc is
attached, then the ref_cnt must be positive: VM_BUG_ONs to check that, and to
check that pc->page matches page (we're on the way to finding why sometimes it
doesn't, but this patch doesn't fix that).

Signed-off-by: Hugh Dickins
Cc: David Rientjes
Cc: Balbir Singh
Acked-by: KAMEZAWA Hiroyuki
Cc: Hirokazu Takahashi
Cc: YAMAMOTO Takashi
Cc: Paul Menage
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2008-03-05 08:35:15 +0800
d5b69e38f memcg: memcontrol uninlined and static ... Browse Code »

More cleanup to memcontrol.c, this time changing some of the code generated.
Let the compiler decide what to inline (except for page_cgroup_locked which is
only used when CONFIG_DEBUG_VM): the __always_inline on lock_page_cgroup etc.
was quite a waste since bit_spin_lock etc. are inlines in a header file; made
mem_cgroup_force_empty and mem_cgroup_write_strategy static.

Signed-off-by: Hugh Dickins
Cc: David Rientjes
Cc: Balbir Singh
Acked-by: KAMEZAWA Hiroyuki
Cc: Hirokazu Takahashi
Cc: YAMAMOTO Takashi
Cc: Paul Menage
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2008-03-05 08:35:15 +0800
8869b8f6e memcg: memcontrol whitespace cleanups ... Browse Code »

Sorry, before getting down to more important changes, I'd like to do some
cleanup in memcontrol.c. This patch doesn't change the code generated, but
cleans up whitespace, moves up a double declaration, removes an unused enum,
removes void returns, removes misleading comments, that kind of thing.

Signed-off-by: Hugh Dickins
Cc: David Rientjes
Cc: Balbir Singh
Cc: KAMEZAWA Hiroyuki
Cc: Hirokazu Takahashi
Cc: YAMAMOTO Takashi
Cc: Paul Menage
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2008-03-05 08:35:15 +0800
8289546e5 memcg: remove mem_cgroup_uncharge ... Browse Code »

Nothing uses mem_cgroup_uncharge apart from mem_cgroup_uncharge_page, (a
trivial wrapper around it) and mem_cgroup_end_migration (which does the same
as mem_cgroup_uncharge_page). And it often ends up having to lock just to let
its caller unlock. Remove it (but leave the silly locking until a later
patch).

Moved mem_cgroup_cache_charge next to mem_cgroup_charge in memcontrol.h.

Signed-off-by: Hugh Dickins
Cc: David Rientjes
Acked-by: Balbir Singh
Acked-by: KAMEZAWA Hiroyuki
Cc: Hirokazu Takahashi
Cc: YAMAMOTO Takashi
Cc: Paul Menage
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2008-03-05 08:35:15 +0800
7e924aafa memcg: mem_cgroup_charge never NULL ... Browse Code »

My memcgroup patch to fix hang with shmem/tmpfs added NULL page handling to
mem_cgroup_charge_common. It seemed convenient at the time, but hard to
justify now: there's a perfectly appropriate swappage to charge and uncharge
instead, this is not on any hot path through shmem_getpage, and no performance
hit was observed from the slight extra overhead.

So revert that NULL page handling from mem_cgroup_charge_common; and make it
clearer by bringing page_cgroup_assign_new_page_cgroup into its body - that
was a helper I found more of a hindrance to understanding.

Signed-off-by: Hugh Dickins
Cc: David Rientjes
Acked-by: Balbir Singh
Acked-by: KAMEZAWA Hiroyuki
Cc: Hirokazu Takahashi
Cc: YAMAMOTO Takashi
Cc: Paul Menage
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2008-03-05 08:35:15 +0800
9442ec9df memcg: bad page if page_cgroup when free ... Browse Code »

Replace free_hot_cold_page's VM_BUG_ON(page_get_page_cgroup(page)) by a "Bad
page state" and clear: most users don't have CONFIG_DEBUG_VM on, and if it
were set here, it'd likely cause corruption when the page is reused.

Don't use page_assign_page_cgroup to clear it: that should be private to
memcontrol.c, and always called with the lock taken; and memmap_init_zone
doesn't need it either - like page->mapping and other pointers throughout the
kernel, Linux assumes pointers in zeroed structures are NULL pointers.

Instead use page_reset_bad_cgroup, added to memcontrol.h for this only.

Signed-off-by: Hugh Dickins
Cc: David Rientjes
Acked-by: Balbir Singh
Acked-by: KAMEZAWA Hiroyuki
Cc: Hirokazu Takahashi
Cc: YAMAMOTO Takashi
Cc: Paul Menage
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2008-03-05 08:35:15 +0800