Eric Lee / smarc-fsl-linux-kernel

12 Jan, 2013

8 commits

8fb74b9fb mm: compaction: partially revert capture of suitable high-order page ... Browse Code »

Eric Wong reported on 3.7 and 3.8-rc2 that ppoll() got stuck when
waiting for POLLIN on a local TCP socket. It was easier to trigger if
there was disk IO and dirty pages at the same time and he bisected it to
commit 1fb3f8ca0e92 ("mm: compaction: capture a suitable high-order page
immediately when it is made available").

The intention of that patch was to improve high-order allocations under
memory pressure after changes made to reclaim in 3.6 drastically hurt
THP allocations but the approach was flawed. For Eric, the problem was
that page->pfmemalloc was not being cleared for captured pages leading
to a poor interaction with swap-over-NFS support causing the packets to
be dropped. However, I identified a few more problems with the patch
including the fact that it can increase contention on zone->lock in some
cases which could result in async direct compaction being aborted early.

In retrospect the capture patch took the wrong approach. What it should
have done is mark the pageblock being migrated as MIGRATE_ISOLATE if it
was allocating for THP and avoided races that way. While the patch was
showing to improve allocation success rates at the time, the benefit is
marginal given the relative complexity and it should be revisited from
scratch in the context of the other reclaim-related changes that have
taken place since the patch was first written and tested. This patch
partially reverts commit 1fb3f8ca0e92 ("mm: compaction: capture a
suitable high-order page immediately when it is made available").

Reported-and-tested-by: Eric Wong
Tested-by: Eric Dumazet
Cc:
Signed-off-by: Mel Gorman
Cc: David Miller
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2013-01-12 06:54:56 +0800
062f1af21 mm: thp: acquire the anon_vma rwsem for write during split ... Browse Code »

Zhouping Liu reported the following against 3.8-rc1 when running a mmap
testcase from LTP.

mapcount 0 page_mapcount 3
------------[ cut here ]------------
kernel BUG at mm/huge_memory.c:1798!
invalid opcode: 0000 [#1] SMP
Modules linked in: ip6table_filter ip6_tables ebtable_nat ebtables bnep bluetooth rfkill iptable_mangle ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter ip_tables be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i cxgb3 mdio libcxgbi ib_iser rdma_cm ib_addr iw_cm ib_cm ib_sa ib_mad ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi vfat fat dm_mirror dm_region_hash dm_log dm_mod cdc_ether iTCO_wdt i7core_edac coretemp usbnet iTCO_vendor_support mii crc32c_intel edac_core lpc_ich shpchp ioatdma mfd_core i2c_i801 pcspkr serio_raw bnx2 microcode dca vhost_net tun macvtap macvlan kvm_intel kvm uinput mgag200 sr_mod cdrom i2c_algo_bit sd_mod drm_kms_helper crc_t10dif ata_generic pata_acpi ttm ata_piix drm libata i2c_core megaraid_sas
CPU 1
Pid: 23217, comm: mmap10 Not tainted 3.8.0-rc1mainline+ #17 IBM IBM System x3400 M3 Server -[7379I08]-/69Y4356
RIP: __split_huge_page+0x677/0x6d0
RSP: 0000:ffff88017a03fc08 EFLAGS: 00010293
RAX: 0000000000000003 RBX: ffff88027a6c22e0 RCX: 00000000000034d2
RDX: 000000000000748b RSI: 0000000000000046 RDI: 0000000000000246
RBP: ffff88017a03fcb8 R08: ffffffff819d2440 R09: 000000000000054a
R10: 0000000000aaaaaa R11: 00000000ffffffff R12: 0000000000000000
R13: 00007f4f11a00000 R14: ffff880179e96e00 R15: ffffea0005c08000
FS: 00007f4f11f4a740(0000) GS:ffff88017bc20000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000037e9ebb404 CR3: 000000017a436000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process mmap10 (pid: 23217, threadinfo ffff88017a03e000, task ffff880172dd32e0)
Stack:
ffff88017a540ec8 ffff88017a03fc20 ffffffff816017b5 ffff88017a03fc88
ffffffff812fa014 0000000000000000 ffff880279ebd5c0 00000000f4f11a4c
00000007f4f11f49 00000007f4f11a00 ffff88017a540ef0 ffff88017a540ee8
Call Trace:
split_huge_page+0x68/0xb0
__split_huge_page_pmd+0x134/0x330
split_huge_page_pmd_mm+0x51/0x60
split_huge_page_address+0x3b/0x50
__vma_adjust_trans_huge+0x9c/0xf0
vma_adjust+0x684/0x750
__split_vma.isra.28+0x1fa/0x220
do_munmap+0xf9/0x420
vm_munmap+0x4e/0x70
sys_munmap+0x2b/0x40
system_call_fastpath+0x16/0x1b

Alexander Beregalov and Alex Xu reported similar bugs and Hillf Danton
identified that commit 5a505085f043 ("mm/rmap: Convert the struct
anon_vma::mutex to an rwsem") and commit 4fc3f1d66b1e ("mm/rmap,
migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable")
were likely the problem. Reverting these commits was reported to solve
the problem for Alexander.

Despite the reason for these commits, NUMA balancing is not the direct
source of the problem. split_huge_page() expects the anon_vma lock to
be exclusive to serialise the whole split operation. Ordinarily it is
expected that the anon_vma lock would only be required when updating the
avcs but THP also uses the anon_vma rwsem for collapse and split
operations where the page lock or compound lock cannot be used (as the
page is changing from base to THP or vice versa) and the page table
locks are insufficient.

This patch takes the anon_vma lock for write to serialise against parallel
split_huge_page as THP expected before the conversion to rwsem.

Reported-and-tested-by: Zhouping Liu
Reported-by: Alexander Beregalov
Reported-by: Alex Xu
Signed-off-by: Mel Gorman
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2013-01-12 06:54:55 +0800
572043c90 mm: mmap: annotate vm_lock_anon_vma locking properly for lockdep ... Browse Code »

Commit 5a505085f043 ("mm/rmap: Convert the struct anon_vma::mutex to an
rwsem") turned anon_vma mutex to rwsem.

However, the properly annotated nested locking in mm_take_all_locks()
has been converted from

mutex_lock_nest_lock(&anon_vma->root->mutex, &mm->mmap_sem);

to

down_write(&anon_vma->root->rwsem);

which is incomplete, and causes the false positive report from lockdep
below.

Annotate the fact that mmap_sem is used as an outter lock to serialize
taking of all the anon_vma rwsems at once no matter the order, using the
down_write_nest_lock() primitive.

This patch fixes this lockdep report:

=============================================
[ INFO: possible recursive locking detected ]
3.8.0-rc2-00036-g5f73896 #171 Not tainted
---------------------------------------------
qemu-kvm/2315 is trying to acquire lock:
(&anon_vma->rwsem){+.+...}, at: mm_take_all_locks+0x149/0x1b0

but task is already holding lock:
(&anon_vma->rwsem){+.+...}, at: mm_take_all_locks+0x149/0x1b0

other info that might help us debug this:
Possible unsafe locking scenario:

CPU0
----
lock(&anon_vma->rwsem);
lock(&anon_vma->rwsem);

*** DEADLOCK ***

May be due to missing lock nesting notation

4 locks held by qemu-kvm/2315:
#0: (&mm->mmap_sem){++++++}, at: do_mmu_notifier_register+0xfc/0x170
#1: (mm_all_locks_mutex){+.+...}, at: mm_take_all_locks+0x36/0x1b0
#2: (&mapping->i_mmap_mutex){+.+...}, at: mm_take_all_locks+0xc9/0x1b0
#3: (&anon_vma->rwsem){+.+...}, at: mm_take_all_locks+0x149/0x1b0

stack backtrace:
Pid: 2315, comm: qemu-kvm Not tainted 3.8.0-rc2-00036-g5f73896 #171
Call Trace:
print_deadlock_bug+0xf2/0x100
validate_chain+0x4f6/0x720
__lock_acquire+0x359/0x580
lock_acquire+0x121/0x190
down_write+0x3f/0x70
mm_take_all_locks+0x149/0x1b0
do_mmu_notifier_register+0x68/0x170
mmu_notifier_register+0xe/0x10
kvm_create_vm+0x22b/0x330 [kvm]
kvm_dev_ioctl+0xf8/0x1a0 [kvm]
do_vfs_ioctl+0x9d/0x350
sys_ioctl+0x91/0xb0
system_call_fastpath+0x16/0x1b

Signed-off-by: Jiri Kosina
Cc: Rik van Riel
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: Mel Gorman
Tested-by: Sedat Dilek
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jiri Kosina
2013-01-12 06:54:55 +0800
10d73e655 mm: bootmem: fix free_all_bootmem_core() with odd bitmap alignment ... Browse Code »

Currently free_all_bootmem_core ignores that node_min_pfn may be not
multiple of BITS_PER_LONG. Eg commit 6dccdcbe2c3e ("mm: bootmem: fix
checking the bitmap when finally freeing bootmem") shifts vec by lower
bits of start instead of lower bits of idx. Also

if (IS_ALIGNED(start, BITS_PER_LONG) && vec == ~0UL)

assumes that vec bit 0 corresponds to start pfn, which is only true when
node_min_pfn is a multiple of BITS_PER_LONG. Also loop in the else
clause can double-free pages (e.g. with node_min_pfn == start == 1,
map[0] == ~0 on 32-bit machine page 32 will be double-freed).

This bug causes the following message during xtensa kernel boot:

bootmem::free_all_bootmem_core nid=0 start=1 end=8000
BUG: Bad page state in process swapper pfn:00001
page:d04bd020 count:0 mapcount:-127 mapping: (null) index:0x2
page flags: 0x0()
Call Trace:
bad_page+0x8c/0x9c
free_pages_prepare+0x5e/0x88
free_hot_cold_page+0xc/0xa0
__free_pages+0x24/0x38
__free_pages_bootmem+0x54/0x56
free_all_bootmem_core$part$11+0xeb/0x138
free_all_bootmem+0x46/0x58
mem_init+0x25/0xa4
start_kernel+0x11e/0x25c
should_never_return+0x0/0x3be7

The fix is the following:
- always align vec so that its bit 0 corresponds to start
- provide BITS_PER_LONG bits in vec, if those bits are available in the
map
- don't free pages past next start position in the else clause.

Signed-off-by: Max Filippov
Cc: Gavin Shan
Cc: Johannes Weiner
Cc: Tejun Heo
Cc: Yinghai Lu
Cc: Joonsoo Kim
Cc: Prasad Koya
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Max Filippov
2013-01-12 06:54:55 +0800
c060f943d mm: use aligned zone start for pfn_to_bitidx calculation ... Browse Code »

The current calculation in pfn_to_bitidx assumes that (pfn -
zone->zone_start_pfn) >> pageblock_order will return the same bit for
all pfn in a pageblock. If zone_start_pfn is not aligned to
pageblock_nr_pages, this may not always be correct.

Consider the following with pageblock order = 10, zone start 2MB:

pfn | pfn - zone start | (pfn - zone start) >> page block order
----------------------------------------------------------------
0x26000 | 0x25e00 | 0x97
0x26100 | 0x25f00 | 0x97
0x26200 | 0x26000 | 0x98
0x26300 | 0x26100 | 0x98

This means that calling {get,set}_pageblock_migratetype on a single page
will not set the migratetype for the full block. Fix this by rounding
down zone_start_pfn when doing the bitidx calculation.

For our use case, the effects of this bug were mostly tied to the fact
that CMA allocations would either take a long time or fail to happen.
Depending on the driver using CMA, this could result in anything from
visual glitches to application failures.

Signed-off-by: Laura Abbott
Acked-by: Mel Gorman
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Laura Abbott
2013-01-12 06:54:55 +0800
7964c06d6 mm: compaction: fix echo 1 > compact_memory return error issue ... Browse Code »

when run the folloing command under shell, it will return error

sh/$ echo 1 > /proc/sys/vm/compact_memory
sh/$ sh: write error: Bad address

After strace, I found the following log:

...
write(1, "1\n", 2) = 3
write(1, "", 4294967295) = -1 EFAULT (Bad address)
write(2, "echo: write error: Bad address\n", 31echo: write error: Bad address
) = 31

This tells system return 3(COMPACT_COMPLETE) after write data to
compact_memory.

The fix is to make the system just return 0 instead 3(COMPACT_COMPLETE)
from sysctl_compaction_handler after compaction_nodes finished.

Signed-off-by: Jason Liu
Suggested-by: David Rientjes
Acked-by: Mel Gorman
Cc: Rik van Riel
Cc: Minchan Kim
Cc: KAMEZAWA Hiroyuki
Acked-by: David Rientjes
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jason Liu
2013-01-12 06:54:54 +0800
c0232ae86 mm: memblock: fix wrong memmove size in memblock_merge_regions() ... Browse Code »

The memmove span covers from (next+1) to the end of the array, and the
index of next is (i+1), so the index of (next+1) is (i+2). So the size
of remaining array elements is (type->cnt - (i + 2)).

Since the remaining elements of the memblock array are move forward by
one element and there is only one additional element caused by this bug.
So there won't be any write overflow here but read overflow. It may
read one more element out of the array address if the array happens to
be full. Commonly it doesn't matter at all but if the array happens to
be located at the end a memblock, it may cause a invalid read operation
for the physical address doesn't exist.

There are 2 *happens to be* here, so I think the probability is quite
low, I don't know if any guy is haunted by this bug before.

Mostly I think it's user-invisible.

Signed-off-by: Lin Feng
Acked-by: Tejun Heo
Reviewed-by: Wanpeng Li
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lin Feng
2013-01-12 06:54:54 +0800
04fa5d6a6 mm: migrate: check page_count of THP before migrating ... Browse Code »

Hugh Dickins pointed out that migrate_misplaced_transhuge_page() does
not check page_count before migrating like base page migration and
khugepage. He could not see why this was safe and he is right.

The potential impact of the bug is avoided due to the limitations of
NUMA balancing. The page_mapcount() check ensures that only a single
address space is using this page and as THPs are typically private it
should not be possible for another address space to fault it in
parallel. If the address space has one associated task then it's
difficult to have both a GUP pin and be referencing the page at the same
time. If there are multiple tasks then a buggy scenario requires that
another thread be accessing the page while the direct IO is in flight.
This is dodgy behaviour as there is a possibility of corruption with or
without THP migration. It would be

While we happen to be safe for the most part it is shoddy to depend on
such "safety" so this patch checks the page count similar to anonymous
pages. Note that this does not mean that the page_mapcount() check can
go away. If we were to remove the page_mapcount() check the the THP
would have to be unmapped from all referencing PTEs, replaced with
migration PTEs and restored properly afterwards.

Signed-off-by: Mel Gorman
Reported-by: Hugh Dickins
Cc: Ingo Molnar
Cc: Andrea Arcangeli
Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2013-01-12 06:54:54 +0800

10 Jan, 2013

1 commit

e53289c0c mm: reinstante dropped pmd_trans_splitting() check ... Browse Code »

The check for a pmd being in the process of being split was dropped by
mistake by commit d10e63f29488 ("mm: numa: Create basic numa page
hinting infrastructure"). Put it back.

Reported-by: Dave Jones
Debugged-by: Hillf Danton
Acked-by: Andrea Arcangeli
Acked-by: Mel Gorman
Cc: Kirill Shutemov
Signed-off-by: Linus Torvalds

Linus Torvalds
2013-01-10 00:36:54 +0800

05 Jan, 2013

2 commits

53a59fc67 mm: limit mmu_gather batching to fix soft lockups on !CONFIG_PREEMPT ... Browse Code »

Since commit e303297e6c3a ("mm: extended batches for generic
mmu_gather") we are batching pages to be freed until either
tlb_next_batch cannot allocate a new batch or we are done.

This works just fine most of the time but we can get in troubles with
non-preemptible kernel (CONFIG_PREEMPT_NONE or CONFIG_PREEMPT_VOLUNTARY)
on large machines where too aggressive batching might lead to soft
lockups during process exit path (exit_mmap) because there are no
scheduling points down the free_pages_and_swap_cache path and so the
freeing can take long enough to trigger the soft lockup.

The lockup is harmless except when the system is setup to panic on
softlockup which is not that unusual.

The simplest way to work around this issue is to limit the maximum
number of batches in a single mmu_gather. 10k of collected pages should
be safe to prevent from soft lockups (we would have 2ms for one) even if
they are all freed without an explicit scheduling point.

This patch doesn't add any new explicit scheduling points because it
relies on zap_pmd_range during page tables zapping which calls
cond_resched per PMD.

The following lockup has been reported for 3.0 kernel with a huge
process (in order of hundreds gigs but I do know any more details).

BUG: soft lockup - CPU#56 stuck for 22s! [kernel:31053]
Modules linked in: af_packet nfs lockd fscache auth_rpcgss nfs_acl sunrpc mptctl mptbase autofs4 binfmt_misc dm_round_robin dm_multipath bonding cpufreq_conservative cpufreq_userspace cpufreq_powersave pcc_cpufreq mperf microcode fuse loop osst sg sd_mod crc_t10dif st qla2xxx scsi_transport_fc scsi_tgt netxen_nic i7core_edac iTCO_wdt joydev e1000e serio_raw pcspkr edac_core iTCO_vendor_support acpi_power_meter rtc_cmos hpwdt hpilo button container usbhid hid dm_mirror dm_region_hash dm_log linear uhci_hcd ehci_hcd usbcore usb_common scsi_dh_emc scsi_dh_alua scsi_dh_hp_sw scsi_dh_rdac scsi_dh dm_snapshot pcnet32 mii edd dm_mod raid1 ext3 mbcache jbd fan thermal processor thermal_sys hwmon cciss scsi_mod
Supported: Yes
CPU 56
Pid: 31053, comm: kernel Not tainted 3.0.31-0.9-default #1 HP ProLiant DL580 G7
RIP: 0010: _raw_spin_unlock_irqrestore+0x8/0x10
RSP: 0018:ffff883ec1037af0 EFLAGS: 00000206
RAX: 0000000000000e00 RBX: ffffea01a0817e28 RCX: ffff88803ffd9e80
RDX: 0000000000000200 RSI: 0000000000000206 RDI: 0000000000000206
RBP: 0000000000000002 R08: 0000000000000001 R09: ffff887ec724a400
R10: 0000000000000000 R11: dead000000200200 R12: ffffffff8144c26e
R13: 0000000000000030 R14: 0000000000000297 R15: 000000000000000e
FS: 00007ed834282700(0000) GS:ffff88c03f200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 000000000068b240 CR3: 0000003ec13c5000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kernel (pid: 31053, threadinfo ffff883ec1036000, task ffff883ebd5d4100)
Call Trace:
release_pages+0xc5/0x260
free_pages_and_swap_cache+0x9d/0xc0
tlb_flush_mmu+0x5c/0x80
tlb_finish_mmu+0xe/0x50
exit_mmap+0xbd/0x120
mmput+0x49/0x120
exit_mm+0x122/0x160
do_exit+0x17a/0x430
do_group_exit+0x3d/0xb0
get_signal_to_deliver+0x247/0x480
do_signal+0x71/0x1b0
do_notify_resume+0x98/0xb0
int_signal+0x12/0x17
DWARF2 unwinder stuck at int_signal+0x12/0x17

Signed-off-by: Michal Hocko
Cc: [3.0+]
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2013-01-05 08:11:46 +0800
a458431e1 mm: fix zone_watermark_ok_safe() accounting of isolated pages ... Browse Code »

Commit 702d1a6e0766 ("memory-hotplug: fix kswapd looping forever
problem") added an isolated pageblocks counter (nr_pageblock_isolate in
struct zone) and used it to adjust free pages counter in
zone_watermark_ok_safe() to prevent kswapd looping forever problem.

Then later, commit 2139cbe627b8 ("cma: fix counting of isolated pages")
fixed accounting of isolated pages in global free pages counter. It
made the previous zone_watermark_ok_safe() fix unnecessary and
potentially harmful (cause now isolated pages may be accounted twice
making free pages counter incorrect).

This patch removes the special isolated pageblocks counter altogether
which fixes zone_watermark_ok_safe() free pages check.

Reported-by: Tomasz Stanislawski
Signed-off-by: Bartlomiej Zolnierkiewicz
Signed-off-by: Kyungmin Park
Cc: Minchan Kim
Cc: KOSAKI Motohiro
Cc: Aaditya Kumar
Cc: KAMEZAWA Hiroyuki
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Marek Szyprowski
Cc: Michal Nazarewicz
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Bartlomiej Zolnierkiewicz
2013-01-05 08:11:46 +0800

04 Jan, 2013

1 commit

fcb35a9ba MM: vmscan: remove __devinit attribute. ... Browse Code »

CONFIG_HOTPLUG is going away as an option. As a result, the __dev*
markings need to be removed.

This change removes the use of __devinit from the file.

Based on patches originally written by Bill Pemberton, but redone by me
in order to handle some of the coding style issues better, by hand.

Cc: Bill Pemberton
Cc: Andrew Morton
Cc: Hugh Dickins
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Konstantin Khlebnikov
Signed-off-by: Greg Kroah-Hartman

Greg Kroah-Hartman
2013-01-04 07:57:13 +0800

03 Jan, 2013

3 commits

42288fe36 mm: mempolicy: Convert shared_policy mutex to spinlock ... Browse Code »

Sasha was fuzzing with trinity and reported the following problem:

BUG: sleeping function called from invalid context at kernel/mutex.c:269
in_atomic(): 1, irqs_disabled(): 0, pid: 6361, name: trinity-main
2 locks held by trinity-main/6361:
#0: (&mm->mmap_sem){++++++}, at: [] __do_page_fault+0x1e4/0x4f0
#1: (&(&mm->page_table_lock)->rlock){+.+...}, at: [] handle_pte_fault+0x3f7/0x6a0
Pid: 6361, comm: trinity-main Tainted: G W
3.7.0-rc2-next-20121024-sasha-00001-gd95ef01-dirty #74
Call Trace:
__might_sleep+0x1c3/0x1e0
mutex_lock_nested+0x29/0x50
mpol_shared_policy_lookup+0x2e/0x90
shmem_get_policy+0x2e/0x30
get_vma_policy+0x5a/0xa0
mpol_misplaced+0x41/0x1d0
handle_pte_fault+0x465/0x6a0

This was triggered by a different version of automatic NUMA balancing
but in theory the current version is vunerable to the same problem.

do_numa_page
-> numa_migrate_prep
-> mpol_misplaced
-> get_vma_policy
-> shmem_get_policy

It's very unlikely this will happen as shared pages are not marked
pte_numa -- see the page_mapcount() check in change_pte_range() -- but
it is possible.

To address this, this patch restores sp->lock as originally implemented
by Kosaki Motohiro. In the path where get_vma_policy() is called, it
should not be calling sp_alloc() so it is not necessary to treat the PTL
specially.

Signed-off-by: KOSAKI Motohiro
Tested-by: KOSAKI Motohiro
Signed-off-by: Mel Gorman
Signed-off-by: Linus Torvalds

Mel Gorman
2013-01-03 09:32:13 +0800
a7a88b237 mempolicy: remove arg from mpol_parse_str, mpol_to_str ... Browse Code »

Remove the unused argument (formerly no_context) from mpol_parse_str()
and from mpol_to_str().

Signed-off-by: Hugh Dickins
Signed-off-by: Linus Torvalds

Hugh Dickins
2013-01-03 01:27:10 +0800
f2a07f40d tmpfs mempolicy: fix /proc/mounts corrupting memory ... Browse Code »

Recently I suggested using "mount -o remount,mpol=local /tmp" in NUMA
mempolicy testing. Very nasty. Reading /proc/mounts, /proc/pid/mounts
or /proc/pid/mountinfo may then corrupt one bit of kernel memory, often
in a page table (causing "Bad swap" or "Bad page map" warning or "Bad
pagetable" oops), sometimes in a vm_area_struct or rbnode or somewhere
worse. "mpol=prefer" and "mpol=prefer:Node" are equally toxic.

Recent NUMA enhancements are not to blame: this dates back to 2.6.35,
when commit e17f74af351c "mempolicy: don't call mpol_set_nodemask() when
no_context" skipped mpol_parse_str()'s call to mpol_set_nodemask(),
which used to initialize v.preferred_node, or set MPOL_F_LOCAL in flags.
With slab poisoning, you can then rely on mpol_to_str() to set the bit
for node 0x6b6b, probably in the next page above the caller's stack.

mpol_parse_str() is only called from shmem_parse_options(): no_context
is always true, so call it unused for now, and remove !no_context code.
Set v.nodes or v.preferred_node or MPOL_F_LOCAL as mpol_to_str() might
expect. Then mpol_to_str() can ignore its no_context argument also,
the mpol being appropriately initialized whether contextualized or not.
Rename its no_context unused too, and let subsequent patch remove them
(that's not needed for stable backporting, which would involve rejects).

I don't understand why MPOL_LOCAL is described as a pseudo-policy:
it's a reasonable policy which suffers from a confusing implementation
in terms of MPOL_PREFERRED with MPOL_F_LOCAL. I believe this would be
much more robust if MPOL_LOCAL were recognized in switch statements
throughout, MPOL_F_LOCAL deleted, and MPOL_PREFERRED use the (possibly
empty) nodes mask like everyone else, instead of its preferred_node
variant (I presume an optimization from the days before MPOL_LOCAL).
But that would take me too long to get right and fully tested.

Signed-off-by: Hugh Dickins
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds

Hugh Dickins
2013-01-03 01:27:10 +0800

29 Dec, 2012

1 commit

ecccd1248 mm: fix null pointer dereference in wait_iff_congested() ... Browse Code »

An unintended consequence of commit 4ae0a48b5efc ("mm: modify
pgdat_balanced() so that it also handles order-0") is that
wait_iff_congested() can now be called with NULL 'struct zone *'
producing kernel oops like this:

BUG: unable to handle kernel NULL pointer dereference
IP: [] wait_iff_congested+0x59/0x140

This trivial patch fixes it.

Reported-by: Zhouping Liu
Reported-and-tested-by: Sedat Dilek
Cc: Andrew Morton
Cc: Mel Gorman
Cc: Hugh Dickins
Signed-off-by: Zlatko Calusic
Signed-off-by: Linus Torvalds

Zlatko Calusic
2012-12-29 00:42:39 +0800

24 Dec, 2012

1 commit

4ae0a48b5 mm: modify pgdat_balanced() so that it also handles order-0 ... Browse Code »

Teach pgdat_balanced() about order-0 allocations so that we can simplify
code in a few places in vmstat.c.

Suggested-by: Andrew Morton
Signed-off-by: Zlatko Calusic
Cc: Mel Gorman
Cc: Hugh Dickins
Signed-off-by: Linus Torvalds

Zlatko Calusic
2012-12-24 01:46:36 +0800

21 Dec, 2012

9 commits

4c9a44aeb Merge branch 'akpm' (Andrew's patch-bomb) ... Browse Code »

Merge the rest of Andrew's patches for -rc1:
"A bunch of fixes and misc missed-out-on things.

That'll do for -rc1. I still have a batch of IPC patches which still
have a possible bug report which I'm chasing down."

* emailed patches from Andrew Morton : (25 commits)
keys: use keyring_alloc() to create module signing keyring
keys: fix unreachable code
sendfile: allows bypassing of notifier events
SGI-XP: handle non-fatal traps
fat: fix incorrect function comment
Documentation: ABI: remove testing/sysfs-devices-node
proc: fix inconsistent lock state
linux/kernel.h: fix DIV_ROUND_CLOSEST with unsigned divisors
memcg: don't register hotcpu notifier from ->css_alloc()
checkpatch: warn on uapi #includes that #include
mm: cma: WARN if freed memory is still in use
exec: do not leave bprm->interp on stack
...

Linus Torvalds
2012-12-21 12:00:43 +0800
154b454ed memcg: don't register hotcpu notifier from ->css_alloc() ... Browse Code »

Commit 648bb56d076b ("cgroup: lock cgroup_mutex in cgroup_init_subsys()")
made cgroup_init_subsys() grab cgroup_mutex before invoking
->css_alloc() for the root css. Because memcg registers hotcpu notifier
from ->css_alloc() for the root css, this introduced circular locking
dependency between cgroup_mutex and cpu hotplug.

Fix it by moving hotcpu notifier registration to a subsys initcall.

======================================================
[ INFO: possible circular locking dependency detected ]
3.7.0-rc4-work+ #42 Not tainted
-------------------------------------------------------
bash/645 is trying to acquire lock:
(cgroup_mutex){+.+.+.}, at: [] cgroup_lock+0x17/0x20

but task is already holding lock:
(cpu_hotplug.lock){+.+.+.}, at: [] cpu_hotplug_begin+0x2f/0x60

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (cpu_hotplug.lock){+.+.+.}:
lock_acquire+0x97/0x1e0
mutex_lock_nested+0x61/0x3b0
get_online_cpus+0x3c/0x60
rebuild_sched_domains_locked+0x1b/0x70
cpuset_write_resmask+0x298/0x2c0
cgroup_file_write+0x1ef/0x300
vfs_write+0xa8/0x160
sys_write+0x52/0xa0
system_call_fastpath+0x16/0x1b

-> #0 (cgroup_mutex){+.+.+.}:
__lock_acquire+0x14ce/0x1d20
lock_acquire+0x97/0x1e0
mutex_lock_nested+0x61/0x3b0
cgroup_lock+0x17/0x20
cpuset_handle_hotplug+0x1b/0x560
cpuset_update_active_cpus+0xe/0x10
cpuset_cpu_inactive+0x47/0x50
notifier_call_chain+0x66/0x150
__raw_notifier_call_chain+0xe/0x10
__cpu_notify+0x20/0x40
_cpu_down+0x7e/0x2f0
cpu_down+0x36/0x50
store_online+0x5d/0xe0
dev_attr_store+0x18/0x30
sysfs_write_file+0xe0/0x150
vfs_write+0xa8/0x160
sys_write+0x52/0xa0
system_call_fastpath+0x16/0x1b
other info that might help us debug this:

Possible unsafe locking scenario:

CPU0 CPU1
---- ----
lock(cpu_hotplug.lock);
lock(cgroup_mutex);
lock(cpu_hotplug.lock);
lock(cgroup_mutex);

*** DEADLOCK ***

5 locks held by bash/645:
#0: (&buffer->mutex){+.+.+.}, at: [] sysfs_write_file+0x48/0x150
#1: (s_active#42){.+.+.+}, at: [] sysfs_write_file+0xc8/0x150
#2: (x86_cpu_hotplug_driver_mutex){+.+...}, at: [] cpu_hotplug_driver_lock+0x1
+7/0x20
#3: (cpu_add_remove_lock){+.+.+.}, at: [] cpu_maps_update_begin+0x17/0x20
#4: (cpu_hotplug.lock){+.+.+.}, at: [] cpu_hotplug_begin+0x2f/0x60

stack backtrace:
Pid: 645, comm: bash Not tainted 3.7.0-rc4-work+ #42
Call Trace:
print_circular_bug+0x28e/0x29f
__lock_acquire+0x14ce/0x1d20
lock_acquire+0x97/0x1e0
mutex_lock_nested+0x61/0x3b0
cgroup_lock+0x17/0x20
cpuset_handle_hotplug+0x1b/0x560
cpuset_update_active_cpus+0xe/0x10
cpuset_cpu_inactive+0x47/0x50
notifier_call_chain+0x66/0x150
__raw_notifier_call_chain+0xe/0x10
__cpu_notify+0x20/0x40
_cpu_down+0x7e/0x2f0
cpu_down+0x36/0x50
store_online+0x5d/0xe0
dev_attr_store+0x18/0x30
sysfs_write_file+0xe0/0x150
vfs_write+0xa8/0x160
sys_write+0x52/0xa0
system_call_fastpath+0x16/0x1b

Signed-off-by: Tejun Heo
Reported-by: Fengguang Wu
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tejun Heo
2012-12-21 09:40:20 +0800
2c79737af mm: clean up transparent hugepage sysfs error messages ... Browse Code »

Clarify error messages and correct a few typos in the transparent hugepage
sysfs init code.

Signed-off-by: Jeremy Eder
Acked-by: Rafael Aquini
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jeremy Eder
2012-12-21 09:40:20 +0800
bcc2b02f4 mm: cma: WARN if freed memory is still in use ... Browse Code »

Memory returned to free_contig_range() must have no other references.
Let kernel to complain loudly if page reference count is not equal to 1.

[rientjes@google.com: support sparsemem]
Signed-off-by: Marek Szyprowski
Reviewed-by: Kyungmin Park
Acked-by: Michal Nazarewicz
Signed-off-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Marek Szyprowski
2012-12-21 09:40:19 +0800
c8b74c2f6 mm: fix calculation of dirtyable memory ... Browse Code »

The system uses global_dirtyable_memory() to calculate number of
dirtyable pages/pages that can be allocated to the page cache. A bug
causes an underflow thus making the page count look like a big unsigned
number. This in turn confuses the dirty writeback throttling to
aggressively write back pages as they become dirty (usually 1 page at a
time). This generally only affects systems with highmem because the
underflowed count gets subtracted from the global count of dirtyable
memory.

The problem was introduced with v3.2-4896-gab8fabd

Fix is to ensure we don't get an underflowed total of either highmem or
global dirtyable memory.

Signed-off-by: Sonny Rao
Signed-off-by: Puneet Kumar
Acked-by: Johannes Weiner
Tested-by: Damien Wyart
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sonny Rao
2012-12-21 09:40:18 +0800
010fc29a4 compaction: fix build error in CMA && !COMPACTION ... Browse Code »

isolate_freepages_block() and isolate_migratepages_range() are used for
CMA as well as compaction so it breaks build for CONFIG_CMA &&
!CONFIG_COMPACTION.

This patch fixes it.

[akpm@linux-foundation.org: add "do { } while (0)", per Mel]
Signed-off-by: Minchan Kim
Cc: Mel Gorman
Cc: Marek Szyprowski
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2012-12-21 09:40:18 +0800
21e89c0c4 Merge branch 'fscache' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells… ... Browse Code »

…/linux-fs into for-linus

Al Viro
2012-12-21 07:49:14 +0800
7898575fc mm: drop vmtruncate ... Browse Code »

Removed vmtruncate

Signed-off-by: Marco Stornelli
Signed-off-by: Al Viro

Marco Stornelli
2012-12-21 07:46:29 +0800
b7dfde956 Merge tag 'virtio-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux ... Browse Code »

Pull virtio update from Rusty Russell:
"Some nice cleanups, and even a patch my wife did as a "live" demo for
Latinoware 2012.

There's a slightly non-trivial merge in virtio-net, as we cleaned up
the virtio add_buf interface while DaveM accepted the mq virtio-net
patches."

* tag 'virtio-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: (27 commits)
virtio_console: Add support for remoteproc serial
virtio_console: Merge struct buffer_token into struct port_buffer
virtio: add drv_to_virtio to make code clearly
virtio: use dev_to_virtio wrapper in virtio
virtio-mmio: Fix irq parsing in command line parameter
virtio_console: Free buffers from out-queue upon close
virtio: Convert dev_printk(KERN_ to dev_(
virtio_console: Use kmalloc instead of kzalloc
virtio_console: Free buffer if splice fails
virtio: tools: make it clear that virtqueue_add_buf() no longer returns > 0
virtio: scsi: make it clear that virtqueue_add_buf() no longer returns > 0
virtio: rpmsg: make it clear that virtqueue_add_buf() no longer returns > 0
virtio: net: make it clear that virtqueue_add_buf() no longer returns > 0
virtio: console: make it clear that virtqueue_add_buf() no longer returns > 0
virtio: make virtqueue_add_buf() returning 0 on success, not capacity.
virtio: console: don't rely on virtqueue_add_buf() returning capacity.
virtio_net: don't rely on virtqueue_add_buf() returning capacity.
virtio-net: remove unused skb_vnet_hdr->num_sg field
virtio-net: correct capacity math on ring full
virtio: move queue_index and num_free fields into core struct virtqueue.
...

Linus Torvalds
2012-12-21 00:37:05 +0800

20 Dec, 2012

2 commits

b6b19f25f ksm: make rmap walks more scalable ... Browse Code »

The rmap walks in ksm.c are like those in rmap.c: they can safely be
done with anon_vma_lock_read().

Signed-off-by: Hugh Dickins
Acked-by: Mel Gorman
Signed-off-by: Linus Torvalds

Hugh Dickins
2012-12-20 23:06:56 +0800
cda73a10e mm: do not sleep in balance_pgdat if there's no i/o congestion ... Browse Code »

On a 4GB RAM machine, where Normal zone is much smaller than DMA32 zone,
the Normal zone gets fragmented in time. This requires relatively more
pressure in balance_pgdat to get the zone above the required watermark.
Unfortunately, the congestion_wait() call in there slows it down for a
completely wrong reason, expecting that there's a lot of
writeback/swapout, even when there's none (much more common). After a
few days, when fragmentation progresses, this flawed logic translates to
a very high CPU iowait times, even though there's no I/O congestion at
all. If THP is enabled, the problem occurs sooner, but I was able to
see it even on !THP kernels, just by giving it a bit more time to occur.

The proper way to deal with this is to not wait, unless there's
congestion. Thanks to Mel Gorman, we already have the function that
perfectly fits the job. The patch was tested on a machine which nicely
revealed the problem after only 1 day of uptime, and it's been working
great.

Signed-off-by: Zlatko Calusic
Acked-by: Mel Gorman
Signed-off-by: Linus Torvalds

Zlatko Calusic
2012-12-20 23:06:56 +0800

19 Dec, 2012

12 commits

3cf23841b mm/vmscan.c: avoid possible deadlock caused by too_many_isolated() ... Browse Code »

Neil found that if too_many_isolated() returns true while performing
direct reclaim we can end up waiting for other threads to complete their
direct reclaim. If those threads are allowed to enter the FS or IO to
free memory, but this thread is not, then it is possible that those
threads will be waiting on this thread and so we get a circular deadlock.

some task enters direct reclaim with GFP_KERNEL
=> too_many_isolated() false
=> vmscan and run into dirty pages
=> pageout()
=> take some FS lock
=> fs/block code does GFP_NOIO allocation
=> enter direct reclaim again
=> too_many_isolated() true
=> waiting for others to progress, however the other
tasks may be circular waiting for the FS lock..

The fix is to let !__GFP_IO and !__GFP_FS direct reclaims enjoy higher
priority than normal ones, by lowering the throttle threshold for the
latter.

Allowing ~1/8 isolated pages in normal is large enough. For example, for
a 1GB LRU list, that's ~128MB isolated pages, or 1k blocked tasks (each
isolates 32 4KB pages), or 64 blocked tasks per logical CPU (assuming 16
logical CPUs per NUMA node). So it's not likely some CPU goes idle
waiting (when it could make progress) because of this limit: there are
much more sleeping reclaim tasks than the number of CPU, so the task may
well be blocked by some low level queue/lock anyway.

Now !GFP_IOFS reclaims won't be waiting for GFP_IOFS reclaims to progress.
They will be blocked only when there are too many concurrent !GFP_IOFS
reclaims, however that's very unlikely because the IO-less direct reclaims
is able to progress much more faster, and they won't deadlock each other.
The threshold is raised high enough for them, so that there can be
sufficient parallel progress of !GFP_IOFS reclaims.

[akpm@linux-foundation.org: tweak comment]
Signed-off-by: Wu Fengguang
Cc: Torsten Kaiser
Tested-by: NeilBrown
Reviewed-by: Minchan Kim
Acked-by: KOSAKI Motohiro
Acked-by: Rik van Riel
Cc: Mel Gorman
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Fengguang Wu
2012-12-19 07:02:15 +0800
d37dd5dcb vmscan: comment too_many_isolated() ... Browse Code »

Comment "Why it's doing so" rather than "What it does" as proposed by
Andrew Morton.

Signed-off-by: Wu Fengguang
Reviewed-by: KOSAKI Motohiro
Reviewed-by: Minchan Kim
Reviewed-by: Rik van Riel
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Fengguang Wu
2012-12-19 07:02:15 +0800
dc053733e mm/kmemleak.c: remove obsolete simple_strtoul ... Browse Code »

Replace the obsolete simple_strtoul() with kstrtoul().

Signed-off-by: Abhijit Pawar
Cc: Catalin Marinas
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Abhijit Pawar
2012-12-19 07:02:15 +0800
79a4dcefd mm/memory_hotplug.c: improve comments ... Browse Code »

Signed-off-by: Tang Chen
Cc: Jiang Liu
Cc: Lai Jiangshan
Cc: Wen Congyang
Cc: Yasuaki Ishimatsu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tang Chen
2012-12-19 07:02:15 +0800
7179e7bf4 mm/hugetlb: create hugetlb cgroup file in hugetlb_init ... Browse Code »

Build kernel with CONFIG_HUGETLBFS=y,CONFIG_HUGETLB_PAGE=y and
CONFIG_CGROUP_HUGETLB=y, then specify hugepagesz=xx boot option, system
will fail to boot.

This failure is caused by following code path:

setup_hugepagesz
hugetlb_add_hstate
hugetlb_cgroup_file_init
cgroup_add_cftypes
kzalloc
Signed-off-by: Jiang Liu
Reviewed-by: Aneesh Kumar K.V
Acked-by: Michal Hocko
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jianguo Wu
2012-12-19 07:02:15 +0800
7d12efaea mm/mprotect.c: coding-style cleanups ... Browse Code »

A few gremlins have recently crept in.

Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2012-12-19 07:02:15 +0800
5413dfba8 slub: drop mutex before deleting sysfs entry ... Browse Code »

Sasha Levin recently reported a lockdep problem resulting from the new
attribute propagation introduced by kmemcg series. In short, slab_mutex
will be called from within the sysfs attribute store function. This will
create a dependency, that will later be held backwards when a cache is
destroyed - since destruction occurs with the slab_mutex held, and then
calls in to the sysfs directory removal function.

In this patch, I propose to adopt a strategy close to what
__kmem_cache_create does before calling sysfs_slab_add, and release the
lock before the call to sysfs_slab_remove. This is pretty much the last
operation in the kmem_cache_shutdown() path, so we could do better by
splitting this and moving this call alone to later on. This will fit
nicely when sysfs handling is consistent between all caches, but will look
weird now.

Lockdep info:

======================================================
[ INFO: possible circular locking dependency detected ]
3.7.0-rc4-next-20121106-sasha-00008-g353b62f #117 Tainted: G W
-------------------------------------------------------
trinity-child13/6961 is trying to acquire lock:
(s_active#43){++++.+}, at: sysfs_addrm_finish+0x31/0x60

but task is already holding lock:
(slab_mutex){+.+.+.}, at: kmem_cache_destroy+0x22/0xe0

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:
-> #1 (slab_mutex){+.+.+.}:
lock_acquire+0x1aa/0x240
__mutex_lock_common+0x59/0x5a0
mutex_lock_nested+0x3f/0x50
slab_attr_store+0xde/0x110
sysfs_write_file+0xfa/0x150
vfs_write+0xb0/0x180
sys_pwrite64+0x60/0xb0
tracesys+0xe1/0xe6
-> #0 (s_active#43){++++.+}:
__lock_acquire+0x14df/0x1ca0
lock_acquire+0x1aa/0x240
sysfs_deactivate+0x122/0x1a0
sysfs_addrm_finish+0x31/0x60
sysfs_remove_dir+0x89/0xd0
kobject_del+0x16/0x40
__kmem_cache_shutdown+0x40/0x60
kmem_cache_destroy+0x40/0xe0
mon_text_release+0x78/0xe0
__fput+0x122/0x2d0
____fput+0x9/0x10
task_work_run+0xbe/0x100
do_exit+0x432/0xbd0
do_group_exit+0x84/0xd0
get_signal_to_deliver+0x81d/0x930
do_signal+0x3a/0x950
do_notify_resume+0x3e/0x90
int_signal+0x12/0x17

other info that might help us debug this:

Possible unsafe locking scenario:

CPU0 CPU1
---- ----
lock(slab_mutex);
lock(s_active#43);
lock(slab_mutex);
lock(s_active#43);

*** DEADLOCK ***

2 locks held by trinity-child13/6961:
#0: (mon_lock){+.+.+.}, at: mon_text_release+0x25/0xe0
#1: (slab_mutex){+.+.+.}, at: kmem_cache_destroy+0x22/0xe0

stack backtrace:
Pid: 6961, comm: trinity-child13 Tainted: G W 3.7.0-rc4-next-20121106-sasha-00008-g353b62f #117
Call Trace:
print_circular_bug+0x1fb/0x20c
__lock_acquire+0x14df/0x1ca0
lock_acquire+0x1aa/0x240
sysfs_deactivate+0x122/0x1a0
sysfs_addrm_finish+0x31/0x60
sysfs_remove_dir+0x89/0xd0
kobject_del+0x16/0x40
__kmem_cache_shutdown+0x40/0x60
kmem_cache_destroy+0x40/0xe0
mon_text_release+0x78/0xe0
__fput+0x122/0x2d0
____fput+0x9/0x10
task_work_run+0xbe/0x100
do_exit+0x432/0xbd0
do_group_exit+0x84/0xd0
get_signal_to_deliver+0x81d/0x930
do_signal+0x3a/0x950
do_notify_resume+0x3e/0x90
int_signal+0x12/0x17

Signed-off-by: Glauber Costa
Reported-by: Sasha Levin
Cc: Michal Hocko
Cc: Kamezawa Hiroyuki
Cc: Johannes Weiner
Cc: Christoph Lameter
Cc: Pekka Enberg
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Glauber Costa
2012-12-19 07:02:15 +0800
ebe945c27 memcg: add comments clarifying aspects of cache attribute propagation ... Browse Code »

This patch clarifies two aspects of cache attribute propagation.

First, the expected context for the for_each_memcg_cache macro in
memcontrol.h. The usages already in the codebase are safe. In mm/slub.c,
it is trivially safe because the lock is acquired right before the loop.
In mm/slab.c, it is less so: the lock is acquired by an outer function a
few steps back in the stack, so a VM_BUG_ON() is added to make sure it is
indeed safe.

A comment is also added to detail why we are returning the value of the
parent cache and ignoring the children's when we propagate the attributes.

Signed-off-by: Glauber Costa
Cc: Michal Hocko
Cc: Kamezawa Hiroyuki
Cc: Johannes Weiner
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Glauber Costa
2012-12-19 07:02:15 +0800
107dab5c9 slub: slub-specific propagation changes ... Browse Code »

SLUB allows us to tune a particular cache behavior with sysfs-based
tunables. When creating a new memcg cache copy, we'd like to preserve any
tunables the parent cache already had.

This can be done by tapping into the store attribute function provided by
the allocator. We of course don't need to mess with read-only fields.
Since the attributes can have multiple types and are stored internally by
sysfs, the best strategy is to issue a ->show() in the root cache, and
then ->store() in the memcg cache.

The drawback of that, is that sysfs can allocate up to a page in buffering
for show(), that we are likely not to need, but also can't guarantee. To
avoid always allocating a page for that, we can update the caches at store
time with the maximum attribute size ever stored to the root cache. We
will then get a buffer big enough to hold it. The corolary to this, is
that if no stores happened, nothing will be propagated.

It can also happen that a root cache has its tunables updated during
normal system operation. In this case, we will propagate the change to
all caches that are already active.

[akpm@linux-foundation.org: tweak code to avoid __maybe_unused]
Signed-off-by: Glauber Costa
Cc: Christoph Lameter
Cc: David Rientjes
Cc: Frederic Weisbecker
Cc: Greg Thelen
Cc: Johannes Weiner
Cc: JoonSoo Kim
Cc: KAMEZAWA Hiroyuki
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Pekka Enberg
Cc: Rik van Riel
Cc: Suleiman Souhlal
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Glauber Costa
2012-12-19 07:02:14 +0800
943a451a8 slab: propagate tunable values ... Browse Code »

SLAB allows us to tune a particular cache behavior with tunables. When
creating a new memcg cache copy, we'd like to preserve any tunables the
parent cache already had.

This could be done by an explicit call to do_tune_cpucache() after the
cache is created. But this is not very convenient now that the caches are
created from common code, since this function is SLAB-specific.

Another method of doing that is taking advantage of the fact that
do_tune_cpucache() is always called from enable_cpucache(), which is
called at cache initialization. We can just preset the values, and then
things work as expected.

It can also happen that a root cache has its tunables updated during
normal system operation. In this case, we will propagate the change to
all caches that are already active.

This change will require us to move the assignment of root_cache in
memcg_params a bit earlier. We need this to be already set - which
memcg_kmem_register_cache will do - when we reach __kmem_cache_create()

Signed-off-by: Glauber Costa
Cc: Christoph Lameter
Cc: David Rientjes
Cc: Frederic Weisbecker
Cc: Greg Thelen
Cc: Johannes Weiner
Cc: JoonSoo Kim
Cc: KAMEZAWA Hiroyuki
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Pekka Enberg
Cc: Rik van Riel
Cc: Suleiman Souhlal
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Glauber Costa
2012-12-19 07:02:14 +0800
749c54151 memcg: aggregate memcg cache values in slabinfo ... Browse Code »

When we create caches in memcgs, we need to display their usage
information somewhere. We'll adopt a scheme similar to /proc/meminfo,
with aggregate totals shown in the global file, and per-group information
stored in the group itself.

For the time being, only reads are allowed in the per-group cache.

Signed-off-by: Glauber Costa
Cc: Christoph Lameter
Cc: David Rientjes
Cc: Frederic Weisbecker
Cc: Greg Thelen
Cc: Johannes Weiner
Cc: JoonSoo Kim
Cc: KAMEZAWA Hiroyuki
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Pekka Enberg
Cc: Rik van Riel
Cc: Suleiman Souhlal
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Glauber Costa
2012-12-19 07:02:14 +0800
229331529 memcg/sl[au]b: shrink dead caches ... Browse Code »

This means that when we destroy a memcg cache that happened to be empty,
those caches may take a lot of time to go away: removing the memcg
reference won't destroy them - because there are pending references, and
the empty pages will stay there, until a shrinker is called upon for any
reason.

In this patch, we will call kmem_cache_shrink() for all dead caches that
cannot be destroyed because of remaining pages. After shrinking, it is
possible that it could be freed. If this is not the case, we'll schedule
a lazy worker to keep trying.

Signed-off-by: Glauber Costa
Cc: Christoph Lameter
Cc: David Rientjes
Cc: Frederic Weisbecker
Cc: Greg Thelen
Cc: Johannes Weiner
Cc: JoonSoo Kim
Cc: KAMEZAWA Hiroyuki
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Pekka Enberg
Cc: Rik van Riel
Cc: Suleiman Souhlal
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Glauber Costa
2012-12-19 07:02:14 +0800