Doug / smarc-fsl-linux-kernel | Embedian Git Server

24 Feb, 2013

40 commits

692e89abd memcg: increment static branch right after limit set ... Browse Code »

We were deferring the kmemcg static branch increment to a later time,
due to a nasty dependency between the cpu_hotplug lock, taken by the
jump label update, and the cgroup_lock.

Now we no longer take the cgroup lock, and we can save ourselves the
trouble.

Signed-off-by: Glauber Costa
Acked-by: Michal Hocko
Cc: Tejun Heo
Cc: Hiroyuki Kamezawa
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Glauber Costa
2013-02-24 09:50:18 +0800
0999821b1 memcg: replace cgroup_lock with memcg specific memcg_lock ... Browse Code »

After the preparation work done in earlier patches, the cgroup_lock can
be trivially replaced with a memcg-specific lock. This is an automatic
translation at every site where the values involved were queried.

The sites where values are written, however, used to be naturally called
under cgroup_lock. This is the case for instance in the css_online
callback. For those, we now need to explicitly add the memcg lock.

With this, all the calls to cgroup_lock outside cgroup core are gone.

Signed-off-by: Glauber Costa
Acked-by: Michal Hocko
Cc: Tejun Heo
Cc: Hiroyuki Kamezawa
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Glauber Costa
2013-02-24 09:50:18 +0800
b5f99b537 memcg: fast hierarchy-aware child test ... Browse Code »

Currently, we use cgroups' provided list of children to verify if it is
safe to proceed with any value change that is dependent on the cgroup
being empty.

This is less than ideal, because it enforces a dependency over cgroup
core that we would be better off without. The solution proposed here is
to iterate over the child cgroups and if any is found that is already
online, we bounce and return: we don't really care how many children we
have, only if we have any.

This is also made to be hierarchy aware. IOW, cgroups with hierarchy
disabled, while they still exist, will be considered for the purpose of
this interface as having no children.

[akpm@linux-foundation.org: tweak comments]
Signed-off-by: Glauber Costa
Acked-by: Michal Hocko
Cc: Tejun Heo
Cc: Hiroyuki Kamezawa
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Glauber Costa
2013-02-24 09:50:18 +0800
d142e3e66 memcg: split part of memcg creation to css_online ... Browse Code »

This patch is a preparatory work for later locking rework to get rid of
big cgroup lock from memory controller code.

The memory controller uses some tunables to adjust its operation. Those
tunables are inherited from parent to children upon children
intialization. For most of them, the value cannot be changed after the
parent has a new children.

cgroup core splits initialization in two phases: css_alloc and css_online.
After css_alloc, the memory allocation and basic initialization are done.
But the new group is not yet visible anywhere, not even for cgroup core
code. It is only somewhere between css_alloc and css_online that it is
inserted into the internal children lists. Copying tunable values in
css_alloc will lead to inconsistent values: the children will copy the old
parent values, that can change between the copy and the moment in which
the groups is linked to any data structure that can indicate the presence
of children.

Signed-off-by: Glauber Costa
Acked-by: Michal Hocko
Cc: Tejun Heo
Cc: Hiroyuki Kamezawa
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Glauber Costa
2013-02-24 09:50:18 +0800
ee5e8472b memcg: prevent changes to move_charge_at_immigrate during task attach ... Browse Code »

In memcg, we use the cgroup_lock basically to synchronize against
attaching new children to a cgroup. We do this because we rely on
cgroup core to provide us with this information.

We need to guarantee that upon child creation, our tunables are
consistent. For those, the calls to cgroup_lock() all live in handlers
like mem_cgroup_hierarchy_write(), where we change a tunable in the
group that is hierarchy-related. For instance, the use_hierarchy flag
cannot be changed if the cgroup already have children.

Furthermore, those values are propagated from the parent to the child
when a new child is created. So if we don't lock like this, we can end
up with the following situation:

A B
memcg_css_alloc() mem_cgroup_hierarchy_write()
copy use hierarchy from parent change use hierarchy in parent
finish creation.

This is mainly because during create, we are still not fully connected
to the css tree. So all iterators and the such that we could use, will
fail to show that the group has children.

My observation is that all of creation can proceed in parallel with
those tasks, except value assignment. So what this patch series does is
to first move all value assignment that is dependent on parent values
from css_alloc to css_online, where the iterators all work, and then we
lock only the value assignment. This will guarantee that parent and
children always have consistent values. Together with an online test,
that can be derived from the observation that the refcount of an online
memcg can be made to be always positive, we should be able to
synchronize our side without the cgroup lock.

This patch:

Currently, we rely on the cgroup_lock() to prevent changes to
move_charge_at_immigrate during task migration. However, this is only
needed because the current strategy keeps checking this value throughout
the whole process. Since all we need is serialization, one needs only
to guarantee that whatever decision we made in the beginning of a
specific migration is respected throughout the process.

We can achieve this by just saving it in mc. By doing this, no kind of
locking is needed.

Signed-off-by: Glauber Costa
Acked-by: Michal Hocko
Cc: Tejun Heo
Cc: Hiroyuki Kamezawa
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Glauber Costa
2013-02-24 09:50:18 +0800
45cf7ebd5 memcg: reduce the size of struct memcg 244-fold. ... Browse Code »

In order to maintain all the memcg bookkeeping, we need per-node
descriptors, which will in turn contain a per-zone descriptor.

Because we want to statically allocate those, this array ends up being
very big. Part of the reason is that we allocate something large enough
to hold MAX_NUMNODES, the compile time constant that holds the maximum
number of nodes we would ever consider.

However, we can do better in some cases if the firmware help us. This
is true for modern x86 machines; coincidentally one of the architectures
in which MAX_NUMNODES tends to be very big.

By using the firmware-provided maximum number of nodes instead of
MAX_NUMNODES, we can reduce the memory footprint of struct memcg
considerably. In the extreme case in which we have only one node, this
reduces the size of the structure from ~ 64k to ~2k. This is
particularly important because it means that we will no longer resort to
the vmalloc area for the struct memcg on defconfigs. We also have
enough room for an extra node and still be outside vmalloc.

One also has to keep in mind that with the industry's ability to fit
more processors in a die as fast as the FED prints money, a nodes = 2
configuration is already respectably big.

[akpm@linux-foundation.org: add check for invalid nid, remove inline]
Signed-off-by: Glauber Costa
Acked-by: Michal Hocko
Cc: Kamezawa Hiroyuki
Cc: Johannes Weiner
Reviewed-by: Greg Thelen
Cc: Hugh Dickins
Cc: Ying Han
Cc: Mel Gorman
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Glauber Costa
2013-02-24 09:50:18 +0800
a4e1b4c6c mm: init: report on last-nid information stored in page->flags ... Browse Code »

Answering the question "how much space remains in the page->flags" is
time-consuming. mminit_loglevel can help answer the question but it
does not take last_nid information into account. This patch corrects it
and while there it corrects the messages related to page flag usage,
pgshifts and node/zone id. When applied the relevant output looks
something like this but will depend on the kernel configuration.

mminit::pageflags_layout_widths Section 0 Node 9 Zone 2 Lastnid 9 Flags 25
mminit::pageflags_layout_shifts Section 19 Node 9 Zone 2 Lastnid 9
mminit::pageflags_layout_pgshifts Section 0 Node 55 Zone 53 Lastnid 44
mminit::pageflags_layout_nodezoneid Node/Zone ID: 64 -> 53
mminit::pageflags_layout_usage location: 64 -> 44 layout 44 -> 25 unused 25 -> 0 page-flags

Signed-off-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2013-02-24 09:50:18 +0800
4468b8f1e mm: uninline page_xchg_last_nid() ... Browse Code »

Andrew Morton pointed out that page_xchg_last_nid() and
reset_page_last_nid() were "getting nuttily large" and asked that it be
investigated.

reset_page_last_nid() is on the page free path and it would be
unfortunate to make that path more expensive than it needs to be. Due
to the internal use of page_xchg_last_nid() it is already too expensive
but fortunately, it should also be impossible for the page->flags to be
updated in parallel when we call reset_page_last_nid(). Instead of
unlining the function, it uses a simplier implementation that assumes no
parallel updates and should now be sufficiently short for inlining.

page_xchg_last_nid() is called in paths that are already quite expensive
(splitting huge page, fault handling, migration) and it is reasonable to
uninline. There was not really a good place to place the function but
mm/mmzone.c was the closest fit IMO.

This patch saved 128 bytes of text in the vmlinux file for the kernel
configuration I used for testing automatic NUMA balancing.

Signed-off-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2013-02-24 09:50:18 +0800
6acc8b025 memcg: clean up swap accounting initialization code ... Browse Code »

Memcg swap accounting is currently enabled by enable_swap_cgroup when
the root cgroup is created. mem_cgroup_init acts as a memcg subsystem
initializer which sounds like a much better place for enable_swap_cgroup
as well. We already register memsw files from there so it makes a lot
of sense to merge those two into a single enable_swap_cgroup function.

This patch doesn't introduce any semantic changes.

Signed-off-by: Michal Hocko
Cc: Zhouping Liu
Cc: Kamezawa Hiroyuki
Cc: David Rientjes
Cc: Li Zefan
Cc: CAI Qian
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2013-02-24 09:50:17 +0800
2d11085e4 memcg: do not create memsw files if swap accounting is disabled ... Browse Code »

Zhouping Liu has reported that memsw files are exported even though swap
accounting is runtime disabled if MEMCG_SWAP is enabled. This behavior
has been introduced by commit af36f906c0f4 ("memcg: always create memsw
files if CGROUP_MEM_RES_CTLR_SWAP") and it causes any attempt to open
the file to return EOPNOTSUPP. Although EOPNOTSUPP should say be clear
that memsw operations are not supported in the given configuration it is
fair to say that this behavior could be quite confusing.

Let's tear memsw files out of default cgroup files and add them only if
the swap accounting is really enabled (either by MEMCG_SWAP_ENABLED or
swapaccount=1 boot parameter). We can hook into mem_cgroup_init which
is called when the memcg subsystem is initialized and which happens
after boot command line is processed.

Signed-off-by: Michal Hocko
Reported-by: Zhouping Liu
Tested-by: Zhouping Liu
Cc: Kamezawa Hiroyuki
Cc: David Rientjes
Cc: Li Zefan
Cc: CAI Qian
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2013-02-24 09:50:17 +0800
75f7ad8e0 page-writeback.c: subtract min_free_kbytes from dirtyable memory ... Browse Code »

When calculating amount of dirtyable memory, min_free_kbytes should be
subtracted because it is not intended for dirty pages.

Addresses http://bugs.debian.org/695182

[akpm@linux-foundation.org: fix up min_free_kbytes extern declarations]
[akpm@linux-foundation.org: fix min() warning]
Signed-off-by: Paul Szabo
Acked-by: Rik van Riel
Cc: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Szabo
2013-02-24 09:50:17 +0800
08b52706d mm/rmap: rename anon_vma_unlock() => anon_vma_unlock_write() ... Browse Code »

The comment in commit 4fc3f1d66b1e ("mm/rmap, migration: Make
rmap_walk_anon() and try_to_unmap_anon() more scalable") says:

| Rename anon_vma_[un]lock() => anon_vma_[un]lock_write(),
| to make it clearer that it's an exclusive write-lock in
| that case - suggested by Rik van Riel.

But that commit renames only anon_vma_lock()

Signed-off-by: Konstantin Khlebnikov
Cc: Ingo Molnar
Reviewed-by: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2013-02-24 09:50:17 +0800
ec8acf20a swap: add per-partition lock for swapfile ... Browse Code »

swap_lock is heavily contended when I test swap to 3 fast SSD (even
slightly slower than swap to 2 such SSD). The main contention comes
from swap_info_get(). This patch tries to fix the gap with adding a new
per-partition lock.

Global data like nr_swapfiles, total_swap_pages, least_priority and
swap_list are still protected by swap_lock.

nr_swap_pages is an atomic now, it can be changed without swap_lock. In
theory, it's possible get_swap_page() finds no swap pages but actually
there are free swap pages. But sounds not a big problem.

Accessing partition specific data (like scan_swap_map and so on) is only
protected by swap_info_struct.lock.

Changing swap_info_struct.flags need hold swap_lock and
swap_info_struct.lock, because scan_scan_map() will check it. read the
flags is ok with either the locks hold.

If both swap_lock and swap_info_struct.lock must be hold, we always hold
the former first to avoid deadlock.

swap_entry_free() can change swap_list. To delete that code, we add a
new highest_priority_index. Whenever get_swap_page() is called, we
check it. If it's valid, we use it.

It's a pity get_swap_page() still holds swap_lock(). But in practice,
swap_lock() isn't heavily contended in my test with this patch (or I can
say there are other much more heavier bottlenecks like TLB flush). And
BTW, looks get_swap_page() doesn't really need the lock. We never free
swap_info[] and we check SWAP_WRITEOK flag. The only risk without the
lock is we could swapout to some low priority swap, but we can quickly
recover after several rounds of swap, so sounds not a big deal to me.
But I'd prefer to fix this if it's a real problem.

"swap: make each swap partition have one address_space" improved the
swapout speed from 1.7G/s to 2G/s. This patch further improves the
speed to 2.3G/s, so around 15% improvement. It's a multi-process test,
so TLB flush isn't the biggest bottleneck before the patches.

[arnd@arndb.de: fix it for nommu]
[hughd@google.com: add missing unlock]
[minchan@kernel.org: get rid of lockdep whinge on sys_swapon]
Signed-off-by: Shaohua Li
Cc: Hugh Dickins
Cc: Rik van Riel
Cc: Minchan Kim
Cc: Greg Kroah-Hartman
Cc: Seth Jennings
Cc: Konrad Rzeszutek Wilk
Cc: Xiao Guangrong
Cc: Dan Magenheimer
Cc: Stephen Rothwell
Signed-off-by: Arnd Bergmann
Signed-off-by: Hugh Dickins
Signed-off-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shaohua Li
2013-02-24 09:50:17 +0800
33806f06d swap: make each swap partition have one address_space ... Browse Code »

When I use several fast SSD to do swap, swapper_space.tree_lock is
heavily contended. This makes each swap partition have one
address_space to reduce the lock contention. There is an array of
address_space for swap. The swap entry type is the index to the array.

In my test with 3 SSD, this increases the swapout throughput 20%.

[akpm@linux-foundation.org: revert unneeded change to __add_to_swap_cache]
Signed-off-by: Shaohua Li
Cc: Hugh Dickins
Acked-by: Rik van Riel
Acked-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shaohua Li
2013-02-24 09:50:17 +0800
9800339b5 mm: don't inline page_mapping() ... Browse Code »

According to akpm, this saves 1/2k text and makes things simple for the
next patch.

Numbers from Minchan:

add/remove: 1/0 grow/shrink: 6/22 up/down: 92/-516 (-424)
function old new delta
page_mapping - 48 +48
do_task_stat 2292 2308 +16
page_remove_rmap 240 248 +8
load_elf_binary 4500 4508 +8
update_queue 532 536 +4
scsi_probe_and_add_lun 2892 2896 +4
lookup_fast 644 648 +4
vcs_read 1040 1036 -4
__ip_route_output_key 1904 1900 -4
ip_route_input_noref 2508 2500 -8
shmem_file_aio_read 784 772 -12
__isolate_lru_page 272 256 -16
shmem_replace_page 708 688 -20
mark_buffer_dirty 228 208 -20
__set_page_dirty_buffers 240 220 -20
__remove_mapping 276 256 -20
update_mmu_cache 500 476 -24
set_page_dirty_balance 92 68 -24
set_page_dirty 172 148 -24
page_evictable 88 64 -24
page_cache_pipe_buf_steal 248 224 -24
clear_page_dirty_for_io 340 316 -24
test_set_page_writeback 400 372 -28
test_clear_page_writeback 516 488 -28
invalidate_inode_page 156 128 -28
page_mkclean 432 400 -32
flush_dcache_page 360 328 -32
__set_page_dirty_nobuffers 324 280 -44
shrink_page_list 2412 2356 -56

Signed-off-by: Shaohua Li
Suggested-by: Andrew Morton
Cc: Hugh Dickins
Acked-by: Rik van Riel
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shaohua Li
2013-02-24 09:50:17 +0800
340ef3902 mm: numa: cleanup flow of transhuge page migration ... Browse Code »

When correcting commit 04fa5d6a6547 ("mm: migrate: check page_count of
THP before migrating") Hugh Dickins noted that the control flow for
transhuge migration was difficult to follow. Unconditionally calling
put_page() in numamigrate_isolate_page() made the failure paths of both
migrate_misplaced_transhuge_page() and migrate_misplaced_page() more
complex that they should be. Further, he was extremely wary that an
unlock_page() should ever happen after a put_page() even if the
put_page() should never be the final put_page.

Hugh implemented the following cleanup to simplify the path by calling
putback_lru_page() inside numamigrate_isolate_page() if it failed to
isolate and always calling unlock_page() within
migrate_misplaced_transhuge_page().

There is no functional change after this patch is applied but the code
is easier to follow and unlock_page() always happens before put_page().

[mgorman@suse.de: changelog only]
Signed-off-by: Mel Gorman
Signed-off-by: Hugh Dickins
Cc: Peter Zijlstra
Cc: Andrea Arcangeli
Cc: Ingo Molnar
Cc: Simon Jeons
Cc: Wanpeng Li
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2013-02-24 09:50:17 +0800
75980e97d mm: fold page->_last_nid into page->flags where possible ... Browse Code »

page->_last_nid fits into page->flags on 64-bit. The unlikely 32-bit
NUMA configuration with NUMA Balancing will still need an extra page
field. As Peter notes "Completely dropping 32bit support for
CONFIG_NUMA_BALANCING would simplify things, but it would also remove
the warning if we grow enough 64bit only page-flags to push the last-cpu
out."

[mgorman@suse.de: minor modifications]
Signed-off-by: Mel Gorman
Cc: Peter Zijlstra
Cc: Andrea Arcangeli
Cc: Ingo Molnar
Cc: Simon Jeons
Cc: Wanpeng Li
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Zijlstra
2013-02-24 09:50:17 +0800
bbeae5b05 mm: move page flags layout to separate header ... Browse Code »

This is a preparation patch for moving page->_last_nid into page->flags
that moves page flag layout information to a separate header. This
patch is necessary because otherwise there would be a circular
dependency between mm_types.h and mm.h.

Signed-off-by: Mel Gorman
Cc: Peter Zijlstra
Cc: Andrea Arcangeli
Cc: Ingo Molnar
Cc: Simon Jeons
Cc: Wanpeng Li
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Zijlstra
2013-02-24 09:50:17 +0800
3c0ff4689 mm: numa: handle side-effects in count_vm_numa_events() for !CONFIG_NUMA_BALANCING ... Browse Code »

The current definitions for count_vm_numa_events() is wrong for
!CONFIG_NUMA_BALANCING as the following would miss the side-effect.

count_vm_numa_events(NUMA_FOO, bar++);

There are no such users of count_vm_numa_events() but this patch fixes
it as it is a potential pitfall. Ideally both would be converted to
static inline but NUMA_PTE_UPDATES is not defined if
!CONFIG_NUMA_BALANCING and creating dummy constants just to have a
static inline would be similarly clumsy.

Signed-off-by: Mel Gorman
Cc: Peter Zijlstra
Cc: Andrea Arcangeli
Cc: Ingo Molnar
Cc: Simon Jeons
Cc: Wanpeng Li
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2013-02-24 09:50:16 +0800
3abef4e6c mm: numa: take THP into account when migrating pages for NUMA balancing ... Browse Code »

Wanpeng Li pointed out that numamigrate_isolate_page() assumes that only
one base page is being migrated when in fact it can also be checking
THP.

The consequences are that a migration will be attempted when a target
node is nearly full and fail later. It's unlikely to be user-visible
but it should be fixed. While we are there, migrate_balanced_pgdat()
should treat nr_migrate_pages as an unsigned long as it is treated as a
watermark.

Signed-off-by: Mel Gorman
Suggested-by: Wanpeng Li
Cc: Peter Zijlstra
Cc: Andrea Arcangeli
Cc: Ingo Molnar
Cc: Simon Jeons
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2013-02-24 09:50:16 +0800
34f0315ad mm: numa: fix minor typo in numa_next_scan ... Browse Code »

s/me/be/ and clarify the comment a bit when we're changing it anyway.

Signed-off-by: Mel Gorman
Suggested-by: Simon Jeons
Cc: Peter Zijlstra
Cc: Andrea Arcangeli
Cc: Ingo Molnar
Cc: Wanpeng Li
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2013-02-24 09:50:16 +0800
cf82f3489 mm: remove unused memclear_highpage_flush() ... Browse Code »

Signed-off-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2013-02-24 09:50:16 +0800
4d769def2 usb: forbid memory allocation with I/O during bus reset ... Browse Code »

If one storage interface or usb network interface(iSCSI case) exists in
current configuration, memory allocation with GFP_KERNEL during
usb_device_reset() might trigger I/O transfer on the storage interface
itself and cause deadlock because the 'us->dev_mutex' is held in
.pre_reset() and the storage interface can't do I/O transfer when the
reset is triggered by other interface, or the error handling can't be
completed if the reset is triggered by the storage itself (error
handling path).

Signed-off-by: Ming Lei
Cc: "David S. Miller"
Cc: Eric Dumazet
Cc: David Decotigny
Cc: Tom Herbert
Cc: Ingo Molnar
Cc: Jens Axboe
Cc: Minchan Kim
Cc: Alan Stern
Cc: Oliver Neukum
Reviewed-by: Jiri Kosina
Cc: Mel Gorman
Cc: KAMEZAWA Hiroyuki
Cc: Michal Hocko
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: "Rafael J. Wysocki"
Cc: Greg KH
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ming Lei
2013-02-24 09:50:16 +0800
db88175f4 pm / runtime: force memory allocation with no I/O during Runtime PM callbcack ... Browse Code »

Apply the introduced memalloc_noio_save() and memalloc_noio_restore() to
force memory allocation with no I/O during runtime_resume/runtime_suspend
callback on device with the flag of 'memalloc_noio' set.

Signed-off-by: Ming Lei
Cc: "David S. Miller"
Cc: Eric Dumazet
Cc: David Decotigny
Cc: Tom Herbert
Cc: Ingo Molnar
Cc: Jens Axboe
Cc: Minchan Kim
Cc: Alan Stern
Cc: Oliver Neukum
Cc: Jiri Kosina
Cc: Mel Gorman
Cc: KAMEZAWA Hiroyuki
Cc: Michal Hocko
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: "Rafael J. Wysocki"
Cc: Greg KH
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ming Lei
2013-02-24 09:50:16 +0800
9802c8e22 net/core: apply pm_runtime_set_memalloc_noio on network devices ... Browse Code »

Deadlock might be caused by allocating memory with GFP_KERNEL in
runtime_resume and runtime_suspend callback of network devices in iSCSI
situation, so mark network devices and its ancestor as 'memalloc_noio'
with the introduced pm_runtime_set_memalloc_noio().

Signed-off-by: Ming Lei
Cc: "David S. Miller"
Cc: Eric Dumazet
Cc: David Decotigny
Cc: Tom Herbert
Cc: Ingo Molnar
Cc: Jens Axboe
Cc: Minchan Kim
Cc: Alan Stern
Cc: Oliver Neukum
Cc: Jiri Kosina
Cc: Mel Gorman
Cc: KAMEZAWA Hiroyuki
Cc: Michal Hocko
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: "Rafael J. Wysocki"
Cc: Greg KH
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ming Lei
2013-02-24 09:50:16 +0800
25e823c8c block/genhd.c: apply pm_runtime_set_memalloc_noio on block devices ... Browse Code »

Apply the introduced pm_runtime_set_memalloc_noio on block device so
that PM core will teach mm to not allocate memory with GFP_IOFS when
calling the runtime_resume and runtime_suspend callback for block
devices and its ancestors.

Signed-off-by: Ming Lei
Cc: Jens Axboe
Cc: Minchan Kim
Cc: Alan Stern
Cc: Oliver Neukum
Cc: Jiri Kosina
Cc: Mel Gorman
Cc: KAMEZAWA Hiroyuki
Cc: Michal Hocko
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: "Rafael J. Wysocki"
Cc: Greg KH
Cc: "David S. Miller"
Cc: Eric Dumazet
Cc: David Decotigny
Cc: Tom Herbert
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ming Lei
2013-02-24 09:50:16 +0800
e823407f7 pm / runtime: introduce pm_runtime_set_memalloc_noio() ... Browse Code »

Introduce the flag memalloc_noio in 'struct dev_pm_info' to help PM core
to teach mm not allocating memory with GFP_KERNEL flag for avoiding
probable deadlock.

As explained in the comment, any GFP_KERNEL allocation inside
runtime_resume() or runtime_suspend() on any one of device in the path
from one block or network device to the root device in the device tree
may cause deadlock, the introduced pm_runtime_set_memalloc_noio() sets
or clears the flag on device in the path recursively.

Signed-off-by: Ming Lei
Cc: Minchan Kim
Cc: Alan Stern
Cc: Oliver Neukum
Cc: Jiri Kosina
Cc: Mel Gorman
Cc: KAMEZAWA Hiroyuki
Cc: Michal Hocko
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: "Rafael J. Wysocki"
Cc: Greg KH
Cc: Jens Axboe
Cc: "David S. Miller"
Cc: Eric Dumazet
Cc: David Decotigny
Cc: Tom Herbert
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ming Lei
2013-02-24 09:50:16 +0800
21caf2fc1 mm: teach mm by current context info to not do I/O during memory allocation ... Browse Code »

This patch introduces PF_MEMALLOC_NOIO on process flag('flags' field of
'struct task_struct'), so that the flag can be set by one task to avoid
doing I/O inside memory allocation in the task's context.

The patch trys to solve one deadlock problem caused by block device, and
the problem may happen at least in the below situations:

- during block device runtime resume, if memory allocation with
GFP_KERNEL is called inside runtime resume callback of any one of its
ancestors(or the block device itself), the deadlock may be triggered
inside the memory allocation since it might not complete until the block
device becomes active and the involed page I/O finishes. The situation
is pointed out first by Alan Stern. It is not a good approach to
convert all GFP_KERNEL[1] in the path into GFP_NOIO because several
subsystems may be involved(for example, PCI, USB and SCSI may be
involved for usb mass stoarage device, network devices involved too in
the iSCSI case)

- during block device runtime suspend, because runtime resume need to
wait for completion of concurrent runtime suspend.

- during error handling of usb mass storage deivce, USB bus reset will
be put on the device, so there shouldn't have any memory allocation with
GFP_KERNEL during USB bus reset, otherwise the deadlock similar with
above may be triggered. Unfortunately, any usb device may include one
mass storage interface in theory, so it requires all usb interface
drivers to handle the situation. In fact, most usb drivers don't know
how to handle bus reset on the device and don't provide .pre_set() and
.post_reset() callback at all, so USB core has to unbind and bind driver
for these devices. So it is still not practical to resort to GFP_NOIO
for solving the problem.

Also the introduced solution can be used by block subsystem or block
drivers too, for example, set the PF_MEMALLOC_NOIO flag before doing
actual I/O transfer.

It is not a good idea to convert all these GFP_KERNEL in the affected
path into GFP_NOIO because these functions doing that may be implemented
as library and will be called in many other contexts.

In fact, memalloc_noio_flags() can convert some of current static
GFP_NOIO allocation into GFP_KERNEL back in other non-affected contexts,
at least almost all GFP_NOIO in USB subsystem can be converted into
GFP_KERNEL after applying the approach and make allocation with GFP_NOIO
only happen in runtime resume/bus reset/block I/O transfer contexts
generally.

[1], several GFP_KERNEL allocation examples in runtime resume path

- pci subsystem
acpi_os_allocate

Signed-off-by: Minchan Kim
Cc: Alan Stern
Cc: Oliver Neukum
Cc: Jiri Kosina
Cc: Mel Gorman
Cc: KAMEZAWA Hiroyuki
Cc: Michal Hocko
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: "Rafael J. Wysocki"
Cc: Greg KH
Cc: Jens Axboe
Cc: "David S. Miller"
Cc: Eric Dumazet
Cc: David Decotigny
Cc: Tom Herbert
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ming Lei
2013-02-24 09:50:16 +0800
258401a60 mm: don't wait on congested zones in balance_pgdat() ... Browse Code »

From: Zlatko Calusic

Commit 92df3a723f84 ("mm: vmscan: throttle reclaim if encountering too
many dirty pages under writeback") introduced waiting on congested zones
based on a sane algorithm in shrink_inactive_list().

What this means is that there's no more need for throttling and
additional heuristics in balance_pgdat(). So, let's remove it and tidy
up the code.

Signed-off-by: Zlatko Calusic
Cc: Mel Gorman
Cc: Minchan Kim
Cc: Johannes Weiner
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Zlatko Calusic
2013-02-24 09:50:15 +0800
4db0e950c mm/memory-failure.c: fix wrong num_poisoned_pages in handling memory error on thp ... Browse Code »

num_poisoned_pages counts up the number of pages isolated by memory
errors. But for thp, only one subpage is isolated because memory error
handler splits it, so it's wrong to add (1 << compound_trans_order).

[akpm@linux-foundation.org: tweak comment]
Signed-off-by: Naoya Horiguchi
Cc: Andi Kleen
Cc: Tony Luck
Cc: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2013-02-24 09:50:15 +0800
af8fae7c0 mm/memory-failure.c: clean up soft_offline_page() ... Browse Code »

Currently soft_offline_page() is hard to maintain because it has many
return points and goto statements. All of this mess come from
get_any_page().

This function should only get page refcount as the name implies, but it
does some page isolating actions like SetPageHWPoison() and dequeuing
hugepage. This patch corrects it and introduces some internal
subroutines to make soft offlining code more readable and maintainable.

Signed-off-by: Naoya Horiguchi
Reviewed-by: Andi Kleen
Cc: Tony Luck
Cc: Wu Fengguang
Cc: Xishi Qiu
Cc: Jiang Liu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2013-02-24 09:50:15 +0800
293c07e31 memory-failure: use num_poisoned_pages instead of mce_bad_pages ... Browse Code »

Since MCE is an x86 concept, and this code is in mm/, it would be better
to use the name num_poisoned_pages instead of mce_bad_pages.

[akpm@linux-foundation.org: fix mm/sparse.c]
Signed-off-by: Xishi Qiu
Signed-off-by: Jiang Liu
Suggested-by: Borislav Petkov
Reviewed-by: Wanpeng Li
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Xishi Qiu
2013-02-24 09:50:15 +0800
fa8dd8a92 memory-failure: do code refactor of soft_offline_page() ... Browse Code »

There are too many return points randomly intermingled with some "goto
done" return points. So adjust the function structure, one for the
success path, the other for the failure path. Use atomic_long_inc
instead of atomic_long_add.

Signed-off-by: Xishi Qiu
Signed-off-by: Jiang Liu
Suggested-by: Andrew Morton
Cc: Borislav Petkov
Cc: Wanpeng Li
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Xishi Qiu
2013-02-24 09:50:15 +0800
0ebff32c3 memory-failure: fix an error of mce_bad_pages statistics ... Browse Code »

When doing

$ echo paddr > /sys/devices/system/memory/soft_offline_page

to offline a *free* page, the value of mce_bad_pages will be added, and
the page is set HWPoison flag, but it is still managed by page buddy
alocator.

$ cat /proc/meminfo | grep HardwareCorrupted

shows the value.

If we offline the same page, the value of mce_bad_pages will be added
*again*, this means the value is incorrect now. Assume the page is
still free during this short time.

soft_offline_page()
get_any_page()
"else if (is_free_buddy_page(p))" branch return 0
"goto done";
"atomic_long_add(1, &mce_bad_pages);"

This patch:

Move poisoned page check at the beginning of the function in order to
fix the error.

Signed-off-by: Xishi Qiu
Signed-off-by: Jiang Liu
Tested-by: Naoya Horiguchi
Cc: Borislav Petkov
Cc: Wanpeng Li
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Xishi Qiu
2013-02-24 09:50:15 +0800
194159fbc mm: remove MIGRATE_ISOLATE check in hotpath ... Browse Code »

Several functions test MIGRATE_ISOLATE and some of those are hotpath but
MIGRATE_ISOLATE is used only if we enable CONFIG_MEMORY_ISOLATION(ie,
CMA, memory-hotplug and memory-failure) which are not common config
option. So let's not add unnecessary overhead and code when we don't
enable CONFIG_MEMORY_ISOLATION.

Signed-off-by: Minchan Kim
Cc: KOSAKI Motohiro
Acked-by: Michal Nazarewicz
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2013-02-24 09:50:15 +0800
c60514b63 mm: increase totalram_pages when free pages allocated by bootmem allocator ... Browse Code »

Function put_page_bootmem() is used to free pages allocated by bootmem
allocator, so it should increase totalram_pages when freeing pages into
the buddy system.

Signed-off-by: Jiang Liu
Cc: Wen Congyang
Cc: David Rientjes
Cc: Jiang Liu
Cc: Maciej Rutecki
Cc: Chris Clayton
Cc: "Rafael J . Wysocki"
Cc: Mel Gorman
Cc: Minchan Kim
Cc: KAMEZAWA Hiroyuki
Cc: Michal Hocko
Cc: Jianguo Wu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jiang Liu
2013-02-24 09:50:15 +0800
306f2e9ee mm: set zone->present_pages to number of existing pages in the zone ... Browse Code »

Now all users of "number of pages managed by the buddy system" have been
converted to use zone->managed_pages, so set zone->present_pages to what
it should be:

present_pages = spanned_pages - absent_pages;

Signed-off-by: Jiang Liu
Cc: Wen Congyang
Cc: David Rientjes
Cc: Jiang Liu
Cc: Maciej Rutecki
Cc: Chris Clayton
Cc: "Rafael J . Wysocki"
Cc: Mel Gorman
Cc: Minchan Kim
Cc: KAMEZAWA Hiroyuki
Cc: Michal Hocko
Cc: Jianguo Wu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jiang Liu
2013-02-24 09:50:15 +0800
b40da0494 mm: use zone->present_pages instead of zone->managed_pages where appropriate ... Browse Code »

Now we have zone->managed_pages for "pages managed by the buddy system
in the zone", so replace zone->present_pages with zone->managed_pages if
what the user really wants is number of allocatable pages.

Signed-off-by: Jiang Liu
Cc: Wen Congyang
Cc: David Rientjes
Cc: Jiang Liu
Cc: Maciej Rutecki
Cc: Chris Clayton
Cc: "Rafael J . Wysocki"
Cc: Mel Gorman
Cc: Minchan Kim
Cc: KAMEZAWA Hiroyuki
Cc: Michal Hocko
Cc: Jianguo Wu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jiang Liu
2013-02-24 09:50:14 +0800
f7210e6c4 mm/memblock.c: use CONFIG_HAVE_MEMBLOCK_NODE_MAP to protect movablecore_map in m… ... Browse Code »

…emblock_overlaps_region().

The definition of struct movablecore_map is protected by
CONFIG_HAVE_MEMBLOCK_NODE_MAP but its use in memblock_overlaps_region()
is not. So add CONFIG_HAVE_MEMBLOCK_NODE_MAP to protect the use of
movablecore_map in memblock_overlaps_region().

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Tang Chen
2013-02-24 09:50:14 +0800
01a178a94 acpi, memory-hotplug: support getting hotplug info from SRAT ... Browse Code »

We now provide an option for users who don't want to specify physical
memory address in kernel commandline.

/*
* For movablemem_map=acpi:
*
* SRAT: |_____| |_____| |_________| |_________| ......
* node id: 0 1 1 2
* hotpluggable: n y y n
* movablemem_map: |_____| |_________|
*
* Using movablemem_map, we can prevent memblock from allocating memory
* on ZONE_MOVABLE at boot time.
*/

So user just specify movablemem_map=acpi, and the kernel will use
hotpluggable info in SRAT to determine which memory ranges should be set
as ZONE_MOVABLE.

If all the memory ranges in SRAT is hotpluggable, then no memory can be
used by kernel. But before parsing SRAT, memblock has already reserve
some memory ranges for other purposes, such as for kernel image, and so
on. We cannot prevent kernel from using these memory. So we need to
exclude these ranges even if these memory is hotpluggable.

Furthermore, there could be several memory ranges in the single node
which the kernel resides in. We may skip one range that have memory
reserved by memblock, but if the rest of memory is too small, then the
kernel will fail to boot. So, make the whole node which the kernel
resides in un-hotpluggable. Then the kernel has enough memory to use.

NOTE: Using this way will cause NUMA performance down because the
whole node will be set as ZONE_MOVABLE, and kernel cannot use memory
on it. If users don't want to lose NUMA performance, just don't use
it.

[akpm@linux-foundation.org: fix warning]
[akpm@linux-foundation.org: use strcmp()]
Signed-off-by: Tang Chen
Cc: KOSAKI Motohiro
Cc: Jiang Liu
Cc: Jianguo Wu
Cc: Kamezawa Hiroyuki
Cc: Lai Jiangshan
Cc: Wu Jianguo
Cc: Yasuaki Ishimatsu
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Cc: Len Brown
Cc: "Brown, Len"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tang Chen
2013-02-24 09:50:14 +0800