Doug / smarc-fsl-linux-kernel | Embedian Git Server

09 Aug, 2013

12 commits

bd8815a6d cgroup: make css_for_each_descendant() and friends include the origin css in the iteration ... Browse Code »

Previously, all css descendant iterators didn't include the origin
(root of subtree) css in the iteration. The reasons were maintaining
consistency with css_for_each_child() and that at the time of
introduction more use cases needed skipping the origin anyway;
however, given that css_is_descendant() considers self to be a
descendant, omitting the origin css has become more confusing and
looking at the accumulated use cases rather clearly indicates that
including origin would result in simpler code overall.

While this is a change which can easily lead to subtle bugs, cgroup
API including the iterators has recently gone through major
restructuring and no out-of-tree changes will be applicable without
adjustments making this a relatively acceptable opportunity for this
type of change.

The conversions are mostly straight-forward. If the iteration block
had explicit origin handling before or after, it's moved inside the
iteration. If not, if (pos == origin) continue; is added. Some
conversions add extra reference get/put around origin handling by
consolidating origin handling and the rest. While the extra ref
operations aren't strictly necessary, this shouldn't cause any
noticeable difference.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan
Acked-by: Vivek Goyal
Acked-by: Aristeu Rozanski
Acked-by: Michal Hocko
Cc: Jens Axboe
Cc: Matt Helsley
Cc: Johannes Weiner
Cc: Balbir Singh

Tejun Heo
2013-08-09 08:11:27 +0800
81eeaf041 cgroup: make cftype->[un]register_event() deal with cgroup_subsys_state instead of cgroup ... Browse Code »

cgroup is in the process of converting to css (cgroup_subsys_state)
from cgroup as the principal subsystem interface handle. This is
mostly to prepare for the unified hierarchy support where css's will
be created and destroyed dynamically but also helps cleaning up
subsystem implementations as css is usually what they are interested
in anyway.

cftype->[un]register_event() is among the remaining couple interfaces
which still use struct cgroup. Convert it to cgroup_subsys_state.
The conversion is mostly mechanical and removes the last users of
mem_cgroup_from_cont() and cg_to_vmpressure(), which are removed.

v2: indentation update as suggested by Li Zefan.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan
Acked-by: Michal Hocko
Cc: Johannes Weiner
Cc: Balbir Singh

Tejun Heo
2013-08-09 08:11:26 +0800
72ec70299 cgroup: make task iterators deal with cgroup_subsys_state instead of cgroup ... Browse Code »

cgroup is in the process of converting to css (cgroup_subsys_state)
from cgroup as the principal subsystem interface handle. This is
mostly to prepare for the unified hierarchy support where css's will
be created and destroyed dynamically but also helps cleaning up
subsystem implementations as css is usually what they are interested
in anyway.

This patch converts task iterators to deal with css instead of cgroup.
Note that under unified hierarchy, different sets of tasks will be
considered belonging to a given cgroup depending on the subsystem in
question and making the iterators deal with css instead cgroup
provides them with enough information about the iteration.

While at it, fix several function comment formats in cpuset.c.

This patch doesn't introduce any behavior differences.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan
Acked-by: Michal Hocko
Cc: Johannes Weiner
Cc: Balbir Singh
Cc: Matt Helsley

Tejun Heo
2013-08-09 08:11:26 +0800
c59cd3d84 cgroup: make cgroup_task_iter remember the cgroup being iterated ... Browse Code »

Currently all cgroup_task_iter functions require @cgrp to be passed
in, which is superflous and increases chance of usage error. Make
cgroup_task_iter remember the cgroup being iterated and drop @cgrp
argument from next and end functions.

This patch doesn't introduce any behavior differences.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan
Acked-by: Michal Hocko
Cc: Matt Helsley
Cc: Johannes Weiner
Cc: Balbir Singh

Tejun Heo
2013-08-09 08:11:26 +0800
0942eeeef cgroup: rename cgroup_iter to cgroup_task_iter ... Browse Code »

cgroup now has multiple iterators and it's quite confusing to have
something which walks over tasks of a single cgroup named cgroup_iter.
Let's rename it to cgroup_task_iter.

While at it, reformat / update comments and replace the overview
comment above the interface function decls with proper function
comments. Such overview can be useful but function comments should be
more than enough here.

This is pure rename and doesn't introduce any functional changes.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan
Acked-by: Michal Hocko
Cc: Matt Helsley
Cc: Johannes Weiner
Cc: Balbir Singh

Tejun Heo
2013-08-09 08:11:26 +0800
492eb21b9 cgroup: make hierarchy iterators deal with cgroup_subsys_state instead of cgroup ... Browse Code »

cgroup is currently in the process of transitioning to using css
(cgroup_subsys_state) as the primary handle instead of cgroup in
subsystem API. For hierarchy iterators, this is beneficial because

* In most cases, css is the only thing subsystems care about anyway.

* On the planned unified hierarchy, iterations for different
subsystems will need to skip over different subtrees of the
hierarchy depending on which subsystems are enabled on each cgroup.
Passing around css makes it unnecessary to explicitly specify the
subsystem in question as css is intersection between cgroup and
subsystem

* For the planned unified hierarchy, css's would need to be created
and destroyed dynamically independent from cgroup hierarchy. Having
cgroup core manage css iteration makes enforcing deref rules a lot
easier.

Most subsystem conversions are straight-forward. Noteworthy changes
are

* blkio: cgroup_to_blkcg() is no longer used. Removed.

* freezer: cgroup_freezer() is no longer used. Removed.

* devices: cgroup_to_devcgroup() is no longer used. Removed.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan
Acked-by: Michal Hocko
Acked-by: Vivek Goyal
Acked-by: Aristeu Rozanski
Cc: Johannes Weiner
Cc: Balbir Singh
Cc: Matt Helsley
Cc: Jens Axboe

Tejun Heo
2013-08-09 08:11:25 +0800
182446d08 cgroup: pass around cgroup_subsys_state instead of cgroup in file methods ... Browse Code »

cgroup is currently in the process of transitioning to using struct
cgroup_subsys_state * as the primary handle instead of struct cgroup.
Please see the previous commit which converts the subsystem methods
for rationale.

This patch converts all cftype file operations to take @css instead of
@cgroup. cftypes for the cgroup core files don't have their subsytem
pointer set. These will automatically use the dummy_css added by the
previous patch and can be converted the same way.

Most subsystem conversions are straight forwards but there are some
interesting ones.

* freezer: update_if_frozen() is also converted to take @css instead
of @cgroup for consistency. This will make the code look simpler
too once iterators are converted to use css.

* memory/vmpressure: mem_cgroup_from_css() needs to be exported to
vmpressure while mem_cgroup_from_cont() can be made static.
Updated accordingly.

* cpu: cgroup_tg() doesn't have any user left. Removed.

* cpuacct: cgroup_ca() doesn't have any user left. Removed.

* hugetlb: hugetlb_cgroup_form_cgroup() doesn't have any user left.
Removed.

* net_cls: cgrp_cls_state() doesn't have any user left. Removed.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan
Acked-by: Michal Hocko
Acked-by: Vivek Goyal
Acked-by: Aristeu Rozanski
Acked-by: Daniel Wagner
Cc: Peter Zijlstra
Cc: Ingo Molnar
Cc: Johannes Weiner
Cc: Balbir Singh
Cc: Matt Helsley
Cc: Jens Axboe
Cc: Steven Rostedt

Tejun Heo
2013-08-09 08:11:24 +0800
eb95419b0 cgroup: pass around cgroup_subsys_state instead of cgroup in subsystem methods ... Browse Code »

cgroup is currently in the process of transitioning to using struct
cgroup_subsys_state * as the primary handle instead of struct cgroup *
in subsystem implementations for the following reasons.

* With unified hierarchy, subsystems will be dynamically bound and
unbound from cgroups and thus css's (cgroup_subsys_state) may be
created and destroyed dynamically over the lifetime of a cgroup,
which is different from the current state where all css's are
allocated and destroyed together with the associated cgroup. This
in turn means that cgroup_css() should be synchronized and may
return NULL, making it more cumbersome to use.

* Differing levels of per-subsystem granularity in the unified
hierarchy means that the task and descendant iterators should behave
differently depending on the specific subsystem the iteration is
being performed for.

* In majority of the cases, subsystems only care about its part in the
cgroup hierarchy - ie. the hierarchy of css's. Subsystem methods
often obtain the matching css pointer from the cgroup and don't
bother with the cgroup pointer itself. Passing around css fits
much better.

This patch converts all cgroup_subsys methods to take @css instead of
@cgroup. The conversions are mostly straight-forward. A few
noteworthy changes are

* ->css_alloc() now takes css of the parent cgroup rather than the
pointer to the new cgroup as the css for the new cgroup doesn't
exist yet. Knowing the parent css is enough for all the existing
subsystems.

* In kernel/cgroup.c::offline_css(), unnecessary open coded css
dereference is replaced with local variable access.

This patch shouldn't cause any behavior differences.

v2: Unnecessary explicit cgrp->subsys[] deref in css_online() replaced
with local variable @css as suggested by Li Zefan.

Rebased on top of new for-3.12 which includes for-3.11-fixes so
that ->css_free() invocation added by da0a12caff ("cgroup: fix a
leak when percpu_ref_init() fails") is converted too. Suggested
by Li Zefan.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan
Acked-by: Michal Hocko
Acked-by: Vivek Goyal
Acked-by: Aristeu Rozanski
Acked-by: Daniel Wagner
Cc: Peter Zijlstra
Cc: Ingo Molnar
Cc: Johannes Weiner
Cc: Balbir Singh
Cc: Matt Helsley
Cc: Jens Axboe
Cc: Steven Rostedt

Tejun Heo
2013-08-09 08:11:23 +0800
638769869 cgroup: add css_parent() ... Browse Code »

Currently, controllers have to explicitly follow the cgroup hierarchy
to find the parent of a given css. cgroup is moving towards using
cgroup_subsys_state as the main controller interface construct, so
let's provide a way to climb the hierarchy using just csses.

This patch implements css_parent() which, given a css, returns its
parent. The function is guarnateed to valid non-NULL parent css as
long as the target css is not at the top of the hierarchy.

freezer, cpuset, cpu, cpuacct, hugetlb, memory, net_cls and devices
are converted to use css_parent() instead of accessing cgroup->parent
directly.

* __parent_ca() is dropped from cpuacct and its usage is replaced with
parent_ca(). The only difference between the two was NULL test on
cgroup->parent which is now embedded in css_parent() making the
distinction moot. Note that eventually a css->parent field will be
added to css and the NULL check in css_parent() will go away.

This patch shouldn't cause any behavior differences.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan

Tejun Heo
2013-08-09 08:11:23 +0800
a7c6d554a cgroup: add/update accessors which obtain subsys specific data from css ... Browse Code »

css (cgroup_subsys_state) is usually embedded in a subsys specific
data structure. Subsystems either use container_of() directly to cast
from css to such data structure or has an accessor function wrapping
such cast. As cgroup as whole is moving towards using css as the main
interface handle, add and update such accessors to ease dealing with
css's.

All accessors explicitly handle NULL input and return NULL in those
cases. While this looks like an extra branch in the code, as all
controllers specific data structures have css as the first field, the
casting doesn't involve any offsetting and the compiler can trivially
optimize out the branch.

* blkio, freezer, cpuset, cpu, cpuacct and net_cls didn't have such
accessor. Added.

* memory, hugetlb and devices already had one but didn't explicitly
handle NULL input. Updated.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan

Tejun Heo
2013-08-09 08:11:23 +0800
3f7985183 hugetlb_cgroup: pass around @hugetlb_cgroup instead of @cgroup ... Browse Code »

cgroup controller API will be converted to primarily use struct
cgroup_subsys_state instead of struct cgroup. In preparation, make
hugetlb_cgroup functions pass around struct hugetlb_cgroup instead of
struct cgroup.

This patch shouldn't cause any behavior differences.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan
Reviewed-by: Aneesh Kumar K.V
Reviewed-by: Michal Hocko
Cc: KAMEZAWA Hiroyuki
Cc: Johannes Weiner

Tejun Heo
2013-08-09 08:11:22 +0800
8af01f56a cgroup: s/cgroup_subsys_state/cgroup_css/ s/task_subsys_state/task_css/ ... Browse Code »

The names of the two struct cgroup_subsys_state accessors -
cgroup_subsys_state() and task_subsys_state() - are somewhat awkward.
The former clashes with the type name and the latter doesn't even
indicate it's somehow related to cgroup.

We're about to revamp large portion of cgroup API, so, let's rename
them so that they're less awkward. Most per-controller usages of the
accessors are localized in accessor wrappers and given the amount of
scheduled changes, this isn't gonna add any noticeable headache.

Rename cgroup_subsys_state() to cgroup_css() and task_subsys_state()
to task_css(). This patch is pure rename.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan

Tejun Heo
2013-08-09 08:11:22 +0800

11 Jul, 2013

3 commits

98d1e64f9 mm: remove free_area_cache ... Browse Code »

Since all architectures have been converted to use vm_unmapped_area(),
there is no remaining use for the free_area_cache.

Signed-off-by: Michel Lespinasse
Acked-by: Rik van Riel
Cc: "James E.J. Bottomley"
Cc: "Luck, Tony"
Cc: Benjamin Herrenschmidt
Cc: David Howells
Cc: Helge Deller
Cc: Ivan Kokshaysky
Cc: Matt Turner
Cc: Paul Mackerras
Cc: Richard Henderson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michel Lespinasse
2013-07-11 09:11:34 +0800
2b2811178 zswap: add to mm/ ... Browse Code »

zswap is a thin backend for frontswap that takes pages that are in the
process of being swapped out and attempts to compress them and store
them in a RAM-based memory pool. This can result in a significant I/O
reduction on the swap device and, in the case where decompressing from
RAM is faster than reading from the swap device, can also improve
workload performance.

It also has support for evicting swap pages that are currently
compressed in zswap to the swap device on an LRU(ish) basis. This
functionality makes zswap a true cache in that, once the cache is full,
the oldest pages can be moved out of zswap to the swap device so newer
pages can be compressed and stored in zswap.

This patch adds the zswap driver to mm/

Signed-off-by: Seth Jennings
Acked-by: Rik van Riel
Cc: Greg Kroah-Hartman
Cc: Nitin Gupta
Cc: Minchan Kim
Cc: Konrad Rzeszutek Wilk
Cc: Dan Magenheimer
Cc: Robert Jennings
Cc: Jenifer Hopper
Cc: Mel Gorman
Cc: Johannes Weiner
Cc: Larry Woodman
Cc: Benjamin Herrenschmidt
Cc: Dave Hansen
Cc: Joe Perches
Cc: Joonsoo Kim
Cc: Cody P Schafer
Cc: Hugh Dickens
Cc: Paul Mackerras
Cc: Fengguang Wu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Seth Jennings
2013-07-11 09:11:34 +0800
4e2e2770b zbud: add to mm/ ... Browse Code »

zbud is an special purpose allocator for storing compressed pages. It
is designed to store up to two compressed pages per physical page.
While this design limits storage density, it has simple and
deterministic reclaim properties that make it preferable to a higher
density approach when reclaim will be used.

zbud works by storing compressed pages, or "zpages", together in pairs
in a single memory page called a "zbud page". The first buddy is "left
justifed" at the beginning of the zbud page, and the last buddy is
"right justified" at the end of the zbud page. The benefit is that if
either buddy is freed, the freed buddy space, coalesced with whatever
slack space that existed between the buddies, results in the largest
possible free region within the zbud page.

zbud also provides an attractive lower bound on density. The ratio of
zpages to zbud pages can not be less than 1. This ensures that zbud can
never "do harm" by using more pages to store zpages than the
uncompressed zpages would have used on their own.

This implementation is a rewrite of the zbud allocator internally used
by zcache in the driver/staging tree. The rewrite was necessary to
remove some of the zcache specific elements that were ingrained
throughout and provide a generic allocation interface that can later be
used by zsmalloc and others.

This patch adds zbud to mm/ for later use by zswap.

Signed-off-by: Seth Jennings
Acked-by: Rik van Riel
Cc: Greg Kroah-Hartman
Cc: Nitin Gupta
Cc: Minchan Kim
Cc: Konrad Rzeszutek Wilk
Cc: Dan Magenheimer
Cc: Robert Jennings
Cc: Jenifer Hopper
Cc: Mel Gorman
Cc: Johannes Weiner
Cc: Larry Woodman
Cc: Benjamin Herrenschmidt
Cc: Dave Hansen
Cc: Joe Perches
Cc: Joonsoo Kim
Cc: Cody P Schafer
Cc: Hugh Dickens
Cc: Paul Mackerras
Cc: Bob Liu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Seth Jennings
2013-07-11 09:11:34 +0800

10 Jul, 2013

25 commits

c103a4dc4 ipc/shmc.c: eliminate ugly 80-col tricks ... Browse Code »

Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2013-07-10 01:33:26 +0800
0a1be1509 mm/memory_hotplug.c: fix return value of online_pages() ... Browse Code »

online_pages() is called from memory_block_action() when a user requests
to online a memory block via sysfs. This function needs to return a
proper error value in case of error.

Signed-off-by: Toshi Kani
Cc: Yasuaki Ishimatsu
Cc: Tang Chen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Toshi Kani
2013-07-10 01:33:25 +0800
5f12733e9 mm: honor min_free_kbytes set by user ... Browse Code »

min_free_kbytes is updated during memory hotplug (by
init_per_zone_wmark_min) currently which is right thing to do in most
cases but this could be unexpected if admin increased the value to
prevent from allocation failures and the new min_free_kbytes would be
decreased as a result of memory hotadd.

This patch saves the user defined value and allows updating
min_free_kbytes only if it is higher than the saved one.

A warning is printed when the new value is ignored.

Signed-off-by: Michal Hocko
Cc: Mel Gorman
Acked-by: Zhang Yanfei
Acked-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2013-07-10 01:33:25 +0800
465939a1f memcg: don't need to free memcg via RCU or workqueue ... Browse Code »

Now memcg has the same life cycle with its corresponding cgroup, and a
cgroup is freed via RCU and then mem_cgroup_css_free() will be called in
a work function, so we can simply call __mem_cgroup_free() in
mem_cgroup_css_free().

This actually reverts commit 59927fb984d ("memcg: free mem_cgroup by RCU
to fix oops").

Signed-off-by: Li Zefan
Cc: Hugh Dickins
Acked-by: Michal Hocko
Acked-by: KAMEZAWA Hiroyuki
Cc: Tejun Heo
Cc: Glauber Costa
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Li Zefan
2013-07-10 01:33:24 +0800
e0743e6bc memcg: kill memcg refcnt ... Browse Code »

Now memcg has the same life cycle as its corresponding cgroup. Kill the
useless refcnt.

Signed-off-by: Li Zefan
Acked-by: Michal Hocko
Acked-by: KAMEZAWA Hiroyuki
Cc: Hugh Dickins
Cc: Tejun Heo
Cc: Glauber Costa
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Li Zefan
2013-07-10 01:33:24 +0800
8d76a9797 memcg: don't need to get a reference to the parent ... Browse Code »

The cgroup core guarantees it's always safe to access the parent.

Signed-off-by: Li Zefan
Acked-by: Michal Hocko
Acked-by: KAMEZAWA Hiroyuki
Cc: Hugh Dickins
Cc: Tejun Heo
Cc: Glauber Costa
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Li Zefan
2013-07-10 01:33:24 +0800
4050377b5 memcg: use css_get/put for swap memcg ... Browse Code »

Use css_get/put instead of mem_cgroup_get/put. A simple replacement
will do.

The historical reason that memcg has its own refcnt instead of always
using css_get/put, is that cgroup couldn't be removed if there're still
css refs, so css refs can't be used as long-lived reference. The
situation has changed so that rmdir a cgroup will succeed regardless css
refs, but won't be freed until css refs goes down to 0.

Signed-off-by: Li Zefan
Acked-by: Michal Hocko
Acked-by: KAMEZAWA Hiroyuki
Cc: Hugh Dickins
Cc: Tejun Heo
Cc: Glauber Costa
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Li Zefan
2013-07-10 01:33:24 +0800
10d5ebf40 memcg: use css_get/put when charging/uncharging kmem ... Browse Code »

Use css_get/put instead of mem_cgroup_get/put.

We can't do a simple replacement, because here mem_cgroup_put() is
called during mem_cgroup_css_free(), while mem_cgroup_css_free() won't
be called until css refcnt goes down to 0.

Instead we increment css refcnt in mem_cgroup_css_offline(), and then
check if there's still kmem charges. If not, css refcnt will be
decremented immediately, otherwise the refcnt will be released after the
last kmem allocation is uncahred.

[akpm@linux-foundation.org: tweak comment]
Signed-off-by: Li Zefan
Acked-by: Michal Hocko
Acked-by: KAMEZAWA Hiroyuki
Reviewed-by: Tejun Heo
Cc: Michal Hocko
Cc: Glauber Costa
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Li Zefan
2013-07-10 01:33:24 +0800
20f05310b memcg: don't use mem_cgroup_get() when creating a kmemcg cache ... Browse Code »

Use css_get()/css_put() instead of mem_cgroup_get()/mem_cgroup_put().

There are two things being done in the current code:

First, we acquired a css_ref to make sure that the underlying cgroup
would not go away. That is a short lived reference, and it is put as
soon as the cache is created.

At this point, we acquire a long-lived per-cache memcg reference count
to guarantee that the memcg will still be alive.

so it is:

enqueue: css_get
create : memcg_get, css_put
destroy: memcg_put

So we only need to get rid of the memcg_get, change the memcg_put to
css_put, and get rid of the now extra css_put.

(This changelog is mostly written by Glauber)

Signed-off-by: Li Zefan
Acked-by: Michal Hocko
Acked-by: KAMEZAWA Hiroyuki
Cc: Hugh Dickins
Cc: Tejun Heo
Cc: Glauber Costa
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Li Zefan
2013-07-10 01:33:24 +0800
5347e5ae1 memcg: use css_get() in sock_update_memcg() ... Browse Code »

Use css_get/css_put instead of mem_cgroup_get/put.

Note, if at the same time someone is moving @current to a different
cgroup and removing the old cgroup, css_tryget() may return false, and
sock->sk_cgrp won't be initialized, which is fine.

Signed-off-by: Li Zefan
Acked-by: KAMEZAWA Hiroyuki
Acked-by: Michal Hocko
Cc: Hugh Dickins
Cc: Tejun Heo
Cc: Glauber Costa
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Li Zefan
2013-07-10 01:33:24 +0800
f37a96914 memcg, kmem: fix reference count handling on the error path ... Browse Code »

mem_cgroup_css_online calls mem_cgroup_put if memcg_init_kmem fails.
This is not correct because only memcg_propagate_kmem takes an
additional reference while mem_cgroup_sockets_init is allowed to fail as
well (although no current implementation fails) but it doesn't take any
reference. This all suggests that it should be memcg_propagate_kmem
that should clean up after itself so this patch moves mem_cgroup_put
over there.

Unfortunately this is not that easy (as pointed out by Li Zefan) because
memcg_kmem_mark_dead marks the group dead (KMEM_ACCOUNTED_DEAD) if it is
marked active (KMEM_ACCOUNTED_ACTIVE) which is the case even if
memcg_propagate_kmem fails so the additional reference is dropped in
that case in kmem_cgroup_destroy which means that the reference would be
dropped two times.

The easiest way then would be to simply remove mem_cgrroup_put from
mem_cgroup_css_online and rely on kmem_cgroup_destroy doing the right
thing.

Signed-off-by: Michal Hocko
Signed-off-by: Li Zefan
Acked-by: KAMEZAWA Hiroyuki
Cc: Hugh Dickins
Cc: Tejun Heo
Cc: Glauber Costa
Cc: Johannes Weiner
Cc: [3.8]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2013-07-10 01:33:24 +0800
fa460c2d3 Revert "memcg: avoid dangling reference count in creation failure" ... Browse Code »

This reverts commit e4715f01be697a.

mem_cgroup_put is hierarchy aware so mem_cgroup_put(memcg) already drops
an additional reference from all parents so the additional
mem_cgrroup_put(parent) potentially causes use-after-free.

Signed-off-by: Michal Hocko
Signed-off-by: Li Zefan
Acked-by: KAMEZAWA Hiroyuki
Cc: Hugh Dickins
Cc: Tejun Heo
Cc: Glauber Costa
Cc: Johannes Weiner
Cc: [3.9+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2013-07-10 01:33:24 +0800
493af5780 mmap: allow MAP_HUGETLB for hugetlbfs files v2 ... Browse Code »

It is counterintuitive at best that mmap'ing a hugetlbfs file with
MAP_HUGETLB fails, while mmap'ing it without will a) succeed and b)
return huge pages.

v2: use is_file_hugepages(), as suggested by Jianguo

Signed-off-by: Joern Engel
Cc: Jianguo Wu
Signed-off-by: Linus Torvalds

Jörn Engel
2013-07-10 01:33:24 +0800
918fc718c mm: vmscan: do not scale writeback pages when deciding whether to set ZONE_WRITEBACK ... Browse Code »

After the patch "mm: vmscan: Flatten kswapd priority loop" was merged
the scanning priority of kswapd changed.

The priority now rises until it is scanning enough pages to meet the
high watermark. shrink_inactive_list sets ZONE_WRITEBACK if a number of
pages were encountered under writeback but this value is scaled based on
the priority. As kswapd frequently scans with a higher priority now it
is relatively easy to set ZONE_WRITEBACK. This patch removes the
scaling and treates writeback pages similar to how it treats unqueued
dirty pages and congested pages. The user-visible effect should be that
kswapd will writeback fewer pages from reclaim context.

Signed-off-by: Mel Gorman
Cc: Rik van Riel
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Dave Chinner
Cc: Kamezawa Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2013-07-10 01:33:23 +0800
5a1c9cbc1 mm: vmscan: do not continue scanning if reclaim was aborted for compaction ... Browse Code »

Direct reclaim is not aborting to allow compaction to go ahead properly.
do_try_to_free_pages is told to abort reclaim which is happily ignores
and instead increases priority instead until it reaches 0 and starts
shrinking file/anon equally. This patch corrects the situation by
aborting reclaim when requested instead of raising priority.

Signed-off-by: Mel Gorman
Cc: Rik van Riel
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Dave Chinner
Cc: Kamezawa Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2013-07-10 01:33:23 +0800
7e9f5eb03 mm/memory_hotplug.c: fix a comment typo in register_page_bootmem_info_node() ... Browse Code »

Signed-off-by: Tang Chen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tang Chen
2013-07-10 01:33:23 +0800
d8bbdd773 mm/memblock.c: fix wrong comment in __next_free_mem_range() ... Browse Code »

Remove one redundant "nid" in the comment.

Signed-off-by: Tang Chen
Signed-off-by: Linus Torvalds

Tang Chen
2013-07-10 01:33:23 +0800
bcb615a81 mm/vmalloc.c: fix an overflow bug in alloc_vmap_area() ... Browse Code »

When searching a vmap area in the vmalloc space, we use (addr + size -
1) to check if the value is less than addr, which is an overflow. But
we assign (addr + size) to vmap_area->va_end.

So if we come across the below case:

(addr + size - 1) : not overflow
(addr + size) : overflow

we will assign an overflow value (e.g 0) to vmap_area->va_end, And this
will trigger BUG in __insert_vmap_area, causing system panic.

So using (addr + size) to check the overflow should be the correct
behaviour, not (addr + size - 1).

Signed-off-by: Zhang Yanfei
Reported-by: Ghennadi Procopciuc
Tested-by: Daniel Baluta
Cc: David Rientjes
Cc: Minchan Kim
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Zhang Yanfei
2013-07-10 01:33:23 +0800
64363aad5 mm: remove unused VM_<READfoo> macros and expand other in-place ... Browse Code »

These VM_ macros aren't used very often and three of them
aren't used at all.

Expand the ones that are used in-place, and remove all the now unused
#define VM_ macros.

VM_READHINTMASK, VM_NormalReadHint and VM_ClearReadHint were added just
before 2.4 and appears have never been used.

Signed-off-by: Joe Perches
Acked-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2013-07-10 01:33:23 +0800
f3deb6872 mm/sparse.c: put clear_hwpoisoned_pages within CONFIG_MEMORY_HOTREMOVE ... Browse Code »

With CONFIG_MEMORY_HOTREMOVE unset, there is a compile warning:

mm/sparse.c:755: warning: `clear_hwpoisoned_pages' defined but not used

And Bisecting it ended up pointing to 4edd7ceff ("mm, hotplug: avoid
compiling memory hotremove functions when disabled").

This is because the commit above put sparse_remove_one_section() within
the protection of CONFIG_MEMORY_HOTREMOVE but the only user of
clear_hwpoisoned_pages() is sparse_remove_one_section(), and it is not
within the protection of CONFIG_MEMORY_HOTREMOVE.

So put clear_hwpoisoned_pages within CONFIG_MEMORY_HOTREMOVE should fix
the warning.

Signed-off-by: Zhang Yanfei
Cc: David Rientjes
Acked-by: Toshi Kani
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Zhang Yanfei
2013-07-10 01:33:22 +0800
929aaf569 mm: remove unused __put_page() ... Browse Code »

This function is nowhere used, and it has a confusing name with put_page
in mm/swap.c. So better to remove it.

Signed-off-by: Zhang Yanfei
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Zhang Yanfei
2013-07-10 01:33:22 +0800
59d3132f8 vfree: don't schedule free_work() if llist_add() returns false ... Browse Code »

vfree() only needs schedule_work(&p->wq) if p->list was empty, otherwise
vfree_deferred->wq is already pending or it is running and didn't do
llist_del_all() yet.

Signed-off-by: Oleg Nesterov
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2013-07-10 01:33:22 +0800
345606d42 mm/page_alloc.c: remove unlikely() from the current_order test ... Browse Code »

In __rmqueue_fallback(), current_order loops down from MAX_ORDER - 1 to
the order passed. MAX_ORDER is typically 11 and pageblock_order is
typically 9 on x86. Integer division truncates, so pageblock_order / 2
is 4. For the first eight iterations, it's guaranteed that
current_order >= pageblock_order / 2 if it even gets that far!

So just remove the unlikely(), it's completely bogus.

Signed-off-by: Zhang Yanfei
Suggested-by: David Rientjes
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Zhang Yanfei
2013-07-10 01:33:22 +0800
bc732f1d5 mm/page_alloc.c: remove zone_type argument of build_zonelists_node ... Browse Code »

The callers of build_zonelists_node always pass MAX_NR_ZONES -1 as the
zone_type argument, so we can directly use the value in
build_zonelists_node and remove zone_type argument.

Signed-off-by: Zhang Yanfei
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Zhang Yanfei
2013-07-10 01:33:22 +0800
425c598d5 memcg: do not account memory used for cache creation ... Browse Code »

The memory we used to hold the memcg arrays is currently accounted to
the current memcg. But that creates a problem, because that memory can
only be freed after the last user is gone. Our only way to know which
is the last user, is to hook up to freeing time, but the fact that we
still have some in flight kmallocs will prevent freeing to happen. I
believe therefore to be just easier to account this memory as global
overhead.

Signed-off-by: Glauber Costa
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Kamezawa Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Glauber Costa
2013-07-10 01:33:21 +0800