Eric Lee / smarc-fsl-linux-kernel

07 Dec, 2020

1 commit

7a5bde379 hugetlb_cgroup: fix offline of hugetlb cgroup with reservations ... Browse Code »

Adrian Moreno was ruuning a kubernetes 1.19 + containerd/docker workload
using hugetlbfs. In this environment the issue is reproduced by:

- Start a simple pod that uses the recently added HugePages medium
feature (pod yaml attached)

- Start a DPDK app. It doesn't need to run successfully (as in transfer
packets) nor interact with real hardware. It seems just initializing
the EAL layer (which handles hugepage reservation and locking) is
enough to trigger the issue

- Delete the Pod (or let it "Complete").

This would result in a kworker thread going into a tight loop (top output):

1425 root 20 0 0 0 0 R 99.7 0.0 5:22.45 kworker/28:7+cgroup_destroy

'perf top -g' reports:

- 63.28% 0.01% [kernel] [k] worker_thread
- 49.97% worker_thread
- 52.64% process_one_work
- 62.08% css_killed_work_fn
- hugetlb_cgroup_css_offline
41.52% _raw_spin_lock
- 2.82% _cond_resched
rcu_all_qs
2.66% PageHuge
- 0.57% schedule
- 0.57% __schedule

We are spinning in the do-while loop in hugetlb_cgroup_css_offline.
Worse yet, we are holding the master cgroup lock (cgroup_mutex) while
infinitely spinning. Little else can be done on the system as the
cgroup_mutex can not be acquired.

Do note that the issue can be reproduced by simply offlining a hugetlb
cgroup containing pages with reservation counts.

The loop in hugetlb_cgroup_css_offline is moving page counts from the
cgroup being offlined to the parent cgroup. This is done for each
hstate, and is repeated until hugetlb_cgroup_have_usage returns false.
The routine moving counts (hugetlb_cgroup_move_parent) is only moving
'usage' counts. The routine hugetlb_cgroup_have_usage is checking for
both 'usage' and 'reservation' counts. Discussion about what to do with
reservation counts when reparenting was discussed here:

https://lore.kernel.org/linux-kselftest/CAHS8izMFAYTgxym-Hzb_JmkTK1N_S9tGN71uS6MFV+R7swYu5A@mail.gmail.com/

The decision was made to leave a zombie cgroup for with reservation
counts. Unfortunately, the code checking reservation counts was
incorrectly added to hugetlb_cgroup_have_usage.

To fix the issue, simply remove the check for reservation counts. While
fixing this issue, a related bug in hugetlb_cgroup_css_offline was
noticed. The hstate index is not reinitialized each time through the
do-while loop. Fix this as well.

Fixes: 1adc4d419aa2 ("hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations")
Reported-by: Adrian Moreno
Signed-off-by: Mike Kravetz
Signed-off-by: Andrew Morton
Tested-by: Adrian Moreno
Reviewed-by: Shakeel Butt
Cc: Mina Almasry
Cc: David Rientjes
Cc: Greg Thelen
Cc: Sandipan Das
Cc: Shuah Khan
Cc:
Link: https://lkml.kernel.org/r/20201203220242.158165-1-mike.kravetz@oracle.com
Signed-off-by: Linus Torvalds

Mike Kravetz
2020-12-07 02:19:07 +0800

22 Aug, 2020

1 commit

d5a169597 hugetlb_cgroup: convert comma to semicolon ... Browse Code »

Replace a comma between expression statements by a semicolon.

Fixes: faced7e0806cf4 ("mm: hugetlb controller for cgroups v2")
Signed-off-by: Xu Wang
Signed-off-by: Andrew Morton
Cc: Tejun Heo
Cc: Giuseppe Scrivano
Link: http://lkml.kernel.org/r/20200818064333.21759-1-vulab@iscas.ac.cn
Signed-off-by: Linus Torvalds

Xu Wang
2020-08-22 00:52:52 +0800

08 Apr, 2020

1 commit

e4a9bc589 mm: use fallthrough; ... Browse Code »

Convert the various /* fallthrough */ comments to the pseudo-keyword
fallthrough;

Done via script:
https://lore.kernel.org/lkml/b56602fcf79f849e733e7b521bb0e17895d390fa.1582230379.git.joe@perches.com/

Signed-off-by: Joe Perches
Signed-off-by: Andrew Morton
Reviewed-by: Gustavo A. R. Silva
Link: http://lkml.kernel.org/r/f62fea5d10eb0ccfc05d87c242a620c261219b66.camel@perches.com
Signed-off-by: Linus Torvalds

Joe Perches
2020-04-08 01:43:41 +0800

03 Apr, 2020

5 commits

075a61d07 hugetlb_cgroup: add accounting for shared mappings ... Browse Code »

For shared mappings, the pointer to the hugetlb_cgroup to uncharge lives
in the resv_map entries, in file_region->reservation_counter.

After a call to region_chg, we charge the approprate hugetlb_cgroup, and
if successful, we pass on the hugetlb_cgroup info to a follow up
region_add call. When a file_region entry is added to the resv_map via
region_add, we put the pointer to that cgroup in
file_region->reservation_counter. If charging doesn't succeed, we report
the error to the caller, so that the kernel fails the reservation.

On region_del, which is when the hugetlb memory is unreserved, we also
uncharge the file_region->reservation_counter.

[akpm@linux-foundation.org: forward declare struct file_region]
Signed-off-by: Mina Almasry
Signed-off-by: Andrew Morton
Reviewed-by: Mike Kravetz
Cc: David Rientjes
Cc: Greg Thelen
Cc: Mike Kravetz
Cc: Sandipan Das
Cc: Shakeel Butt
Cc: Shuah Khan
Link: http://lkml.kernel.org/r/20200211213128.73302-5-almasrymina@google.com
Signed-off-by: Linus Torvalds

Mina Almasry
2020-04-03 00:35:32 +0800
e9fe92ae0 hugetlb_cgroup: add reservation accounting for private mappings ... Browse Code »

Normally the pointer to the cgroup to uncharge hangs off the struct page,
and gets queried when it's time to free the page. With hugetlb_cgroup
reservations, this is not possible. Because it's possible for a page to
be reserved by one task and actually faulted in by another task.

The best place to put the hugetlb_cgroup pointer to uncharge for
reservations is in the resv_map. But, because the resv_map has different
semantics for private and shared mappings, the code patch to
charge/uncharge shared and private mappings is different. This patch
implements charging and uncharging for private mappings.

For private mappings, the counter to uncharge is in
resv_map->reservation_counter. On initializing the resv_map this is set
to NULL. On reservation of a region in private mapping, the tasks
hugetlb_cgroup is charged and the hugetlb_cgroup is placed is
resv_map->reservation_counter.

On hugetlb_vm_op_close, we uncharge resv_map->reservation_counter.

[akpm@linux-foundation.org: forward declare struct resv_map]
Signed-off-by: Mina Almasry
Signed-off-by: Andrew Morton
Reviewed-by: Mike Kravetz
Acked-by: David Rientjes
Cc: Greg Thelen
Cc: Sandipan Das
Cc: Shakeel Butt
Cc: Shuah Khan
Link: http://lkml.kernel.org/r/20200211213128.73302-3-almasrymina@google.com
Signed-off-by: Linus Torvalds

Mina Almasry
2020-04-03 00:35:32 +0800
9808895e1 mm/hugetlb_cgroup: fix hugetlb_cgroup migration ... Browse Code »

Commit c32300516047 ("hugetlb_cgroup: add interface for charge/uncharge
hugetlb reservations") mistakingly doesn't handle the migration of *both*
the reservation hugetlb_cgroup and the fault hugetlb_cgroup correctly.

What should happen is that both cgroups shuold be queried from the old
page, then both set to NULL on the old page, then both inserted into the
new page.

The mistake also creates the following warning:

mm/hugetlb_cgroup.c: In function 'hugetlb_cgroup_migrate':
mm/hugetlb_cgroup.c:777:25: warning: variable 'h_cg' set but not used
[-Wunused-but-set-variable]
struct hugetlb_cgroup *h_cg;
^~~~

Solution is to add the missing steps, namly setting the reservation
hugetlb_cgroup to NULL on the old page, and setting the fault
hugetlb_cgroup on the new page.

Fixes: c32300516047 ("hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations")
Reported-by: Qian Cai
Signed-off-by: Mina Almasry
Signed-off-by: Andrew Morton
Cc: David Rientjes
Cc: Greg Thelen
Cc: Mike Kravetz
Cc: Sandipan Das
Cc: Shakeel Butt
Cc: Shuah Khan
Link: http://lkml.kernel.org/r/20200218194727.46995-1-almasrymina@google.com
Signed-off-by: Linus Torvalds

Mina Almasry
2020-04-03 00:35:32 +0800
1adc4d419 hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations ... Browse Code »

Augments hugetlb_cgroup_charge_cgroup to be able to charge hugetlb usage
or hugetlb reservation counter.

Adds a new interface to uncharge a hugetlb_cgroup counter via
hugetlb_cgroup_uncharge_counter.

Integrates the counter with hugetlb_cgroup, via hugetlb_cgroup_init,
hugetlb_cgroup_have_usage, and hugetlb_cgroup_css_offline.

Signed-off-by: Mina Almasry
Signed-off-by: Andrew Morton
Acked-by: Mike Kravetz
Acked-by: David Rientjes
Cc: Greg Thelen
Cc: Sandipan Das
Cc: Shakeel Butt
Cc: Shuah Khan
Link: http://lkml.kernel.org/r/20200211213128.73302-2-almasrymina@google.com
Signed-off-by: Linus Torvalds

Mina Almasry
2020-04-03 00:35:32 +0800
cdc2fcfea hugetlb_cgroup: add hugetlb_cgroup reservation counter ... Browse Code »

These counters will track hugetlb reservations rather than hugetlb memory
faulted in. This patch only adds the counter, following patches add the
charging and uncharging of the counter.

This is patch 1 of an 9 patch series.

Problem:

Currently tasks attempting to reserve more hugetlb memory than is
available get a failure at mmap/shmget time. This is thanks to Hugetlbfs
Reservations [1]. However, if a task attempts to reserve more hugetlb
memory than its hugetlb_cgroup limit allows, the kernel will allow the
mmap/shmget call, but will SIGBUS the task when it attempts to fault in
the excess memory.

We have users hitting their hugetlb_cgroup limits and thus we've been
looking at this failure mode. We'd like to improve this behavior such
that users violating the hugetlb_cgroup limits get an error on mmap/shmget
time, rather than getting SIGBUS'd when they try to fault the excess
memory in. This gives the user an opportunity to fallback more gracefully
to non-hugetlbfs memory for example.

The underlying problem is that today's hugetlb_cgroup accounting happens
at hugetlb memory *fault* time, rather than at *reservation* time. Thus,
enforcing the hugetlb_cgroup limit only happens at fault time, and the
offending task gets SIGBUS'd.

Proposed Solution:

A new page counter named
'hugetlb.xMB.rsvd.[limit|usage|max_usage]_in_bytes'. This counter has
slightly different semantics than
'hugetlb.xMB.[limit|usage|max_usage]_in_bytes':

- While usage_in_bytes tracks all *faulted* hugetlb memory,
rsvd.usage_in_bytes tracks all *reserved* hugetlb memory and hugetlb
memory faulted in without a prior reservation.

- If a task attempts to reserve more memory than limit_in_bytes allows,
the kernel will allow it to do so. But if a task attempts to reserve
more memory than rsvd.limit_in_bytes, the kernel will fail this
reservation.

This proposal is implemented in this patch series, with tests to verify
functionality and show the usage.

Alternatives considered:

1. A new cgroup, instead of only a new page_counter attached to the
existing hugetlb_cgroup. Adding a new cgroup seemed like a lot of code
duplication with hugetlb_cgroup. Keeping hugetlb related page counters
under hugetlb_cgroup seemed cleaner as well.

2. Instead of adding a new counter, we considered adding a sysctl that
modifies the behavior of hugetlb.xMB.[limit|usage]_in_bytes, to do
accounting at reservation time rather than fault time. Adding a new
page_counter seems better as userspace could, if it wants, choose to
enforce different cgroups differently: one via limit_in_bytes, and
another via rsvd.limit_in_bytes. This could be very useful if you're
transitioning how hugetlb memory is partitioned on your system one
cgroup at a time, for example. Also, someone may find usage for both
limit_in_bytes and rsvd.limit_in_bytes concurrently, and this approach
gives them the option to do so.

Testing:
- Added tests passing.
- Used libhugetlbfs for regression testing.

[1]: https://www.kernel.org/doc/html/latest/vm/hugetlbfs_reserv.html

Signed-off-by: Mina Almasry
Signed-off-by: Andrew Morton
Reviewed-by: Mike Kravetz
Acked-by: David Rientjes
Cc: Shuah Khan
Cc: Shakeel Butt
Cc: Greg Thelen
Cc: Sandipan Das
Link: http://lkml.kernel.org/r/20200211213128.73302-1-almasrymina@google.com
Signed-off-by: Linus Torvalds

Mina Almasry
2020-04-03 00:35:32 +0800

30 Mar, 2020

1 commit

726b7bbea hugetlb_cgroup: fix illegal access to memory ... Browse Code »

This appears to be a mistake in commit faced7e0806cf ("mm: hugetlb
controller for cgroups v2").

Essentially that commit does a hugetlb_cgroup_from_counter assuming that
page_counter_try_charge has initialized counter.

But if that has failed then it seems will not initialize counter, so
hugetlb_cgroup_from_counter(counter) ends up pointing to random memory,
causing kasan to complain.

The solution is to simply use 'h_cg', instead of
hugetlb_cgroup_from_counter(counter), since that is a reference to the
hugetlb_cgroup anyway. After this change kasan ceases to complain.

Fixes: faced7e0806cf ("mm: hugetlb controller for cgroups v2")
Reported-by: syzbot+cac0c4e204952cf449b1@syzkaller.appspotmail.com
Signed-off-by: Mina Almasry
Signed-off-by: Andrew Morton
Acked-by: Giuseppe Scrivano
Acked-by: Tejun Heo
Cc: Mike Kravetz
Cc: David Rientjes
Link: http://lkml.kernel.org/r/20200313223920.124230-1-almasrymina@google.com
Signed-off-by: Linus Torvalds

Mina Almasry
2020-03-30 00:47:05 +0800

17 Dec, 2019

1 commit

faced7e08 mm: hugetlb controller for cgroups v2 ... Browse Code »

In the effort of supporting cgroups v2 into Kubernetes, I stumped on
the lack of the hugetlb controller.

When the controller is enabled, it exposes four new files for each
hugetlb size on non-root cgroups:

- hugetlb..current
- hugetlb..max
- hugetlb..events
- hugetlb..events.local

The differences with the legacy hierarchy are in the file names and
using the value "max" instead of "-1" to disable a limit.

The file .limit_in_bytes is renamed to .max.

The file .usage_in_bytes is renamed to .current.

.failcnt is not provided as a single file anymore, but its value can
be read through the new flat-keyed files .events and .events.local,
through the "max" key.

Signed-off-by: Giuseppe Scrivano
Signed-off-by: Tejun Heo

Giuseppe Scrivano
2019-12-17 04:41:40 +0800

16 Nov, 2019

1 commit

0362f326d mm: hugetlb: switch to css_tryget() in hugetlb_cgroup_charge_cgroup() ... Browse Code »

An exiting task might belong to an offline cgroup. In this case an
attempt to grab a cgroup reference from the task can end up with an
infinite loop in hugetlb_cgroup_charge_cgroup(), because neither the
cgroup will become online, neither the task will be migrated to a live
cgroup.

Fix this by switching over to css_tryget(). As css_tryget_online()
can't guarantee that the cgroup won't go offline, in most cases the
check doesn't make sense. In this particular case users of
hugetlb_cgroup_charge_cgroup() are not affected by this change.

A similar problem is described by commit 18fa84a2db0e ("cgroup: Use
css_tryget() instead of css_tryget_online() in task_get_css()").

Link: http://lkml.kernel.org/r/20191106225131.3543616-2-guro@fb.com
Signed-off-by: Roman Gushchin
Acked-by: Johannes Weiner
Acked-by: Tejun Heo
Reviewed-by: Shakeel Butt
Cc: Michal Hocko
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Roman Gushchin
2019-11-16 10:34:00 +0800

25 Sep, 2019

1 commit

d8c6546b1 mm: introduce compound_nr() ... Browse Code »

Replace 1 << compound_order(page) with compound_nr(page). Minor
improvements in readability.

Link: http://lkml.kernel.org/r/20190721104612.19120-4-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle)
Reviewed-by: Andrew Morton
Reviewed-by: Ira Weiny
Acked-by: Kirill A. Shutemov
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matthew Wilcox (Oracle)
2019-09-25 06:54:08 +0800

08 Jun, 2018

1 commit

bbec2e151 mm: rename page_counter's count/limit into usage/max ... Browse Code »

This patch renames struct page_counter fields:
count -> usage
limit -> max

and the corresponding functions:
page_counter_limit() -> page_counter_set_max()
mem_cgroup_get_limit() -> mem_cgroup_get_max()
mem_cgroup_resize_limit() -> mem_cgroup_resize_max()
memcg_update_kmem_limit() -> memcg_update_kmem_max()
memcg_update_tcp_limit() -> memcg_update_tcp_max()

The idea behind this renaming is to have the direct matching
between memory cgroup knobs (low, high, max) and page_counters API.

This is pure renaming, this patch doesn't bring any functional change.

Link: http://lkml.kernel.org/r/20180405185921.4942-1-guro@fb.com
Signed-off-by: Roman Gushchin
Acked-by: Johannes Weiner
Cc: Michal Hocko
Cc: Vladimir Davydov
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Roman Gushchin
2018-06-08 08:34:35 +0800

21 May, 2016

1 commit

297880f4a mm, hugetlb_cgroup: round limit_in_bytes down to hugepage size ... Browse Code »

The page_counter rounds limits down to page size values. This makes
sense, except in the case of hugetlb_cgroup where it's not possible to
charge partial hugepages. If the hugetlb_cgroup margin is less than the
hugepage size being charged, it will fail as expected.

Round the hugetlb_cgroup limit down to hugepage size, since it is the
effective limit of the cgroup.

For consistency, round down PAGE_COUNTER_MAX as well when a
hugetlb_cgroup is created: this prevents error reports when a user
cannot restore the value to the kernel default.

Signed-off-by: David Rientjes
Cc: Michal Hocko
Cc: Nikolay Borisov
Cc: Johannes Weiner
Cc: "Kirill A. Shutemov"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2016-05-21 08:58:30 +0800

07 Nov, 2015

1 commit

1d798ca3f mm: make compound_head() robust ... Browse Code »

Hugh has pointed that compound_head() call can be unsafe in some
context. There's one example:

CPU0 CPU1

isolate_migratepages_block()
page_count()
compound_head()
!!PageTail() == true
put_page()
tail->first_page = NULL
head = tail->first_page
alloc_pages(__GFP_COMP)
prep_compound_page()
tail->first_page = head
__SetPageTail(p);
!!PageTail() == true

The race is pure theoretical. I don't it's possible to trigger it in
practice. But who knows.

We can fix the race by changing how encode PageTail() and compound_head()
within struct page to be able to update them in one shot.

The patch introduces page->compound_head into third double word block in
front of compound_dtor and compound_order. Bit 0 encodes PageTail() and
the rest bits are pointer to head page if bit zero is set.

The patch moves page->pmd_huge_pte out of word, just in case if an
architecture defines pgtable_t into something what can have the bit 0
set.

hugetlb_cgroup uses page->lru.next in the second tail page to store
pointer struct hugetlb_cgroup. The patch switch it to use page->private
in the second tail page instead. The space is free since ->first_page is
removed from the union.

The patch also opens possibility to remove HUGETLB_CGROUP_MIN_ORDER
limitation, since there's now space in first tail page to store struct
hugetlb_cgroup pointer. But that's out of scope of the patch.

That means page->compound_head shares storage space with:

- page->lru.next;
- page->next;
- page->rcu_head.next;

That's too long list to be absolutely sure, but looks like nobody uses
bit 0 of the word.

page->rcu_head.next guaranteed[1] to have bit 0 clean as long as we use
call_rcu(), call_rcu_bh(), call_rcu_sched(), or call_srcu(). But future
call_rcu_lazy() is not allowed as it makes use of the bit and we can
get false positive PageTail().

[1] http://lkml.kernel.org/g/20150827163634.GD4029@linux.vnet.ibm.com

Signed-off-by: Kirill A. Shutemov
Acked-by: Michal Hocko
Reviewed-by: Andrea Arcangeli
Cc: Hugh Dickins
Cc: David Rientjes
Cc: Vlastimil Babka
Acked-by: Paul E. McKenney
Cc: Aneesh Kumar K.V
Cc: Andi Kleen
Cc: Christoph Lameter
Cc: Joonsoo Kim
Cc: Sergey Senozhatsky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2015-11-07 09:50:42 +0800

06 Nov, 2015

1 commit

6071ca520 mm: page_counter: let page_counter_try_charge() return bool ... Browse Code »

page_counter_try_charge() currently returns 0 on success and -ENOMEM on
failure, which is surprising behavior given the function name.

Make it follow the expected pattern of try_stuff() functions that return a
boolean true to indicate success, or false for failure.

Signed-off-by: Johannes Weiner
Acked-by: Michal Hocko
Cc: Vladimir Davydov
Signed-off-by: Linus Torvalds

Johannes Weiner
2015-11-06 11:34:48 +0800

12 Feb, 2015

1 commit

650c5e565 mm: page_counter: pull "-1" handling out of page_counter_memparse() ... Browse Code »

The unified hierarchy interface for memory cgroups will no longer use "-1"
to mean maximum possible resource value. In preparation for this, make
the string an argument and let the caller supply it.

Signed-off-by: Johannes Weiner
Acked-by: Michal Hocko
Cc: Vladimir Davydov
Cc: Greg Thelen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2015-02-12 09:06:02 +0800

11 Dec, 2014

1 commit

71f87bee3 mm: hugetlb_cgroup: convert to lockless page counters ... Browse Code »

Abandon the spinlock-protected byte counters in favor of the unlocked
page counters in the hugetlb controller as well.

Signed-off-by: Johannes Weiner
Reviewed-by: Vladimir Davydov
Acked-by: Michal Hocko
Cc: Tejun Heo
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-12-11 09:41:04 +0800

30 Aug, 2014

1 commit

7ea8574e5 hugetlb_cgroup: use lockdep_assert_held rather than spin_is_locked ... Browse Code »

spin_lock may be an empty struct for !SMP configurations and so
arch_spin_is_locked may return unconditional 0 and trigger the VM_BUG_ON
even when the lock is held.

Replace spin_is_locked by lockdep_assert_held. We will not BUG anymore
but it is questionable whether crashing makes a lot of sense in the
uncharge path. Uncharge happens after the last page reference was
released so nobody should touch the page and the function doesn't update
any shared state except for res counter which uses synchronization of
its own.

Signed-off-by: Michal Hocko
Reviewed-by: Aneesh Kumar K.V
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2014-08-30 07:28:16 +0800

15 Aug, 2014

1 commit

24d7cd207 mm, hugetlb_cgroup: align hugetlb cgroup limit to hugepage size ... Browse Code »

Memcg aligns memory.limit_in_bytes to PAGE_SIZE as part of the resource
counter since it makes no sense to allow a partial page to be charged.

As a result of the hugetlb cgroup using the resource counter, it is also
aligned to PAGE_SIZE but makes no sense unless aligned to the size of
the hugepage being limited.

Align hugetlb cgroup limit to hugepage size.

Signed-off-by: David Rientjes
Acked-by: Michal Hocko
Cc: "Aneesh Kumar K.V"
Cc: Tejun Heo
Cc: Li Zefan
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2014-08-15 00:56:15 +0800

15 Jul, 2014

1 commit

2cf669a58 cgroup: replace cgroup_add_cftypes() with cgroup_add_legacy_cftypes() ... Browse Code »

Currently, cftypes added by cgroup_add_cftypes() are used for both the
unified default hierarchy and legacy ones and subsystems can mark each
file with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE if it has to
appear only on one of them. This is quite hairy and error-prone.
Also, we may end up exposing interface files to the default hierarchy
without thinking it through.

cgroup_subsys will grow two separate cftype addition functions and
apply each only on the hierarchies of the matching type. This will
allow organizing cftypes in a lot clearer way and encourage subsystems
to scrutinize the interface which is being exposed in the new default
hierarchy.

In preparation, this patch adds cgroup_add_legacy_cftypes() which
currently is a simple wrapper around cgroup_add_cftypes() and replaces
all cgroup_add_cftypes() usages with it.

While at it, this patch drops a completely spurious return from
__hugetlb_cgroup_file_init().

This patch doesn't introduce any functional differences.

Signed-off-by: Tejun Heo
Acked-by: Neil Horman
Acked-by: Li Zefan
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Aneesh Kumar K.V

Tejun Heo
2014-07-15 23:05:09 +0800

17 May, 2014

1 commit

5c9d535b8 cgroup: remove css_parent() ... Browse Code »

cgroup in general is moving towards using cgroup_subsys_state as the
fundamental structural component and css_parent() was introduced to
convert from using cgroup->parent to css->parent. It was quite some
time ago and we're moving forward with making css more prominent.

This patch drops the trivial wrapper css_parent() and let the users
dereference css->parent. While at it, explicitly mark fields of css
which are public and immutable.

v2: New usage from device_cgroup.c converted.

Signed-off-by: Tejun Heo
Acked-by: Michal Hocko
Acked-by: Neil Horman
Acked-by: "David S. Miller"
Acked-by: Li Zefan
Cc: Vivek Goyal
Cc: Jens Axboe
Cc: Peter Zijlstra
Cc: Johannes Weiner

Tejun Heo
2014-05-17 01:22:48 +0800

14 May, 2014

3 commits

6770c64e5 cgroup: replace cftype->trigger() with cftype->write() ... Browse Code »

cftype->trigger() is pointless. It's trivial to ignore the input
buffer from a regular ->write() operation. Convert all ->trigger()
users to ->write() and remove ->trigger().

This patch doesn't introduce any visible behavior changes.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan
Cc: Johannes Weiner
Cc: Michal Hocko

Tejun Heo
2014-05-14 00:16:21 +0800
451af504d cgroup: replace cftype->write_string() with cftype->write() ... Browse Code »

Convert all cftype->write_string() users to the new cftype->write()
which maps directly to kernfs write operation and has full access to
kernfs and cgroup contexts. The conversions are mostly mechanical.

* @css and @cft are accessed using of_css() and of_cft() accessors
respectively instead of being specified as arguments.

* Should return @nbytes on success instead of 0.

* @buf is not trimmed automatically. Trim if necessary. Note that
blkcg and netprio don't need this as the parsers already handle
whitespaces.

cftype->write_string() has no user left after the conversions and
removed.

While at it, remove unnecessary local variable @p in
cgroup_subtree_control_write() and stale comment about
CGROUP_LOCAL_BUFFER_SIZE in cgroup_freezer.c.

This patch doesn't introduce any visible behavior changes.

v2: netprio was missing from conversion. Converted.

Signed-off-by: Tejun Heo
Acked-by: Aristeu Rozanski
Acked-by: Vivek Goyal
Acked-by: Li Zefan
Cc: Jens Axboe
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Neil Horman
Cc: "David S. Miller"

Tejun Heo
2014-05-14 00:16:21 +0800
ec903c0c8 cgroup: rename css_tryget*() to css_tryget_online*() ... Browse Code »

Unlike the more usual refcnting, what css_tryget() provides is the
distinction between online and offline csses instead of protection
against upping a refcnt which already reached zero. cgroup is
planning to provide actual tryget which fails if the refcnt already
reached zero. Let's rename the existing trygets so that they clearly
indicate that they're onliness.

I thought about keeping the existing names as-are and introducing new
names for the planned actual tryget; however, given that each
controller participates in the synchronization of the online state, it
seems worthwhile to make it explicit that these functions are about
on/offline state.

Rename css_tryget() to css_tryget_online() and css_tryget_from_dir()
to css_tryget_online_from_dir(). This is pure rename.

v2: cgroup_freezer grew new usages of css_tryget(). Update
accordingly.

Signed-off-by: Tejun Heo
Acked-by: Johannes Weiner
Acked-by: Michal Hocko
Acked-by: Li Zefan
Cc: Vivek Goyal
Cc: Jens Axboe
Cc: Peter Zijlstra
Cc: Paul Mackerras
Cc: Ingo Molnar
Cc: Arnaldo Carvalho de Melo

Tejun Heo
2014-05-14 00:11:01 +0800

19 Mar, 2014

1 commit

4d3bb511b cgroup: drop const from @buffer of cftype->write_string() ... Browse Code »

cftype->write_string() just passes on the writeable buffer from kernfs
and there's no reason to add const restriction on the buffer. The
only thing const achieves is unnecessarily complicating parsing of the
buffer. Drop const from @buffer.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan
Cc: Peter Zijlstra
Cc: Paul Mackerras
Cc: Ingo Molnar
Cc: Arnaldo Carvalho de Melo
Cc: Daniel Borkmann
Cc: Michal Hocko
Cc: Johannes Weiner
Cc: Balbir Singh
Cc: KAMEZAWA Hiroyuki

Tejun Heo
2014-03-19 22:23:54 +0800

08 Feb, 2014

1 commit

073219e99 cgroup: clean up cgroup_subsys names and initialization ... Browse Code »

cgroup_subsys is a bit messier than it needs to be.

* The name of a subsys can be different from its internal identifier
defined in cgroup_subsys.h. Most subsystems use the matching name
but three - cpu, memory and perf_event - use different ones.

* cgroup_subsys_id enums are postfixed with _subsys_id and each
cgroup_subsys is postfixed with _subsys. cgroup.h is widely
included throughout various subsystems, it doesn't and shouldn't
have claim on such generic names which don't have any qualifier
indicating that they belong to cgroup.

* cgroup_subsys->subsys_id should always equal the matching
cgroup_subsys_id enum; however, we require each controller to
initialize it and then BUG if they don't match, which is a bit
silly.

This patch cleans up cgroup_subsys names and initialization by doing
the followings.

* cgroup_subsys_id enums are now postfixed with _cgrp_id, and each
cgroup_subsys with _cgrp_subsys.

* With the above, renaming subsys identifiers to match the userland
visible names doesn't cause any naming conflicts. All non-matching
identifiers are renamed to match the official names.

cpu_cgroup -> cpu
mem_cgroup -> memory
perf -> perf_event

* controllers no longer need to initialize ->subsys_id and ->name.
They're generated in cgroup core and set automatically during boot.

* Redundant cgroup_subsys declarations removed.

* While updating BUG_ON()s in cgroup_init_early(), convert them to
WARN()s. BUGging that early during boot is stupid - the kernel
can't print anything, even through serial console and the trap
handler doesn't even link stack frame properly for back-tracing.

This patch doesn't introduce any behavior changes.

v2: Rebased on top of fe1217c4f3f7 ("net: net_cls: move cgroupfs
classid handling into core").

Signed-off-by: Tejun Heo
Acked-by: Neil Horman
Acked-by: "David S. Miller"
Acked-by: "Rafael J. Wysocki"
Acked-by: Michal Hocko
Acked-by: Peter Zijlstra
Acked-by: Aristeu Rozanski
Acked-by: Ingo Molnar
Acked-by: Li Zefan
Cc: Johannes Weiner
Cc: Balbir Singh
Cc: KAMEZAWA Hiroyuki
Cc: Serge E. Hallyn
Cc: Vivek Goyal
Cc: Thomas Graf

Tejun Heo
2014-02-08 23:36:58 +0800

24 Jan, 2014

1 commit

309381fea mm: dump page when hitting a VM_BUG_ON using VM_BUG_ON_PAGE ... Browse Code »

Most of the VM_BUG_ON assertions are performed on a page. Usually, when
one of these assertions fails we'll get a BUG_ON with a call stack and
the registers.

I've recently noticed based on the requests to add a small piece of code
that dumps the page to various VM_BUG_ON sites that the page dump is
quite useful to people debugging issues in mm.

This patch adds a VM_BUG_ON_PAGE(cond, page) which beyond doing what
VM_BUG_ON() does, also dumps the page before executing the actual
BUG_ON.

[akpm@linux-foundation.org: fix up includes]
Signed-off-by: Sasha Levin
Cc: "Kirill A. Shutemov"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sasha Levin
2014-01-24 08:36:50 +0800

06 Dec, 2013

1 commit

716f479d2 hugetlb_cgroup: convert away from cftype->read() ... Browse Code »

In preparation of conversion to kernfs, cgroup file handling is being
consolidated so that it can be easily mapped to the seq_file based
interface of kernfs.

All users of cftype->read() can be easily served, usually better, by
seq_file and other methods. Update hugetlb_cgroup_read() to return
u64 instead of printing itself and rename it to
hugetlb_cgroup_read_u64().

This patch doesn't make any visible behavior changes.

Signed-off-by: Tejun Heo
Reviewed-by: Michal Hocko
Acked-by: Li Zefan
Cc: Aneesh Kumar K.V
Cc: Johannes Weiner

Tejun Heo
2013-12-06 01:28:03 +0800

09 Aug, 2013

6 commits

182446d08 cgroup: pass around cgroup_subsys_state instead of cgroup in file methods ... Browse Code »

cgroup is currently in the process of transitioning to using struct
cgroup_subsys_state * as the primary handle instead of struct cgroup.
Please see the previous commit which converts the subsystem methods
for rationale.

This patch converts all cftype file operations to take @css instead of
@cgroup. cftypes for the cgroup core files don't have their subsytem
pointer set. These will automatically use the dummy_css added by the
previous patch and can be converted the same way.

Most subsystem conversions are straight forwards but there are some
interesting ones.

* freezer: update_if_frozen() is also converted to take @css instead
of @cgroup for consistency. This will make the code look simpler
too once iterators are converted to use css.

* memory/vmpressure: mem_cgroup_from_css() needs to be exported to
vmpressure while mem_cgroup_from_cont() can be made static.
Updated accordingly.

* cpu: cgroup_tg() doesn't have any user left. Removed.

* cpuacct: cgroup_ca() doesn't have any user left. Removed.

* hugetlb: hugetlb_cgroup_form_cgroup() doesn't have any user left.
Removed.

* net_cls: cgrp_cls_state() doesn't have any user left. Removed.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan
Acked-by: Michal Hocko
Acked-by: Vivek Goyal
Acked-by: Aristeu Rozanski
Acked-by: Daniel Wagner
Cc: Peter Zijlstra
Cc: Ingo Molnar
Cc: Johannes Weiner
Cc: Balbir Singh
Cc: Matt Helsley
Cc: Jens Axboe
Cc: Steven Rostedt

Tejun Heo
2013-08-09 08:11:24 +0800
eb95419b0 cgroup: pass around cgroup_subsys_state instead of cgroup in subsystem methods ... Browse Code »

cgroup is currently in the process of transitioning to using struct
cgroup_subsys_state * as the primary handle instead of struct cgroup *
in subsystem implementations for the following reasons.

* With unified hierarchy, subsystems will be dynamically bound and
unbound from cgroups and thus css's (cgroup_subsys_state) may be
created and destroyed dynamically over the lifetime of a cgroup,
which is different from the current state where all css's are
allocated and destroyed together with the associated cgroup. This
in turn means that cgroup_css() should be synchronized and may
return NULL, making it more cumbersome to use.

* Differing levels of per-subsystem granularity in the unified
hierarchy means that the task and descendant iterators should behave
differently depending on the specific subsystem the iteration is
being performed for.

* In majority of the cases, subsystems only care about its part in the
cgroup hierarchy - ie. the hierarchy of css's. Subsystem methods
often obtain the matching css pointer from the cgroup and don't
bother with the cgroup pointer itself. Passing around css fits
much better.

This patch converts all cgroup_subsys methods to take @css instead of
@cgroup. The conversions are mostly straight-forward. A few
noteworthy changes are

* ->css_alloc() now takes css of the parent cgroup rather than the
pointer to the new cgroup as the css for the new cgroup doesn't
exist yet. Knowing the parent css is enough for all the existing
subsystems.

* In kernel/cgroup.c::offline_css(), unnecessary open coded css
dereference is replaced with local variable access.

This patch shouldn't cause any behavior differences.

v2: Unnecessary explicit cgrp->subsys[] deref in css_online() replaced
with local variable @css as suggested by Li Zefan.

Rebased on top of new for-3.12 which includes for-3.11-fixes so
that ->css_free() invocation added by da0a12caff ("cgroup: fix a
leak when percpu_ref_init() fails") is converted too. Suggested
by Li Zefan.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan
Acked-by: Michal Hocko
Acked-by: Vivek Goyal
Acked-by: Aristeu Rozanski
Acked-by: Daniel Wagner
Cc: Peter Zijlstra
Cc: Ingo Molnar
Cc: Johannes Weiner
Cc: Balbir Singh
Cc: Matt Helsley
Cc: Jens Axboe
Cc: Steven Rostedt

Tejun Heo
2013-08-09 08:11:23 +0800
638769869 cgroup: add css_parent() ... Browse Code »

Currently, controllers have to explicitly follow the cgroup hierarchy
to find the parent of a given css. cgroup is moving towards using
cgroup_subsys_state as the main controller interface construct, so
let's provide a way to climb the hierarchy using just csses.

This patch implements css_parent() which, given a css, returns its
parent. The function is guarnateed to valid non-NULL parent css as
long as the target css is not at the top of the hierarchy.

freezer, cpuset, cpu, cpuacct, hugetlb, memory, net_cls and devices
are converted to use css_parent() instead of accessing cgroup->parent
directly.

* __parent_ca() is dropped from cpuacct and its usage is replaced with
parent_ca(). The only difference between the two was NULL test on
cgroup->parent which is now embedded in css_parent() making the
distinction moot. Note that eventually a css->parent field will be
added to css and the NULL check in css_parent() will go away.

This patch shouldn't cause any behavior differences.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan

Tejun Heo
2013-08-09 08:11:23 +0800
a7c6d554a cgroup: add/update accessors which obtain subsys specific data from css ... Browse Code »

css (cgroup_subsys_state) is usually embedded in a subsys specific
data structure. Subsystems either use container_of() directly to cast
from css to such data structure or has an accessor function wrapping
such cast. As cgroup as whole is moving towards using css as the main
interface handle, add and update such accessors to ease dealing with
css's.

All accessors explicitly handle NULL input and return NULL in those
cases. While this looks like an extra branch in the code, as all
controllers specific data structures have css as the first field, the
casting doesn't involve any offsetting and the compiler can trivially
optimize out the branch.

* blkio, freezer, cpuset, cpu, cpuacct and net_cls didn't have such
accessor. Added.

* memory, hugetlb and devices already had one but didn't explicitly
handle NULL input. Updated.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan

Tejun Heo
2013-08-09 08:11:23 +0800
3f7985183 hugetlb_cgroup: pass around @hugetlb_cgroup instead of @cgroup ... Browse Code »

cgroup controller API will be converted to primarily use struct
cgroup_subsys_state instead of struct cgroup. In preparation, make
hugetlb_cgroup functions pass around struct hugetlb_cgroup instead of
struct cgroup.

This patch shouldn't cause any behavior differences.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan
Reviewed-by: Aneesh Kumar K.V
Reviewed-by: Michal Hocko
Cc: KAMEZAWA Hiroyuki
Cc: Johannes Weiner

Tejun Heo
2013-08-09 08:11:22 +0800
8af01f56a cgroup: s/cgroup_subsys_state/cgroup_css/ s/task_subsys_state/task_css/ ... Browse Code »

The names of the two struct cgroup_subsys_state accessors -
cgroup_subsys_state() and task_subsys_state() - are somewhat awkward.
The former clashes with the type name and the latter doesn't even
indicate it's somehow related to cgroup.

We're about to revamp large portion of cgroup API, so, let's rename
them so that they're less awkward. Most per-controller usages of the
accessors are localized in accessor wrappers and given the amount of
scheduled changes, this isn't gonna add any noticeable headache.

Rename cgroup_subsys_state() to cgroup_css() and task_subsys_state()
to task_css(). This patch is pure rename.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan

Tejun Heo
2013-08-09 08:11:22 +0800

19 Dec, 2012

1 commit

7179e7bf4 mm/hugetlb: create hugetlb cgroup file in hugetlb_init ... Browse Code »

Build kernel with CONFIG_HUGETLBFS=y,CONFIG_HUGETLB_PAGE=y and
CONFIG_CGROUP_HUGETLB=y, then specify hugepagesz=xx boot option, system
will fail to boot.

This failure is caused by following code path:

setup_hugepagesz
hugetlb_add_hstate
hugetlb_cgroup_file_init
cgroup_add_cftypes
kzalloc
Signed-off-by: Jiang Liu
Reviewed-by: Aneesh Kumar K.V
Acked-by: Michal Hocko
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jianguo Wu
2012-12-19 07:02:15 +0800

20 Nov, 2012

1 commit

92fb97487 cgroup: rename ->create/post_create/pre_destroy/destroy() to ->css_alloc/online/offline/free() ... Browse Code »

Rename cgroup_subsys css lifetime related callbacks to better describe
what their roles are. Also, update documentation.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan

Tejun Heo
2012-11-20 00:13:38 +0800

06 Nov, 2012

2 commits

bcf6de1b9 cgroup: make ->pre_destroy() return void ... Browse Code »

All ->pre_destory() implementations return 0 now, which is the only
allowed return value. Make it return void.

Signed-off-by: Tejun Heo
Reviewed-by: Michal Hocko
Acked-by: KAMEZAWA Hiroyuki
Acked-by: Li Zefan
Cc: Balbir Singh
Cc: Vivek Goyal

Tejun Heo
2012-11-06 01:16:59 +0800
9d093cb10 hugetlb: do not fail in hugetlb_cgroup_pre_destroy ... Browse Code »

Now that pre_destroy callbacks are called from the context where neither
any task can attach the group nor any children group can be added there
is no other way to fail from hugetlb_pre_destroy.

Signed-off-by: Michal Hocko
Reviewed-by: Tejun Heo
Reviewed-by: Glauber Costa
Acked-by: KAMEZAWA Hiroyuki
Signed-off-by: Tejun Heo

Michal Hocko
2012-11-06 01:16:59 +0800

01 Aug, 2012

1 commit

75754681f hugetlb/cgroup: remove exclude and wakeup rmdir calls from migrate ... Browse Code »

We already hold the hugetlb_lock. That should prevent a parallel cgroup
rmdir from touching page's hugetlb cgroup. So remove the exclude and
wakeup calls.

Signed-off-by: Aneesh Kumar K.V
Reviewed-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Aneesh Kumar K.V
2012-08-01 09:42:41 +0800