07 Dec, 2020

1 commit

  • Adrian Moreno was ruuning a kubernetes 1.19 + containerd/docker workload
    using hugetlbfs. In this environment the issue is reproduced by:

    - Start a simple pod that uses the recently added HugePages medium
    feature (pod yaml attached)

    - Start a DPDK app. It doesn't need to run successfully (as in transfer
    packets) nor interact with real hardware. It seems just initializing
    the EAL layer (which handles hugepage reservation and locking) is
    enough to trigger the issue

    - Delete the Pod (or let it "Complete").

    This would result in a kworker thread going into a tight loop (top output):

    1425 root 20 0 0 0 0 R 99.7 0.0 5:22.45 kworker/28:7+cgroup_destroy

    'perf top -g' reports:

    - 63.28% 0.01% [kernel] [k] worker_thread
    - 49.97% worker_thread
    - 52.64% process_one_work
    - 62.08% css_killed_work_fn
    - hugetlb_cgroup_css_offline
    41.52% _raw_spin_lock
    - 2.82% _cond_resched
    rcu_all_qs
    2.66% PageHuge
    - 0.57% schedule
    - 0.57% __schedule

    We are spinning in the do-while loop in hugetlb_cgroup_css_offline.
    Worse yet, we are holding the master cgroup lock (cgroup_mutex) while
    infinitely spinning. Little else can be done on the system as the
    cgroup_mutex can not be acquired.

    Do note that the issue can be reproduced by simply offlining a hugetlb
    cgroup containing pages with reservation counts.

    The loop in hugetlb_cgroup_css_offline is moving page counts from the
    cgroup being offlined to the parent cgroup. This is done for each
    hstate, and is repeated until hugetlb_cgroup_have_usage returns false.
    The routine moving counts (hugetlb_cgroup_move_parent) is only moving
    'usage' counts. The routine hugetlb_cgroup_have_usage is checking for
    both 'usage' and 'reservation' counts. Discussion about what to do with
    reservation counts when reparenting was discussed here:

    https://lore.kernel.org/linux-kselftest/CAHS8izMFAYTgxym-Hzb_JmkTK1N_S9tGN71uS6MFV+R7swYu5A@mail.gmail.com/

    The decision was made to leave a zombie cgroup for with reservation
    counts. Unfortunately, the code checking reservation counts was
    incorrectly added to hugetlb_cgroup_have_usage.

    To fix the issue, simply remove the check for reservation counts. While
    fixing this issue, a related bug in hugetlb_cgroup_css_offline was
    noticed. The hstate index is not reinitialized each time through the
    do-while loop. Fix this as well.

    Fixes: 1adc4d419aa2 ("hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations")
    Reported-by: Adrian Moreno
    Signed-off-by: Mike Kravetz
    Signed-off-by: Andrew Morton
    Tested-by: Adrian Moreno
    Reviewed-by: Shakeel Butt
    Cc: Mina Almasry
    Cc: David Rientjes
    Cc: Greg Thelen
    Cc: Sandipan Das
    Cc: Shuah Khan
    Cc:
    Link: https://lkml.kernel.org/r/20201203220242.158165-1-mike.kravetz@oracle.com
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

22 Aug, 2020

1 commit

  • Replace a comma between expression statements by a semicolon.

    Fixes: faced7e0806cf4 ("mm: hugetlb controller for cgroups v2")
    Signed-off-by: Xu Wang
    Signed-off-by: Andrew Morton
    Cc: Tejun Heo
    Cc: Giuseppe Scrivano
    Link: http://lkml.kernel.org/r/20200818064333.21759-1-vulab@iscas.ac.cn
    Signed-off-by: Linus Torvalds

    Xu Wang
     

08 Apr, 2020

1 commit

  • Convert the various /* fallthrough */ comments to the pseudo-keyword
    fallthrough;

    Done via script:
    https://lore.kernel.org/lkml/b56602fcf79f849e733e7b521bb0e17895d390fa.1582230379.git.joe@perches.com/

    Signed-off-by: Joe Perches
    Signed-off-by: Andrew Morton
    Reviewed-by: Gustavo A. R. Silva
    Link: http://lkml.kernel.org/r/f62fea5d10eb0ccfc05d87c242a620c261219b66.camel@perches.com
    Signed-off-by: Linus Torvalds

    Joe Perches
     

03 Apr, 2020

5 commits

  • For shared mappings, the pointer to the hugetlb_cgroup to uncharge lives
    in the resv_map entries, in file_region->reservation_counter.

    After a call to region_chg, we charge the approprate hugetlb_cgroup, and
    if successful, we pass on the hugetlb_cgroup info to a follow up
    region_add call. When a file_region entry is added to the resv_map via
    region_add, we put the pointer to that cgroup in
    file_region->reservation_counter. If charging doesn't succeed, we report
    the error to the caller, so that the kernel fails the reservation.

    On region_del, which is when the hugetlb memory is unreserved, we also
    uncharge the file_region->reservation_counter.

    [akpm@linux-foundation.org: forward declare struct file_region]
    Signed-off-by: Mina Almasry
    Signed-off-by: Andrew Morton
    Reviewed-by: Mike Kravetz
    Cc: David Rientjes
    Cc: Greg Thelen
    Cc: Mike Kravetz
    Cc: Sandipan Das
    Cc: Shakeel Butt
    Cc: Shuah Khan
    Link: http://lkml.kernel.org/r/20200211213128.73302-5-almasrymina@google.com
    Signed-off-by: Linus Torvalds

    Mina Almasry
     
  • Normally the pointer to the cgroup to uncharge hangs off the struct page,
    and gets queried when it's time to free the page. With hugetlb_cgroup
    reservations, this is not possible. Because it's possible for a page to
    be reserved by one task and actually faulted in by another task.

    The best place to put the hugetlb_cgroup pointer to uncharge for
    reservations is in the resv_map. But, because the resv_map has different
    semantics for private and shared mappings, the code patch to
    charge/uncharge shared and private mappings is different. This patch
    implements charging and uncharging for private mappings.

    For private mappings, the counter to uncharge is in
    resv_map->reservation_counter. On initializing the resv_map this is set
    to NULL. On reservation of a region in private mapping, the tasks
    hugetlb_cgroup is charged and the hugetlb_cgroup is placed is
    resv_map->reservation_counter.

    On hugetlb_vm_op_close, we uncharge resv_map->reservation_counter.

    [akpm@linux-foundation.org: forward declare struct resv_map]
    Signed-off-by: Mina Almasry
    Signed-off-by: Andrew Morton
    Reviewed-by: Mike Kravetz
    Acked-by: David Rientjes
    Cc: Greg Thelen
    Cc: Sandipan Das
    Cc: Shakeel Butt
    Cc: Shuah Khan
    Link: http://lkml.kernel.org/r/20200211213128.73302-3-almasrymina@google.com
    Signed-off-by: Linus Torvalds

    Mina Almasry
     
  • Commit c32300516047 ("hugetlb_cgroup: add interface for charge/uncharge
    hugetlb reservations") mistakingly doesn't handle the migration of *both*
    the reservation hugetlb_cgroup and the fault hugetlb_cgroup correctly.

    What should happen is that both cgroups shuold be queried from the old
    page, then both set to NULL on the old page, then both inserted into the
    new page.

    The mistake also creates the following warning:

    mm/hugetlb_cgroup.c: In function 'hugetlb_cgroup_migrate':
    mm/hugetlb_cgroup.c:777:25: warning: variable 'h_cg' set but not used
    [-Wunused-but-set-variable]
    struct hugetlb_cgroup *h_cg;
    ^~~~

    Solution is to add the missing steps, namly setting the reservation
    hugetlb_cgroup to NULL on the old page, and setting the fault
    hugetlb_cgroup on the new page.

    Fixes: c32300516047 ("hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations")
    Reported-by: Qian Cai
    Signed-off-by: Mina Almasry
    Signed-off-by: Andrew Morton
    Cc: David Rientjes
    Cc: Greg Thelen
    Cc: Mike Kravetz
    Cc: Sandipan Das
    Cc: Shakeel Butt
    Cc: Shuah Khan
    Link: http://lkml.kernel.org/r/20200218194727.46995-1-almasrymina@google.com
    Signed-off-by: Linus Torvalds

    Mina Almasry
     
  • Augments hugetlb_cgroup_charge_cgroup to be able to charge hugetlb usage
    or hugetlb reservation counter.

    Adds a new interface to uncharge a hugetlb_cgroup counter via
    hugetlb_cgroup_uncharge_counter.

    Integrates the counter with hugetlb_cgroup, via hugetlb_cgroup_init,
    hugetlb_cgroup_have_usage, and hugetlb_cgroup_css_offline.

    Signed-off-by: Mina Almasry
    Signed-off-by: Andrew Morton
    Acked-by: Mike Kravetz
    Acked-by: David Rientjes
    Cc: Greg Thelen
    Cc: Sandipan Das
    Cc: Shakeel Butt
    Cc: Shuah Khan
    Link: http://lkml.kernel.org/r/20200211213128.73302-2-almasrymina@google.com
    Signed-off-by: Linus Torvalds

    Mina Almasry
     
  • These counters will track hugetlb reservations rather than hugetlb memory
    faulted in. This patch only adds the counter, following patches add the
    charging and uncharging of the counter.

    This is patch 1 of an 9 patch series.

    Problem:

    Currently tasks attempting to reserve more hugetlb memory than is
    available get a failure at mmap/shmget time. This is thanks to Hugetlbfs
    Reservations [1]. However, if a task attempts to reserve more hugetlb
    memory than its hugetlb_cgroup limit allows, the kernel will allow the
    mmap/shmget call, but will SIGBUS the task when it attempts to fault in
    the excess memory.

    We have users hitting their hugetlb_cgroup limits and thus we've been
    looking at this failure mode. We'd like to improve this behavior such
    that users violating the hugetlb_cgroup limits get an error on mmap/shmget
    time, rather than getting SIGBUS'd when they try to fault the excess
    memory in. This gives the user an opportunity to fallback more gracefully
    to non-hugetlbfs memory for example.

    The underlying problem is that today's hugetlb_cgroup accounting happens
    at hugetlb memory *fault* time, rather than at *reservation* time. Thus,
    enforcing the hugetlb_cgroup limit only happens at fault time, and the
    offending task gets SIGBUS'd.

    Proposed Solution:

    A new page counter named
    'hugetlb.xMB.rsvd.[limit|usage|max_usage]_in_bytes'. This counter has
    slightly different semantics than
    'hugetlb.xMB.[limit|usage|max_usage]_in_bytes':

    - While usage_in_bytes tracks all *faulted* hugetlb memory,
    rsvd.usage_in_bytes tracks all *reserved* hugetlb memory and hugetlb
    memory faulted in without a prior reservation.

    - If a task attempts to reserve more memory than limit_in_bytes allows,
    the kernel will allow it to do so. But if a task attempts to reserve
    more memory than rsvd.limit_in_bytes, the kernel will fail this
    reservation.

    This proposal is implemented in this patch series, with tests to verify
    functionality and show the usage.

    Alternatives considered:

    1. A new cgroup, instead of only a new page_counter attached to the
    existing hugetlb_cgroup. Adding a new cgroup seemed like a lot of code
    duplication with hugetlb_cgroup. Keeping hugetlb related page counters
    under hugetlb_cgroup seemed cleaner as well.

    2. Instead of adding a new counter, we considered adding a sysctl that
    modifies the behavior of hugetlb.xMB.[limit|usage]_in_bytes, to do
    accounting at reservation time rather than fault time. Adding a new
    page_counter seems better as userspace could, if it wants, choose to
    enforce different cgroups differently: one via limit_in_bytes, and
    another via rsvd.limit_in_bytes. This could be very useful if you're
    transitioning how hugetlb memory is partitioned on your system one
    cgroup at a time, for example. Also, someone may find usage for both
    limit_in_bytes and rsvd.limit_in_bytes concurrently, and this approach
    gives them the option to do so.

    Testing:
    - Added tests passing.
    - Used libhugetlbfs for regression testing.

    [1]: https://www.kernel.org/doc/html/latest/vm/hugetlbfs_reserv.html

    Signed-off-by: Mina Almasry
    Signed-off-by: Andrew Morton
    Reviewed-by: Mike Kravetz
    Acked-by: David Rientjes
    Cc: Shuah Khan
    Cc: Shakeel Butt
    Cc: Greg Thelen
    Cc: Sandipan Das
    Link: http://lkml.kernel.org/r/20200211213128.73302-1-almasrymina@google.com
    Signed-off-by: Linus Torvalds

    Mina Almasry
     

30 Mar, 2020

1 commit

  • This appears to be a mistake in commit faced7e0806cf ("mm: hugetlb
    controller for cgroups v2").

    Essentially that commit does a hugetlb_cgroup_from_counter assuming that
    page_counter_try_charge has initialized counter.

    But if that has failed then it seems will not initialize counter, so
    hugetlb_cgroup_from_counter(counter) ends up pointing to random memory,
    causing kasan to complain.

    The solution is to simply use 'h_cg', instead of
    hugetlb_cgroup_from_counter(counter), since that is a reference to the
    hugetlb_cgroup anyway. After this change kasan ceases to complain.

    Fixes: faced7e0806cf ("mm: hugetlb controller for cgroups v2")
    Reported-by: syzbot+cac0c4e204952cf449b1@syzkaller.appspotmail.com
    Signed-off-by: Mina Almasry
    Signed-off-by: Andrew Morton
    Acked-by: Giuseppe Scrivano
    Acked-by: Tejun Heo
    Cc: Mike Kravetz
    Cc: David Rientjes
    Link: http://lkml.kernel.org/r/20200313223920.124230-1-almasrymina@google.com
    Signed-off-by: Linus Torvalds

    Mina Almasry
     

17 Dec, 2019

1 commit

  • In the effort of supporting cgroups v2 into Kubernetes, I stumped on
    the lack of the hugetlb controller.

    When the controller is enabled, it exposes four new files for each
    hugetlb size on non-root cgroups:

    - hugetlb..current
    - hugetlb..max
    - hugetlb..events
    - hugetlb..events.local

    The differences with the legacy hierarchy are in the file names and
    using the value "max" instead of "-1" to disable a limit.

    The file .limit_in_bytes is renamed to .max.

    The file .usage_in_bytes is renamed to .current.

    .failcnt is not provided as a single file anymore, but its value can
    be read through the new flat-keyed files .events and .events.local,
    through the "max" key.

    Signed-off-by: Giuseppe Scrivano
    Signed-off-by: Tejun Heo

    Giuseppe Scrivano
     

16 Nov, 2019

1 commit

  • An exiting task might belong to an offline cgroup. In this case an
    attempt to grab a cgroup reference from the task can end up with an
    infinite loop in hugetlb_cgroup_charge_cgroup(), because neither the
    cgroup will become online, neither the task will be migrated to a live
    cgroup.

    Fix this by switching over to css_tryget(). As css_tryget_online()
    can't guarantee that the cgroup won't go offline, in most cases the
    check doesn't make sense. In this particular case users of
    hugetlb_cgroup_charge_cgroup() are not affected by this change.

    A similar problem is described by commit 18fa84a2db0e ("cgroup: Use
    css_tryget() instead of css_tryget_online() in task_get_css()").

    Link: http://lkml.kernel.org/r/20191106225131.3543616-2-guro@fb.com
    Signed-off-by: Roman Gushchin
    Acked-by: Johannes Weiner
    Acked-by: Tejun Heo
    Reviewed-by: Shakeel Butt
    Cc: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     

25 Sep, 2019

1 commit

  • Replace 1 << compound_order(page) with compound_nr(page). Minor
    improvements in readability.

    Link: http://lkml.kernel.org/r/20190721104612.19120-4-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle)
    Reviewed-by: Andrew Morton
    Reviewed-by: Ira Weiny
    Acked-by: Kirill A. Shutemov
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

08 Jun, 2018

1 commit

  • This patch renames struct page_counter fields:
    count -> usage
    limit -> max

    and the corresponding functions:
    page_counter_limit() -> page_counter_set_max()
    mem_cgroup_get_limit() -> mem_cgroup_get_max()
    mem_cgroup_resize_limit() -> mem_cgroup_resize_max()
    memcg_update_kmem_limit() -> memcg_update_kmem_max()
    memcg_update_tcp_limit() -> memcg_update_tcp_max()

    The idea behind this renaming is to have the direct matching
    between memory cgroup knobs (low, high, max) and page_counters API.

    This is pure renaming, this patch doesn't bring any functional change.

    Link: http://lkml.kernel.org/r/20180405185921.4942-1-guro@fb.com
    Signed-off-by: Roman Gushchin
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     

21 May, 2016

1 commit

  • The page_counter rounds limits down to page size values. This makes
    sense, except in the case of hugetlb_cgroup where it's not possible to
    charge partial hugepages. If the hugetlb_cgroup margin is less than the
    hugepage size being charged, it will fail as expected.

    Round the hugetlb_cgroup limit down to hugepage size, since it is the
    effective limit of the cgroup.

    For consistency, round down PAGE_COUNTER_MAX as well when a
    hugetlb_cgroup is created: this prevents error reports when a user
    cannot restore the value to the kernel default.

    Signed-off-by: David Rientjes
    Cc: Michal Hocko
    Cc: Nikolay Borisov
    Cc: Johannes Weiner
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

07 Nov, 2015

1 commit

  • Hugh has pointed that compound_head() call can be unsafe in some
    context. There's one example:

    CPU0 CPU1

    isolate_migratepages_block()
    page_count()
    compound_head()
    !!PageTail() == true
    put_page()
    tail->first_page = NULL
    head = tail->first_page
    alloc_pages(__GFP_COMP)
    prep_compound_page()
    tail->first_page = head
    __SetPageTail(p);
    !!PageTail() == true

    The race is pure theoretical. I don't it's possible to trigger it in
    practice. But who knows.

    We can fix the race by changing how encode PageTail() and compound_head()
    within struct page to be able to update them in one shot.

    The patch introduces page->compound_head into third double word block in
    front of compound_dtor and compound_order. Bit 0 encodes PageTail() and
    the rest bits are pointer to head page if bit zero is set.

    The patch moves page->pmd_huge_pte out of word, just in case if an
    architecture defines pgtable_t into something what can have the bit 0
    set.

    hugetlb_cgroup uses page->lru.next in the second tail page to store
    pointer struct hugetlb_cgroup. The patch switch it to use page->private
    in the second tail page instead. The space is free since ->first_page is
    removed from the union.

    The patch also opens possibility to remove HUGETLB_CGROUP_MIN_ORDER
    limitation, since there's now space in first tail page to store struct
    hugetlb_cgroup pointer. But that's out of scope of the patch.

    That means page->compound_head shares storage space with:

    - page->lru.next;
    - page->next;
    - page->rcu_head.next;

    That's too long list to be absolutely sure, but looks like nobody uses
    bit 0 of the word.

    page->rcu_head.next guaranteed[1] to have bit 0 clean as long as we use
    call_rcu(), call_rcu_bh(), call_rcu_sched(), or call_srcu(). But future
    call_rcu_lazy() is not allowed as it makes use of the bit and we can
    get false positive PageTail().

    [1] http://lkml.kernel.org/g/20150827163634.GD4029@linux.vnet.ibm.com

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Reviewed-by: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: David Rientjes
    Cc: Vlastimil Babka
    Acked-by: Paul E. McKenney
    Cc: Aneesh Kumar K.V
    Cc: Andi Kleen
    Cc: Christoph Lameter
    Cc: Joonsoo Kim
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

06 Nov, 2015

1 commit

  • page_counter_try_charge() currently returns 0 on success and -ENOMEM on
    failure, which is surprising behavior given the function name.

    Make it follow the expected pattern of try_stuff() functions that return a
    boolean true to indicate success, or false for failure.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Vladimir Davydov
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

12 Feb, 2015

1 commit


11 Dec, 2014

1 commit


30 Aug, 2014

1 commit

  • spin_lock may be an empty struct for !SMP configurations and so
    arch_spin_is_locked may return unconditional 0 and trigger the VM_BUG_ON
    even when the lock is held.

    Replace spin_is_locked by lockdep_assert_held. We will not BUG anymore
    but it is questionable whether crashing makes a lot of sense in the
    uncharge path. Uncharge happens after the last page reference was
    released so nobody should touch the page and the function doesn't update
    any shared state except for res counter which uses synchronization of
    its own.

    Signed-off-by: Michal Hocko
    Reviewed-by: Aneesh Kumar K.V
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

15 Aug, 2014

1 commit

  • Memcg aligns memory.limit_in_bytes to PAGE_SIZE as part of the resource
    counter since it makes no sense to allow a partial page to be charged.

    As a result of the hugetlb cgroup using the resource counter, it is also
    aligned to PAGE_SIZE but makes no sense unless aligned to the size of
    the hugepage being limited.

    Align hugetlb cgroup limit to hugepage size.

    Signed-off-by: David Rientjes
    Acked-by: Michal Hocko
    Cc: "Aneesh Kumar K.V"
    Cc: Tejun Heo
    Cc: Li Zefan
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

15 Jul, 2014

1 commit

  • Currently, cftypes added by cgroup_add_cftypes() are used for both the
    unified default hierarchy and legacy ones and subsystems can mark each
    file with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE if it has to
    appear only on one of them. This is quite hairy and error-prone.
    Also, we may end up exposing interface files to the default hierarchy
    without thinking it through.

    cgroup_subsys will grow two separate cftype addition functions and
    apply each only on the hierarchies of the matching type. This will
    allow organizing cftypes in a lot clearer way and encourage subsystems
    to scrutinize the interface which is being exposed in the new default
    hierarchy.

    In preparation, this patch adds cgroup_add_legacy_cftypes() which
    currently is a simple wrapper around cgroup_add_cftypes() and replaces
    all cgroup_add_cftypes() usages with it.

    While at it, this patch drops a completely spurious return from
    __hugetlb_cgroup_file_init().

    This patch doesn't introduce any functional differences.

    Signed-off-by: Tejun Heo
    Acked-by: Neil Horman
    Acked-by: Li Zefan
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Aneesh Kumar K.V

    Tejun Heo
     

17 May, 2014

1 commit

  • cgroup in general is moving towards using cgroup_subsys_state as the
    fundamental structural component and css_parent() was introduced to
    convert from using cgroup->parent to css->parent. It was quite some
    time ago and we're moving forward with making css more prominent.

    This patch drops the trivial wrapper css_parent() and let the users
    dereference css->parent. While at it, explicitly mark fields of css
    which are public and immutable.

    v2: New usage from device_cgroup.c converted.

    Signed-off-by: Tejun Heo
    Acked-by: Michal Hocko
    Acked-by: Neil Horman
    Acked-by: "David S. Miller"
    Acked-by: Li Zefan
    Cc: Vivek Goyal
    Cc: Jens Axboe
    Cc: Peter Zijlstra
    Cc: Johannes Weiner

    Tejun Heo
     

14 May, 2014

3 commits

  • cftype->trigger() is pointless. It's trivial to ignore the input
    buffer from a regular ->write() operation. Convert all ->trigger()
    users to ->write() and remove ->trigger().

    This patch doesn't introduce any visible behavior changes.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Cc: Johannes Weiner
    Cc: Michal Hocko

    Tejun Heo
     
  • Convert all cftype->write_string() users to the new cftype->write()
    which maps directly to kernfs write operation and has full access to
    kernfs and cgroup contexts. The conversions are mostly mechanical.

    * @css and @cft are accessed using of_css() and of_cft() accessors
    respectively instead of being specified as arguments.

    * Should return @nbytes on success instead of 0.

    * @buf is not trimmed automatically. Trim if necessary. Note that
    blkcg and netprio don't need this as the parsers already handle
    whitespaces.

    cftype->write_string() has no user left after the conversions and
    removed.

    While at it, remove unnecessary local variable @p in
    cgroup_subtree_control_write() and stale comment about
    CGROUP_LOCAL_BUFFER_SIZE in cgroup_freezer.c.

    This patch doesn't introduce any visible behavior changes.

    v2: netprio was missing from conversion. Converted.

    Signed-off-by: Tejun Heo
    Acked-by: Aristeu Rozanski
    Acked-by: Vivek Goyal
    Acked-by: Li Zefan
    Cc: Jens Axboe
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Neil Horman
    Cc: "David S. Miller"

    Tejun Heo
     
  • Unlike the more usual refcnting, what css_tryget() provides is the
    distinction between online and offline csses instead of protection
    against upping a refcnt which already reached zero. cgroup is
    planning to provide actual tryget which fails if the refcnt already
    reached zero. Let's rename the existing trygets so that they clearly
    indicate that they're onliness.

    I thought about keeping the existing names as-are and introducing new
    names for the planned actual tryget; however, given that each
    controller participates in the synchronization of the online state, it
    seems worthwhile to make it explicit that these functions are about
    on/offline state.

    Rename css_tryget() to css_tryget_online() and css_tryget_from_dir()
    to css_tryget_online_from_dir(). This is pure rename.

    v2: cgroup_freezer grew new usages of css_tryget(). Update
    accordingly.

    Signed-off-by: Tejun Heo
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: Li Zefan
    Cc: Vivek Goyal
    Cc: Jens Axboe
    Cc: Peter Zijlstra
    Cc: Paul Mackerras
    Cc: Ingo Molnar
    Cc: Arnaldo Carvalho de Melo

    Tejun Heo
     

19 Mar, 2014

1 commit

  • cftype->write_string() just passes on the writeable buffer from kernfs
    and there's no reason to add const restriction on the buffer. The
    only thing const achieves is unnecessarily complicating parsing of the
    buffer. Drop const from @buffer.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Cc: Peter Zijlstra
    Cc: Paul Mackerras
    Cc: Ingo Molnar
    Cc: Arnaldo Carvalho de Melo
    Cc: Daniel Borkmann
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki

    Tejun Heo
     

08 Feb, 2014

1 commit

  • cgroup_subsys is a bit messier than it needs to be.

    * The name of a subsys can be different from its internal identifier
    defined in cgroup_subsys.h. Most subsystems use the matching name
    but three - cpu, memory and perf_event - use different ones.

    * cgroup_subsys_id enums are postfixed with _subsys_id and each
    cgroup_subsys is postfixed with _subsys. cgroup.h is widely
    included throughout various subsystems, it doesn't and shouldn't
    have claim on such generic names which don't have any qualifier
    indicating that they belong to cgroup.

    * cgroup_subsys->subsys_id should always equal the matching
    cgroup_subsys_id enum; however, we require each controller to
    initialize it and then BUG if they don't match, which is a bit
    silly.

    This patch cleans up cgroup_subsys names and initialization by doing
    the followings.

    * cgroup_subsys_id enums are now postfixed with _cgrp_id, and each
    cgroup_subsys with _cgrp_subsys.

    * With the above, renaming subsys identifiers to match the userland
    visible names doesn't cause any naming conflicts. All non-matching
    identifiers are renamed to match the official names.

    cpu_cgroup -> cpu
    mem_cgroup -> memory
    perf -> perf_event

    * controllers no longer need to initialize ->subsys_id and ->name.
    They're generated in cgroup core and set automatically during boot.

    * Redundant cgroup_subsys declarations removed.

    * While updating BUG_ON()s in cgroup_init_early(), convert them to
    WARN()s. BUGging that early during boot is stupid - the kernel
    can't print anything, even through serial console and the trap
    handler doesn't even link stack frame properly for back-tracing.

    This patch doesn't introduce any behavior changes.

    v2: Rebased on top of fe1217c4f3f7 ("net: net_cls: move cgroupfs
    classid handling into core").

    Signed-off-by: Tejun Heo
    Acked-by: Neil Horman
    Acked-by: "David S. Miller"
    Acked-by: "Rafael J. Wysocki"
    Acked-by: Michal Hocko
    Acked-by: Peter Zijlstra
    Acked-by: Aristeu Rozanski
    Acked-by: Ingo Molnar
    Acked-by: Li Zefan
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Serge E. Hallyn
    Cc: Vivek Goyal
    Cc: Thomas Graf

    Tejun Heo
     

24 Jan, 2014

1 commit

  • Most of the VM_BUG_ON assertions are performed on a page. Usually, when
    one of these assertions fails we'll get a BUG_ON with a call stack and
    the registers.

    I've recently noticed based on the requests to add a small piece of code
    that dumps the page to various VM_BUG_ON sites that the page dump is
    quite useful to people debugging issues in mm.

    This patch adds a VM_BUG_ON_PAGE(cond, page) which beyond doing what
    VM_BUG_ON() does, also dumps the page before executing the actual
    BUG_ON.

    [akpm@linux-foundation.org: fix up includes]
    Signed-off-by: Sasha Levin
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     

06 Dec, 2013

1 commit

  • In preparation of conversion to kernfs, cgroup file handling is being
    consolidated so that it can be easily mapped to the seq_file based
    interface of kernfs.

    All users of cftype->read() can be easily served, usually better, by
    seq_file and other methods. Update hugetlb_cgroup_read() to return
    u64 instead of printing itself and rename it to
    hugetlb_cgroup_read_u64().

    This patch doesn't make any visible behavior changes.

    Signed-off-by: Tejun Heo
    Reviewed-by: Michal Hocko
    Acked-by: Li Zefan
    Cc: Aneesh Kumar K.V
    Cc: Johannes Weiner

    Tejun Heo
     

09 Aug, 2013

6 commits

  • cgroup is currently in the process of transitioning to using struct
    cgroup_subsys_state * as the primary handle instead of struct cgroup.
    Please see the previous commit which converts the subsystem methods
    for rationale.

    This patch converts all cftype file operations to take @css instead of
    @cgroup. cftypes for the cgroup core files don't have their subsytem
    pointer set. These will automatically use the dummy_css added by the
    previous patch and can be converted the same way.

    Most subsystem conversions are straight forwards but there are some
    interesting ones.

    * freezer: update_if_frozen() is also converted to take @css instead
    of @cgroup for consistency. This will make the code look simpler
    too once iterators are converted to use css.

    * memory/vmpressure: mem_cgroup_from_css() needs to be exported to
    vmpressure while mem_cgroup_from_cont() can be made static.
    Updated accordingly.

    * cpu: cgroup_tg() doesn't have any user left. Removed.

    * cpuacct: cgroup_ca() doesn't have any user left. Removed.

    * hugetlb: hugetlb_cgroup_form_cgroup() doesn't have any user left.
    Removed.

    * net_cls: cgrp_cls_state() doesn't have any user left. Removed.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Michal Hocko
    Acked-by: Vivek Goyal
    Acked-by: Aristeu Rozanski
    Acked-by: Daniel Wagner
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: Matt Helsley
    Cc: Jens Axboe
    Cc: Steven Rostedt

    Tejun Heo
     
  • cgroup is currently in the process of transitioning to using struct
    cgroup_subsys_state * as the primary handle instead of struct cgroup *
    in subsystem implementations for the following reasons.

    * With unified hierarchy, subsystems will be dynamically bound and
    unbound from cgroups and thus css's (cgroup_subsys_state) may be
    created and destroyed dynamically over the lifetime of a cgroup,
    which is different from the current state where all css's are
    allocated and destroyed together with the associated cgroup. This
    in turn means that cgroup_css() should be synchronized and may
    return NULL, making it more cumbersome to use.

    * Differing levels of per-subsystem granularity in the unified
    hierarchy means that the task and descendant iterators should behave
    differently depending on the specific subsystem the iteration is
    being performed for.

    * In majority of the cases, subsystems only care about its part in the
    cgroup hierarchy - ie. the hierarchy of css's. Subsystem methods
    often obtain the matching css pointer from the cgroup and don't
    bother with the cgroup pointer itself. Passing around css fits
    much better.

    This patch converts all cgroup_subsys methods to take @css instead of
    @cgroup. The conversions are mostly straight-forward. A few
    noteworthy changes are

    * ->css_alloc() now takes css of the parent cgroup rather than the
    pointer to the new cgroup as the css for the new cgroup doesn't
    exist yet. Knowing the parent css is enough for all the existing
    subsystems.

    * In kernel/cgroup.c::offline_css(), unnecessary open coded css
    dereference is replaced with local variable access.

    This patch shouldn't cause any behavior differences.

    v2: Unnecessary explicit cgrp->subsys[] deref in css_online() replaced
    with local variable @css as suggested by Li Zefan.

    Rebased on top of new for-3.12 which includes for-3.11-fixes so
    that ->css_free() invocation added by da0a12caff ("cgroup: fix a
    leak when percpu_ref_init() fails") is converted too. Suggested
    by Li Zefan.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Michal Hocko
    Acked-by: Vivek Goyal
    Acked-by: Aristeu Rozanski
    Acked-by: Daniel Wagner
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: Matt Helsley
    Cc: Jens Axboe
    Cc: Steven Rostedt

    Tejun Heo
     
  • Currently, controllers have to explicitly follow the cgroup hierarchy
    to find the parent of a given css. cgroup is moving towards using
    cgroup_subsys_state as the main controller interface construct, so
    let's provide a way to climb the hierarchy using just csses.

    This patch implements css_parent() which, given a css, returns its
    parent. The function is guarnateed to valid non-NULL parent css as
    long as the target css is not at the top of the hierarchy.

    freezer, cpuset, cpu, cpuacct, hugetlb, memory, net_cls and devices
    are converted to use css_parent() instead of accessing cgroup->parent
    directly.

    * __parent_ca() is dropped from cpuacct and its usage is replaced with
    parent_ca(). The only difference between the two was NULL test on
    cgroup->parent which is now embedded in css_parent() making the
    distinction moot. Note that eventually a css->parent field will be
    added to css and the NULL check in css_parent() will go away.

    This patch shouldn't cause any behavior differences.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     
  • css (cgroup_subsys_state) is usually embedded in a subsys specific
    data structure. Subsystems either use container_of() directly to cast
    from css to such data structure or has an accessor function wrapping
    such cast. As cgroup as whole is moving towards using css as the main
    interface handle, add and update such accessors to ease dealing with
    css's.

    All accessors explicitly handle NULL input and return NULL in those
    cases. While this looks like an extra branch in the code, as all
    controllers specific data structures have css as the first field, the
    casting doesn't involve any offsetting and the compiler can trivially
    optimize out the branch.

    * blkio, freezer, cpuset, cpu, cpuacct and net_cls didn't have such
    accessor. Added.

    * memory, hugetlb and devices already had one but didn't explicitly
    handle NULL input. Updated.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     
  • cgroup controller API will be converted to primarily use struct
    cgroup_subsys_state instead of struct cgroup. In preparation, make
    hugetlb_cgroup functions pass around struct hugetlb_cgroup instead of
    struct cgroup.

    This patch shouldn't cause any behavior differences.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Reviewed-by: Aneesh Kumar K.V
    Reviewed-by: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Cc: Johannes Weiner

    Tejun Heo
     
  • The names of the two struct cgroup_subsys_state accessors -
    cgroup_subsys_state() and task_subsys_state() - are somewhat awkward.
    The former clashes with the type name and the latter doesn't even
    indicate it's somehow related to cgroup.

    We're about to revamp large portion of cgroup API, so, let's rename
    them so that they're less awkward. Most per-controller usages of the
    accessors are localized in accessor wrappers and given the amount of
    scheduled changes, this isn't gonna add any noticeable headache.

    Rename cgroup_subsys_state() to cgroup_css() and task_subsys_state()
    to task_css(). This patch is pure rename.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     

19 Dec, 2012

1 commit

  • Build kernel with CONFIG_HUGETLBFS=y,CONFIG_HUGETLB_PAGE=y and
    CONFIG_CGROUP_HUGETLB=y, then specify hugepagesz=xx boot option, system
    will fail to boot.

    This failure is caused by following code path:

    setup_hugepagesz
    hugetlb_add_hstate
    hugetlb_cgroup_file_init
    cgroup_add_cftypes
    kzalloc
    Signed-off-by: Jiang Liu
    Reviewed-by: Aneesh Kumar K.V
    Acked-by: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jianguo Wu
     

20 Nov, 2012

1 commit


06 Nov, 2012

2 commits

  • All ->pre_destory() implementations return 0 now, which is the only
    allowed return value. Make it return void.

    Signed-off-by: Tejun Heo
    Reviewed-by: Michal Hocko
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Li Zefan
    Cc: Balbir Singh
    Cc: Vivek Goyal

    Tejun Heo
     
  • Now that pre_destroy callbacks are called from the context where neither
    any task can attach the group nor any children group can be added there
    is no other way to fail from hugetlb_pre_destroy.

    Signed-off-by: Michal Hocko
    Reviewed-by: Tejun Heo
    Reviewed-by: Glauber Costa
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Tejun Heo

    Michal Hocko
     

01 Aug, 2012

1 commit