09 Aug, 2013

12 commits

  • Previously, all css descendant iterators didn't include the origin
    (root of subtree) css in the iteration. The reasons were maintaining
    consistency with css_for_each_child() and that at the time of
    introduction more use cases needed skipping the origin anyway;
    however, given that css_is_descendant() considers self to be a
    descendant, omitting the origin css has become more confusing and
    looking at the accumulated use cases rather clearly indicates that
    including origin would result in simpler code overall.

    While this is a change which can easily lead to subtle bugs, cgroup
    API including the iterators has recently gone through major
    restructuring and no out-of-tree changes will be applicable without
    adjustments making this a relatively acceptable opportunity for this
    type of change.

    The conversions are mostly straight-forward. If the iteration block
    had explicit origin handling before or after, it's moved inside the
    iteration. If not, if (pos == origin) continue; is added. Some
    conversions add extra reference get/put around origin handling by
    consolidating origin handling and the rest. While the extra ref
    operations aren't strictly necessary, this shouldn't cause any
    noticeable difference.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Vivek Goyal
    Acked-by: Aristeu Rozanski
    Acked-by: Michal Hocko
    Cc: Jens Axboe
    Cc: Matt Helsley
    Cc: Johannes Weiner
    Cc: Balbir Singh

    Tejun Heo
     
  • cgroup is in the process of converting to css (cgroup_subsys_state)
    from cgroup as the principal subsystem interface handle. This is
    mostly to prepare for the unified hierarchy support where css's will
    be created and destroyed dynamically but also helps cleaning up
    subsystem implementations as css is usually what they are interested
    in anyway.

    cftype->[un]register_event() is among the remaining couple interfaces
    which still use struct cgroup. Convert it to cgroup_subsys_state.
    The conversion is mostly mechanical and removes the last users of
    mem_cgroup_from_cont() and cg_to_vmpressure(), which are removed.

    v2: indentation update as suggested by Li Zefan.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Balbir Singh

    Tejun Heo
     
  • cgroup is in the process of converting to css (cgroup_subsys_state)
    from cgroup as the principal subsystem interface handle. This is
    mostly to prepare for the unified hierarchy support where css's will
    be created and destroyed dynamically but also helps cleaning up
    subsystem implementations as css is usually what they are interested
    in anyway.

    This patch converts task iterators to deal with css instead of cgroup.
    Note that under unified hierarchy, different sets of tasks will be
    considered belonging to a given cgroup depending on the subsystem in
    question and making the iterators deal with css instead cgroup
    provides them with enough information about the iteration.

    While at it, fix several function comment formats in cpuset.c.

    This patch doesn't introduce any behavior differences.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: Matt Helsley

    Tejun Heo
     
  • Currently all cgroup_task_iter functions require @cgrp to be passed
    in, which is superflous and increases chance of usage error. Make
    cgroup_task_iter remember the cgroup being iterated and drop @cgrp
    argument from next and end functions.

    This patch doesn't introduce any behavior differences.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Michal Hocko
    Cc: Matt Helsley
    Cc: Johannes Weiner
    Cc: Balbir Singh

    Tejun Heo
     
  • cgroup now has multiple iterators and it's quite confusing to have
    something which walks over tasks of a single cgroup named cgroup_iter.
    Let's rename it to cgroup_task_iter.

    While at it, reformat / update comments and replace the overview
    comment above the interface function decls with proper function
    comments. Such overview can be useful but function comments should be
    more than enough here.

    This is pure rename and doesn't introduce any functional changes.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Michal Hocko
    Cc: Matt Helsley
    Cc: Johannes Weiner
    Cc: Balbir Singh

    Tejun Heo
     
  • cgroup is currently in the process of transitioning to using css
    (cgroup_subsys_state) as the primary handle instead of cgroup in
    subsystem API. For hierarchy iterators, this is beneficial because

    * In most cases, css is the only thing subsystems care about anyway.

    * On the planned unified hierarchy, iterations for different
    subsystems will need to skip over different subtrees of the
    hierarchy depending on which subsystems are enabled on each cgroup.
    Passing around css makes it unnecessary to explicitly specify the
    subsystem in question as css is intersection between cgroup and
    subsystem

    * For the planned unified hierarchy, css's would need to be created
    and destroyed dynamically independent from cgroup hierarchy. Having
    cgroup core manage css iteration makes enforcing deref rules a lot
    easier.

    Most subsystem conversions are straight-forward. Noteworthy changes
    are

    * blkio: cgroup_to_blkcg() is no longer used. Removed.

    * freezer: cgroup_freezer() is no longer used. Removed.

    * devices: cgroup_to_devcgroup() is no longer used. Removed.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Michal Hocko
    Acked-by: Vivek Goyal
    Acked-by: Aristeu Rozanski
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: Matt Helsley
    Cc: Jens Axboe

    Tejun Heo
     
  • cgroup is currently in the process of transitioning to using struct
    cgroup_subsys_state * as the primary handle instead of struct cgroup.
    Please see the previous commit which converts the subsystem methods
    for rationale.

    This patch converts all cftype file operations to take @css instead of
    @cgroup. cftypes for the cgroup core files don't have their subsytem
    pointer set. These will automatically use the dummy_css added by the
    previous patch and can be converted the same way.

    Most subsystem conversions are straight forwards but there are some
    interesting ones.

    * freezer: update_if_frozen() is also converted to take @css instead
    of @cgroup for consistency. This will make the code look simpler
    too once iterators are converted to use css.

    * memory/vmpressure: mem_cgroup_from_css() needs to be exported to
    vmpressure while mem_cgroup_from_cont() can be made static.
    Updated accordingly.

    * cpu: cgroup_tg() doesn't have any user left. Removed.

    * cpuacct: cgroup_ca() doesn't have any user left. Removed.

    * hugetlb: hugetlb_cgroup_form_cgroup() doesn't have any user left.
    Removed.

    * net_cls: cgrp_cls_state() doesn't have any user left. Removed.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Michal Hocko
    Acked-by: Vivek Goyal
    Acked-by: Aristeu Rozanski
    Acked-by: Daniel Wagner
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: Matt Helsley
    Cc: Jens Axboe
    Cc: Steven Rostedt

    Tejun Heo
     
  • cgroup is currently in the process of transitioning to using struct
    cgroup_subsys_state * as the primary handle instead of struct cgroup *
    in subsystem implementations for the following reasons.

    * With unified hierarchy, subsystems will be dynamically bound and
    unbound from cgroups and thus css's (cgroup_subsys_state) may be
    created and destroyed dynamically over the lifetime of a cgroup,
    which is different from the current state where all css's are
    allocated and destroyed together with the associated cgroup. This
    in turn means that cgroup_css() should be synchronized and may
    return NULL, making it more cumbersome to use.

    * Differing levels of per-subsystem granularity in the unified
    hierarchy means that the task and descendant iterators should behave
    differently depending on the specific subsystem the iteration is
    being performed for.

    * In majority of the cases, subsystems only care about its part in the
    cgroup hierarchy - ie. the hierarchy of css's. Subsystem methods
    often obtain the matching css pointer from the cgroup and don't
    bother with the cgroup pointer itself. Passing around css fits
    much better.

    This patch converts all cgroup_subsys methods to take @css instead of
    @cgroup. The conversions are mostly straight-forward. A few
    noteworthy changes are

    * ->css_alloc() now takes css of the parent cgroup rather than the
    pointer to the new cgroup as the css for the new cgroup doesn't
    exist yet. Knowing the parent css is enough for all the existing
    subsystems.

    * In kernel/cgroup.c::offline_css(), unnecessary open coded css
    dereference is replaced with local variable access.

    This patch shouldn't cause any behavior differences.

    v2: Unnecessary explicit cgrp->subsys[] deref in css_online() replaced
    with local variable @css as suggested by Li Zefan.

    Rebased on top of new for-3.12 which includes for-3.11-fixes so
    that ->css_free() invocation added by da0a12caff ("cgroup: fix a
    leak when percpu_ref_init() fails") is converted too. Suggested
    by Li Zefan.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Michal Hocko
    Acked-by: Vivek Goyal
    Acked-by: Aristeu Rozanski
    Acked-by: Daniel Wagner
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: Matt Helsley
    Cc: Jens Axboe
    Cc: Steven Rostedt

    Tejun Heo
     
  • Currently, controllers have to explicitly follow the cgroup hierarchy
    to find the parent of a given css. cgroup is moving towards using
    cgroup_subsys_state as the main controller interface construct, so
    let's provide a way to climb the hierarchy using just csses.

    This patch implements css_parent() which, given a css, returns its
    parent. The function is guarnateed to valid non-NULL parent css as
    long as the target css is not at the top of the hierarchy.

    freezer, cpuset, cpu, cpuacct, hugetlb, memory, net_cls and devices
    are converted to use css_parent() instead of accessing cgroup->parent
    directly.

    * __parent_ca() is dropped from cpuacct and its usage is replaced with
    parent_ca(). The only difference between the two was NULL test on
    cgroup->parent which is now embedded in css_parent() making the
    distinction moot. Note that eventually a css->parent field will be
    added to css and the NULL check in css_parent() will go away.

    This patch shouldn't cause any behavior differences.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     
  • css (cgroup_subsys_state) is usually embedded in a subsys specific
    data structure. Subsystems either use container_of() directly to cast
    from css to such data structure or has an accessor function wrapping
    such cast. As cgroup as whole is moving towards using css as the main
    interface handle, add and update such accessors to ease dealing with
    css's.

    All accessors explicitly handle NULL input and return NULL in those
    cases. While this looks like an extra branch in the code, as all
    controllers specific data structures have css as the first field, the
    casting doesn't involve any offsetting and the compiler can trivially
    optimize out the branch.

    * blkio, freezer, cpuset, cpu, cpuacct and net_cls didn't have such
    accessor. Added.

    * memory, hugetlb and devices already had one but didn't explicitly
    handle NULL input. Updated.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     
  • cgroup controller API will be converted to primarily use struct
    cgroup_subsys_state instead of struct cgroup. In preparation, make
    hugetlb_cgroup functions pass around struct hugetlb_cgroup instead of
    struct cgroup.

    This patch shouldn't cause any behavior differences.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Reviewed-by: Aneesh Kumar K.V
    Reviewed-by: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Cc: Johannes Weiner

    Tejun Heo
     
  • The names of the two struct cgroup_subsys_state accessors -
    cgroup_subsys_state() and task_subsys_state() - are somewhat awkward.
    The former clashes with the type name and the latter doesn't even
    indicate it's somehow related to cgroup.

    We're about to revamp large portion of cgroup API, so, let's rename
    them so that they're less awkward. Most per-controller usages of the
    accessors are localized in accessor wrappers and given the amount of
    scheduled changes, this isn't gonna add any noticeable headache.

    Rename cgroup_subsys_state() to cgroup_css() and task_subsys_state()
    to task_css(). This patch is pure rename.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     

11 Jul, 2013

3 commits

  • Since all architectures have been converted to use vm_unmapped_area(),
    there is no remaining use for the free_area_cache.

    Signed-off-by: Michel Lespinasse
    Acked-by: Rik van Riel
    Cc: "James E.J. Bottomley"
    Cc: "Luck, Tony"
    Cc: Benjamin Herrenschmidt
    Cc: David Howells
    Cc: Helge Deller
    Cc: Ivan Kokshaysky
    Cc: Matt Turner
    Cc: Paul Mackerras
    Cc: Richard Henderson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • zswap is a thin backend for frontswap that takes pages that are in the
    process of being swapped out and attempts to compress them and store
    them in a RAM-based memory pool. This can result in a significant I/O
    reduction on the swap device and, in the case where decompressing from
    RAM is faster than reading from the swap device, can also improve
    workload performance.

    It also has support for evicting swap pages that are currently
    compressed in zswap to the swap device on an LRU(ish) basis. This
    functionality makes zswap a true cache in that, once the cache is full,
    the oldest pages can be moved out of zswap to the swap device so newer
    pages can be compressed and stored in zswap.

    This patch adds the zswap driver to mm/

    Signed-off-by: Seth Jennings
    Acked-by: Rik van Riel
    Cc: Greg Kroah-Hartman
    Cc: Nitin Gupta
    Cc: Minchan Kim
    Cc: Konrad Rzeszutek Wilk
    Cc: Dan Magenheimer
    Cc: Robert Jennings
    Cc: Jenifer Hopper
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Larry Woodman
    Cc: Benjamin Herrenschmidt
    Cc: Dave Hansen
    Cc: Joe Perches
    Cc: Joonsoo Kim
    Cc: Cody P Schafer
    Cc: Hugh Dickens
    Cc: Paul Mackerras
    Cc: Fengguang Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Seth Jennings
     
  • zbud is an special purpose allocator for storing compressed pages. It
    is designed to store up to two compressed pages per physical page.
    While this design limits storage density, it has simple and
    deterministic reclaim properties that make it preferable to a higher
    density approach when reclaim will be used.

    zbud works by storing compressed pages, or "zpages", together in pairs
    in a single memory page called a "zbud page". The first buddy is "left
    justifed" at the beginning of the zbud page, and the last buddy is
    "right justified" at the end of the zbud page. The benefit is that if
    either buddy is freed, the freed buddy space, coalesced with whatever
    slack space that existed between the buddies, results in the largest
    possible free region within the zbud page.

    zbud also provides an attractive lower bound on density. The ratio of
    zpages to zbud pages can not be less than 1. This ensures that zbud can
    never "do harm" by using more pages to store zpages than the
    uncompressed zpages would have used on their own.

    This implementation is a rewrite of the zbud allocator internally used
    by zcache in the driver/staging tree. The rewrite was necessary to
    remove some of the zcache specific elements that were ingrained
    throughout and provide a generic allocation interface that can later be
    used by zsmalloc and others.

    This patch adds zbud to mm/ for later use by zswap.

    Signed-off-by: Seth Jennings
    Acked-by: Rik van Riel
    Cc: Greg Kroah-Hartman
    Cc: Nitin Gupta
    Cc: Minchan Kim
    Cc: Konrad Rzeszutek Wilk
    Cc: Dan Magenheimer
    Cc: Robert Jennings
    Cc: Jenifer Hopper
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Larry Woodman
    Cc: Benjamin Herrenschmidt
    Cc: Dave Hansen
    Cc: Joe Perches
    Cc: Joonsoo Kim
    Cc: Cody P Schafer
    Cc: Hugh Dickens
    Cc: Paul Mackerras
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Seth Jennings
     

10 Jul, 2013

25 commits

  • Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • online_pages() is called from memory_block_action() when a user requests
    to online a memory block via sysfs. This function needs to return a
    proper error value in case of error.

    Signed-off-by: Toshi Kani
    Cc: Yasuaki Ishimatsu
    Cc: Tang Chen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Toshi Kani
     
  • min_free_kbytes is updated during memory hotplug (by
    init_per_zone_wmark_min) currently which is right thing to do in most
    cases but this could be unexpected if admin increased the value to
    prevent from allocation failures and the new min_free_kbytes would be
    decreased as a result of memory hotadd.

    This patch saves the user defined value and allows updating
    min_free_kbytes only if it is higher than the saved one.

    A warning is printed when the new value is ignored.

    Signed-off-by: Michal Hocko
    Cc: Mel Gorman
    Acked-by: Zhang Yanfei
    Acked-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Now memcg has the same life cycle with its corresponding cgroup, and a
    cgroup is freed via RCU and then mem_cgroup_css_free() will be called in
    a work function, so we can simply call __mem_cgroup_free() in
    mem_cgroup_css_free().

    This actually reverts commit 59927fb984d ("memcg: free mem_cgroup by RCU
    to fix oops").

    Signed-off-by: Li Zefan
    Cc: Hugh Dickins
    Acked-by: Michal Hocko
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Tejun Heo
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • Now memcg has the same life cycle as its corresponding cgroup. Kill the
    useless refcnt.

    Signed-off-by: Li Zefan
    Acked-by: Michal Hocko
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Tejun Heo
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • The cgroup core guarantees it's always safe to access the parent.

    Signed-off-by: Li Zefan
    Acked-by: Michal Hocko
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Tejun Heo
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • Use css_get/put instead of mem_cgroup_get/put. A simple replacement
    will do.

    The historical reason that memcg has its own refcnt instead of always
    using css_get/put, is that cgroup couldn't be removed if there're still
    css refs, so css refs can't be used as long-lived reference. The
    situation has changed so that rmdir a cgroup will succeed regardless css
    refs, but won't be freed until css refs goes down to 0.

    Signed-off-by: Li Zefan
    Acked-by: Michal Hocko
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Tejun Heo
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • Use css_get/put instead of mem_cgroup_get/put.

    We can't do a simple replacement, because here mem_cgroup_put() is
    called during mem_cgroup_css_free(), while mem_cgroup_css_free() won't
    be called until css refcnt goes down to 0.

    Instead we increment css refcnt in mem_cgroup_css_offline(), and then
    check if there's still kmem charges. If not, css refcnt will be
    decremented immediately, otherwise the refcnt will be released after the
    last kmem allocation is uncahred.

    [akpm@linux-foundation.org: tweak comment]
    Signed-off-by: Li Zefan
    Acked-by: Michal Hocko
    Acked-by: KAMEZAWA Hiroyuki
    Reviewed-by: Tejun Heo
    Cc: Michal Hocko
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • Use css_get()/css_put() instead of mem_cgroup_get()/mem_cgroup_put().

    There are two things being done in the current code:

    First, we acquired a css_ref to make sure that the underlying cgroup
    would not go away. That is a short lived reference, and it is put as
    soon as the cache is created.

    At this point, we acquire a long-lived per-cache memcg reference count
    to guarantee that the memcg will still be alive.

    so it is:

    enqueue: css_get
    create : memcg_get, css_put
    destroy: memcg_put

    So we only need to get rid of the memcg_get, change the memcg_put to
    css_put, and get rid of the now extra css_put.

    (This changelog is mostly written by Glauber)

    Signed-off-by: Li Zefan
    Acked-by: Michal Hocko
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Tejun Heo
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • Use css_get/css_put instead of mem_cgroup_get/put.

    Note, if at the same time someone is moving @current to a different
    cgroup and removing the old cgroup, css_tryget() may return false, and
    sock->sk_cgrp won't be initialized, which is fine.

    Signed-off-by: Li Zefan
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: Hugh Dickins
    Cc: Tejun Heo
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • mem_cgroup_css_online calls mem_cgroup_put if memcg_init_kmem fails.
    This is not correct because only memcg_propagate_kmem takes an
    additional reference while mem_cgroup_sockets_init is allowed to fail as
    well (although no current implementation fails) but it doesn't take any
    reference. This all suggests that it should be memcg_propagate_kmem
    that should clean up after itself so this patch moves mem_cgroup_put
    over there.

    Unfortunately this is not that easy (as pointed out by Li Zefan) because
    memcg_kmem_mark_dead marks the group dead (KMEM_ACCOUNTED_DEAD) if it is
    marked active (KMEM_ACCOUNTED_ACTIVE) which is the case even if
    memcg_propagate_kmem fails so the additional reference is dropped in
    that case in kmem_cgroup_destroy which means that the reference would be
    dropped two times.

    The easiest way then would be to simply remove mem_cgrroup_put from
    mem_cgroup_css_online and rely on kmem_cgroup_destroy doing the right
    thing.

    Signed-off-by: Michal Hocko
    Signed-off-by: Li Zefan
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Tejun Heo
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Cc: [3.8]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • This reverts commit e4715f01be697a.

    mem_cgroup_put is hierarchy aware so mem_cgroup_put(memcg) already drops
    an additional reference from all parents so the additional
    mem_cgrroup_put(parent) potentially causes use-after-free.

    Signed-off-by: Michal Hocko
    Signed-off-by: Li Zefan
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Tejun Heo
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Cc: [3.9+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • It is counterintuitive at best that mmap'ing a hugetlbfs file with
    MAP_HUGETLB fails, while mmap'ing it without will a) succeed and b)
    return huge pages.

    v2: use is_file_hugepages(), as suggested by Jianguo

    Signed-off-by: Joern Engel
    Cc: Jianguo Wu
    Signed-off-by: Linus Torvalds

    Jörn Engel
     
  • After the patch "mm: vmscan: Flatten kswapd priority loop" was merged
    the scanning priority of kswapd changed.

    The priority now rises until it is scanning enough pages to meet the
    high watermark. shrink_inactive_list sets ZONE_WRITEBACK if a number of
    pages were encountered under writeback but this value is scaled based on
    the priority. As kswapd frequently scans with a higher priority now it
    is relatively easy to set ZONE_WRITEBACK. This patch removes the
    scaling and treates writeback pages similar to how it treats unqueued
    dirty pages and congested pages. The user-visible effect should be that
    kswapd will writeback fewer pages from reclaim context.

    Signed-off-by: Mel Gorman
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Dave Chinner
    Cc: Kamezawa Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Direct reclaim is not aborting to allow compaction to go ahead properly.
    do_try_to_free_pages is told to abort reclaim which is happily ignores
    and instead increases priority instead until it reaches 0 and starts
    shrinking file/anon equally. This patch corrects the situation by
    aborting reclaim when requested instead of raising priority.

    Signed-off-by: Mel Gorman
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Dave Chinner
    Cc: Kamezawa Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Signed-off-by: Tang Chen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tang Chen
     
  • Remove one redundant "nid" in the comment.

    Signed-off-by: Tang Chen
    Signed-off-by: Linus Torvalds

    Tang Chen
     
  • When searching a vmap area in the vmalloc space, we use (addr + size -
    1) to check if the value is less than addr, which is an overflow. But
    we assign (addr + size) to vmap_area->va_end.

    So if we come across the below case:

    (addr + size - 1) : not overflow
    (addr + size) : overflow

    we will assign an overflow value (e.g 0) to vmap_area->va_end, And this
    will trigger BUG in __insert_vmap_area, causing system panic.

    So using (addr + size) to check the overflow should be the correct
    behaviour, not (addr + size - 1).

    Signed-off-by: Zhang Yanfei
    Reported-by: Ghennadi Procopciuc
    Tested-by: Daniel Baluta
    Cc: David Rientjes
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang Yanfei
     
  • These VM_ macros aren't used very often and three of them
    aren't used at all.

    Expand the ones that are used in-place, and remove all the now unused
    #define VM_ macros.

    VM_READHINTMASK, VM_NormalReadHint and VM_ClearReadHint were added just
    before 2.4 and appears have never been used.

    Signed-off-by: Joe Perches
    Acked-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • With CONFIG_MEMORY_HOTREMOVE unset, there is a compile warning:

    mm/sparse.c:755: warning: `clear_hwpoisoned_pages' defined but not used

    And Bisecting it ended up pointing to 4edd7ceff ("mm, hotplug: avoid
    compiling memory hotremove functions when disabled").

    This is because the commit above put sparse_remove_one_section() within
    the protection of CONFIG_MEMORY_HOTREMOVE but the only user of
    clear_hwpoisoned_pages() is sparse_remove_one_section(), and it is not
    within the protection of CONFIG_MEMORY_HOTREMOVE.

    So put clear_hwpoisoned_pages within CONFIG_MEMORY_HOTREMOVE should fix
    the warning.

    Signed-off-by: Zhang Yanfei
    Cc: David Rientjes
    Acked-by: Toshi Kani
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang Yanfei
     
  • This function is nowhere used, and it has a confusing name with put_page
    in mm/swap.c. So better to remove it.

    Signed-off-by: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang Yanfei
     
  • vfree() only needs schedule_work(&p->wq) if p->list was empty, otherwise
    vfree_deferred->wq is already pending or it is running and didn't do
    llist_del_all() yet.

    Signed-off-by: Oleg Nesterov
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • In __rmqueue_fallback(), current_order loops down from MAX_ORDER - 1 to
    the order passed. MAX_ORDER is typically 11 and pageblock_order is
    typically 9 on x86. Integer division truncates, so pageblock_order / 2
    is 4. For the first eight iterations, it's guaranteed that
    current_order >= pageblock_order / 2 if it even gets that far!

    So just remove the unlikely(), it's completely bogus.

    Signed-off-by: Zhang Yanfei
    Suggested-by: David Rientjes
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang Yanfei
     
  • The callers of build_zonelists_node always pass MAX_NR_ZONES -1 as the
    zone_type argument, so we can directly use the value in
    build_zonelists_node and remove zone_type argument.

    Signed-off-by: Zhang Yanfei
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang Yanfei
     
  • The memory we used to hold the memcg arrays is currently accounted to
    the current memcg. But that creates a problem, because that memory can
    only be freed after the last user is gone. Our only way to know which
    is the last user, is to hook up to freeing time, but the fact that we
    still have some in flight kmallocs will prevent freeing to happen. I
    believe therefore to be just easier to account this memory as global
    overhead.

    Signed-off-by: Glauber Costa
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Kamezawa Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Glauber Costa