27 May, 2011

8 commits

  • Two new stats in per-memcg memory.stat which tracks the number of page
    faults and number of major page faults.

    "pgfault"
    "pgmajfault"

    They are different from "pgpgin"/"pgpgout" stat which count number of
    pages charged/discharged to the cgroup and have no meaning of reading/
    writing page to disk.

    It is valuable to track the two stats for both measuring application's
    performance as well as the efficiency of the kernel page reclaim path.
    Counting pagefaults per process is useful, but we also need the aggregated
    value since processes are monitored and controlled in cgroup basis in
    memcg.

    Functional test: check the total number of pgfault/pgmajfault of all
    memcgs and compare with global vmstat value:

    $ cat /proc/vmstat | grep fault
    pgfault 1070751
    pgmajfault 553

    $ cat /dev/cgroup/memory.stat | grep fault
    pgfault 1071138
    pgmajfault 553
    total_pgfault 1071142
    total_pgmajfault 553

    $ cat /dev/cgroup/A/memory.stat | grep fault
    pgfault 199
    pgmajfault 0
    total_pgfault 199
    total_pgmajfault 0

    Performance test: run page fault test(pft) wit 16 thread on faulting in
    15G anon pages in 16G container. There is no regression noticed on the
    "flt/cpu/s"

    Sample output from pft:

    TAG pft:anon-sys-default:
    Gb Thr CLine User System Wall flt/cpu/s fault/wsec
    15 16 1 0.67s 233.41s 14.76s 16798.546 266356.260

    +-------------------------------------------------------------------------+
    N Min Max Median Avg Stddev
    x 10 16682.962 17344.027 16913.524 16928.812 166.5362
    + 10 16695.568 16923.896 16820.604 16824.652 84.816568
    No difference proven at 95.0% confidence

    [akpm@linux-foundation.org: fix build]
    [hughd@google.com: shmem fix]
    Signed-off-by: Ying Han
    Acked-by: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Reviewed-by: Minchan Kim
    Cc: Daisuke Nishimura
    Acked-by: Balbir Singh
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ying Han
     
  • The new API exports numa_maps per-memcg basis. This is a piece of useful
    information where it exports per-memcg page distribution across real numa
    nodes.

    One of the usecases is evaluating application performance by combining
    this information w/ the cpu allocation to the application.

    The output of the memory.numastat tries to follow w/ simiar format of
    numa_maps like:

    total= N0= N1= ...
    file= N0= N1= ...
    anon= N0= N1= ...
    unevictable= N0= N1= ...

    And we have per-node:

    total = file + anon + unevictable

    $ cat /dev/cgroup/memory/memory.numa_stat
    total=250020 N0=87620 N1=52367 N2=45298 N3=64735
    file=225232 N0=83402 N1=46160 N2=40522 N3=55148
    anon=21053 N0=3424 N1=6207 N2=4776 N3=6646
    unevictable=3735 N0=794 N1=0 N2=0 N3=2941

    Signed-off-by: Ying Han
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Daisuke Nishimura
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ying Han
     
  • The caller of the function has been renamed to zone_nr_lru_pages(), and
    this is just fixing up in the memcg code. The current name is easily to
    be mis-read as zone's total number of pages.

    Signed-off-by: Ying Han
    Acked-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Reviewed-by: Minchan Kim
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ying Han
     
  • If the memcg reclaim code detects the target memcg below its limit it
    exits and returns a guaranteed non-zero value so that the charge is
    retried.

    Nowadays, the charge side checks the memcg limit itself and does not rely
    on this non-zero return value trick.

    This patch removes it. The reclaim code will now always return the true
    number of pages it reclaimed on its own.

    Signed-off-by: Johannes Weiner
    Acked-by: Rik van Riel
    Acked-by: Ying Han
    Acked-by: KAMEZAWA Hiroyuki
    Reviewed-by: Michal Hocko
    Cc: Balbir Singh
    Cc: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Presently, memory cgroup's direct reclaim frees memory from the current
    node. But this has some troubles. Usually when a set of threads works in
    a cooperative way, they tend to operate on the same node. So if they hit
    limits under memcg they will reclaim memory from themselves, damaging the
    active working set.

    For example, assume 2 node system which has Node 0 and Node 1 and a memcg
    which has 1G limit. After some work, file cache remains and the usages
    are

    Node 0: 1M
    Node 1: 998M.

    and run an application on Node 0, it will eat its foot before freeing
    unnecessary file caches.

    This patch adds round-robin for NUMA and adds equal pressure to each node.
    When using cpuset's spread memory feature, this will work very well.

    But yes, a better algorithm is needed.

    [akpm@linux-foundation.org: comment editing]
    [kamezawa.hiroyu@jp.fujitsu.com: fix time comparisons]
    Signed-off-by: Ying Han
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: KOSAKI Motohiro
    Cc: Daisuke Nishimura
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ying Han
     
  • next_mz is assigned to NULL if __mem_cgroup_largest_soft_limit_node
    selects the same mz. This doesn't make much sense as we assign to the
    variable right in the next loop.

    Compiler will probably optimize this out but it is little bit confusing
    for the code reading.

    Signed-off-by: Michal Hocko
    Acked-by: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • The global kswapd scans per-zone LRU and reclaims pages regardless of the
    cgroup. It breaks memory isolation since one cgroup can end up reclaiming
    pages from another cgroup. Instead we should rely on memcg-aware target
    reclaim including per-memcg kswapd and soft_limit hierarchical reclaim under
    memory pressure.

    In the global background reclaim, we do soft reclaim before scanning the
    per-zone LRU. However, the return value is ignored. This patch is the first
    step to skip shrink_zone() if soft_limit reclaim does enough work.

    This is part of the effort which tries to reduce reclaiming pages in global
    LRU in memcg. The per-memcg background reclaim patchset further enhances the
    per-cgroup targetting reclaim, which I should have V4 posted shortly.

    Try running multiple memory intensive workloads within seperate memcgs. Watch
    the counters of soft_steal in memory.stat.

    $ cat /dev/cgroup/A/memory.stat | grep 'soft'
    soft_steal 240000
    soft_scan 240000
    total_soft_steal 240000
    total_soft_scan 240000

    This patch:

    In the global background reclaim, we do soft reclaim before scanning the
    per-zone LRU. However, the return value is ignored.

    We would like to skip shrink_zone() if soft_limit reclaim does enough
    work. Also, we need to make the memory pressure balanced across per-memcg
    zones, like the logic vm-core. This patch is the first step where we
    start with counting the nr_scanned and nr_reclaimed from soft_limit
    reclaim into the global scan_control.

    Signed-off-by: Ying Han
    Cc: KOSAKI Motohiro
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Acked-by: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ying Han
     
  • Add cgroup subsystem callbacks for per-thread attachment in atomic contexts

    Add can_attach_task(), pre_attach(), and attach_task() as new callbacks
    for cgroups's subsystem interface. Unlike can_attach and attach, these
    are for per-thread operations, to be called potentially many times when
    attaching an entire threadgroup.

    Also, the old "bool threadgroup" interface is removed, as replaced by
    this. All subsystems are modified for the new interface - of note is
    cpuset, which requires from/to nodemasks for attach to be globally scoped
    (though per-cpuset would work too) to persist from its pre_attach to
    attach_task and attach.

    This is a pre-patch for cgroup-procs-writable.patch.

    Signed-off-by: Ben Blum
    Cc: "Eric W. Biederman"
    Cc: Li Zefan
    Cc: Matt Helsley
    Reviewed-by: Paul Menage
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Miao Xie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ben Blum
     

25 May, 2011

1 commit

  • The noswapaccount parameter has been deprecated since 2.6.38 without any
    complaints from users so we can remove it. swapaccount=0|1 can be used
    instead.

    As we are removing the parameter we can also clean up swapaccount because
    it doesn't have to accept an empty string anymore (to match noswapaccount)
    and so we can push = into __setup macro rather than checking "=1" resp.
    "=0" strings

    Signed-off-by: Michal Hocko
    Cc: Hiroyuki Kamezawa
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

31 Mar, 2011

1 commit


24 Mar, 2011

22 commits

  • fs/fuse/dev.c::fuse_try_move_page() does

    (1) remove a page by ->steal()
    (2) re-add the page to page cache
    (3) link the page to LRU if it was not on LRU at (1)

    This implies the page is _on_ LRU when it's added to radix-tree. So, the
    page is added to memory cgroup while it's on LRU. because LRU is lazy and
    no one flushs it.

    This is the same behavior as SwapCache and needs special care as
    - remove page from LRU before overwrite pc->mem_cgroup.
    - add page to LRU after overwrite pc->mem_cgroup.

    And we need to taking care of pagevec.

    If PageLRU(page) is set before we add PCG_USED bit, the page will not be
    added to memcg's LRU (in short period). So, regardlress of PageLRU(page)
    value before commit_charge(), we need to check PageLRU(page) after
    commit_charge().

    Addresses https://bugzilla.kernel.org/show_bug.cgi?id=30432

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Johannes Weiner
    Acked-by: Daisuke Nishimura
    Cc: Miklos Szeredi
    Cc: Balbir Singh
    Reported-by: Daniel Poelzleithner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • mm/memcontrol.c: In function 'mem_cgroup_force_empty':
    mm/memcontrol.c:2280: warning: 'flags' may be used uninitialized in this function

    It's a false positive.

    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: Greg Thelen
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • The statistic counters are in units of pages, there is no reason to make
    them 64-bit wide on 32-bit machines.

    Make them native words. Since they are signed, this leaves 31 bit on
    32-bit machines, which can represent roughly 8TB assuming a page size of
    4k.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Johannes Weiner
    Signed-off-by: Greg Thelen
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Balbir Singh
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • For increasing and decreasing per-cpu cgroup usage counters it makes sense
    to use signed types, as single per-cpu values might go negative during
    updates. But this is not the case for only-ever-increasing event
    counters.

    All the counters have been signed 64-bit so far, which was enough to count
    events even with the sign bit wasted.

    This patch:
    - divides s64 counters into signed usage counters and unsigned
    monotonically increasing event counters.
    - converts unsigned event counters into 'unsigned long' rather than
    'u64'. This matches the type used by the /proc/vmstat event counters.

    The next patch narrows the signed usage counters type (on 32-bit CPUs,
    that is).

    Signed-off-by: Johannes Weiner
    Signed-off-by: Greg Thelen
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Balbir Singh
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • There is no clear pattern when we pass a page count and when we pass a
    byte count that is a multiple of PAGE_SIZE.

    We never charge or uncharge subpage quantities, so convert it all to page
    counts.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • We never uncharge subpage quantities.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • We never keep subpage quantities in the per-cpu stock.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • We have two charge cancelling functions: one takes a page count, the other
    a page size. The second one just divides the parameter by PAGE_SIZE and
    then calls the first one. This is trivial, no need for an extra function.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The reclaim_param_lock is only taken around single reads and writes to
    integer variables and is thus superfluous. Drop it.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • page_cgroup_zoneinfo() will never return NULL for a charged page, remove
    the check for it in mem_cgroup_get_reclaim_stat_from_page().

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • In struct page_cgroup, we have a full word for flags but only a few are
    reserved. Use the remaining upper bits to encode, depending on
    configuration, the node or the section, to enable page_cgroup-to-page
    lookups without a direct pointer.

    This saves a full word for every page in a system with memory cgroups
    enabled.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Minchan Kim
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The per-cgroup LRU lists string up 'struct page_cgroup's. To get from
    those structures to the page they represent, a lookup is required.
    Currently, the lookup is done through a direct pointer in struct
    page_cgroup, so a lot of functions down the callchain do this lookup by
    themselves instead of receiving the page pointer from their callers.

    The next patch removes this pointer, however, and the lookup is no longer
    that straight-forward. In preparation for that, this patch only leaves
    the non-optional lookups when coming directly from the LRU list and passes
    the page down the stack.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • It is one logical function, no need to have it split up.

    Also, get rid of some checks from the inner function that ensured the
    sanity of the outer function.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Instead of passing a whole struct page_cgroup to this function, let it
    take only what it really needs from it: the struct mem_cgroup and the
    page.

    This has the advantage that reading pc->mem_cgroup is now done at the same
    place where the ordering rules for this pointer are enforced and
    explained.

    It is also in preparation for removing the pc->page backpointer.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • This patch series removes the direct page pointer from struct page_cgroup,
    which saves 20% of per-page memcg memory overhead (Fedora and Ubuntu
    enable memcg per default, openSUSE apparently too).

    The node id or section number is encoded in the remaining free bits of
    pc->flags which allows calculating the corresponding page without the
    extra pointer.

    I ran, what I think is, a worst-case microbenchmark that just cats a large
    sparse file to /dev/null, because it means that walking the LRU list on
    behalf of per-cgroup reclaim and looking up pages from page_cgroups is
    happening constantly and at a high rate. But it made no measurable
    difference. A profile reported a 0.11% share of the new
    lookup_cgroup_page() function in this benchmark.

    This patch:

    All callsites check PCG_USED before passing pc->mem_cgroup, so the latter
    is never NULL.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Acked-by: Balbir Singh
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Add checks at allocating or freeing a page whether the page is used (iow,
    charged) from the view point of memcg.

    This check may be useful in debugging a problem and we did similar checks
    before the commit 52d4b9ac(memcg: allocate all page_cgroup at boot).

    This patch adds some overheads at allocating or freeing memory, so it's
    enabled only when CONFIG_DEBUG_VM is enabled.

    Signed-off-by: Daisuke Nishimura
    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • The page_cgroup array is set up before even fork is initialized. I
    seriously doubt that this code executes before the array is alloc'd.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • No callsite ever passes a NULL pointer for a struct mem_cgroup * to the
    committing function. There is no need to check for it.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • These definitions have been unused since '4b3bde4 memcg: remove the
    overhead associated with the root cgroup'.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Since transparent huge pages, checking whether memory cgroups are below
    their limits is no longer enough, but the actual amount of chargeable
    space is important.

    To not have more than one limit-checking interface, replace
    memory_cgroup_check_under_limit() and memory_cgroup_check_margin() with a
    single memory_cgroup_margin() that returns the chargeable space and leaves
    the comparison to the callsite.

    Soft limits are now checked the other way round, by using the already
    existing function that returns the amount by which soft limits are
    exceeded: res_counter_soft_limit_excess().

    Also remove all the corresponding functions on the res_counter side that
    are now no longer used.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Acked-by: Balbir Singh
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Soft limit reclaim continues until the usage is below the current soft
    limit, but the documented semantics are actually that soft limit reclaim
    will push usage back until the soft limits are met again.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Acked-by: Balbir Singh
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Remove initialization of vaiable in caller of memory cgroup function.
    Actually, it's return value of memcg function but it's initialized in
    caller.

    Some memory cgroup uses following style to bring the result of start
    function to the end function for avoiding races.

    mem_cgroup_start_A(&(*ptr))
    /* Something very complicated can happen here. */
    mem_cgroup_end_A(*ptr)

    In some calls, *ptr should be initialized to NULL be caller. But it's
    ugly. This patch fixes that *ptr is initialized by _start function.

    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Johannes Weiner
    Acked-by: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

23 Mar, 2011

3 commits

  • Right now, if a mm_walk has either ->pte_entry or ->pmd_entry set, it will
    unconditionally split any transparent huge pages it runs in to. In
    practice, that means that anyone doing a

    cat /proc/$pid/smaps

    will unconditionally break down every huge page in the process and depend
    on khugepaged to re-collapse it later. This is fairly suboptimal.

    This patch changes that behavior. It teaches each ->pmd_entry handler
    (there are five) that they must break down the THPs themselves. Also, the
    _generic_ code will never break down a THP unless a ->pte_entry handler is
    actually set.

    This means that the ->pmd_entry handlers can now choose to deal with THPs
    without breaking them down.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Dave Hansen
    Acked-by: Mel Gorman
    Acked-by: David Rientjes
    Reviewed-by: Eric B Munson
    Tested-by: Eric B Munson
    Cc: Michael J Wolf
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Matt Mackall
    Cc: Jeremy Fitzhardinge
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • The rotate_reclaimable_page function moves just written out pages, which
    the VM wanted to reclaim, to the end of the inactive list. That way the
    VM will find those pages first next time it needs to free memory.

    This patch applies the rule in memcg. It can help to prevent unnecessary
    working page eviction of memcg.

    Signed-off-by: Minchan Kim
    Acked-by: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Reviewed-by: Rik van Riel
    Cc: KOSAKI Motohiro
    Acked-by: Johannes Weiner
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • This function basically does:

    remove_from_page_cache(old);
    page_cache_release(old);
    add_to_page_cache_locked(new);

    Except it does this atomically, so there's no possibility for the "add" to
    fail because of a race.

    If memory cgroups are enabled, then the memory cgroup charge is also moved
    from the old page to the new.

    This function is currently used by fuse to move pages into the page cache
    on read, instead of copying the page contents.

    [minchan.kim@gmail.com: add freepage() hook to replace_page_cache_page()]
    Signed-off-by: Miklos Szeredi
    Acked-by: Rik van Riel
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Signed-off-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     

03 Feb, 2011

5 commits

  • Changes in e401f1761 ("memcg: modify accounting function for supporting
    THP better") adds nr_pages to support multiple page size in
    memory_cgroup_charge_statistics.

    But counting the number of event nees abs(nr_pages) for increasing
    counters. This patch fixes event counting.

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Johannes Weiner
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Huge page coverage should obviously have less priority than the continued
    execution of a process.

    Never kill a process when charging it a huge page fails. Instead, give up
    after the first failed reclaim attempt and fall back to regular pages.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Reviewed-by: Minchan Kim
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • If reclaim after a failed charging was unsuccessful, the limits are
    checked again, just in case they settled by means of other tasks.

    This is all fine as long as every charge is of size PAGE_SIZE, because in
    that case, being below the limit means having at least PAGE_SIZE bytes
    available.

    But with transparent huge pages, we may end up in an endless loop where
    charging and reclaim fail, but we keep going because the limits are not
    yet exceeded, although not allowing for a huge page.

    Fix this up by explicitely checking for enough room, not just whether we
    are within limits.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Reviewed-by: Minchan Kim
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The charging code can encounter a charge size that is bigger than a
    regular page in two situations: one is a batched charge to fill the
    per-cpu stocks, the other is a huge page charge.

    This code is distributed over two functions, however, and only the outer
    one is aware of huge pages. In case the charging fails, the inner
    function will tell the outer function to retry if the charge size is
    bigger than regular pages--assuming batched charging is the only case.
    And the outer function will retry forever charging a huge page.

    This patch makes sure the inner function can distinguish between batch
    charging and a single huge page charge. It will only signal another
    attempt if batch charging failed, and go into regular reclaim when it is
    called on behalf of a huge page.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Reviewed-by: Minchan Kim
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • noswapaccount couldn't be used to control memsw for both on/off cases so
    we have added swapaccount[=0|1] parameter. This way we can turn the
    feature in two ways noswapaccount resp. swapaccount=0. We have kept the
    original noswapaccount but I think we should remove it after some time as
    it just makes more command line parameters without any advantages and also
    the code to handle parameters is uglier if we want both parameters.

    Signed-off-by: Michal Hocko
    Requested-by: KAMEZAWA Hiroyuki
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko