24 Mar, 2011

40 commits

  • oldcs->mems_allowed is not modified during cpuset_attach(), so we don't
    have to copy it to a buffer allocated by NODEMASK_ALLOC(). Just pass it
    to cpuset_migrate_mm().

    Signed-off-by: Li Zefan
    Cc: Paul Menage
    Acked-by: David Rientjes
    Cc: Miao Xie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • It's not necessary to copy cpuset->mems_allowed to a buffer allocated by
    NODEMASK_ALLOC(). Just pass it to nodelist_scnprintf().

    As spotted by Paul, a side effect is we fix a bug that the function can
    return -ENOMEM but the caller doesn't expect negative return value.
    Therefore change the return value of cpuset_sprintf_cpulist() and
    cpuset_sprintf_memlist() from int to size_t.

    Signed-off-by: Li Zefan
    Acked-by: Paul Menage
    Acked-by: David Rientjes
    Cc: Miao Xie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • When a memcg is oom and current has already received a SIGKILL, then give
    it access to memory reserves with a higher scheduling priority so that it
    may quickly exit and free its memory.

    This is identical to the global oom killer and is done even before
    checking for panic_on_oom: a pending SIGKILL here while panic_on_oom is
    selected is guaranteed to have come from userspace; the thread only needs
    access to memory reserves to exit and thus we don't unnecessarily panic
    the machine until the kernel has no last resort to free memory.

    Signed-off-by: David Rientjes
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • fs/fuse/dev.c::fuse_try_move_page() does

    (1) remove a page by ->steal()
    (2) re-add the page to page cache
    (3) link the page to LRU if it was not on LRU at (1)

    This implies the page is _on_ LRU when it's added to radix-tree. So, the
    page is added to memory cgroup while it's on LRU. because LRU is lazy and
    no one flushs it.

    This is the same behavior as SwapCache and needs special care as
    - remove page from LRU before overwrite pc->mem_cgroup.
    - add page to LRU after overwrite pc->mem_cgroup.

    And we need to taking care of pagevec.

    If PageLRU(page) is set before we add PCG_USED bit, the page will not be
    added to memcg's LRU (in short period). So, regardlress of PageLRU(page)
    value before commit_charge(), we need to check PageLRU(page) after
    commit_charge().

    Addresses https://bugzilla.kernel.org/show_bug.cgi?id=30432

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Johannes Weiner
    Acked-by: Daisuke Nishimura
    Cc: Miklos Szeredi
    Cc: Balbir Singh
    Reported-by: Daniel Poelzleithner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • KAMEZAWA Hiroyuki noted that free_pages_cgroup doesn't have to check for
    PageReserved because we never store the array on reserved pages (neither
    alloc_pages_exact nor vmalloc use those pages).

    So we can replace the check by a BUG_ON.

    Signed-off-by: Michal Hocko
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Currently we are allocating a single page_cgroup array per memory section
    (stored in mem_section->base) when CONFIG_SPARSEMEM is selected. This is
    correct but memory inefficient solution because the allocated memory
    (unless we fall back to vmalloc) is not kmalloc friendly:

    - 32b - 16384 entries (20B per entry) fit into 327680B so the
    524288B slab cache is used
    - 32b with PAE - 131072 entries with 2621440B fit into 4194304B
    - 64b - 32768 entries (40B per entry) fit into 2097152 cache

    This is ~37% wasted space per memory section and it sumps up for the whole
    memory. On a x86_64 machine it is something like 6MB per 1GB of RAM.

    We can reduce the internal fragmentation by using alloc_pages_exact which
    allocates PAGE_SIZE aligned blocks so we will get down to
    Cc: Dave Hansen
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • mm/memcontrol.c: In function 'mem_cgroup_force_empty':
    mm/memcontrol.c:2280: warning: 'flags' may be used uninitialized in this function

    It's a false positive.

    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: Greg Thelen
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • The statistic counters are in units of pages, there is no reason to make
    them 64-bit wide on 32-bit machines.

    Make them native words. Since they are signed, this leaves 31 bit on
    32-bit machines, which can represent roughly 8TB assuming a page size of
    4k.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Johannes Weiner
    Signed-off-by: Greg Thelen
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Balbir Singh
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • For increasing and decreasing per-cpu cgroup usage counters it makes sense
    to use signed types, as single per-cpu values might go negative during
    updates. But this is not the case for only-ever-increasing event
    counters.

    All the counters have been signed 64-bit so far, which was enough to count
    events even with the sign bit wasted.

    This patch:
    - divides s64 counters into signed usage counters and unsigned
    monotonically increasing event counters.
    - converts unsigned event counters into 'unsigned long' rather than
    'u64'. This matches the type used by the /proc/vmstat event counters.

    The next patch narrows the signed usage counters type (on 32-bit CPUs,
    that is).

    Signed-off-by: Johannes Weiner
    Signed-off-by: Greg Thelen
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Balbir Singh
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • There is no clear pattern when we pass a page count and when we pass a
    byte count that is a multiple of PAGE_SIZE.

    We never charge or uncharge subpage quantities, so convert it all to page
    counts.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • We never uncharge subpage quantities.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • We never keep subpage quantities in the per-cpu stock.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • We have two charge cancelling functions: one takes a page count, the other
    a page size. The second one just divides the parameter by PAGE_SIZE and
    then calls the first one. This is trivial, no need for an extra function.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The reclaim_param_lock is only taken around single reads and writes to
    integer variables and is thus superfluous. Drop it.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • page_cgroup_zoneinfo() will never return NULL for a charged page, remove
    the check for it in mem_cgroup_get_reclaim_stat_from_page().

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • In struct page_cgroup, we have a full word for flags but only a few are
    reserved. Use the remaining upper bits to encode, depending on
    configuration, the node or the section, to enable page_cgroup-to-page
    lookups without a direct pointer.

    This saves a full word for every page in a system with memory cgroups
    enabled.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Minchan Kim
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The per-cgroup LRU lists string up 'struct page_cgroup's. To get from
    those structures to the page they represent, a lookup is required.
    Currently, the lookup is done through a direct pointer in struct
    page_cgroup, so a lot of functions down the callchain do this lookup by
    themselves instead of receiving the page pointer from their callers.

    The next patch removes this pointer, however, and the lookup is no longer
    that straight-forward. In preparation for that, this patch only leaves
    the non-optional lookups when coming directly from the LRU list and passes
    the page down the stack.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • It is one logical function, no need to have it split up.

    Also, get rid of some checks from the inner function that ensured the
    sanity of the outer function.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Instead of passing a whole struct page_cgroup to this function, let it
    take only what it really needs from it: the struct mem_cgroup and the
    page.

    This has the advantage that reading pc->mem_cgroup is now done at the same
    place where the ordering rules for this pointer are enforced and
    explained.

    It is also in preparation for removing the pc->page backpointer.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • This patch series removes the direct page pointer from struct page_cgroup,
    which saves 20% of per-page memcg memory overhead (Fedora and Ubuntu
    enable memcg per default, openSUSE apparently too).

    The node id or section number is encoded in the remaining free bits of
    pc->flags which allows calculating the corresponding page without the
    extra pointer.

    I ran, what I think is, a worst-case microbenchmark that just cats a large
    sparse file to /dev/null, because it means that walking the LRU list on
    behalf of per-cgroup reclaim and looking up pages from page_cgroups is
    happening constantly and at a high rate. But it made no measurable
    difference. A profile reported a 0.11% share of the new
    lookup_cgroup_page() function in this benchmark.

    This patch:

    All callsites check PCG_USED before passing pc->mem_cgroup, so the latter
    is never NULL.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Acked-by: Balbir Singh
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Add checks at allocating or freeing a page whether the page is used (iow,
    charged) from the view point of memcg.

    This check may be useful in debugging a problem and we did similar checks
    before the commit 52d4b9ac(memcg: allocate all page_cgroup at boot).

    This patch adds some overheads at allocating or freeing memory, so it's
    enabled only when CONFIG_DEBUG_VM is enabled.

    Signed-off-by: Daisuke Nishimura
    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • The page_cgroup array is set up before even fork is initialized. I
    seriously doubt that this code executes before the array is alloc'd.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • No callsite ever passes a NULL pointer for a struct mem_cgroup * to the
    committing function. There is no need to check for it.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • These definitions have been unused since '4b3bde4 memcg: remove the
    overhead associated with the root cgroup'.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Since transparent huge pages, checking whether memory cgroups are below
    their limits is no longer enough, but the actual amount of chargeable
    space is important.

    To not have more than one limit-checking interface, replace
    memory_cgroup_check_under_limit() and memory_cgroup_check_margin() with a
    single memory_cgroup_margin() that returns the chargeable space and leaves
    the comparison to the callsite.

    Soft limits are now checked the other way round, by using the already
    existing function that returns the amount by which soft limits are
    exceeded: res_counter_soft_limit_excess().

    Also remove all the corresponding functions on the res_counter side that
    are now no longer used.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Acked-by: Balbir Singh
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Soft limit reclaim continues until the usage is below the current soft
    limit, but the documented semantics are actually that soft limit reclaim
    will push usage back until the soft limits are met again.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Acked-by: Balbir Singh
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Remove initialization of vaiable in caller of memory cgroup function.
    Actually, it's return value of memcg function but it's initialized in
    caller.

    Some memory cgroup uses following style to bring the result of start
    function to the end function for avoiding races.

    mem_cgroup_start_A(&(*ptr))
    /* Something very complicated can happen here. */
    mem_cgroup_end_A(*ptr)

    In some calls, *ptr should be initialized to NULL be caller. But it's
    ugly. This patch fixes that *ptr is initialized by _start function.

    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Johannes Weiner
    Acked-by: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • res_counter_read_u64 reads u64 value without lock. It's dangerous in a
    32bit environment. Add locking.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Greg Thelen
    Cc: Johannes Weiner
    Cc: David Rientjes
    Cc: KOSAKI Motohiro
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • minix bit operations are only used by minix filesystem and useless by
    other modules. Because byte order of inode and block bitmaps is different
    on each architecture like below:

    m68k:
    big-endian 16bit indexed bitmaps

    h8300, microblaze, s390, sparc, m68knommu:
    big-endian 32 or 64bit indexed bitmaps

    m32r, mips, sh, xtensa:
    big-endian 32 or 64bit indexed bitmaps for big-endian mode
    little-endian bitmaps for little-endian mode

    Others:
    little-endian bitmaps

    In order to move minix bit operations from asm/bitops.h to architecture
    independent code in minix filesystem, this provides two config options.

    CONFIG_MINIX_FS_BIG_ENDIAN_16BIT_INDEXED is only selected by m68k.
    CONFIG_MINIX_FS_NATIVE_ENDIAN is selected by the architectures which use
    native byte order bitmaps (h8300, microblaze, s390, sparc, m68knommu,
    m32r, mips, sh, xtensa). The architectures which always use little-endian
    bitmaps do not select these options.

    Finally, we can remove minix bit operations from asm/bitops.h for all
    architectures.

    Signed-off-by: Akinobu Mita
    Acked-by: Arnd Bergmann
    Acked-by: Greg Ungerer
    Cc: Geert Uytterhoeven
    Cc: Roman Zippel
    Cc: Andreas Schwab
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Yoshinori Sato
    Cc: Michal Simek
    Cc: "David S. Miller"
    Cc: Hirokazu Takata
    Acked-by: Ralf Baechle
    Acked-by: Paul Mundt
    Cc: Chris Zankel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • As a preparation for moving minix bit operations from asm/bitops.h to
    architecture independent code in minix filesystem, this removes inline asm
    from minix_find_first_zero_bit() for m68k.

    Signed-off-by: Akinobu Mita
    Cc: Geert Uytterhoeven
    Cc: Roman Zippel
    Cc: Andreas Schwab
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • As the result of conversions, there are no users of ext2 non-atomic bit
    operations except for ext2 filesystem itself. Now we can put them into
    architecture independent code in ext2 filesystem, and remove from
    asm/bitops.h for all architectures.

    Signed-off-by: Akinobu Mita
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • As a preparation for removing ext2 non-atomic bit operations from
    asm/bitops.h. This converts ext2 non-atomic bit operations to
    little-endian bit operations.

    Signed-off-by: Akinobu Mita
    Cc: Alasdair Kergon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • As a preparation for removing ext2 non-atomic bit operations from
    asm/bitops.h. This converts ext2 non-atomic bit operations to
    little-endian bit operations.

    Signed-off-by: Akinobu Mita
    Acked-by: NeilBrown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • As a preparation for removing ext2 non-atomic bit operations from
    asm/bitops.h. This converts ext2 non-atomic bit operations to
    little-endian bit operations.

    Signed-off-by: Akinobu Mita
    Cc: Evgeniy Dushistov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • As a preparation for removing ext2 non-atomic bit operations from
    asm/bitops.h. This converts ext2 non-atomic bit operations to
    little-endian bit operations.

    Signed-off-by: Akinobu Mita
    Acked-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • As a preparation for removing ext2 non-atomic bit operations from
    asm/bitops.h. This converts ext2 non-atomic bit operations to
    little-endian bit operations.

    Signed-off-by: Akinobu Mita
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • As a preparation for removing ext2 non-atomic bit operations from
    asm/bitops.h. This converts ext2 non-atomic bit operations to
    little-endian bit operations.

    Signed-off-by: Akinobu Mita
    Acked-by: Ryusuke Konishi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • As a preparation for removing ext2 non-atomic bit operations from
    asm/bitops.h. This converts ext2 non-atomic bit operations to
    little-endian bit operations.

    Signed-off-by: Akinobu Mita
    Acked-by: Joel Becker
    Cc: Mark Fasheh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • As a preparation for removing ext2 non-atomic bit operations from
    asm/bitops.h. This converts ext2 non-atomic bit operations to
    little-endian bit operations.

    Signed-off-by: Akinobu Mita
    Acked-by: "Theodore Ts'o"
    Cc: Andreas Dilger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • As a preparation for removing ext2 non-atomic bit operations from
    asm/bitops.h. This converts ext2 non-atomic bit operations to
    little-endian bit operations.

    Signed-off-by: Akinobu Mita
    Acked-by: Jan Kara
    Cc: Andreas Dilger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita