03 Apr, 2009

16 commits

  • Current mem_cgroup_cache_charge is a bit complicated especially
    in the case of shmem's swap-in.

    This patch cleans it up by using try_charge_swapin and commit_charge_swapin.

    Signed-off-by: Daisuke Nishimura
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • It's pointed out that swap_cgroup's message at swapon() is nonsense.
    Because

    * It can be calculated very easily if all necessary information is
    written in Kconfig.

    * It's not necessary to annoying people at every swapon().

    In other view, now, memory usage per swp_entry is reduced to 2bytes from
    8bytes(64bit) and I think it's reasonably small.

    Reported-by: Hugh Dickins
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Try to use CSS ID for records in swap_cgroup. By this, on 64bit machine,
    size of swap_cgroup goes down to 2 bytes from 8bytes.

    This means, when 2GB of swap is equipped, (assume the page size is 4096bytes)

    From size of swap_cgroup = 2G/4k * 8 = 4Mbytes.
    To size of swap_cgroup = 2G/4k * 2 = 1Mbytes.

    Reduction is large. Of course, there are trade-offs. This CSS ID will
    add overhead to swap-in/swap-out/swap-free.

    But in general,
    - swap is a resource which the user tend to avoid use.
    - If swap is never used, swap_cgroup area is not used.
    - Reading traditional manuals, size of swap should be proportional to
    size of memory. Memory size of machine is increasing now.

    I think reducing size of swap_cgroup makes sense.

    Note:
    - ID->CSS lookup routine has no locks, it's under RCU-Read-Side.
    - memcg can be obsolete at rmdir() but not freed while refcnt from
    swap_cgroup is available.

    Changelog v4->v5:
    - reworked on to memcg-charge-swapcache-to-proper-memcg.patch
    Changlog ->v4:
    - fixed not configured case.
    - deleted unnecessary comments.
    - fixed NULL pointer bug.
    - fixed message in dmesg.

    [nishimura@mxp.nes.nec.co.jp: css_tryget can be called twice in !PageCgroupUsed case]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: Balbir Singh
    Cc: Paul Menage
    Cc: Hugh Dickins
    Signed-off-by: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • memcg_test.txt says at 4.1:

    This swap-in is one of the most complicated work. In do_swap_page(),
    following events occur when pte is unchanged.

    (1) the page (SwapCache) is looked up.
    (2) lock_page()
    (3) try_charge_swapin()
    (4) reuse_swap_page() (may call delete_swap_cache())
    (5) commit_charge_swapin()
    (6) swap_free().

    Considering following situation for example.

    (A) The page has not been charged before (2) and reuse_swap_page()
    doesn't call delete_from_swap_cache().
    (B) The page has not been charged before (2) and reuse_swap_page()
    calls delete_from_swap_cache().
    (C) The page has been charged before (2) and reuse_swap_page() doesn't
    call delete_from_swap_cache().
    (D) The page has been charged before (2) and reuse_swap_page() calls
    delete_from_swap_cache().

    memory.usage/memsw.usage changes to this page/swp_entry will be
    Case (A) (B) (C) (D)
    Event
    Before (2) 0/ 1 0/ 1 1/ 1 1/ 1
    ===========================================
    (3) +1/+1 +1/+1 +1/+1 +1/+1
    (4) - 0/ 0 - -1/ 0
    (5) 0/-1 0/ 0 -1/-1 0/ 0
    (6) - 0/-1 - 0/-1
    ===========================================
    Result 1/ 1 1/ 1 1/ 1 1/ 1

    In any cases, charges to this page should be 1/ 1.

    In case of (D), mem_cgroup_try_get_from_swapcache() returns NULL
    (because lookup_swap_cgroup() returns NULL), so "+1/+1" at (3) means
    charges to the memcg("foo") to which the "current" belongs.
    OTOH, "-1/0" at (4) and "0/-1" at (6) means uncharges from the memcg("baa")
    to which the page has been charged.

    So, if the "foo" and "baa" is different(for example because of task move),
    this charge will be moved from "baa" to "foo".

    I think this is an unexpected behavior.

    This patch fixes this by modifying mem_cgroup_try_get_from_swapcache()
    to return the memcg to which the swapcache has been charged if PCG_USED bit
    is set.
    IIUC, checking PCG_USED bit of swapcache is safe under page lock.

    Signed-off-by: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • Currently, mem_cgroup_calc_mapped_ratio() is unused at all. it can be
    removed and KAMEZAWA-san suggested it.

    Signed-off-by: KOSAKI Motohiro
    Cc: KAMEZAWA Hiroyuki
    Acked-by: Balbir Singh
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Add RSS and swap to OOM output from memcg

    Display memcg values like failcnt, usage and limit when an OOM occurs due
    to memcg.

    Thanks to Johannes Weiner, Li Zefan, David Rientjes, Kamezawa Hiroyuki,
    Daisuke Nishimura and KOSAKI Motohiro for review.

    Sample output
    -------------

    Task in /a/x killed as a result of limit of /a
    memory: usage 1048576kB, limit 1048576kB, failcnt 4183
    memory+swap: usage 1400964kB, limit 9007199254740991kB, failcnt 0

    [akpm@linux-foundation.org: compilation fix]
    [akpm@linux-foundation.org: fix kerneldoc and whitespace]
    [akpm@linux-foundation.org: add printk facility level]
    Signed-off-by: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Li Zefan
    Cc: Paul Menage
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     
  • This patch tries to fix OOM Killer problems caused by hierarchy.
    Now, memcg itself has OOM KILL function (in oom_kill.c) and tries to
    kill a task in memcg.

    But, when hierarchy is used, it's broken and correct task cannot
    be killed. For example, in following cgroup

    /groupA/ hierarchy=1, limit=1G,
    01 nolimit
    02 nolimit
    All tasks' memory usage under /groupA, /groupA/01, groupA/02 is limited to
    groupA's 1Gbytes but OOM Killer just kills tasks in groupA.

    This patch provides makes the bad process be selected from all tasks
    under hierarchy. BTW, currently, oom_jiffies is updated against groupA
    in above case. oom_jiffies of tree should be updated.

    To see how oom_jiffies is used, please check mem_cgroup_oom_called()
    callers.

    [akpm@linux-foundation.org: build fix]
    [akpm@linux-foundation.org: const fix]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Paul Menage
    Cc: Li Zefan
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • As pointed out, shrinking memcg's limit should return -EBUSY after
    reasonable retries. This patch tries to fix the current behavior of
    shrink_usage.

    Before looking into "shrink should return -EBUSY" problem, we should fix
    hierarchical reclaim code. It compares current usage and current limit,
    but it only makes sense when the kernel reclaims memory because hit
    limits. This is also a problem.

    What this patch does are.

    1. add new argument "shrink" to hierarchical reclaim. If "shrink==true",
    hierarchical reclaim returns immediately and the caller checks the kernel
    should shrink more or not.
    (At shrinking memory, usage is always smaller than limit. So check for
    usage < limit is useless.)

    2. For adjusting to above change, 2 changes in "shrink"'s retry path.
    2-a. retry_count depends on # of children because the kernel visits
    the children under hierarchy one by one.
    2-b. rather than checking return value of hierarchical_reclaim's progress,
    compares usage-before-shrink and usage-after-shrink.
    If usage-before-shrink
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Paul Menage
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Clean up memory.stat file routine and show "total" hierarchical stat.

    This patch does
    - renamed get_all_zonestat to be get_local_zonestat.
    - remove old mem_cgroup_stat_desc, which is only for per-cpu stat.
    - add mcs_stat to cover both of per-cpu/per-lru stat.
    - add "total" stat of hierarchy (*)
    - add a callback system to scan all memcg under a root.
    == "total" is added.
    [kamezawa@localhost ~]$ cat /opt/cgroup/xxx/memory.stat
    cache 0
    rss 0
    pgpgin 0
    pgpgout 0
    inactive_anon 0
    active_anon 0
    inactive_file 0
    active_file 0
    unevictable 0
    hierarchical_memory_limit 50331648
    hierarchical_memsw_limit 9223372036854775807
    total_cache 65536
    total_rss 192512
    total_pgpgin 218
    total_pgpgout 155
    total_inactive_anon 0
    total_active_anon 135168
    total_inactive_file 61440
    total_active_file 4096
    total_unevictable 0
    ==
    (*) maybe the user can do calc hierarchical stat by his own program
    in userland but if it can be written in clean way, it's worth to be
    shown, I think.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Paul Menage
    Cc: Li Zefan
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Assigning CSS ID for each memcg and use css_get_next() for scanning hierarchy.

    Assume folloing tree.

    group_A (ID=3)
    /01 (ID=4)
    /0A (ID=7)
    /02 (ID=10)
    group_B (ID=5)
    and task in group_A/01/0A hits limit at group_A.

    reclaim will be done in following order (round-robin).
    group_A(3) -> group_A/01 (4) -> group_A/01/0A (7) -> group_A/02(10)
    -> group_A -> .....

    Round robin by ID. The last visited cgroup is recorded and restart
    from it when it start reclaim again.
    (More smart algorithm can be implemented..)

    No cgroup_mutex or hierarchy_mutex is required.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Paul Menage
    Cc: Li Zefan
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • In following situation, with memory subsystem,

    /groupA use_hierarchy==1
    /01 some tasks
    /02 some tasks
    /03 some tasks
    /04 empty

    When tasks under 01/02/03 hit limit on /groupA, hierarchical reclaim
    is triggered and the kernel walks tree under groupA. In this case,
    rmdir /groupA/04 fails with -EBUSY frequently because of temporal
    refcnt from the kernel.

    In general. cgroup can be rmdir'd if there are no children groups and
    no tasks. Frequent fails of rmdir() is not useful to users.
    (And the reason for -EBUSY is unknown to users.....in most cases)

    This patch tries to modify above behavior, by
    - retries if css_refcnt is got by someone.
    - add "return value" to pre_destroy() and allows subsystem to
    say "we're really busy!"

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Paul Menage
    Cc: Li Zefan
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • It is a fairly common operation to have a pointer to a work and to need a
    pointer to the delayed work it is contained in. In particular, all
    delayed works which want to rearm themselves will have to do that. So it
    would seem fair to offer a helper function for this operation.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Jean Delvare
    Acked-by: Ingo Molnar
    Cc: "David S. Miller"
    Cc: Herbert Xu
    Cc: Benjamin Herrenschmidt
    Cc: Martin Schwidefsky
    Cc: Greg KH
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jean Delvare
     
  • The calculation of the value nr in do_xip_mapping_read is incorrect. If
    the copy required more than one iteration in the do while loop the copies
    variable will be non-zero. The maximum length that may be passed to the
    call to copy_to_user(buf+copied, xip_mem+offset, nr) is len-copied but the
    check only compares against (nr > len).

    This bug is the cause for the heap corruption Carsten has been chasing
    for so long:

    *** glibc detected *** /bin/bash: free(): invalid next size (normal): 0x00000000800e39f0 ***
    ======= Backtrace: =========
    /lib64/libc.so.6[0x200000b9b44]
    /lib64/libc.so.6(cfree+0x8e)[0x200000bdade]
    /bin/bash(free_buffered_stream+0x32)[0x80050e4e]
    /bin/bash(close_buffered_stream+0x1c)[0x80050ea4]
    /bin/bash(unset_bash_input+0x2a)[0x8001c366]
    /bin/bash(make_child+0x1d4)[0x8004115c]
    /bin/bash[0x8002fc3c]
    /bin/bash(execute_command_internal+0x656)[0x8003048e]
    /bin/bash(execute_command+0x5e)[0x80031e1e]
    /bin/bash(execute_command_internal+0x79a)[0x800305d2]
    /bin/bash(execute_command+0x5e)[0x80031e1e]
    /bin/bash(reader_loop+0x270)[0x8001efe0]
    /bin/bash(main+0x1328)[0x8001e960]
    /lib64/libc.so.6(__libc_start_main+0x100)[0x200000592a8]
    /bin/bash(clearerr+0x5e)[0x8001c092]

    With this bug fix the commit 0e4a9b59282914fe057ab17027f55123964bc2e2
    "ext2/xip: refuse to change xip flag during remount with busy inodes" can
    be removed again.

    Cc: Carsten Otte
    Cc: Nick Piggin
    Cc: Jared Hulbert
    Cc:
    Signed-off-by: Martin Schwidefsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Martin Schwidefsky
     
  • Even though vmstat_work is marked deferrable, there are still benefits to
    aligning it. For certain applications we want to keep OS jitter as low as
    possible and aligning timers and work so they occur together can reduce
    their overall impact.

    Signed-off-by: Anton Blanchard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anton Blanchard
     
  • Fix a number of issues with the per-MM VMA patch:

    (1) Make mmap_pages_allocated an atomic_long_t, just in case this is used on
    a NOMMU system with more than 2G pages. Makes no difference on a 32-bit
    system.

    (2) Report vma->vm_pgoff * PAGE_SIZE as a 64-bit value, not a 32-bit value,
    lest it overflow.

    (3) Move the allocation of the vm_area_struct slab back for fork.c.

    (4) Use KMEM_CACHE() for both vm_area_struct and vm_region slabs.

    (5) Use BUG_ON() rather than if () BUG().

    (6) Make the default validate_nommu_regions() a static inline rather than a
    #define.

    (7) Make free_page_series()'s objection to pages with a refcount != 1 more
    informative.

    (8) Adjust the __put_nommu_region() banner comment to indicate that the
    semaphore must be held for writing.

    (9) Limit the number of warnings about munmaps of non-mmapped regions.

    Reported-by: Andrew Morton
    Signed-off-by: David Howells
    Cc: Greg Ungerer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     
  • This fixes a build failure with generic debug pagealloc:

    mm/debug-pagealloc.c: In function 'set_page_poison':
    mm/debug-pagealloc.c:8: error: 'struct page' has no member named 'debug_flags'
    mm/debug-pagealloc.c: In function 'clear_page_poison':
    mm/debug-pagealloc.c:13: error: 'struct page' has no member named 'debug_flags'
    mm/debug-pagealloc.c: In function 'page_poison':
    mm/debug-pagealloc.c:18: error: 'struct page' has no member named 'debug_flags'
    mm/debug-pagealloc.c: At top level:
    mm/debug-pagealloc.c:120: error: redefinition of 'kernel_map_pages'
    include/linux/mm.h:1278: error: previous definition of 'kernel_map_pages' was here
    mm/debug-pagealloc.c: In function 'kernel_map_pages':
    mm/debug-pagealloc.c:122: error: 'debug_pagealloc_enabled' undeclared (first use in this function)

    by fixing

    - debug_flags should be in struct page
    - define DEBUG_PAGEALLOC config option for all architectures

    Signed-off-by: Akinobu Mita
    Reported-by: Alexander Beregalov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     

01 Apr, 2009

24 commits

  • Synopsis: if shmem_writepage calls swap_writepage directly, most shmem
    swap loads benefit, and a catastrophic interaction between SLUB and some
    flash storage is avoided.

    shmem_writepage() has always been peculiar in making no attempt to write:
    it has just transferred a shmem page from file cache to swap cache, then
    let that page make its way around the LRU again before being written and
    freed.

    The idea was that people use tmpfs because they want those pages to stay
    in RAM; so although we give it an overflow to swap, we should resist
    writing too soon, giving those pages a second chance before they can be
    reclaimed.

    That was always questionable, and I've toyed with this patch for years;
    but never had a clear justification to depart from the original design.

    It became more questionable in 2.6.28, when the split LRU patches classed
    shmem and tmpfs pages as SwapBacked rather than as file_cache: that in
    itself gives them more resistance to reclaim than normal file pages. I
    prepared this patch for 2.6.29, but the merge window arrived before I'd
    completed gathering statistics to justify sending it in.

    Then while comparing SLQB against SLUB, running SLUB on a laptop I'd
    habitually used with SLAB, I found SLUB to run my tmpfs kbuild swapping
    tests five times slower than SLAB or SLQB - other machines slower too, but
    nowhere near so bad. Simpler "cp -a" swapping tests showed the same.

    slub_max_order=0 brings sanity to all, but heavy swapping is too far from
    normal to justify such a tuning. The crucial factor on that laptop turns
    out to be that I'm using an SD card for swap. What happens is this:

    By default, SLUB uses order-2 pages for shmem_inode_cache (and many other
    fs inodes), so creating tmpfs files under memory pressure brings lumpy
    reclaim into play. One subpage of the order is chosen from the bottom of
    the LRU as usual, then the other three picked out from their random
    positions on the LRUs.

    In a tmpfs load, many of these pages will be ones which already passed
    through shmem_writepage, so already have swap allocated. And though their
    offsets on swap were probably allocated sequentially, now that the pages
    are picked off at random, their swap offsets are scattered.

    But the flash storage on the SD card is very sensitive to having its
    writes merged: once swap is written at scattered offsets, performance
    falls apart. Rotating disk seeks increase too, but less disastrously.

    So: stop giving shmem/tmpfs pages a second pass around the LRU, write them
    out to swap as soon as their swap has been allocated.

    It's surely possible to devise an artificial load which runs faster the
    old way, one whose sizing is such that the tmpfs pages on their second
    pass are the ones that are wanted again, and other pages not.

    But I've not yet found such a load: on all machines, under the loads I've
    tried, immediate swap_writepage speeds up shmem swapping: especially when
    using the SLUB allocator (and more effectively than slub_max_order=0), but
    also with the others; and it also reduces the variance between runs. How
    much faster varies widely: a factor of five is rare, 5% is common.

    One load which might have suffered: imagine a swapping shmem load in a
    limited mem_cgroup on a machine with plenty of memory. Before 2.6.29 the
    swapcache was not charged, and such a load would have run quickest with
    the shmem swapcache never written to swap. But now swapcache is charged,
    so even this load benefits from shmem_writepage directly to swap.

    Apologies for the #ifndef CONFIG_SWAP swap_writepage() stub in swap.h:
    it's silly because that will never get called; but refactoring shmem.c
    sensibly according to CONFIG_SWAP will be a separate task.

    Signed-off-by: Hugh Dickins
    Acked-by: Pekka Enberg
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • try_to_free_pages() is used for the direct reclaim of up to
    SWAP_CLUSTER_MAX pages when watermarks are low. The caller to
    alloc_pages_nodemask() can specify a nodemask of nodes that are allowed to
    be used but this is not passed to try_to_free_pages(). This can lead to
    unnecessary reclaim of pages that are unusable by the caller and int the
    worst case lead to allocation failure as progress was not been make where
    it is needed.

    This patch passes the nodemask used for alloc_pages_nodemask() to
    try_to_free_pages().

    Reviewed-by: KOSAKI Motohiro
    Acked-by: Mel Gorman
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • When a shrinker has a negative number of objects to delete, the symbol
    name of the shrinker should be printed, not shrink_slab. This also makes
    the error message slightly more informative.

    Cc: Ingo Molnar
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Make CONFIG_UNEVICTABLE_LRU available when CONFIG_MMU=n. There's no logical
    reason it shouldn't be available, and it can be used for ramfs.

    Signed-off-by: David Howells
    Reviewed-by: KOSAKI Motohiro
    Cc: Peter Zijlstra
    Cc: Greg Ungerer
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Cc: Lee Schermerhorn
    Cc: Enrik Berkhan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     
  • The mlock() facility does not exist for NOMMU since all mappings are
    effectively locked anyway, so we don't make the bits available when
    they're not useful.

    Signed-off-by: David Howells
    Reviewed-by: KOSAKI Motohiro
    Cc: Peter Zijlstra
    Cc: Greg Ungerer
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Cc: Lee Schermerhorn
    Cc: Enrik Berkhan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     
  • x86 has debug_kmap_atomic_prot() which is error checking function for
    kmap_atomic. It is usefull for the other architectures, although it needs
    CONFIG_TRACE_IRQFLAGS_SUPPORT.

    This patch exposes it to the other architectures.

    Signed-off-by: Akinobu Mita
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • Change the page_mkwrite prototype to take a struct vm_fault, and return
    VM_FAULT_xxx flags. There should be no functional change.

    This makes it possible to return much more detailed error information to
    the VM (and also can provide more information eg. virtual_address to the
    driver, which might be important in some special cases).

    This is required for a subsequent fix. And will also make it easier to
    merge page_mkwrite() with fault() in future.

    Signed-off-by: Nick Piggin
    Cc: Chris Mason
    Cc: Trond Myklebust
    Cc: Miklos Szeredi
    Cc: Steven Whitehouse
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Artem Bityutskiy
    Cc: Felix Blyakher
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Addresses http://bugzilla.kernel.org/show_bug.cgi?id=9838

    On i386, HZ=1000, jiffies_to_clock_t() converts time in a somewhat strange
    way from the user's point of view:

    # echo 500 >/proc/sys/vm/dirty_writeback_centisecs
    # cat /proc/sys/vm/dirty_writeback_centisecs
    499

    So, we have 5000 jiffies converted to only 499 clock ticks and reported
    back.

    TICK_NSEC = 999848
    ACTHZ = 256039

    Keeping in-kernel variable in units passed from userspace will fix issue
    of course, but this probably won't be right for every sysctl.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Alexey Dobriyan
    Cc: Peter Zijlstra
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • CONFIG_DEBUG_PAGEALLOC is now supported by x86, powerpc, sparc64, and
    s390. This patch implements it for the rest of the architectures by
    filling the pages with poison byte patterns after free_pages() and
    verifying the poison patterns before alloc_pages().

    This generic one cannot detect invalid page accesses immediately but
    invalid read access may cause invalid dereference by poisoned memory and
    invalid write access can be detected after a long delay.

    Signed-off-by: Akinobu Mita
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • I notice there are many places doing copy_from_user() which follows
    kmalloc():

    dst = kmalloc(len, GFP_KERNEL);
    if (!dst)
    return -ENOMEM;
    if (copy_from_user(dst, src, len)) {
    kfree(dst);
    return -EFAULT
    }

    memdup_user() is a wrapper of the above code. With this new function, we
    don't have to write 'len' twice, which can lead to typos/mistakes. It
    also produces smaller code and kernel text.

    A quick grep shows 250+ places where memdup_user() *may* be used. I'll
    prepare a patchset to do this conversion.

    Signed-off-by: Li Zefan
    Cc: KOSAKI Motohiro
    Cc: Americo Wang
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • chg is unsigned, so it cannot be less than 0.

    Also, since region_chg returns long, let vma_needs_reservation() forward
    this to alloc_huge_page(). Store it as long as well. all callers cast it
    to long anyway.

    Signed-off-by: Roel Kluin
    Cc: Andy Whitcroft
    Cc: Mel Gorman
    Cc: Adam Litke
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roel Kluin
     
  • pagevec_swap_free() is now unused.

    Signed-off-by: KOSAKI Motohiro
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • The pagevec_swap_free() at the end of shrink_active_list() was introduced
    in 68a22394 "vmscan: free swap space on swap-in/activation" when
    shrink_active_list() was still rotating referenced active pages.

    In 7e9cd48 "vmscan: fix pagecache reclaim referenced bit check" this was
    changed, the rotating removed but the pagevec_swap_free() after the
    rotation loop was forgotten, applying now to the pagevec of the
    deactivation loop instead.

    Now swap space is freed for deactivated pages. And only for those that
    happen to be on the pagevec after the deactivation loop.

    Complete 7e9cd48 and remove the rest of the swap freeing.

    Signed-off-by: Johannes Weiner
    Acked-by: Rik van Riel
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • In shrink_active_list() after the deactivation loop, we strip buffer heads
    from the potentially remaining pages in the pagevec.

    Currently, this drops the zone's lru lock for stripping, only to reacquire
    it again afterwards to update statistics.

    It is not necessary to strip the pages before updating the stats, so move
    the whole thing out of the protected region and save the extra locking.

    Signed-off-by: Johannes Weiner
    Reviewed-by: MinChan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Add a helper function account_page_dirtied(). Use that from two
    callsites. reiser4 adds a function which adds a third callsite.

    Signed-off-by: Edward Shishkin
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Edward Shishkin
     
  • During page allocation, there are two stages of direct reclaim that are
    applied to each zone in the preferred list. The first stage using
    zone_reclaim() reclaims unmapped file backed pages and slab pages if over
    defined limits as these are cheaper to reclaim. The caller specifies the
    order of the target allocation but the scan control is not being correctly
    initialised.

    The impact is that the correct number of pages are being reclaimed but
    that lumpy reclaim is not being applied. This increases the chances of a
    full direct reclaim via try_to_free_pages() is required.

    This patch initialises the order field of the scan control as requested by
    the caller.

    [mel@csn.ul.ie: rewrote changelog]
    Signed-off-by: Johannes Weiner
    Acked-by: Mel Gorman
    Cc: Rik van Riel
    Cc: Andy Whitcroft
    Cc: KOSAKI Motohiro
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • At first look, mark_page_accessed() in follow_page() seems a bit strange.
    It seems pte_mkyoung() would be better consistent with other kernel code.

    However, it is intentional. The commit log said:

    ------------------------------------------------
    commit 9e45f61d69be9024a2e6bef3831fb04d90fac7a8
    Author: akpm
    Date: Fri Aug 15 07:24:59 2003 +0000

    [PATCH] Use mark_page_accessed() in follow_page()

    Touching a page via follow_page() counts as a reference so we should be
    either setting the referenced bit in the pte or running mark_page_accessed().

    Altering the pte is tricky because we haven't implemented an atomic
    pte_mkyoung(). And mark_page_accessed() is better anyway because it has more
    aging state: it can move the page onto the active list.

    BKrev: 3f3c8acbplT8FbwBVGtth7QmnqWkIw
    ------------------------------------------------

    The atomic issue is still true nowadays. adding comment help to understand
    code intention and it would be better.

    [akpm@linux-foundation.org: clarify text]
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Hugh Dickins
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • shrink_inactive_list() scans in sc->swap_cluster_max chunks until it hits
    the scan limit it was passed.

    shrink_inactive_list()
    {
    do {
    isolate_pages(swap_cluster_max)
    shrink_page_list()
    } while (nr_scanned < max_scan);
    }

    This assumes that swap_cluster_max is not bigger than the scan limit
    because the latter is checked only after at least one iteration.

    In shrink_all_memory() sc->swap_cluster_max is initialized to the overall
    reclaim goal in the beginning but not decreased while reclaim is making
    progress which leads to subsequent calls to shrink_inactive_list()
    reclaiming way too much in the one iteration that is done unconditionally.

    Set sc->swap_cluster_max always to the proper goal before doing
    shrink_all_zones()
    shrink_list()
    shrink_inactive_list().

    While the current shrink_all_memory() happily reclaims more than actually
    requested, this patch fixes it to never exceed the goal:

    unpatched
    wanted=10000 reclaimed=13356
    wanted=10000 reclaimed=19711
    wanted=10000 reclaimed=10289
    wanted=10000 reclaimed=17306
    wanted=10000 reclaimed=10700
    wanted=10000 reclaimed=10004
    wanted=10000 reclaimed=13301
    wanted=10000 reclaimed=10976
    wanted=10000 reclaimed=10605
    wanted=10000 reclaimed=10088
    wanted=10000 reclaimed=15000

    patched
    wanted=10000 reclaimed=10000
    wanted=10000 reclaimed=9599
    wanted=10000 reclaimed=8476
    wanted=10000 reclaimed=8326
    wanted=10000 reclaimed=10000
    wanted=10000 reclaimed=10000
    wanted=10000 reclaimed=9919
    wanted=10000 reclaimed=10000
    wanted=10000 reclaimed=10000
    wanted=10000 reclaimed=10000
    wanted=10000 reclaimed=10000
    wanted=10000 reclaimed=9624
    wanted=10000 reclaimed=10000
    wanted=10000 reclaimed=10000
    wanted=8500 reclaimed=8092
    wanted=316 reclaimed=316

    Signed-off-by: Johannes Weiner
    Reviewed-by: MinChan Kim
    Acked-by: Nigel Cunningham
    Acked-by: "Rafael J. Wysocki"
    Reviewed-by: KOSAKI Motohiro
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Commit a79311c14eae4bb946a97af25f3e1b17d625985d "vmscan: bail out of
    direct reclaim after swap_cluster_max pages" moved the nr_reclaimed
    counter into the scan control to accumulate the number of all reclaimed
    pages in a reclaim invocation.

    shrink_all_memory() can use the same mechanism. it increase code
    consistency and redability.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: MinChan Kim
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Johannes Weiner
    Cc: "Rafael J. Wysocki"
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    MinChan Kim
     
  • commit bf3f3bc5e734706730c12a323f9b2068052aa1f0 (mm: don't
    mark_page_accessed in fault path) only remove the mark_page_accessed() in
    filemap_fault().

    Therefore, swap-backed pages and file-backed pages have inconsistent
    behavior. mark_page_accessed() should be removed from do_swap_page().

    Signed-off-by: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Impact: cleanup

    In almost cases, for_each_zone() is used with populated_zone(). It's
    because almost function doesn't need memoryless node information.
    Therefore, for_each_populated_zone() can help to make code simplify.

    This patch has no functional change.

    [akpm@linux-foundation.org: small cleanup]
    Signed-off-by: KOSAKI Motohiro
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Reviewed-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • sc.may_swap does not only influence reclaiming of anon pages but pages
    mapped into pagetables in general, which also includes mapped file pages.

    In shrink_page_list():

    if (!sc->may_swap && page_mapped(page))
    goto keep_locked;

    For anon pages, this makes sense as they are always mapped and reclaiming
    them always requires swapping.

    But mapped file pages are skipped here as well and it has nothing to do
    with swapping.

    The real effect of the knob is whether mapped pages are unmapped and
    reclaimed or not. Rename it to `may_unmap' to have its name match its
    actual meaning more precisely.

    Signed-off-by: Johannes Weiner
    Reviewed-by: MinChan Kim
    Reviewed-by: KOSAKI Motohiro
    Cc: Lee Schermerhorn
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • There is no need to call for int_sqrt if argument is 0.

    Signed-off-by: Cyrill Gorcunov
    Cc: Pekka Enberg
    Cc: Christoph Lameter
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • vmap's dirty_list is unused. It's for optimizing flushing. but Nick
    didn't write the code yet. so, we don't need it until time as it is
    needed.

    This patch removes vmap_block's dirty_list and codes related to it.

    Signed-off-by: MinChan Kim
    Acked-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    MinChan Kim