12 Sep, 2013

2 commits

  • Pavel reported that in case if vma area get unmapped and then mapped (or
    expanded) in-place, the soft dirty tracker won't be able to recognize this
    situation since it works on pte level and ptes are get zapped on unmap,
    loosing soft dirty bit of course.

    So to resolve this situation we need to track actions on vma level, there
    VM_SOFTDIRTY flag comes in. When new vma area created (or old expanded)
    we set this bit, and keep it here until application calls for clearing
    soft dirty bit.

    Thus when user space application track memory changes now it can detect if
    vma area is renewed.

    Reported-by: Pavel Emelyanov
    Signed-off-by: Cyrill Gorcunov
    Cc: Andy Lutomirski
    Cc: Matt Mackall
    Cc: Xiao Guangrong
    Cc: Marcelo Tosatti
    Cc: KOSAKI Motohiro
    Cc: Stephen Rothwell
    Cc: Peter Zijlstra
    Cc: "Aneesh Kumar K.V"
    Cc: Rob Landley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • Explicitly mention/recommend using the libhugetlbfs test cases when
    changing related kernel code. Developers that are unaware of the project
    can easily miss this and introduce potential regressions that may or may
    not be caught by community review.

    Also do some cleanups that make the document visually easier to view at a
    first glance.

    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

11 Jul, 2013

1 commit

  • Add the documentation file for the zswap functionality

    Signed-off-by: Seth Jennings
    Acked-by: Rik van Riel
    Cc: Greg Kroah-Hartman
    Cc: Nitin Gupta
    Cc: Minchan Kim
    Cc: Konrad Rzeszutek Wilk
    Cc: Dan Magenheimer
    Cc: Robert Jennings
    Cc: Jenifer Hopper
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Larry Woodman
    Cc: Benjamin Herrenschmidt
    Cc: Dave Hansen
    Cc: Joe Perches
    Cc: Joonsoo Kim
    Cc: Cody P Schafer
    Cc: Hugh Dickens
    Cc: Paul Mackerras
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Seth Jennings
     

10 Jul, 2013

1 commit

  • Transparent huge zero page is used during the page fault instead of in
    khugepaged.

    # ls /sys/kernel/mm/transparent_hugepage/
    defrag enabled khugepaged use_zero_page
    # ls /sys/kernel/mm/transparent_hugepage/khugepaged/
    alloc_sleep_millisecs defrag full_scans max_ptes_none pages_collapsed pages_to_scan scan_sleep_millisecs

    This patch corrects the documentation just like the codes done.

    Signed-off-by: Wanpeng Li
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     

05 Jul, 2013

1 commit

  • Pull trivial tree updates from Jiri Kosina:
    "The usual stuff from trivial tree"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (34 commits)
    treewide: relase -> release
    Documentation/cgroups/memory.txt: fix stat file documentation
    sysctl/net.txt: delete reference to obsolete 2.4.x kernel
    spinlock_api_smp.h: fix preprocessor comments
    treewide: Fix typo in printk
    doc: device tree: clarify stuff in usage-model.txt.
    open firmware: "/aliasas" -> "/aliases"
    md: bcache: Fixed a typo with the word 'arithmetic'
    irq/generic-chip: fix a few kernel-doc entries
    frv: Convert use of typedef ctl_table to struct ctl_table
    sgi: xpc: Convert use of typedef ctl_table to struct ctl_table
    doc: clk: Fix incorrect wording
    Documentation/arm/IXP4xx fix a typo
    Documentation/networking/ieee802154 fix a typo
    Documentation/DocBook/media/v4l fix a typo
    Documentation/video4linux/si476x.txt fix a typo
    Documentation/virtual/kvm/api.txt fix a typo
    Documentation/early-userspace/README fix a typo
    Documentation/video4linux/soc-camera.txt fix a typo
    lguest: fix CONFIG_PAE -> CONFIG_x86_PAE in comment
    ...

    Linus Torvalds
     

04 Jul, 2013

2 commits

  • In order to reuse bits from pagemap entries gracefully, we leave the
    entries as is but on pagemap open emit a warning in dmesg, that bits
    55-60 are about to change in a couple of releases. Next, if a user
    issues soft-dirty clear command via the clear_refs file (it was disabled
    before v3.9) we assume that he's aware of the new pagemap format, note
    that fact and report the bits in pagemap in the new manner.

    The "migration strategy" looks like this then:

    1. existing users are not affected -- they don't touch soft-dirty feature, thus
    see old bits in pagemap, but are warned and have time to fix themselves
    2. those who use soft-dirty know about new pagemap format
    3. some time soon we get rid of any signs of page-shift in pagemap as well as
    this trick with clear-soft-dirty affecting pagemap format.

    Signed-off-by: Pavel Emelyanov
    Cc: Matt Mackall
    Cc: Xiao Guangrong
    Cc: Glauber Costa
    Cc: Marcelo Tosatti
    Cc: KOSAKI Motohiro
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • The soft-dirty is a bit on a PTE which helps to track which pages a task
    writes to. In order to do this tracking one should

    1. Clear soft-dirty bits from PTEs ("echo 4 > /proc/PID/clear_refs)
    2. Wait some time.
    3. Read soft-dirty bits (55'th in /proc/PID/pagemap2 entries)

    To do this tracking, the writable bit is cleared from PTEs when the
    soft-dirty bit is. Thus, after this, when the task tries to modify a
    page at some virtual address the #PF occurs and the kernel sets the
    soft-dirty bit on the respective PTE.

    Note, that although all the task's address space is marked as r/o after
    the soft-dirty bits clear, the #PF-s that occur after that are processed
    fast. This is so, since the pages are still mapped to physical memory,
    and thus all the kernel does is finds this fact out and puts back
    writable, dirty and soft-dirty bits on the PTE.

    Another thing to note, is that when mremap moves PTEs they are marked
    with soft-dirty as well, since from the user perspective mremap modifies
    the virtual memory at mremap's new address.

    Signed-off-by: Pavel Emelyanov
    Cc: Matt Mackall
    Cc: Xiao Guangrong
    Cc: Glauber Costa
    Cc: Marcelo Tosatti
    Cc: KOSAKI Motohiro
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     

28 May, 2013

1 commit


30 Apr, 2013

1 commit

  • Add user_reserve_kbytes knob.

    Limit the growth of the memory reserved for other user processes to
    min(3% current process size, user_reserve_pages). Only about 8MB is
    necessary to enable recovery in the default mode, and only a few hundred
    MB are required even when overcommit is disabled.

    user_reserve_pages defaults to min(3% free pages, 128MB)

    I arrived at 128MB by taking the max VSZ of sshd, login, bash, and top ...
    then adding the RSS of each.

    This only affects OVERCOMMIT_NEVER mode.

    Background

    1. user reserve

    __vm_enough_memory reserves a hardcoded 3% of the current process size for
    other applications when overcommit is disabled. This was done so that a
    user could recover if they launched a memory hogging process. Without the
    reserve, a user would easily run into a message such as:

    bash: fork: Cannot allocate memory

    2. admin reserve

    Additionally, a hardcoded 3% of free memory is reserved for root in both
    overcommit 'guess' and 'never' modes. This was intended to prevent a
    scenario where root-cant-log-in and perform recovery operations.

    Note that this reserve shrinks, and doesn't guarantee a useful reserve.

    Motivation

    The two hardcoded memory reserves should be updated to account for current
    memory sizes.

    Also, the admin reserve would be more useful if it didn't shrink too much.

    When the current code was originally written, 1GB was considered
    "enterprise". Now the 3% reserve can grow to multiple GB on large memory
    systems, and it only needs to be a few hundred MB at most to enable a user
    or admin to recover a system with an unwanted memory hogging process.

    I've found that reducing these reserves is especially beneficial for a
    specific type of application load:

    * single application system
    * one or few processes (e.g. one per core)
    * allocating all available memory
    * not initializing every page immediately
    * long running

    I've run scientific clusters with this sort of load. A long running job
    sometimes failed many hours (weeks of CPU time) into a calculation. They
    weren't initializing all of their memory immediately, and they weren't
    using calloc, so I put systems into overcommit 'never' mode. These
    clusters run diskless and have no swap.

    However, with the current reserves, a user wishing to allocate as much
    memory as possible to one process may be prevented from using, for
    example, almost 2GB out of 32GB.

    The effect is less, but still significant when a user starts a job with
    one process per core. I have repeatedly seen a set of processes
    requesting the same amount of memory fail because one of them could not
    allocate the amount of memory a user would expect to be able to allocate.
    For example, Message Passing Interfce (MPI) processes, one per core. And
    it is similar for other parallel programming frameworks.

    Changing this reserve code will make the overcommit never mode more useful
    by allowing applications to allocate nearly all of the available memory.

    Also, the new admin_reserve_kbytes will be safer than the current behavior
    since the hardcoded 3% of available memory reserve can shrink to something
    useless in the case where applications have grabbed all available memory.

    Risks

    * "bash: fork: Cannot allocate memory"

    The downside of the first patch-- which creates a tunable user reserve
    that is only used in overcommit 'never' mode--is that an admin can set
    it so low that a user may not be able to kill their process, even if
    they already have a shell prompt.

    Of course, a user can get in the same predicament with the current 3%
    reserve--they just have to launch processes until 3% becomes negligible.

    * root-cant-log-in problem

    The second patch, adding the tunable rootuser_reserve_pages, allows
    the admin to shoot themselves in the foot by setting it too small. They
    can easily get the system into a state where root-can't-log-in.

    However, the new admin_reserve_kbytes will be safer than the current
    behavior since the hardcoded 3% of available memory reserve can shrink
    to something useless in the case where applications have grabbed all
    available memory.

    Alternatives

    * Memory cgroups provide a more flexible way to limit application memory.

    Not everyone wants to set up cgroups or deal with their overhead.

    * We could create a fourth overcommit mode which provides smaller reserves.

    The size of useful reserves may be drastically different depending
    on the whether the system is embedded or enterprise.

    * Force users to initialize all of their memory or use calloc.

    Some users don't want/expect the system to overcommit when they malloc.
    Overcommit 'never' mode is for this scenario, and it should work well.

    The new user and admin reserve tunables are simple to use, with low
    overhead compared to cgroups. The patches preserve current behavior where
    3% of memory is less than 128MB, except that the admin reserve doesn't
    shrink to an unusable size under pressure. The code allows admins to tune
    for embedded and enterprise usage.

    FAQ

    * How is the root-cant-login problem addressed?
    What happens if admin_reserve_pages is set to 0?

    Root is free to shoot themselves in the foot by setting
    admin_reserve_kbytes too low.

    On x86_64, the minimum useful reserve is:
    8MB for overcommit 'guess'
    128MB for overcommit 'never'

    admin_reserve_pages defaults to min(3% free memory, 8MB)

    So, anyone switching to 'never' mode needs to adjust
    admin_reserve_pages.

    * How do you calculate a minimum useful reserve?

    A user or the admin needs enough memory to login and perform
    recovery operations, which includes, at a minimum:

    sshd or login + bash (or some other shell) + top (or ps, kill, etc.)

    For overcommit 'guess', we can sum resident set sizes (RSS)
    because we only need enough memory to handle what the recovery
    programs will typically use. On x86_64 this is about 8MB.

    For overcommit 'never', we can take the max of their virtual sizes (VSZ)
    and add the sum of their RSS. We use VSZ instead of RSS because mode
    forces us to ensure we can fulfill all of the requested memory allocations--
    even if the programs only use a fraction of what they ask for.
    On x86_64 this is about 128MB.

    When swap is enabled, reserves are useful even when they are as
    small as 10MB, regardless of overcommit mode.

    When both swap and overcommit are disabled, then the admin should
    tune the reserves higher to be absolutley safe. Over 230MB each
    was safest in my testing.

    * What happens if user_reserve_pages is set to 0?

    Note, this only affects overcomitt 'never' mode.

    Then a user will be able to allocate all available memory minus
    admin_reserve_kbytes.

    However, they will easily see a message such as:

    "bash: fork: Cannot allocate memory"

    And they won't be able to recover/kill their application.
    The admin should be able to recover the system if
    admin_reserve_kbytes is set appropriately.

    * What's the difference between overcommit 'guess' and 'never'?

    "Guess" allows an allocation if there are enough free + reclaimable
    pages. It has a hardcoded 3% of free pages reserved for root.

    "Never" allows an allocation if there is enough swap + a configurable
    percentage (default is 50) of physical RAM. It has a hardcoded 3% of
    free pages reserved for root, like "Guess" mode. It also has a
    hardcoded 3% of the current process size reserved for additional
    applications.

    * Why is overcommit 'guess' not suitable even when an app eventually
    writes to every page? It takes free pages, file pages, available
    swap pages, reclaimable slab pages into consideration. In other words,
    these are all pages available, then why isn't overcommit suitable?

    Because it only looks at the present state of the system. It
    does not take into account the memory that other applications have
    malloced, but haven't initialized yet. It overcommits the system.

    Test Summary

    There was little change in behavior in the default overcommit 'guess'
    mode with swap enabled before and after the patch. This was expected.

    Systems run most predictably (i.e. no oom kills) in overcommit 'never'
    mode with swap enabled. This also allowed the most memory to be allocated
    to a user application.

    Overcommit 'guess' mode without swap is a bad idea. It is easy to
    crash the system. None of the other tested combinations crashed.
    This matches my experience on the Roadrunner supercomputer.

    Without the tunable user reserve, a system in overcommit 'never' mode
    and without swap does not allow the admin to recover, although the
    admin can.

    With the new tunable reserves, a system in overcommit 'never' mode
    and without swap can be configured to:

    1. maximize user-allocatable memory, running close to the edge of
    recoverability

    2. maximize recoverability, sacrificing allocatable memory to
    ensure that a user cannot take down a system

    Test Description

    Fedora 18 VM - 4 x86_64 cores, 5725MB RAM, 4GB Swap

    System is booted into multiuser console mode, with unnecessary services
    turned off. Caches were dropped before each test.

    Hogs are user memtester processes that attempt to allocate all free memory
    as reported by /proc/meminfo

    In overcommit 'never' mode, memory_ratio=100

    Test Results

    3.9.0-rc1-mm1

    Overcommit | Swap | Hogs | MB Got/Wanted | OOMs | User Recovery | Admin Recovery
    ---------- ---- ---- ------------- ---- ------------- --------------
    guess yes 1 5432/5432 no yes yes
    guess yes 4 5444/5444 1 yes yes
    guess no 1 5302/5449 no yes yes
    guess no 4 - crash no no

    never yes 1 5460/5460 1 yes yes
    never yes 4 5460/5460 1 yes yes
    never no 1 5218/5432 no no yes
    never no 4 5203/5448 no no yes

    3.9.0-rc1-mm1-tunablereserves

    User and Admin Recovery show their respective reserves, if applicable.

    Overcommit | Swap | Hogs | MB Got/Wanted | OOMs | User Recovery | Admin Recovery
    ---------- ---- ---- ------------- ---- ------------- --------------
    guess yes 1 5419/5419 no - yes 8MB yes
    guess yes 4 5436/5436 1 - yes 8MB yes
    guess no 1 5440/5440 * - yes 8MB yes
    guess no 4 - crash - no 8MB no

    * process would successfully mlock, then the oom killer would pick it

    never yes 1 5446/5446 no 10MB yes 20MB yes
    never yes 4 5456/5456 no 10MB yes 20MB yes
    never no 1 5387/5429 no 128MB no 8MB barely
    never no 1 5323/5428 no 226MB barely 8MB barely
    never no 1 5323/5428 no 226MB barely 8MB barely

    never no 1 5359/5448 no 10MB no 10MB barely

    never no 1 5323/5428 no 0MB no 10MB barely
    never no 1 5332/5428 no 0MB no 50MB yes
    never no 1 5293/5429 no 0MB no 90MB yes

    never no 1 5001/5427 no 230MB yes 338MB yes
    never no 4* 4998/5424 no 230MB yes 338MB yes

    * more memtesters were launched, able to allocate approximately another 100MB

    Future Work

    - Test larger memory systems.

    - Test an embedded image.

    - Test other architectures.

    - Time malloc microbenchmarks.

    - Would it be useful to be able to set overcommit policy for
    each memory cgroup?

    - Some lines are slightly above 80 chars.
    Perhaps define a macro to convert between pages and kb?
    Other places in the kernel do this.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: make init_user_reserve() static]
    Signed-off-by: Andrew Shewmaker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Shewmaker
     

24 Feb, 2013

2 commits

  • Added slightly more detail to the Documentation of merge_across_nodes, a
    few comments in areas indicated by review, and renamed get_ksm_page()'s
    argument from "locked" to "lock_it". No functional change.

    Signed-off-by: Hugh Dickins
    Cc: Mel Gorman
    Cc: Petr Holasek
    Cc: Andrea Arcangeli
    Cc: Izik Eidus
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Here's a KSM series, based on mmotm 2013-01-23-17-04: starting with
    Petr's v7 "KSM: numa awareness sysfs knob"; then fixing the two issues
    we had with that, fully enabling KSM page migration on the way.

    (A different kind of KSM/NUMA issue which I've certainly not begun to
    address here: when KSM pages are unmerged, there's usually no sense in
    preferring to allocate the new pages local to the caller's node.)

    This patch:

    Introduces new sysfs boolean knob /sys/kernel/mm/ksm/merge_across_nodes
    which control merging pages across different numa nodes. When it is set
    to zero only pages from the same node are merged, otherwise pages from
    all nodes can be merged together (default behavior).

    Typical use-case could be a lot of KVM guests on NUMA machine and cpus
    from more distant nodes would have significant increase of access
    latency to the merged ksm page. Sysfs knob was choosen for higher
    variability when some users still prefers higher amount of saved
    physical memory regardless of access latency.

    Every numa node has its own stable & unstable trees because of faster
    searching and inserting. Changing of merge_across_nodes value is
    possible only when there are not any ksm shared pages in system.

    I've tested this patch on numa machines with 2, 4 and 8 nodes and
    measured speed of memory access inside of KVM guests with memory pinned
    to one of nodes with this benchmark:

    http://pholasek.fedorapeople.org/alloc_pg.c

    Population standard deviations of access times in percentage of average
    were following:

    merge_across_nodes=1
    2 nodes 1.4%
    4 nodes 1.6%
    8 nodes 1.7%

    merge_across_nodes=0
    2 nodes 1%
    4 nodes 0.32%
    8 nodes 0.018%

    RFC: https://lkml.org/lkml/2011/11/30/91
    v1: https://lkml.org/lkml/2012/1/23/46
    v2: https://lkml.org/lkml/2012/6/29/105
    v3: https://lkml.org/lkml/2012/9/14/550
    v4: https://lkml.org/lkml/2012/9/23/137
    v5: https://lkml.org/lkml/2012/12/10/540
    v6: https://lkml.org/lkml/2012/12/23/154
    v7: https://lkml.org/lkml/2012/12/27/225

    Hugh notes that this patch brings two problems, whose solution needs
    further support in mm/ksm.c, which follows in subsequent patches:

    1) switching merge_across_nodes after running KSM is liable to oops
    on stale nodes still left over from the previous stable tree;

    2) memory hotremove may migrate KSM pages, but there is no provision
    here for !merge_across_nodes to migrate nodes to the proper tree.

    Signed-off-by: Petr Holasek
    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Izik Eidus
    Cc: Gerald Schaefer
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Petr Holasek
     

14 Dec, 2012

1 commit

  • Merge misc VM changes from Andrew Morton:
    "The rest of most-of-MM. The other MM bits await a slab merge.

    This patch includes the addition of a huge zero_page. Not a
    performance boost but it an save large amounts of physical memory in
    some situations.

    Also a bunch of Fujitsu engineers are working on memory hotplug.
    Which, as it turns out, was badly broken. About half of their patches
    are included here; the remainder are 3.8 material."

    However, this merge disables CONFIG_MOVABLE_NODE, which was totally
    broken. We don't add new features with "default y", nor do we add
    Kconfig questions that are incomprehensible to most people without any
    help text. Does the feature even make sense without compaction or
    memory hotplug?

    * akpm: (54 commits)
    mm/bootmem.c: remove unused wrapper function reserve_bootmem_generic()
    mm/memory.c: remove unused code from do_wp_page()
    asm-generic, mm: pgtable: consolidate zero page helpers
    mm/hugetlb.c: fix warning on freeing hwpoisoned hugepage
    hwpoison, hugetlbfs: fix RSS-counter warning
    hwpoison, hugetlbfs: fix "bad pmd" warning in unmapping hwpoisoned hugepage
    mm: protect against concurrent vma expansion
    memcg: do not check for mm in __mem_cgroup_count_vm_event
    tmpfs: support SEEK_DATA and SEEK_HOLE (reprise)
    mm: provide more accurate estimation of pages occupied by memmap
    fs/buffer.c: remove redundant initialization in alloc_page_buffers()
    fs/buffer.c: do not inline exported function
    writeback: fix a typo in comment
    mm: introduce new field "managed_pages" to struct zone
    mm, oom: remove statically defined arch functions of same name
    mm, oom: remove redundant sleep in pagefault oom handler
    mm, oom: cleanup pagefault oom handler
    memory_hotplug: allow online/offline memory to result movable node
    numa: add CONFIG_MOVABLE_NODE for movable-dedicated node
    mm, memcg: avoid unnecessary function call when memcg is disabled
    ...

    Linus Torvalds
     

13 Dec, 2012

3 commits

  • By default kernel tries to use huge zero page on read page fault. It's
    possible to disable huge zero page by writing 0 or enable it back by
    writing 1:

    echo 0 >/sys/kernel/mm/transparent_hugepage/khugepaged/use_zero_page
    echo 1 >/sys/kernel/mm/transparent_hugepage/khugepaged/use_zero_page

    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Andi Kleen
    Cc: "H. Peter Anvin"
    Cc: Mel Gorman
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • hzp_alloc is incremented every time a huge zero page is successfully
    allocated. It includes allocations which where dropped due
    race with other allocation. Note, it doesn't count every map
    of the huge zero page, only its allocation.

    hzp_alloc_failed is incremented if kernel fails to allocate huge zero
    page and falls back to using small pages.

    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Andi Kleen
    Cc: "H. Peter Anvin"
    Cc: Mel Gorman
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Pass vma instead of mm and add address parameter.

    In most cases we already have vma on the stack. We provides
    split_huge_page_pmd_mm() for few cases when we have mm, but not vma.

    This change is preparation to huge zero pmd splitting implementation.

    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Andi Kleen
    Cc: "H. Peter Anvin"
    Cc: Mel Gorman
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

19 Nov, 2012

1 commit


09 Oct, 2012

2 commits

  • page_evictable(page, vma) is an irritant: almost all its callers pass
    NULL for vma. Remove the vma arg and use mlocked_vma_newpage(vma, page)
    explicitly in the couple of places it's needed. But in those places we
    don't even need page_evictable() itself! They're dealing with a freshly
    allocated anonymous page, which has no "mapping" and cannot be mlocked yet.

    Signed-off-by: Hugh Dickins
    Acked-by: Mel Gorman
    Cc: Rik van Riel
    Acked-by: Johannes Weiner
    Cc: Michel Lespinasse
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • A long time ago, in v2.4, VM_RESERVED kept swapout process off VMA,
    currently it lost original meaning but still has some effects:

    | effect | alternative flags
    -+------------------------+---------------------------------------------
    1| account as reserved_vm | VM_IO
    2| skip in core dump | VM_IO, VM_DONTDUMP
    3| do not merge or expand | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP
    4| do not mlock | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP

    This patch removes reserved_vm counter from mm_struct. Seems like nobody
    cares about it, it does not exported into userspace directly, it only
    reduces total_vm showed in proc.

    Thus VM_RESERVED can be replaced with VM_IO or pair VM_DONTEXPAND | VM_DONTDUMP.

    remap_pfn_range() and io_remap_pfn_range() set VM_IO|VM_DONTEXPAND|VM_DONTDUMP.
    remap_vmalloc_range() set VM_DONTEXPAND | VM_DONTDUMP.

    [akpm@linux-foundation.org: drivers/vfio/pci/vfio_pci.c fixup]
    Signed-off-by: Konstantin Khlebnikov
    Cc: Alexander Viro
    Cc: Carsten Otte
    Cc: Chris Metcalf
    Cc: Cyrill Gorcunov
    Cc: Eric Paris
    Cc: H. Peter Anvin
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Cc: James Morris
    Cc: Jason Baron
    Cc: Kentaro Takeda
    Cc: Matt Helsley
    Cc: Nick Piggin
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Robert Richter
    Cc: Suresh Siddha
    Cc: Tetsuo Handa
    Cc: Venkatesh Pallipadi
    Acked-by: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

22 Aug, 2012

1 commit

  • Commit f0f57b2b1488 ("mm: move hugepage test examples to
    tools/testing/selftests/vm") moved map_hugetlb.c, hugepage-shm.c and
    hugepage-mmap.c tests into tools/testing/selftests/vm/ directory, but it
    didn't update hugetlbpage.txt

    Signed-off-by: Zhouping Liu
    Acked-by: Dave Young
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhouping Liu
     

23 Jul, 2012

1 commit


05 Jun, 2012

1 commit

  • Pull frontswap feature from Konrad Rzeszutek Wilk:
    "Frontswap provides a "transcendent memory" interface for swap pages.
    In some environments, dramatic performance savings may be obtained
    because swapped pages are saved in RAM (or a RAM-like device) instead
    of a swap disk. This tag provides the basic infrastructure along with
    some changes to the existing backends."

    Fix up trivial conflict in mm/Makefile due to removal of swap token code
    changing a line next to the new frontswap entry.

    This pull request came in before the merge window even opened, it got
    delayed to after the merge window by me just wanting to make sure it had
    actual users. Apparently IBM is using this on their embedded side, and
    Jan Beulich says that it's already made available for SLES and OpenSUSE
    users.

    Also acked by Rik van Riel, and Konrad points to other people liking it
    too. So in it goes.

    By Dan Magenheimer (4) and Konrad Rzeszutek Wilk (2)
    via Konrad Rzeszutek Wilk
    * tag 'stable/frontswap.v16-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/mm:
    frontswap: s/put_page/store/g s/get_page/load
    MAINTAINER: Add myself for the frontswap API
    mm: frontswap: config and doc files
    mm: frontswap: core frontswap functionality
    mm: frontswap: core swap subsystem hooks and headers
    mm: frontswap: add frontswap header file

    Linus Torvalds
     

02 Jun, 2012

1 commit

  • Pull slab updates from Pekka Enberg:
    "Mainly a bunch of SLUB fixes from Joonsoo Kim"

    * 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux:
    slub: use __SetPageSlab function to set PG_slab flag
    slub: fix a memory leak in get_partial_node()
    slub: remove unused argument of init_kmem_cache_node()
    slub: fix a possible memory leak
    Documentations: Fix slabinfo.c directory in vm/slub.txt
    slub: fix incorrect return type of get_any_partial()

    Linus Torvalds
     

01 Jun, 2012

1 commit

  • This is an implementation of Andrew's proposal to extend the pagemap file
    bits to report what is missing about tasks' working set.

    The problem with the working set detection is multilateral. In the criu
    (checkpoint/restore) project we dump the tasks' memory into image files
    and to do it properly we need to detect which pages inside mappings are
    really in use. The mincore syscall I though could help with this did not.
    First, it doesn't report swapped pages, thus we cannot find out which
    parts of anonymous mappings to dump. Next, it does report pages from page
    cache as present even if they are not mapped, and it doesn't make that has
    not been cow-ed.

    Note, that issue with swap pages is critical -- we must dump swap pages to
    image file. But the issues with file pages are optimization -- we can
    take all file pages to image, this would be correct, but if we know that a
    page is not mapped or not cow-ed, we can remove them from dump file. The
    dump would still be self-consistent, though significantly smaller in size
    (up to 10 times smaller on real apps).

    Andrew noticed, that the proc pagemap file solved 2 of 3 above issues --
    it reports whether a page is present or swapped and it doesn't report not
    mapped page cache pages. But, it doesn't distinguish cow-ed file pages
    from not cow-ed.

    I would like to make the last unused bit in this file to report whether the
    page mapped into respective pte is PageAnon or not.

    [comment stolen from Pavel Emelyanov's v1 patch]

    Signed-off-by: Konstantin Khlebnikov
    Cc: Pavel Emelyanov
    Cc: Matt Mackall
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Acked-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

30 May, 2012

1 commit


15 May, 2012

2 commits

  • Sounds so much more natural.

    Suggested-by: Andrea Arcangeli
    Signed-off-by: Konrad Rzeszutek Wilk

    Konrad Rzeszutek Wilk
     
  • This patch 4of4 adds configuration and documentation files including a FAQ.

    [v14: updated docs/FAQ to use zcache and RAMster as examples]
    [v10: no change]
    [v9: akpm@linux-foundation.org: sysfs->debugfs; no longer need Doc/ABI file]
    [v8: rebase to 3.0-rc4]
    [v7: rebase to 3.0-rc3]
    [v6: rebase to 3.0-rc1]
    [v5: change config default to n]
    [v4: rebase to 2.6.39]
    Signed-off-by: Dan Magenheimer
    Acked-by: Jan Beulich
    Acked-by: Seth Jennings
    Cc: Jeremy Fitzhardinge
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Nitin Gupta
    Cc: Matthew Wilcox
    Cc: Chris Mason
    Cc: Rik Riel
    Cc: Andrew Morton
    Signed-off-by: Konrad Rzeszutek Wilk

    Dan Magenheimer
     

10 May, 2012

1 commit


29 Mar, 2012

2 commits

  • hugepage-mmap.c, hugepage-shm.c and map_hugetlb.c in Documentation/vm are
    simple pass/fail tests, It's better to promote them to
    tools/testing/selftests.

    Thanks suggestion of Andrew Morton about this. They all need firstly
    setting up proper nr_hugepages and hugepage-mmap need to mount hugetlbfs.
    So I add a shell script run_vmtests to do such work which will call the
    three test programs and check the return value of them.

    Changes to original code including below:
    a. add run_vmtests script
    b. return error when read_bytes mismatch with writed bytes.
    c. coding style fixes: do not use assignment in if condition

    [akpm@linux-foundation.org: build the targets before trying to execute them]
    [akpm@linux-foundation.org: Documentation/vm/ no longer has a Makefile. Fixes "make clean"]
    Signed-off-by: Dave Young
    Cc: Wu Fengguang
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Frederic Weisbecker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Young
     
  • tools/ is the better place for vm tools which are used by many people.
    Moving them to tools also make them open to more users instead of hide in
    Documentation folder.

    This patch moves page-types.c to tools/vm/page-types.c. Also add a
    Makefile in tools/vm and fix two coding style problems: a) change const
    arrary to 'const char * const', b) change a space to tab for indent.

    Signed-off-by: Dave Young
    Acked-by: Wu Fengguang
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Frederic Weisbecker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Young
     

23 Mar, 2012

1 commit

  • Pull cleancache changes from Konrad Rzeszutek Wilk:
    "This has some patches for the cleancache API that should have been
    submitted a _long_ time ago. They are basically cleanups:

    - rename of flush to invalidate

    - moving reporting of statistics into debugfs

    - use __read_mostly as necessary.

    Oh, and also the MAINTAINERS file change. The files (except the
    MAINTAINERS file) have been in #linux-next for months now. The late
    addition of MAINTAINERS file is a brain-fart on my side - didn't
    realize I needed that just until I was typing this up - and I based
    that patch on v3.3 - so the tree is on top of v3.3."

    * tag 'stable/for-linus-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/mm:
    MAINTAINERS: Adding cleancache API to the list.
    mm: cleancache: Use __read_mostly as appropiate.
    mm: cleancache: report statistics via debugfs instead of sysfs.
    mm: zcache/tmem/cleancache: s/flush/invalidate/
    mm: cleancache: s/flush/invalidate/

    Linus Torvalds
     

22 Mar, 2012

1 commit

  • page-types, which is a common user of pagemap, gets aware of thp with this
    patch. This helps system admins and kernel hackers know about how thp
    works. Here is a sample output of page-types over a thp:

    $ page-types -p --raw --list

    voffset offset len flags
    ...
    7f9d40200 3f8400 1 ___U_lA____Ma_bH______t____________
    7f9d40201 3f8401 1ff ________________T_____t____________

    flags page-count MB symbolic-flags long-symbolic-flags
    0x0000000000410000 511 1 ________________T_____t____________ compound_tail,thp
    0x000000000040d868 1 0 ___U_lA____Ma_bH______t____________ uptodate,lru,active,mmap,anonymous,swapbacked,compound_head,thp

    Signed-off-by: Naoya Horiguchi
    Acked-by: Wu Fengguang
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: KOSAKI Motohiro
    Cc: David Rientjes
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

20 Mar, 2012

1 commit


07 Mar, 2012

1 commit


10 Feb, 2012

2 commits


24 Jan, 2012

2 commits

  • [v9: akpm@linux-foundation.org: sysfs->debugfs; no longer need Doc/ABI file]

    Signed-off-by: Dan Magenheimer
    Signed-off-by: Konrad Wilk
    Cc: Jan Beulich
    Acked-by: Seth Jennings
    Cc: Jeremy Fitzhardinge
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Nitin Gupta
    Cc: Matthew Wilcox
    Cc: Chris Mason
    Cc: Rik Riel
    Cc: Andrew Morton

    Dan Magenheimer
     
  • Per akpm suggestions alter the use of the term flush to be
    invalidate. The next patch will do this across all MM.

    This change is completely cosmetic.

    [v9: akpm@linux-foundation.org: change "flush" to "invalidate", part 3]

    Signed-off-by: Dan Magenheimer
    Cc: Kamezawa Hiroyuki
    Cc: Jan Beulich
    Reviewed-by: Seth Jennings
    Cc: Jeremy Fitzhardinge
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Nitin Gupta
    Cc: Matthew Wilcox
    Cc: Chris Mason
    Cc: Rik Riel
    Cc: Andrew Morton
    [v10: Fixed fs: move code out of buffer.c conflict change]
    Signed-off-by: Konrad Rzeszutek Wilk

    Dan Magenheimer
     

13 Jan, 2012

1 commit


28 Nov, 2011

1 commit


26 Oct, 2011

1 commit