11 Jan, 2012

1 commit

  • Per-zone dirty limits try to distribute page cache pages allocated for
    writing across zones in proportion to the individual zone sizes, to reduce
    the likelihood of reclaim having to write back individual pages from the
    LRU lists in order to make progress.

    This patch:

    The amount of dirtyable pages should not include the full number of free
    pages: there is a number of reserved pages that the page allocator and
    kswapd always try to keep free.

    The closer (reclaimable pages - dirty pages) is to the number of reserved
    pages, the more likely it becomes for reclaim to run into dirty pages:

    +----------+ ---
    | anon | |
    +----------+ |
    | | |
    | | -- dirty limit new -- flusher new
    | file | | |
    | | | |
    | | -- dirty limit old -- flusher old
    | | |
    +----------+ --- reclaim
    | reserved |
    +----------+
    | kernel |
    +----------+

    This patch introduces a per-zone dirty reserve that takes both the lowmem
    reserve as well as the high watermark of the zone into account, and a
    global sum of those per-zone values that is subtracted from the global
    amount of dirtyable pages. The lowmem reserve is unavailable to page
    cache allocations and kswapd tries to keep the high watermark free. We
    don't want to end up in a situation where reclaim has to clean pages in
    order to balance zones.

    Not treating reserved pages as dirtyable on a global level is only a
    conceptual fix. In reality, dirty pages are not distributed equally
    across zones and reclaim runs into dirty pages on a regular basis.

    But it is important to get this right before tackling the problem on a
    per-zone level, where the distance between reclaim and the dirty pages is
    mostly much smaller in absolute numbers.

    [akpm@linux-foundation.org: fix highmem build]
    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Reviewed-by: Michal Hocko
    Reviewed-by: Minchan Kim
    Acked-by: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Cc: Christoph Hellwig
    Cc: Wu Fengguang
    Cc: Dave Chinner
    Cc: Jan Kara
    Cc: Shaohua Li
    Cc: Chris Mason
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

01 Nov, 2011

1 commit

  • Change ISOLATE_XXX macro with bitwise isolate_mode_t type. Normally,
    macro isn't recommended as it's type-unsafe and making debugging harder as
    symbol cannot be passed throught to the debugger.

    Quote from Johannes
    " Hmm, it would probably be cleaner to fully convert the isolation mode
    into independent flags. INACTIVE, ACTIVE, BOTH is currently a
    tri-state among flags, which is a bit ugly."

    This patch moves isolate mode from swap.h to mmzone.h by memcontrol.h

    Signed-off-by: Minchan Kim
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Michal Hocko
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

15 Sep, 2011

1 commit

  • Revert the post-3.0 commit 82f9d486e59f5 ("memcg: add
    memory.vmscan_stat").

    The implementation of per-memcg reclaim statistics violates how memcg
    hierarchies usually behave: hierarchically.

    The reclaim statistics are accounted to child memcgs and the parent
    hitting the limit, but not to hierarchy levels in between. Usually,
    hierarchical statistics are perfectly recursive, with each level
    representing the sum of itself and all its children.

    Since this exports statistics to userspace, this may lead to confusion
    and problems with changing things after the release, so revert it now,
    we can try again later.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Michal Hocko
    Cc: Ying Han
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

27 Jul, 2011

3 commits

  • This allows us to move duplicated code in
    (atomic_inc_not_zero() for now) to

    Signed-off-by: Arun Sharma
    Reviewed-by: Eric Dumazet
    Cc: Ingo Molnar
    Cc: David Miller
    Cc: Eric Dumazet
    Acked-by: Mike Frysinger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arun Sharma
     
  • The commit log of 0ae5e89c60c9 ("memcg: count the soft_limit reclaim
    in...") says it adds scanning stats to memory.stat file. But it doesn't
    because we considered we needed to make a concensus for such new APIs.

    This patch is a trial to add memory.scan_stat. This shows
    - the number of scanned pages(total, anon, file)
    - the number of rotated pages(total, anon, file)
    - the number of freed pages(total, anon, file)
    - the number of elaplsed time (including sleep/pause time)

    for both of direct/soft reclaim.

    The biggest difference with oringinal Ying's one is that this file
    can be reset by some write, as

    # echo 0 ...../memory.scan_stat

    Example of output is here. This is a result after make -j 6 kernel
    under 300M limit.

    [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.scan_stat
    [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.vmscan_stat
    scanned_pages_by_limit 9471864
    scanned_anon_pages_by_limit 6640629
    scanned_file_pages_by_limit 2831235
    rotated_pages_by_limit 4243974
    rotated_anon_pages_by_limit 3971968
    rotated_file_pages_by_limit 272006
    freed_pages_by_limit 2318492
    freed_anon_pages_by_limit 962052
    freed_file_pages_by_limit 1356440
    elapsed_ns_by_limit 351386416101
    scanned_pages_by_system 0
    scanned_anon_pages_by_system 0
    scanned_file_pages_by_system 0
    rotated_pages_by_system 0
    rotated_anon_pages_by_system 0
    rotated_file_pages_by_system 0
    freed_pages_by_system 0
    freed_anon_pages_by_system 0
    freed_file_pages_by_system 0
    elapsed_ns_by_system 0
    scanned_pages_by_limit_under_hierarchy 9471864
    scanned_anon_pages_by_limit_under_hierarchy 6640629
    scanned_file_pages_by_limit_under_hierarchy 2831235
    rotated_pages_by_limit_under_hierarchy 4243974
    rotated_anon_pages_by_limit_under_hierarchy 3971968
    rotated_file_pages_by_limit_under_hierarchy 272006
    freed_pages_by_limit_under_hierarchy 2318492
    freed_anon_pages_by_limit_under_hierarchy 962052
    freed_file_pages_by_limit_under_hierarchy 1356440
    elapsed_ns_by_limit_under_hierarchy 351386416101
    scanned_pages_by_system_under_hierarchy 0
    scanned_anon_pages_by_system_under_hierarchy 0
    scanned_file_pages_by_system_under_hierarchy 0
    rotated_pages_by_system_under_hierarchy 0
    rotated_anon_pages_by_system_under_hierarchy 0
    rotated_file_pages_by_system_under_hierarchy 0
    freed_pages_by_system_under_hierarchy 0
    freed_anon_pages_by_system_under_hierarchy 0
    freed_file_pages_by_system_under_hierarchy 0
    elapsed_ns_by_system_under_hierarchy 0

    total_xxxx is for hierarchy management.

    This will be useful for further memcg developments and need to be
    developped before we do some complicated rework on LRU/softlimit
    management.

    This patch adds a new struct memcg_scanrecord into scan_control struct.
    sc->nr_scanned at el is not designed for exporting information. For
    example, nr_scanned is reset frequentrly and incremented +2 at scanning
    mapped pages.

    To avoid complexity, I added a new param in scan_control which is for
    exporting scanning score.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Michal Hocko
    Cc: Ying Han
    Cc: Andrew Bresticker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Each memory cgroup has a 'swappiness' value which can be accessed by
    get_swappiness(memcg). The major user is try_to_free_mem_cgroup_pages()
    and swappiness is passed by argument. It's propagated by scan_control.

    get_swappiness() is a static function but some planned updates will need
    to get swappiness from files other than memcontrol.c This patch exports
    get_swappiness() as mem_cgroup_swappiness(). With this, we can remove the
    argument of swapiness from try_to_free... and drop swappiness from
    scan_control. only memcg uses it.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Michal Hocko
    Cc: Ying Han
    Cc: Shaohua Li
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

28 Jun, 2011

1 commit

  • Before adding any more global entry points into shmem.c, gather such
    prototypes into shmem_fs.h. Remove mm's own declarations from swap.h,
    but for now leave the ones in mm.h: because shmem_file_setup() and
    shmem_zero_setup() are called from various places, and we should not
    force other subsystems to update immediately.

    Signed-off-by: Hugh Dickins
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

16 Jun, 2011

1 commit

  • Currently, memcg reclaim can disable swap token even if the swap token mm
    doesn't belong in its memory cgroup. It's slightly risky. If an admin
    creates very small mem-cgroup and silly guy runs contentious heavy memory
    pressure workload, every tasks are going to lose swap token and then
    system may become unresponsive. That's bad.

    This patch adds 'memcg' parameter into disable_swap_token(). and if the
    parameter doesn't match swap token, VM doesn't disable it.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

27 May, 2011

1 commit

  • The global kswapd scans per-zone LRU and reclaims pages regardless of the
    cgroup. It breaks memory isolation since one cgroup can end up reclaiming
    pages from another cgroup. Instead we should rely on memcg-aware target
    reclaim including per-memcg kswapd and soft_limit hierarchical reclaim under
    memory pressure.

    In the global background reclaim, we do soft reclaim before scanning the
    per-zone LRU. However, the return value is ignored. This patch is the first
    step to skip shrink_zone() if soft_limit reclaim does enough work.

    This is part of the effort which tries to reduce reclaiming pages in global
    LRU in memcg. The per-memcg background reclaim patchset further enhances the
    per-cgroup targetting reclaim, which I should have V4 posted shortly.

    Try running multiple memory intensive workloads within seperate memcgs. Watch
    the counters of soft_steal in memory.stat.

    $ cat /dev/cgroup/A/memory.stat | grep 'soft'
    soft_steal 240000
    soft_scan 240000
    total_soft_steal 240000
    total_soft_scan 240000

    This patch:

    In the global background reclaim, we do soft reclaim before scanning the
    per-zone LRU. However, the return value is ignored.

    We would like to skip shrink_zone() if soft_limit reclaim does enough
    work. Also, we need to make the memory pressure balanced across per-memcg
    zones, like the logic vm-core. This patch is the first step where we
    start with counting the nr_scanned and nr_reclaimed from soft_limit
    reclaim into the global scan_control.

    Signed-off-by: Ying Han
    Cc: KOSAKI Motohiro
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Acked-by: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ying Han
     

25 Mar, 2011

1 commit

  • * 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block: (65 commits)
    Documentation/iostats.txt: bit-size reference etc.
    cfq-iosched: removing unnecessary think time checking
    cfq-iosched: Don't clear queue stats when preempt.
    blk-throttle: Reset group slice when limits are changed
    blk-cgroup: Only give unaccounted_time under debug
    cfq-iosched: Don't set active queue in preempt
    block: fix non-atomic access to genhd inflight structures
    block: attempt to merge with existing requests on plug flush
    block: NULL dereference on error path in __blkdev_get()
    cfq-iosched: Don't update group weights when on service tree
    fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away
    block: Require subsystems to explicitly allocate bio_set integrity mempool
    jbd2: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
    jbd: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
    fs: make fsync_buffers_list() plug
    mm: make generic_writepages() use plugging
    blk-cgroup: Add unaccounted time to timeslice_used.
    block: fixup plugging stubs for !CONFIG_BLOCK
    block: remove obsolete comments for blkdev_issue_zeroout.
    blktrace: Use rq->cmd_flags directly in blk_add_trace_rq.
    ...

    Fix up conflicts in fs/{aio.c,super.c}

    Linus Torvalds
     

23 Mar, 2011

2 commits

  • When reclaiming for order-0 pages, kswapd requires that all zones be
    balanced. Each cycle through balance_pgdat() does background ageing on
    all zones if necessary and applies equal pressure on the inactive zone
    unless a lot of pages are free already.

    A "lot of free pages" is defined as a "balance gap" above the high
    watermark which is currently 7*high_watermark. Historically this was
    reasonable as min_free_kbytes was small. However, on systems using huge
    pages, it is recommended that min_free_kbytes is higher and it is tuned
    with hugeadm --set-recommended-min_free_kbytes. With the introduction of
    transparent huge page support, this recommended value is also applied. On
    X86-64 with 4G of memory, min_free_kbytes becomes 67584 so one would
    expect around 68M of memory to be free. The Normal zone is approximately
    35000 pages so under even normal memory pressure such as copying a large
    file, it gets exhausted quickly. As it is getting exhausted, kswapd
    applies pressure equally to all zones, including the DMA32 zone. DMA32 is
    approximately 700,000 pages with a high watermark of around 23,000 pages.
    In this situation, kswapd will reclaim around (23000*8 where 8 is the high
    watermark + balance gap of 7 * high watermark) pages or 718M of pages
    before the zone is ignored. What the user sees is that free memory far
    higher than it should be.

    To avoid an excessive number of pages being reclaimed from the larger
    zones, explicitely defines the "balance gap" to be either 1% of the zone
    or the low watermark for the zone, whichever is smaller. While kswapd
    will check all zones to apply pressure, it'll ignore zones that meets the
    (high_wmark + balance_gap) watermark.

    To test this, 80G were copied from a partition and the amount of memory
    being used was recorded. A comparison of a patch and unpatched kernel can
    be seen at
    http://www.csn.ul.ie/~mel/postings/minfree-20110222/memory-usage-hydra.ps
    and shows that kswapd is not reclaiming as much memory with the patch
    applied.

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Shaohua Li
    Cc: "Chen, Tim C"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Recently, there are reported problem about thrashing.
    (http://marc.info/?l=rsync&m=128885034930933&w=2) It happens by backup
    workloads(ex, nightly rsync). That's because the workload makes just
    use-once pages and touches pages twice. It promotes the page into active
    list so that it results in working set page eviction.

    Some app developer want to support POSIX_FADV_NOREUSE. But other OSes
    don't support it, either.
    (http://marc.info/?l=linux-mm&m=128928979512086&w=2)

    By other approach, app developers use POSIX_FADV_DONTNEED. But it has a
    problem. If kernel meets page is writing during invalidate_mapping_pages,
    it can't work. It makes for application programmer to use it since they
    always have to sync data before calling fadivse(..POSIX_FADV_DONTNEED) to
    make sure the pages could be discardable. At last, they can't use
    deferred write of kernel so that they could see performance loss.
    (http://insights.oetiker.ch/linux/fadvise.html)

    In fact, invalidation is very big hint to reclaimer. It means we don't
    use the page any more. So let's move the writing page into inactive
    list's head if we can't truncate it right now.

    Why I move page to head of lru on this patch, Dirty/Writeback page would
    be flushed sooner or later. It can prevent writeout of pageout which is
    less effective than flusher's writeout.

    Originally, I reused lru_demote of Peter with some change so added his
    Signed-off-by.

    Signed-off-by: Minchan Kim
    Reported-by: Ben Gamari
    Signed-off-by: Peter Zijlstra
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Reviewed-by: KOSAKI Motohiro
    Cc: Wu Fengguang
    Acked-by: Johannes Weiner
    Cc: Nick Piggin
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

10 Mar, 2011

1 commit

  • Code has been converted over to the new explicit on-stack plugging,
    and delay users have been converted to use the new API for that.
    So lets kill off the old plugging along with aops->sync_page().

    Signed-off-by: Jens Axboe

    Jens Axboe
     

14 Jan, 2011

1 commit

  • Lately I've been working to make KVM use hugepages transparently without
    the usual restrictions of hugetlbfs. Some of the restrictions I'd like to
    see removed:

    1) hugepages have to be swappable or the guest physical memory remains
    locked in RAM and can't be paged out to swap

    2) if a hugepage allocation fails, regular pages should be allocated
    instead and mixed in the same vma without any failure and without
    userland noticing

    3) if some task quits and more hugepages become available in the
    buddy, guest physical memory backed by regular pages should be
    relocated on hugepages automatically in regions under
    madvise(MADV_HUGEPAGE) (ideally event driven by waking up the
    kernel deamon if the order=HPAGE_PMD_SHIFT-PAGE_SHIFT list becomes
    not null)

    4) avoidance of reservation and maximization of use of hugepages whenever
    possible. Reservation (needed to avoid runtime fatal faliures) may be ok for
    1 machine with 1 database with 1 database cache with 1 database cache size
    known at boot time. It's definitely not feasible with a virtualization
    hypervisor usage like RHEV-H that runs an unknown number of virtual machines
    with an unknown size of each virtual machine with an unknown amount of
    pagecache that could be potentially useful in the host for guest not using
    O_DIRECT (aka cache=off).

    hugepages in the virtualization hypervisor (and also in the guest!) are
    much more important than in a regular host not using virtualization,
    becasue with NPT/EPT they decrease the tlb-miss cacheline accesses from 24
    to 19 in case only the hypervisor uses transparent hugepages, and they
    decrease the tlb-miss cacheline accesses from 19 to 15 in case both the
    linux hypervisor and the linux guest both uses this patch (though the
    guest will limit the addition speedup to anonymous regions only for
    now...). Even more important is that the tlb miss handler is much slower
    on a NPT/EPT guest than for a regular shadow paging or no-virtualization
    scenario. So maximizing the amount of virtual memory cached by the TLB
    pays off significantly more with NPT/EPT than without (even if there would
    be no significant speedup in the tlb-miss runtime).

    The first (and more tedious) part of this work requires allowing the VM to
    handle anonymous hugepages mixed with regular pages transparently on
    regular anonymous vmas. This is what this patch tries to achieve in the
    least intrusive possible way. We want hugepages and hugetlb to be used in
    a way so that all applications can benefit without changes (as usual we
    leverage the KVM virtualization design: by improving the Linux VM at
    large, KVM gets the performance boost too).

    The most important design choice is: always fallback to 4k allocation if
    the hugepage allocation fails! This is the _very_ opposite of some large
    pagecache patches that failed with -EIO back then if a 64k (or similar)
    allocation failed...

    Second important decision (to reduce the impact of the feature on the
    existing pagetable handling code) is that at any time we can split an
    hugepage into 512 regular pages and it has to be done with an operation
    that can't fail. This way the reliability of the swapping isn't decreased
    (no need to allocate memory when we are short on memory to swap) and it's
    trivial to plug a split_huge_page* one-liner where needed without
    polluting the VM. Over time we can teach mprotect, mremap and friends to
    handle pmd_trans_huge natively without calling split_huge_page*. The fact
    it can't fail isn't just for swap: if split_huge_page would return -ENOMEM
    (instead of the current void) we'd need to rollback the mprotect from the
    middle of it (ideally including undoing the split_vma) which would be a
    big change and in the very wrong direction (it'd likely be simpler not to
    call split_huge_page at all and to teach mprotect and friends to handle
    hugepages instead of rolling them back from the middle). In short the
    very value of split_huge_page is that it can't fail.

    The collapsing and madvise(MADV_HUGEPAGE) part will remain separated and
    incremental and it'll just be an "harmless" addition later if this initial
    part is agreed upon. It also should be noted that locking-wise replacing
    regular pages with hugepages is going to be very easy if compared to what
    I'm doing below in split_huge_page, as it will only happen when
    page_count(page) matches page_mapcount(page) if we can take the PG_lock
    and mmap_sem in write mode. collapse_huge_page will be a "best effort"
    that (unlike split_huge_page) can fail at the minimal sign of trouble and
    we can try again later. collapse_huge_page will be similar to how KSM
    works and the madvise(MADV_HUGEPAGE) will work similar to
    madvise(MADV_MERGEABLE).

    The default I like is that transparent hugepages are used at page fault
    time. This can be changed with
    /sys/kernel/mm/transparent_hugepage/enabled. The control knob can be set
    to three values "always", "madvise", "never" which mean respectively that
    hugepages are always used, or only inside madvise(MADV_HUGEPAGE) regions,
    or never used. /sys/kernel/mm/transparent_hugepage/defrag instead
    controls if the hugepage allocation should defrag memory aggressively
    "always", only inside "madvise" regions, or "never".

    The pmd_trans_splitting/pmd_trans_huge locking is very solid. The
    put_page (from get_user_page users that can't use mmu notifier like
    O_DIRECT) that runs against a __split_huge_page_refcount instead was a
    pain to serialize in a way that would result always in a coherent page
    count for both tail and head. I think my locking solution with a
    compound_lock taken only after the page_first is valid and is still a
    PageHead should be safe but it surely needs review from SMP race point of
    view. In short there is no current existing way to serialize the O_DIRECT
    final put_page against split_huge_page_refcount so I had to invent a new
    one (O_DIRECT loses knowledge on the mapping status by the time gup_fast
    returns so...). And I didn't want to impact all gup/gup_fast users for
    now, maybe if we change the gup interface substantially we can avoid this
    locking, I admit I didn't think too much about it because changing the gup
    unpinning interface would be invasive.

    If we ignored O_DIRECT we could stick to the existing compound refcounting
    code, by simply adding a get_user_pages_fast_flags(foll_flags) where KVM
    (and any other mmu notifier user) would call it without FOLL_GET (and if
    FOLL_GET isn't set we'd just BUG_ON if nobody registered itself in the
    current task mmu notifier list yet). But O_DIRECT is fundamental for
    decent performance of virtualized I/O on fast storage so we can't avoid it
    to solve the race of put_page against split_huge_page_refcount to achieve
    a complete hugepage feature for KVM.

    Swap and oom works fine (well just like with regular pages ;). MMU
    notifier is handled transparently too, with the exception of the young bit
    on the pmd, that didn't have a range check but I think KVM will be fine
    because the whole point of hugepages is that EPT/NPT will also use a huge
    pmd when they notice gup returns pages with PageCompound set, so they
    won't care of a range and there's just the pmd young bit to check in that
    case.

    NOTE: in some cases if the L2 cache is small, this may slowdown and waste
    memory during COWs because 4M of memory are accessed in a single fault
    instead of 8k (the payoff is that after COW the program can run faster).
    So we might want to switch the copy_huge_page (and clear_huge_page too) to
    not temporal stores. I also extensively researched ways to avoid this
    cache trashing with a full prefault logic that would cow in 8k/16k/32k/64k
    up to 1M (I can send those patches that fully implemented prefault) but I
    concluded they're not worth it and they add an huge additional complexity
    and they remove all tlb benefits until the full hugepage has been faulted
    in, to save a little bit of memory and some cache during app startup, but
    they still don't improve substantially the cache-trashing during startup
    if the prefault happens in >4k chunks. One reason is that those 4k pte
    entries copied are still mapped on a perfectly cache-colored hugepage, so
    the trashing is the worst one can generate in those copies (cow of 4k page
    copies aren't so well colored so they trashes less, but again this results
    in software running faster after the page fault). Those prefault patches
    allowed things like a pte where post-cow pages were local 4k regular anon
    pages and the not-yet-cowed pte entries were pointing in the middle of
    some hugepage mapped read-only. If it doesn't payoff substantially with
    todays hardware it will payoff even less in the future with larger l2
    caches, and the prefault logic would blot the VM a lot. If one is
    emebdded transparent_hugepage can be disabled during boot with sysfs or
    with the boot commandline parameter transparent_hugepage=0 (or
    transparent_hugepage=2 to restrict hugepages inside madvise regions) that
    will ensure not a single hugepage is allocated at boot time. It is simple
    enough to just disable transparent hugepage globally and let transparent
    hugepages be allocated selectively by applications in the MADV_HUGEPAGE
    region (both at page fault time, and if enabled with the
    collapse_huge_page too through the kernel daemon).

    This patch supports only hugepages mapped in the pmd, archs that have
    smaller hugepages will not fit in this patch alone. Also some archs like
    power have certain tlb limits that prevents mixing different page size in
    the same regions so they will not fit in this framework that requires
    "graceful fallback" to basic PAGE_SIZE in case of physical memory
    fragmentation. hugetlbfs remains a perfect fit for those because its
    software limits happen to match the hardware limits. hugetlbfs also
    remains a perfect fit for hugepage sizes like 1GByte that cannot be hoped
    to be found not fragmented after a certain system uptime and that would be
    very expensive to defragment with relocation, so requiring reservation.
    hugetlbfs is the "reservation way", the point of transparent hugepages is
    not to have any reservation at all and maximizing the use of cache and
    hugepages at all times automatically.

    Some performance result:

    vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largep
    ages3
    memset page fault 1566023
    memset tlb miss 453854
    memset second tlb miss 453321
    random access tlb miss 41635
    random access second tlb miss 41658
    vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largepages3
    memset page fault 1566471
    memset tlb miss 453375
    memset second tlb miss 453320
    random access tlb miss 41636
    random access second tlb miss 41637
    vmx andrea # ./largepages3
    memset page fault 1566642
    memset tlb miss 453417
    memset second tlb miss 453313
    random access tlb miss 41630
    random access second tlb miss 41647
    vmx andrea # ./largepages3
    memset page fault 1566872
    memset tlb miss 453418
    memset second tlb miss 453315
    random access tlb miss 41618
    random access second tlb miss 41659
    vmx andrea # echo 0 > /proc/sys/vm/transparent_hugepage
    vmx andrea # ./largepages3
    memset page fault 2182476
    memset tlb miss 460305
    memset second tlb miss 460179
    random access tlb miss 44483
    random access second tlb miss 44186
    vmx andrea # ./largepages3
    memset page fault 2182791
    memset tlb miss 460742
    memset second tlb miss 459962
    random access tlb miss 43981
    random access second tlb miss 43988

    ============
    #include
    #include
    #include
    #include

    #define SIZE (3UL*1024*1024*1024)

    int main()
    {
    char *p = malloc(SIZE), *p2;
    struct timeval before, after;

    gettimeofday(&before, NULL);
    memset(p, 0, SIZE);
    gettimeofday(&after, NULL);
    printf("memset page fault %Lu\n",
    (after.tv_sec-before.tv_sec)*1000000UL +
    after.tv_usec-before.tv_usec);

    gettimeofday(&before, NULL);
    memset(p, 0, SIZE);
    gettimeofday(&after, NULL);
    printf("memset tlb miss %Lu\n",
    (after.tv_sec-before.tv_sec)*1000000UL +
    after.tv_usec-before.tv_usec);

    gettimeofday(&before, NULL);
    memset(p, 0, SIZE);
    gettimeofday(&after, NULL);
    printf("memset second tlb miss %Lu\n",
    (after.tv_sec-before.tv_sec)*1000000UL +
    after.tv_usec-before.tv_usec);

    gettimeofday(&before, NULL);
    for (p2 = p; p2 < p+SIZE; p2 += 4096)
    *p2 = 0;
    gettimeofday(&after, NULL);
    printf("random access tlb miss %Lu\n",
    (after.tv_sec-before.tv_sec)*1000000UL +
    after.tv_usec-before.tv_usec);

    gettimeofday(&before, NULL);
    for (p2 = p; p2 < p+SIZE; p2 += 4096)
    *p2 = 0;
    gettimeofday(&after, NULL);
    printf("random access second tlb miss %Lu\n",
    (after.tv_sec-before.tv_sec)*1000000UL +
    after.tv_usec-before.tv_usec);

    return 0;
    }
    ============

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

27 Oct, 2010

1 commit


10 Sep, 2010

2 commits

  • Tests with recent firmware on Intel X25-M 80GB and OCZ Vertex 60GB SSDs
    show a shift since I last tested in December: in part because of firmware
    updates, in part because of the necessary move from barriers to awaiting
    completion at the block layer. While discard at swapon still shows as
    slightly beneficial on both, discarding 1MB swap cluster when allocating
    is now disadvanteous: adds 25% overhead on Intel, adds 230% on OCZ (YMMV).

    Surrender: discard as presently implemented is more hindrance than help
    for swap; but might prove useful on other devices, or with improvements.
    So continue to do the discard at swapon, but make discard while swapping
    conditional on a SWAP_FLAG_DISCARD to sys_swapon() (which has been using
    only the lower 16 bits of int flags).

    We can add a --discard or -d to swapon(8), and a "discard" to swap in
    /etc/fstab: matching the mount option for btrfs, ext4, fat, gfs2, nilfs2.

    Signed-off-by: Hugh Dickins
    Cc: Christoph Hellwig
    Cc: Nigel Cunningham
    Cc: Tejun Heo
    Cc: Jens Axboe
    Cc: James Bottomley
    Cc: "Martin K. Petersen"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Please revert 2.6.36-rc commit d2997b1042ec150616c1963b5e5e919ffd0b0ebf
    "hibernation: freeze swap at hibernation". It complicated matters by
    adding a second swap allocation path, just for hibernation; without in any
    way fixing the issue that it was intended to address - page reclaim after
    fixing the hibernation image might free swap from a page already imaged as
    swapcache, letting its swap be reallocated to store a different page of
    the image: resulting in data corruption if the imaged page were freed as
    clean then swapped back in. Pages freed to si->swap_map were still in
    danger of being reallocated by the alternative allocation path.

    I guess it inadvertently fixed slow SSD swap allocation for hibernation,
    as reported by Nigel Cunningham: by missing out the discards that occur on
    the usual swap allocation path; but that was unintentional, and needs a
    separate fix.

    Signed-off-by: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: "Rafael J. Wysocki"
    Cc: Ondrej Zary
    Cc: Andrea Gelmini
    Cc: Balbir Singh
    Cc: Andrea Arcangeli
    Cc: Nigel Cunningham
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

11 Aug, 2010

1 commit


10 Aug, 2010

1 commit

  • When taking a memory snapshot in hibernate_snapshot(), all (directly
    called) memory allocations use GFP_ATOMIC. Hence swap misusage during
    hibernation never occurs.

    But from a pessimistic point of view, there is no guarantee that no page
    allcation has __GFP_WAIT. It is better to have a global indication "we
    enter hibernation, don't use swap!".

    This patch tries to freeze new-swap-allocation during hibernation. (All
    user processes are frozenm so swapin is not a concern).

    This way, no updates will happen to swap_map[] between
    hibernate_snapshot() and save_image(). Swap is thawed when swsusp_free()
    is called. We can be assured that swap corruption will not occur.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: "Rafael J. Wysocki"
    Cc: Hugh Dickins
    Cc: KOSAKI Motohiro
    Cc: Ondrej Zary
    Cc: Balbir Singh
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

28 May, 2010

1 commit

  • This patch adds support for moving charge of file pages, which include
    normal file, tmpfs file and swaps of tmpfs file. It's enabled by setting
    bit 1 of /memory.move_charge_at_immigrate.

    Unlike the case of anonymous pages, file pages(and swaps) in the range
    mmapped by the task will be moved even if the task hasn't done page fault,
    i.e. they might not be the task's "RSS", but other task's "RSS" that maps
    the same file. And mapcount of the page is ignored(the page can be moved
    even if page_mapcount(page) > 1). So, conditions that the page/swap
    should be met to be moved is that it must be in the range mmapped by the
    target task and it must be charged to the old cgroup.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix warning]
    Signed-off-by: Daisuke Nishimura
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     

25 May, 2010

3 commits

  • This patch is the core of a mechanism which compacts memory in a zone by
    relocating movable pages towards the end of the zone.

    A single compaction run involves a migration scanner and a free scanner.
    Both scanners operate on pageblock-sized areas in the zone. The migration
    scanner starts at the bottom of the zone and searches for all movable
    pages within each area, isolating them onto a private list called
    migratelist. The free scanner starts at the top of the zone and searches
    for suitable areas and consumes the free pages within making them
    available for the migration scanner. The pages isolated for migration are
    then migrated to the newly isolated free pages.

    [aarcange@redhat.com: Fix unsafe optimisation]
    [mel@csn.ul.ie: do not schedule work on other CPUs for compaction]
    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Currently, vmscan.c defines the isolation modes for __isolate_lru_page().
    Memory compaction needs access to these modes for isolating pages for
    migration. This patch exports them.

    Signed-off-by: Mel Gorman
    Acked-by: Christoph Lameter
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Shaohua Li reported parallel file copy on tmpfs can lead to OOM killer.
    This is regression of caused by commit 9ff473b9a7 ("vmscan: evict
    streaming IO first"). Wow, It is 2 years old patch!

    Currently, tmpfs file cache is inserted active list at first. This means
    that the insertion doesn't only increase numbers of pages in anon LRU, but
    it also reduces anon scanning ratio. Therefore, vmscan will get totally
    confused. It scans almost only file LRU even though the system has plenty
    unused tmpfs pages.

    Historically, lru_cache_add_active_anon() was used for two reasons.
    1) Intend to priotize shmem page rather than regular file cache.
    2) Intend to avoid reclaim priority inversion of used once pages.

    But we've lost both motivation because (1) Now we have separate anon and
    file LRU list. then, to insert active list doesn't help such priotize.
    (2) In past, one pte access bit will cause page activation. then to
    insert inactive list with pte access bit mean higher priority than to
    insert active list. Its priority inversion may lead to uninteded lru
    chun. but it was already solved by commit 645747462 (vmscan: detect
    mapped file pages used only once). (Thanks Hannes, you are great!)

    Thus, now we can use lru_cache_add_anon() instead.

    Signed-off-by: KOSAKI Motohiro
    Reported-by: Shaohua Li
    Reviewed-by: Wu Fengguang
    Reviewed-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Acked-by: Hugh Dickins
    Cc: Henrique de Moraes Holschuh
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

19 May, 2010

1 commit

  • Added SWP_BLKDEV flag to distinguish block and regular file backed
    swap devices. We could also check if a swap is entire block device,
    rather than a file, by:
    S_ISBLK(swap_info_struct->swap_file->f_mapping->host->i_mode)
    but, I think, simply checking this flag is more convenient.

    Signed-off-by: Nitin Gupta
    Acked-by: Linus Torvalds
    Acked-by: Nigel Cunningham
    Acked-by: Pekka Enberg
    Reviewed-by: Minchan Kim
    Signed-off-by: Greg Kroah-Hartman

    Nitin Gupta
     

13 Mar, 2010

1 commit

  • This patch is another core part of this move-charge-at-task-migration
    feature. It enables moving charges of anonymous swaps.

    To move the charge of swap, we need to exchange swap_cgroup's record.

    In current implementation, swap_cgroup's record is protected by:

    - page lock: if the entry is on swap cache.
    - swap_lock: if the entry is not on swap cache.

    This works well in usual swap-in/out activity.

    But this behavior make the feature of moving swap charge check many
    conditions to exchange swap_cgroup's record safely.

    So I changed modification of swap_cgroup's recored(swap_cgroup_record())
    to use xchg, and define a new function to cmpxchg swap_cgroup's record.

    This patch also enables moving charge of non pte_present but not uncharged
    swap caches, which can be exist on swap-out path, by getting the target
    pages via find_get_page() as do_mincore() does.

    [kosaki.motohiro@jp.fujitsu.com: fix ia64 build]
    [akpm@linux-foundation.org: fix typos]
    Signed-off-by: Daisuke Nishimura
    Cc: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: Paul Menage
    Cc: Daisuke Nishimura
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     

16 Dec, 2009

10 commits

  • Seems that page_io.c doesn't really need to know that page_private(page)
    is the swp_entry 'val'. Rework map_swap_page() to do what its name says
    and map a page to a page offset in the swap space.

    The only other caller of map_swap_page() is internal to mm/swapfile.c and
    it does want to map a swap entry to the 'sector'. So rename
    map_swap_page() to map_swap_entry(), make it 'static' and and implement
    map_swap_page() as a wrapper around that.

    Signed-off-by: Lee Schermerhorn
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Reorder (and comment) the fields of swap_info_struct, to make better
    use of its cachelines: it's good for swap_duplicate() in particular
    if unsigned int max and swap_map are near the start.

    Signed-off-by: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • While we're fiddling with the swap_map values, let's assign a particular
    value to shmem/tmpfs swap pages: their swap counts are never incremented,
    and it helps swapoff's try_to_unuse() a little if it can immediately
    distinguish those pages from process pages.

    Since we've no use for SWAP_MAP_BAD | COUNT_CONTINUED,
    we might as well use that 0xbf value for SWAP_MAP_SHMEM.

    Signed-off-by: Hugh Dickins
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Swap is duplicated (reference count incremented by one) whenever the same
    swap page is inserted into another mm (when forking finds a swap entry in
    place of a pte, or when reclaim unmaps a pte to insert the swap entry).

    swap_info_struct's vmalloc'ed swap_map is the array of these reference
    counts: but what happens when the unsigned short (or unsigned char since
    the preceding patch) is full? (and its high bit is kept for a cache flag)

    We then lose track of it, never freeing, leaving it in use until swapoff:
    at which point we _hope_ that a single pass will have found all instances,
    assume there are no more, and will lose user data if we're wrong.

    Swapping of KSM pages has not yet been enabled; but it is implemented,
    and makes it very easy for a user to overflow the maximum swap count:
    possible with ordinary process pages, but unlikely, even when pid_max
    has been raised from PID_MAX_DEFAULT.

    This patch implements swap count continuations: when the count overflows,
    a continuation page is allocated and linked to the original vmalloc'ed
    map page, and this used to hold the continuation counts for that entry
    and its neighbours. These continuation pages are seldom referenced:
    the common paths all work on the original swap_map, only referring to
    a continuation page when the low "digit" of a count is incremented or
    decremented through SWAP_MAP_MAX.

    Signed-off-by: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Halve the vmalloc'ed swap_map array from unsigned shorts to unsigned
    chars: it's still very unusual to reach a swap count of 126, and the
    next patch allows it to be extended indefinitely.

    Signed-off-by: Hugh Dickins
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Though swap_count() is useful, I'm finding that swap_has_cache() and
    encode_swapmap() obscure what happens in the swap_map entry, just at
    those points where I need to understand it. Remove them, and pass
    more usable "usage" values to scan_swap_map(), swap_entry_free() and
    __swap_duplicate(), instead of the SWAP_MAP and SWAP_CACHE enum.

    Signed-off-by: Hugh Dickins
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Make better use of the space by folding first swap_extent into its
    swap_info_struct, instead of just the list_head: swap partitions need
    only that one, and for others it's used as a circular list anyway.

    [jirislaby@gmail.com: fix crash on double swapon]
    Signed-off-by: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Rik van Riel
    Signed-off-by: Jiri Slaby
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The swap_info_struct is only 76 or 104 bytes, but it does seem wrong
    to reserve an array of about 30 of them in bss, when most people will
    want only one. Change swap_info[] to an array of pointers.

    That does need a "type" field in the structure: pack it as a char with
    next type and short prio (aha, char is unsigned by default on PowerPC).
    Use the (admittedly peculiar) name "type" throughout for this index.

    /proc/swaps does not take swap_lock: I wouldn't want it to, but do take
    care with barriers when adding a new item to the array (never removed).

    Signed-off-by: Hugh Dickins
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The swap_info_struct is mostly private to mm/swapfile.c, with only
    one other in-tree user: get_swap_bio(). Adjust its interface to
    map_swap_page(), so that we can then remove get_swap_info_struct().

    But there is a popular user out-of-tree, TuxOnIce: so leave the
    declaration of swap_info_struct in linux/swap.h.

    Signed-off-by: Hugh Dickins
    Cc: Nigel Cunningham
    Cc: KAMEZAWA Hiroyuki
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • When memory is hot-removed, its node must be cleared in N_HIGH_MEMORY if
    there are no present pages left.

    In such a situation, kswapd must also be stopped since it has nothing left
    to do.

    Signed-off-by: David Rientjes
    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: Yasunori Goto
    Cc: Mel Gorman
    Cc: Rafael J. Wysocki
    Cc: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Cc: Lee Schermerhorn
    Cc: Mel Gorman
    Cc: Randy Dunlap
    Cc: Nishanth Aravamudan
    Cc: Andi Kleen
    Cc: David Rientjes
    Cc: Adam Litke
    Cc: Andy Whitcroft
    Cc: Eric Whitney
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

24 Sep, 2009

3 commits

  • * 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (21 commits)
    HWPOISON: Enable error_remove_page on btrfs
    HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs
    HWPOISON: Add madvise() based injector for hardware poisoned pages v4
    HWPOISON: Enable error_remove_page for NFS
    HWPOISON: Enable .remove_error_page for migration aware file systems
    HWPOISON: The high level memory error handler in the VM v7
    HWPOISON: Add PR_MCE_KILL prctl to control early kill behaviour per process
    HWPOISON: shmem: call set_page_dirty() with locked page
    HWPOISON: Define a new error_remove_page address space op for async truncation
    HWPOISON: Add invalidate_inode_page
    HWPOISON: Refactor truncate to allow direct truncating of page v2
    HWPOISON: check and isolate corrupted free pages v2
    HWPOISON: Handle hardware poisoned pages in try_to_unmap
    HWPOISON: Use bitmask/action code for try_to_unmap behaviour
    HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2
    HWPOISON: Add poison check to page fault handling
    HWPOISON: Add basic support for poisoned pages in fault handler v3
    HWPOISON: Add new SIGBUS error codes for hardware poison signals
    HWPOISON: Add support for poison swap entries v2
    HWPOISON: Export some rmap vma locking to outside world
    ...

    Linus Torvalds
     
  • It's unused.

    It isn't needed -- read or write flag is already passed and sysctl
    shouldn't care about the rest.

    It _was_ used in two places at arch/frv for some reason.

    Signed-off-by: Alexey Dobriyan
    Cc: David Howells
    Cc: "Eric W. Biederman"
    Cc: Al Viro
    Cc: Ralf Baechle
    Cc: Martin Schwidefsky
    Cc: Ingo Molnar
    Cc: "David S. Miller"
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Implement reclaim from groups over their soft limit

    Permit reclaim from memory cgroups on contention (via the direct reclaim
    path).

    memory cgroup soft limit reclaim finds the group that exceeds its soft
    limit by the largest number of pages and reclaims pages from it and then
    reinserts the cgroup into its correct place in the rbtree.

    Add additional checks to mem_cgroup_hierarchical_reclaim() to detect long
    loops in case all swap is turned off. The code has been refactored and
    the loop check (loop < 2) has been enhanced for soft limits. For soft
    limits, we try to do more targetted reclaim. Instead of bailing out after
    two loops, the routine now reclaims memory proportional to the size by
    which the soft limit is exceeded. The proportion has been empirically
    determined.

    [akpm@linux-foundation.org: build fix]
    [kamezawa.hiroyu@jp.fujitsu.com: fix softlimit css refcnt handling]
    [nishimura@mxp.nes.nec.co.jp: refcount of the "victim" should be decremented before exiting the loop]
    Signed-off-by: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Acked-by: KOSAKI Motohiro
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     

22 Sep, 2009

1 commit


16 Sep, 2009

1 commit

  • Memory migration uses special swap entry types to trigger special actions on
    page faults. Extend this mechanism to also support poisoned swap entries, to
    trigger poison handling on page faults. This allows follow-on patches to
    prevent processes from faulting in poisoned pages again.

    v2: Fix overflow in MAX_SWAPFILES (Fengguang Wu)
    v3: Better overflow fix (Hidehiro Kawai)

    Signed-off-by: Andi Kleen

    Andi Kleen