16 Apr, 2015

1 commit

  • "deactivate_page" was created for file invalidation so it has too
    specific logic for file-backed pages. So, let's change the name of the
    function and date to a file-specific one and yield the generic name.

    Signed-off-by: Minchan Kim
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Shaohua Li
    Cc: Wang, Yalin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

12 Feb, 2015

1 commit


14 Dec, 2014

1 commit

  • swp_entry_t being defined in include/linux/swap.h instead of
    include/linux/mm_types.h causes cyclic include dependency later when
    include/linux/page_cgroup.h is included from writeback path. Move the
    definition to include/linux/mm_types.h.

    While at it, reformat the comment above it.

    Signed-off-by: Tejun Heo
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     

10 Oct, 2014

2 commits

  • In a memcg with even just moderate cache pressure, success rates for
    transparent huge page allocations drop to zero, wasting a lot of effort
    that the allocator puts into assembling these pages.

    The reason for this is that the memcg reclaim code was never designed for
    higher-order charges. It reclaims in small batches until there is room
    for at least one page. Huge page charges only succeed when these batches
    add up over a series of huge faults, which is unlikely under any
    significant load involving order-0 allocations in the group.

    Remove that loop on the memcg side in favor of passing the actual reclaim
    goal to direct reclaim, which is already set up and optimized to meet
    higher-order goals efficiently.

    This brings memcg's THP policy in line with the system policy: if the
    allocator painstakingly assembles a hugepage, memcg will at least make an
    honest effort to charge it. As a result, transparent hugepage allocation
    rates amid cache activity are drastically improved:

    vanilla patched
    pgalloc 4717530.80 ( +0.00%) 4451376.40 ( -5.64%)
    pgfault 491370.60 ( +0.00%) 225477.40 ( -54.11%)
    pgmajfault 2.00 ( +0.00%) 1.80 ( -6.67%)
    thp_fault_alloc 0.00 ( +0.00%) 531.60 (+100.00%)
    thp_fault_fallback 749.00 ( +0.00%) 217.40 ( -70.88%)

    [ Note: this may in turn increase memory consumption from internal
    fragmentation, which is an inherent risk of transparent hugepages.
    Some setups may have to adjust the memcg limits accordingly to
    accomodate this - or, if the machine is already packed to capacity,
    disable the transparent huge page feature. ]

    Signed-off-by: Johannes Weiner
    Reviewed-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Dave Hansen
    Cc: Greg Thelen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The deprecation warnings for the scan_unevictable interface triggers by
    scripts doing `sysctl -a | grep something else'. This is annoying and not
    helpful.

    The interface has been defunct since 264e56d8247e ("mm: disable user
    interface to manually rescue unevictable pages"), which was in 2011, and
    there haven't been any reports of usecases for it, only reports that the
    deprecation warnings are annying. It's unlikely that anybody is using
    this interface specifically at this point, so remove it.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

09 Aug, 2014

2 commits

  • The memcg uncharging code that is involved towards the end of a page's
    lifetime - truncation, reclaim, swapout, migration - is impressively
    complicated and fragile.

    Because anonymous and file pages were always charged before they had their
    page->mapping established, uncharges had to happen when the page type
    could still be known from the context; as in unmap for anonymous, page
    cache removal for file and shmem pages, and swap cache truncation for swap
    pages. However, these operations happen well before the page is actually
    freed, and so a lot of synchronization is necessary:

    - Charging, uncharging, page migration, and charge migration all need
    to take a per-page bit spinlock as they could race with uncharging.

    - Swap cache truncation happens during both swap-in and swap-out, and
    possibly repeatedly before the page is actually freed. This means
    that the memcg swapout code is called from many contexts that make
    no sense and it has to figure out the direction from page state to
    make sure memory and memory+swap are always correctly charged.

    - On page migration, the old page might be unmapped but then reused,
    so memcg code has to prevent untimely uncharging in that case.
    Because this code - which should be a simple charge transfer - is so
    special-cased, it is not reusable for replace_page_cache().

    But now that charged pages always have a page->mapping, introduce
    mem_cgroup_uncharge(), which is called after the final put_page(), when we
    know for sure that nobody is looking at the page anymore.

    For page migration, introduce mem_cgroup_migrate(), which is called after
    the migration is successful and the new page is fully rmapped. Because
    the old page is no longer uncharged after migration, prevent double
    charges by decoupling the page's memcg association (PCG_USED and
    pc->mem_cgroup) from the page holding an actual charge. The new bits
    PCG_MEM and PCG_MEMSW represent the respective charges and are transferred
    to the new page during migration.

    mem_cgroup_migrate() is suitable for replace_page_cache() as well,
    which gets rid of mem_cgroup_replace_page_cache(). However, care
    needs to be taken because both the source and the target page can
    already be charged and on the LRU when fuse is splicing: grab the page
    lock on the charge moving side to prevent changing pc->mem_cgroup of a
    page under migration. Also, the lruvecs of both pages change as we
    uncharge the old and charge the new during migration, and putback may
    race with us, so grab the lru lock and isolate the pages iff on LRU to
    prevent races and ensure the pages are on the right lruvec afterward.

    Swap accounting is massively simplified: because the page is no longer
    uncharged as early as swap cache deletion, a new mem_cgroup_swapout() can
    transfer the page's memory+swap charge (PCG_MEMSW) to the swap entry
    before the final put_page() in page reclaim.

    Finally, page_cgroup changes are now protected by whatever protection the
    page itself offers: anonymous pages are charged under the page table lock,
    whereas page cache insertions, swapin, and migration hold the page lock.
    Uncharging happens under full exclusion with no outstanding references.
    Charging and uncharging also ensure that the page is off-LRU, which
    serializes against charge migration. Remove the very costly page_cgroup
    lock and set pc->flags non-atomically.

    [mhocko@suse.cz: mem_cgroup_charge_statistics needs preempt_disable]
    [vdavydov@parallels.com: fix flags definition]
    Signed-off-by: Johannes Weiner
    Cc: Hugh Dickins
    Cc: Tejun Heo
    Cc: Vladimir Davydov
    Tested-by: Jet Chen
    Acked-by: Michal Hocko
    Tested-by: Felipe Balbi
    Signed-off-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • These patches rework memcg charge lifetime to integrate more naturally
    with the lifetime of user pages. This drastically simplifies the code and
    reduces charging and uncharging overhead. The most expensive part of
    charging and uncharging is the page_cgroup bit spinlock, which is removed
    entirely after this series.

    Here are the top-10 profile entries of a stress test that reads a 128G
    sparse file on a freshly booted box, without even a dedicated cgroup (i.e.
    executing in the root memcg). Before:

    15.36% cat [kernel.kallsyms] [k] copy_user_generic_string
    13.31% cat [kernel.kallsyms] [k] memset
    11.48% cat [kernel.kallsyms] [k] do_mpage_readpage
    4.23% cat [kernel.kallsyms] [k] get_page_from_freelist
    2.38% cat [kernel.kallsyms] [k] put_page
    2.32% cat [kernel.kallsyms] [k] __mem_cgroup_commit_charge
    2.18% kswapd0 [kernel.kallsyms] [k] __mem_cgroup_uncharge_common
    1.92% kswapd0 [kernel.kallsyms] [k] shrink_page_list
    1.86% cat [kernel.kallsyms] [k] __radix_tree_lookup
    1.62% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn

    After:

    15.67% cat [kernel.kallsyms] [k] copy_user_generic_string
    13.48% cat [kernel.kallsyms] [k] memset
    11.42% cat [kernel.kallsyms] [k] do_mpage_readpage
    3.98% cat [kernel.kallsyms] [k] get_page_from_freelist
    2.46% cat [kernel.kallsyms] [k] put_page
    2.13% kswapd0 [kernel.kallsyms] [k] shrink_page_list
    1.88% cat [kernel.kallsyms] [k] __radix_tree_lookup
    1.67% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn
    1.39% kswapd0 [kernel.kallsyms] [k] free_pcppages_bulk
    1.30% cat [kernel.kallsyms] [k] kfree

    As you can see, the memcg footprint has shrunk quite a bit.

    text data bss dec hex filename
    37970 9892 400 48262 bc86 mm/memcontrol.o.old
    35239 9892 400 45531 b1db mm/memcontrol.o

    This patch (of 4):

    The memcg charge API charges pages before they are rmapped - i.e. have an
    actual "type" - and so every callsite needs its own set of charge and
    uncharge functions to know what type is being operated on. Worse,
    uncharge has to happen from a context that is still type-specific, rather
    than at the end of the page's lifetime with exclusive access, and so
    requires a lot of synchronization.

    Rewrite the charge API to provide a generic set of try_charge(),
    commit_charge() and cancel_charge() transaction operations, much like
    what's currently done for swap-in:

    mem_cgroup_try_charge() attempts to reserve a charge, reclaiming
    pages from the memcg if necessary.

    mem_cgroup_commit_charge() commits the page to the charge once it
    has a valid page->mapping and PageAnon() reliably tells the type.

    mem_cgroup_cancel_charge() aborts the transaction.

    This reduces the charge API and enables subsequent patches to
    drastically simplify uncharging.

    As pages need to be committed after rmap is established but before they
    are added to the LRU, page_add_new_anon_rmap() must stop doing LRU
    additions again. Revive lru_cache_add_active_or_unevictable().

    [hughd@google.com: fix shmem_unuse]
    [hughd@google.com: Add comments on the private use of -EAGAIN]
    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Cc: Vladimir Davydov
    Signed-off-by: Hugh Dickins
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

07 Aug, 2014

1 commit

  • Do we really need an exported alias for __SetPageReferenced()? Its
    callers better know what they're doing, in which case the page would not
    be already marked referenced. Kill init_page_accessed(), just
    __SetPageReferenced() inline.

    Signed-off-by: Hugh Dickins
    Acked-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Dave Hansen
    Cc: Prabhakar Lad
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

05 Jun, 2014

6 commits

  • Currently, we use (zone->managed_pages + KSWAPD_ZONE_BALANCE_GAP_RATIO-1)
    / KSWAPD_ZONE_BALANCE_GAP_RATIO to avoid a zero gap value. It's better to
    use DIV_ROUND_UP macro for neater code and clear meaning.

    Besides, the gap value is calculated against the per-zone "managed pages",
    not "present pages". This patch also corrects the comment and do some
    rephrasing.

    Signed-off-by: Jianyu Zhan
    Acked-by: Rik van Riel
    Acked-by: Rafael Aquini
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jianyu Zhan
     
  • aops->write_begin may allocate a new page and make it visible only to have
    mark_page_accessed called almost immediately after. Once the page is
    visible the atomic operations are necessary which is noticable overhead
    when writing to an in-memory filesystem like tmpfs but should also be
    noticable with fast storage. The objective of the patch is to initialse
    the accessed information with non-atomic operations before the page is
    visible.

    The bulk of filesystems directly or indirectly use
    grab_cache_page_write_begin or find_or_create_page for the initial
    allocation of a page cache page. This patch adds an init_page_accessed()
    helper which behaves like the first call to mark_page_accessed() but may
    called before the page is visible and can be done non-atomically.

    The primary APIs of concern in this care are the following and are used
    by most filesystems.

    find_get_page
    find_lock_page
    find_or_create_page
    grab_cache_page_nowait
    grab_cache_page_write_begin

    All of them are very similar in detail to the patch creates a core helper
    pagecache_get_page() which takes a flags parameter that affects its
    behavior such as whether the page should be marked accessed or not. Then
    old API is preserved but is basically a thin wrapper around this core
    function.

    Each of the filesystems are then updated to avoid calling
    mark_page_accessed when it is known that the VM interfaces have already
    done the job. There is a slight snag in that the timing of the
    mark_page_accessed() has now changed so in rare cases it's possible a page
    gets to the end of the LRU as PageReferenced where as previously it might
    have been repromoted. This is expected to be rare but it's worth the
    filesystem people thinking about it in case they see a problem with the
    timing change. It is also the case that some filesystems may be marking
    pages accessed that previously did not but it makes sense that filesystems
    have consistent behaviour in this regard.

    The test case used to evaulate this is a simple dd of a large file done
    multiple times with the file deleted on each iterations. The size of the
    file is 1/10th physical memory to avoid dirty page balancing. In the
    async case it will be possible that the workload completes without even
    hitting the disk and will have variable results but highlight the impact
    of mark_page_accessed for async IO. The sync results are expected to be
    more stable. The exception is tmpfs where the normal case is for the "IO"
    to not hit the disk.

    The test machine was single socket and UMA to avoid any scheduling or NUMA
    artifacts. Throughput and wall times are presented for sync IO, only wall
    times are shown for async as the granularity reported by dd and the
    variability is unsuitable for comparison. As async results were variable
    do to writback timings, I'm only reporting the maximum figures. The sync
    results were stable enough to make the mean and stddev uninteresting.

    The performance results are reported based on a run with no profiling.
    Profile data is based on a separate run with oprofile running.

    async dd
    3.15.0-rc3 3.15.0-rc3
    vanilla accessed-v2
    ext3 Max elapsed 13.9900 ( 0.00%) 11.5900 ( 17.16%)
    tmpfs Max elapsed 0.5100 ( 0.00%) 0.4900 ( 3.92%)
    btrfs Max elapsed 12.8100 ( 0.00%) 12.7800 ( 0.23%)
    ext4 Max elapsed 18.6000 ( 0.00%) 13.3400 ( 28.28%)
    xfs Max elapsed 12.5600 ( 0.00%) 2.0900 ( 83.36%)

    The XFS figure is a bit strange as it managed to avoid a worst case by
    sheer luck but the average figures looked reasonable.

    samples percentage
    ext3 86107 0.9783 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    ext3 23833 0.2710 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    ext3 5036 0.0573 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    ext4 64566 0.8961 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    ext4 5322 0.0713 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    ext4 2869 0.0384 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    xfs 62126 1.7675 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    xfs 1904 0.0554 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    xfs 103 0.0030 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    btrfs 10655 0.1338 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    btrfs 2020 0.0273 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    btrfs 587 0.0079 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    tmpfs 59562 3.2628 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    tmpfs 1210 0.0696 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    tmpfs 94 0.0054 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed

    [akpm@linux-foundation.org: don't run init_page_accessed() against an uninitialised pointer]
    Signed-off-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Tested-by: Prabhakar Lad
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • cold is a bool, make it one. Make the likely case the "if" part of the
    block instead of the else as according to the optimisation manual this is
    preferred.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Originally get_swap_page() started iterating through the singly-linked
    list of swap_info_structs using swap_list.next or highest_priority_index,
    which both were intended to point to the highest priority active swap
    target that was not full. The first patch in this series changed the
    singly-linked list to a doubly-linked list, and removed the logic to start
    at the highest priority non-full entry; it starts scanning at the highest
    priority entry each time, even if the entry is full.

    Replace the manually ordered swap_list_head with a plist, swap_active_head.
    Add a new plist, swap_avail_head. The original swap_active_head plist
    contains all active swap_info_structs, as before, while the new
    swap_avail_head plist contains only swap_info_structs that are active and
    available, i.e. not full. Add a new spinlock, swap_avail_lock, to protect
    the swap_avail_head list.

    Mel Gorman suggested using plists since they internally handle ordering
    the list entries based on priority, which is exactly what swap was doing
    manually. All the ordering code is now removed, and swap_info_struct
    entries and simply added to their corresponding plist and automatically
    ordered correctly.

    Using a new plist for available swap_info_structs simplifies and
    optimizes get_swap_page(), which no longer has to iterate over full
    swap_info_structs. Using a new spinlock for swap_avail_head plist
    allows each swap_info_struct to add or remove themselves from the
    plist when they become full or not-full; previously they could not
    do so because the swap_info_struct->lock is held when they change
    from fullnot-full, and the swap_lock protecting the main
    swap_active_head must be ordered before any swap_info_struct->lock.

    Signed-off-by: Dan Streetman
    Acked-by: Mel Gorman
    Cc: Shaohua Li
    Cc: Steven Rostedt
    Cc: Peter Zijlstra
    Cc: Hugh Dickins
    Cc: Dan Streetman
    Cc: Michal Hocko
    Cc: Christian Ehrhardt
    Cc: Weijie Yang
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Bob Liu
    Cc: Paul Gortmaker
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Streetman
     
  • The logic controlling the singly-linked list of swap_info_struct entries
    for all active, i.e. swapon'ed, swap targets is rather complex, because:

    - it stores the entries in priority order
    - there is a pointer to the highest priority entry
    - there is a pointer to the highest priority not-full entry
    - there is a highest_priority_index variable set outside the swap_lock
    - swap entries of equal priority should be used equally

    this complexity leads to bugs such as: https://lkml.org/lkml/2014/2/13/181
    where different priority swap targets are incorrectly used equally.

    That bug probably could be solved with the existing singly-linked lists,
    but I think it would only add more complexity to the already difficult to
    understand get_swap_page() swap_list iteration logic.

    The first patch changes from a singly-linked list to a doubly-linked list
    using list_heads; the highest_priority_index and related code are removed
    and get_swap_page() starts each iteration at the highest priority
    swap_info entry, even if it's full. While this does introduce unnecessary
    list iteration (i.e. Schlemiel the painter's algorithm) in the case where
    one or more of the highest priority entries are full, the iteration and
    manipulation code is much simpler and behaves correctly re: the above bug;
    and the fourth patch removes the unnecessary iteration.

    The second patch adds some minor plist helper functions; nothing new
    really, just functions to match existing regular list functions. These
    are used by the next two patches.

    The third patch adds plist_requeue(), which is used by get_swap_page() in
    the next patch - it performs the requeueing of same-priority entries
    (which moves the entry to the end of its priority in the plist), so that
    all equal-priority swap_info_structs get used equally.

    The fourth patch converts the main list into a plist, and adds a new plist
    that contains only swap_info entries that are both active and not full.
    As Mel suggested using plists allows removing all the ordering code from
    swap - plists handle ordering automatically. The list naming is also
    clarified now that there are two lists, with the original list changed
    from swap_list_head to swap_active_head and the new list named
    swap_avail_head. A new spinlock is also added for the new list, so
    swap_info entries can be added or removed from the new list immediately as
    they become full or not full.

    This patch (of 4):

    Replace the singly-linked list tracking active, i.e. swapon'ed,
    swap_info_struct entries with a doubly-linked list using struct
    list_heads. Simplify the logic iterating and manipulating the list of
    entries, especially get_swap_page(), by using standard list_head
    functions, and removing the highest priority iteration logic.

    The change fixes the bug:
    https://lkml.org/lkml/2014/2/13/181
    in which different priority swap entries after the highest priority entry
    are incorrectly used equally in pairs. The swap behavior is now as
    advertised, i.e. different priority swap entries are used in order, and
    equal priority swap targets are used concurrently.

    Signed-off-by: Dan Streetman
    Acked-by: Mel Gorman
    Cc: Shaohua Li
    Cc: Hugh Dickins
    Cc: Dan Streetman
    Cc: Michal Hocko
    Cc: Christian Ehrhardt
    Cc: Weijie Yang
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Bob Liu
    Cc: Steven Rostedt
    Cc: Peter Zijlstra
    Cc: Paul Gortmaker
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Streetman
     
  • In mm/swap.c, __lru_cache_add() is exported, but actually there are no
    users outside this file.

    This patch unexports __lru_cache_add(), and makes it static. It also
    exports lru_cache_add_file(), as it is use by cifs and fuse, which can
    loaded as modules.

    Signed-off-by: Jianyu Zhan
    Cc: Minchan Kim
    Cc: Johannes Weiner
    Cc: Shaohua Li
    Cc: Bob Liu
    Cc: Seth Jennings
    Cc: Joonsoo Kim
    Cc: Rafael Aquini
    Cc: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Khalid Aziz
    Cc: Christoph Hellwig
    Reviewed-by: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jianyu Zhan
     

04 Apr, 2014

2 commits

  • Previously, page cache radix tree nodes were freed after reclaim emptied
    out their page pointers. But now reclaim stores shadow entries in their
    place, which are only reclaimed when the inodes themselves are
    reclaimed. This is problematic for bigger files that are still in use
    after they have a significant amount of their cache reclaimed, without
    any of those pages actually refaulting. The shadow entries will just
    sit there and waste memory. In the worst case, the shadow entries will
    accumulate until the machine runs out of memory.

    To get this under control, the VM will track radix tree nodes
    exclusively containing shadow entries on a per-NUMA node list. Per-NUMA
    rather than global because we expect the radix tree nodes themselves to
    be allocated node-locally and we want to reduce cross-node references of
    otherwise independent cache workloads. A simple shrinker will then
    reclaim these nodes on memory pressure.

    A few things need to be stored in the radix tree node to implement the
    shadow node LRU and allow tree deletions coming from the list:

    1. There is no index available that would describe the reverse path
    from the node up to the tree root, which is needed to perform a
    deletion. To solve this, encode in each node its offset inside the
    parent. This can be stored in the unused upper bits of the same
    member that stores the node's height at no extra space cost.

    2. The number of shadow entries needs to be counted in addition to the
    regular entries, to quickly detect when the node is ready to go to
    the shadow node LRU list. The current entry count is an unsigned
    int but the maximum number of entries is 64, so a shadow counter
    can easily be stored in the unused upper bits.

    3. Tree modification needs tree lock and tree root, which are located
    in the address space, so store an address_space backpointer in the
    node. The parent pointer of the node is in a union with the 2-word
    rcu_head, so the backpointer comes at no extra cost as well.

    4. The node needs to be linked to an LRU list, which requires a list
    head inside the node. This does increase the size of the node, but
    it does not change the number of objects that fit into a slab page.

    [akpm@linux-foundation.org: export the right function]
    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: Andrea Arcangeli
    Cc: Bob Liu
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Cc: Jan Kara
    Cc: KOSAKI Motohiro
    Cc: Luigi Semenzato
    Cc: Mel Gorman
    Cc: Metin Doslu
    Cc: Michel Lespinasse
    Cc: Ozgun Erdogan
    Cc: Peter Zijlstra
    Cc: Roman Gushchin
    Cc: Ryan Mallon
    Cc: Tejun Heo
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The VM maintains cached filesystem pages on two types of lists. One
    list holds the pages recently faulted into the cache, the other list
    holds pages that have been referenced repeatedly on that first list.
    The idea is to prefer reclaiming young pages over those that have shown
    to benefit from caching in the past. We call the recently usedbut
    ultimately was not significantly better than a FIFO policy and still
    thrashed cache based on eviction speed, rather than actual demand for
    cache.

    This patch solves one half of the problem by decoupling the ability to
    detect working set changes from the inactive list size. By maintaining
    a history of recently evicted file pages it can detect frequently used
    pages with an arbitrarily small inactive list size, and subsequently
    apply pressure on the active list based on actual demand for cache, not
    just overall eviction speed.

    Every zone maintains a counter that tracks inactive list aging speed.
    When a page is evicted, a snapshot of this counter is stored in the
    now-empty page cache radix tree slot. On refault, the minimum access
    distance of the page can be assessed, to evaluate whether the page
    should be part of the active list or not.

    This fixes the VM's blindness towards working set changes in excess of
    the inactive list. And it's the foundation to further improve the
    protection ability and reduce the minimum inactive list size of 50%.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Reviewed-by: Bob Liu
    Cc: Andrea Arcangeli
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Cc: Jan Kara
    Cc: KOSAKI Motohiro
    Cc: Luigi Semenzato
    Cc: Mel Gorman
    Cc: Metin Doslu
    Cc: Michel Lespinasse
    Cc: Ozgun Erdogan
    Cc: Peter Zijlstra
    Cc: Roman Gushchin
    Cc: Ryan Mallon
    Cc: Tejun Heo
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

13 Sep, 2013

1 commit

  • make lru_add_drain_all() only selectively interrupt the cpus that have
    per-cpu free pages that can be drained.

    This is important in nohz mode where calling mlockall(), for example,
    otherwise will interrupt every core unnecessarily.

    This is important on workloads where nohz cores are handling 10 Gb traffic
    in userspace. Those CPUs do not enter the kernel and place pages into LRU
    pagevecs and they really, really don't want to be interrupted, or they
    drop packets on the floor.

    Signed-off-by: Chris Metcalf
    Reviewed-by: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Metcalf
     

12 Sep, 2013

4 commits

  • PageSwapCache() is always false when !CONFIG_SWAP, so compiler
    properly discard related code. Therefore, we don't need #ifdef explicitly.

    Signed-off-by: Joonsoo Kim
    Acked-by: Johannes Weiner
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • swap cluster allocation is to get better request merge to improve
    performance. But the cluster is shared globally, if multiple tasks are
    doing swap, this will cause interleave disk access. While multiple tasks
    swap is quite common, for example, each numa node has a kswapd thread
    doing swap and multiple threads/processes doing direct page reclaim.

    ioscheduler can't help too much here, because tasks don't send swapout IO
    down to block layer in the meantime. Block layer does merge some IOs, but
    a lot not, depending on how many tasks are doing swapout concurrently. In
    practice, I've seen a lot of small size IO in swapout workloads.

    We makes the cluster allocation per-cpu here. The interleave disk access
    issue goes away. All tasks swapout to their own cluster, so swapout will
    become sequential, which can be easily merged to big size IO. If one CPU
    can't get its per-cpu cluster (for example, there is no free cluster
    anymore in the swap), it will fallback to scan swap_map. The CPU can
    still continue swap. We don't need recycle free swap entries of other
    CPUs.

    In my test (swap to a 2-disk raid0 partition), this improves around 10%
    swapout throughput, and request size is increased significantly.

    How does this impact swap readahead is uncertain though. On one side,
    page reclaim always isolates and swaps several adjancent pages, this will
    make page reclaim write the pages sequentially and benefit readahead. On
    the other side, several CPU write pages interleave means the pages don't
    live _sequentially_ but relatively _near_. In the per-cpu allocation
    case, if adjancent pages are written by different cpus, they will live
    relatively _far_. So how this impacts swap readahead depends on how many
    pages page reclaim isolates and swaps one time. If the number is big,
    this patch will benefit swap readahead. Of course, this is about
    sequential access pattern. The patch has no impact for random access
    pattern, because the new cluster allocation algorithm is just for SSD.

    Alternative solution is organizing swap layout to be per-mm instead of
    this per-cpu approach. In the per-mm layout, we allocate a disk range for
    each mm, so pages of one mm live in swap disk adjacently. per-mm layout
    has potential issues of lock contention if multiple reclaimers are swap
    pages from one mm. For a sequential workload, per-mm layout is better to
    implement swap readahead, because pages from the mm are adjacent in disk.
    But per-cpu layout isn't very bad in this workload, as page reclaim always
    isolates and swaps several pages one time, such pages will still live in
    disk sequentially and readahead can utilize this. For a random workload,
    per-mm layout isn't beneficial of request merge, because it's quite
    possible pages from different mm are swapout in the meantime and IO can't
    be merged in per-mm layout. while with per-cpu layout we can merge
    requests from any mm. Considering random workload is more popular in
    workloads with swap (and per-cpu approach isn't too bad for sequential
    workload too), I'm choosing per-cpu layout.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Shaohua Li
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Kyungmin Park
    Cc: Hugh Dickins
    Cc: Rafael Aquini
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • swap can do cluster discard for SSD, which is good, but there are some
    problems here:

    1. swap do the discard just before page reclaim gets a swap entry and
    writes the disk sectors. This is useless for high end SSD, because an
    overwrite to a sector implies a discard to original sector too. A
    discard + overwrite == overwrite.

    2. the purpose of doing discard is to improve SSD firmware garbage
    collection. Idealy we should send discard as early as possible, so
    firmware can do something smart. Sending discard just after swap entry
    is freed is considered early compared to sending discard before write.
    Of course, if workload is already bound to gc speed, sending discard
    earlier or later doesn't make

    3. block discard is a sync API, which will delay scan_swap_map()
    significantly.

    4. Write and discard command can be executed parallel in PCIe SSD.
    Making swap discard async can make execution more efficiently.

    This patch makes swap discard async and moves discard to where swap entry
    is freed. Discard and write have no dependence now, so above issues can
    be avoided. Idealy we should do discard for any freed sectors, but some
    SSD discard is very slow. This patch still does discard for a whole
    cluster.

    My test does a several round of 'mmap, write, unmap', which will trigger a
    lot of swap discard. In a fusionio card, with this patch, the test
    runtime is reduced to 18% of the time without it, so around 5.5x faster.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Shaohua Li
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Kyungmin Park
    Cc: Hugh Dickins
    Cc: Rafael Aquini
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • I'm using a fast SSD to do swap. scan_swap_map() sometimes uses up to
    20~30% CPU time (when cluster is hard to find, the CPU time can be up to
    80%), which becomes a bottleneck. scan_swap_map() scans a byte array to
    search a 256 page cluster, which is very slow.

    Here I introduced a simple algorithm to search cluster. Since we only
    care about 256 pages cluster, we can just use a counter to track if a
    cluster is free. Every 256 pages use one int to store the counter. If
    the counter of a cluster is 0, the cluster is free. All free clusters
    will be added to a list, so searching cluster is very efficient. With
    this, scap_swap_map() overhead disappears.

    This might help low end SD card swap too. Because if the cluster is
    aligned, SD firmware can do flash erase more efficiently.

    We only enable the algorithm for SSD. Hard disk swap isn't fast enough
    and has downside with the algorithm which might introduce regression (see
    below).

    The patch slightly changes which cluster is choosen. It always adds free
    cluster to list tail. This can help wear leveling for low end SSD too.
    And if no cluster found, the scan_swap_map() will do search from the end
    of last cluster. So if no cluster found, the scan_swap_map() will do
    search from the end of last free cluster, which is random. For SSD, this
    isn't a problem at all.

    Another downside is the cluster must be aligned to 256 pages, which will
    reduce the chance to find a cluster. I would expect this isn't a big
    problem for SSD because of the non-seek penality. (And this is the reason
    I only enable the algorithm for SSD).

    Signed-off-by: Shaohua Li
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Kyungmin Park
    Cc: Hugh Dickins
    Cc: Rafael Aquini
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     

04 Jul, 2013

2 commits

  • Considering the use cases where the swap device supports discard:
    a) and can do it quickly;
    b) but it's slow to do in small granularities (or concurrent with other
    I/O);
    c) but the implementation is so horrendous that you don't even want to
    send one down;

    And assuming that the sysadmin considers it useful to send the discards down
    at all, we would (probably) want the following solutions:

    i. do the fine-grained discards for freed swap pages, if device is
    capable of doing so optimally;
    ii. do single-time (batched) swap area discards, either at swapon
    or via something like fstrim (not implemented yet);
    iii. allow doing both single-time and fine-grained discards; or
    iv. turn it off completely (default behavior)

    As implemented today, one can only enable/disable discards for swap, but
    one cannot select, for instance, solution (ii) on a swap device like (b)
    even though the single-time discard is regarded to be interesting, or
    necessary to the workload because it would imply (1), and the device is
    not capable of performing it optimally.

    This patch addresses the scenario depicted above by introducing a way to
    ensure the (probably) wanted solutions (i, ii, iii and iv) can be flexibly
    flagged through swapon(8) to allow a sysadmin to select the best suitable
    swap discard policy accordingly to system constraints.

    This patch introduces SWAP_FLAG_DISCARD_PAGES and SWAP_FLAG_DISCARD_ONCE
    new flags to allow more flexibe swap discard policies being flagged
    through swapon(8). The default behavior is to keep both single-time, or
    batched, area discards (SWAP_FLAG_DISCARD_ONCE) and fine-grained discards
    for page-clusters (SWAP_FLAG_DISCARD_PAGES) enabled, in order to keep
    consistentcy with older kernel behavior, as well as maintain compatibility
    with older swapon(8). However, through the new introduced flags the best
    suitable discard policy can be selected accordingly to any given swap
    device constraint.

    [akpm@linux-foundation.org: tweak comments]
    Signed-off-by: Rafael Aquini
    Acked-by: KOSAKI Motohiro
    Cc: Hugh Dickins
    Cc: Shaohua Li
    Cc: Karel Zak
    Cc: Jeff Moyer
    Cc: Rik van Riel
    Cc: Larry Woodman
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael Aquini
     
  • Similar to __pagevec_lru_add, this patch removes the LRU parameter from
    __lru_cache_add and lru_cache_add_lru as the caller does not control the
    exact LRU the page gets added to. lru_cache_add_lru gets renamed to
    lru_cache_add the name is silly without the lru parameter. With the
    parameter removed, it is required that the caller indicate if they want
    the page added to the active or inactive list by setting or clearing
    PageActive respectively.

    [akpm@linux-foundation.org: Suggested the patch]
    [gang.chen@asianux.com: fix used-unintialized warning]
    Signed-off-by: Mel Gorman
    Signed-off-by: Chen Gang
    Cc: Jan Kara
    Cc: Rik van Riel
    Acked-by: Johannes Weiner
    Cc: Alexey Lyahkov
    Cc: Andrew Perepechko
    Cc: Robin Dong
    Cc: Theodore Tso
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Bernd Schubert
    Cc: David Howells
    Cc: Trond Myklebust
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

30 Apr, 2013

3 commits

  • In page reclaim, huge page is split. split_huge_page() adds tail pages
    to LRU list. Since we are reclaiming a huge page, it's better we
    reclaim all subpages of the huge page instead of just the head page.
    This patch adds split tail pages to shrink page list so the tail pages
    can be reclaimed soon.

    Before this patch, run a swap workload:
    thp_fault_alloc 3492
    thp_fault_fallback 608
    thp_collapse_alloc 6
    thp_collapse_alloc_failed 0
    thp_split 916

    With this patch:
    thp_fault_alloc 4085
    thp_fault_fallback 16
    thp_collapse_alloc 90
    thp_collapse_alloc_failed 0
    thp_split 1272

    fallback allocation is reduced a lot.

    [akpm@linux-foundation.org: fix CONFIG_SWAP=n build]
    Signed-off-by: Shaohua Li
    Acked-by: Rik van Riel
    Acked-by: Minchan Kim
    Acked-by: Johannes Weiner
    Reviewed-by: Wanpeng Li
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • To prevent flooding the swap device with writebacks, frontswap backends
    need to count and limit the number of outstanding writebacks. The
    incrementing of the counter can be done before the call to
    __swap_writepage(). However, the caller must receive a notification
    when the writeback completes in order to decrement the counter.

    To achieve this functionality, this patch modifies __swap_writepage() to
    take the bio completion callback function as an argument.

    end_swap_bio_write(), the normal bio completion function, is also made
    non-static so that code doing the accounting can call it after the
    accounting is done.

    There should be no behavioural change to existing code.

    Signed-off-by: Seth Jennings
    Signed-off-by: Bob Liu
    Acked-by: Minchan Kim
    Reviewed-by: Dan Magenheimer
    Cc: Konrad Rzeszutek Wilk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Seth Jennings
     
  • swap_writepage() is currently where frontswap hooks into the swap write
    path to capture pages with the frontswap_store() function. However, if
    a frontswap backend wants to "resume" the writeback of a page to the
    swap device, it can't call swap_writepage() as the page will simply
    reenter the backend.

    This patch separates swap_writepage() into a top and bottom half, the
    bottom half named __swap_writepage() to allow a frontswap backend, like
    zswap, to resume writeback beyond the frontswap_store() hook.

    __add_to_swap_cache() is also made non-static so that the page for which
    writeback is to be resumed can be added to the swap cache.

    Signed-off-by: Seth Jennings
    Signed-off-by: Bob Liu
    Acked-by: Minchan Kim
    Reviewed-by: Dan Magenheimer
    Cc: Konrad Rzeszutek Wilk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Seth Jennings
     

24 Feb, 2013

5 commits

  • This variable is calculated from nr_free_pagecache_pages so
    change its type to unsigned long.

    Signed-off-by: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang Yanfei
     
  • Currently, the amount of RAM that functions nr_free_*_pages return is
    held in unsigned int. But in machines with big memory (exceeding 16TB),
    the amount may be incorrect because of overflow, so fix it.

    Signed-off-by: Zhang Yanfei
    Cc: Simon Horman
    Cc: Julian Anastasov
    Cc: David Miller
    Cc: Eric Van Hensbergen
    Cc: Ron Minnich
    Cc: Latchesar Ionkov
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang Yanfei
     
  • swap_lock is heavily contended when I test swap to 3 fast SSD (even
    slightly slower than swap to 2 such SSD). The main contention comes
    from swap_info_get(). This patch tries to fix the gap with adding a new
    per-partition lock.

    Global data like nr_swapfiles, total_swap_pages, least_priority and
    swap_list are still protected by swap_lock.

    nr_swap_pages is an atomic now, it can be changed without swap_lock. In
    theory, it's possible get_swap_page() finds no swap pages but actually
    there are free swap pages. But sounds not a big problem.

    Accessing partition specific data (like scan_swap_map and so on) is only
    protected by swap_info_struct.lock.

    Changing swap_info_struct.flags need hold swap_lock and
    swap_info_struct.lock, because scan_scan_map() will check it. read the
    flags is ok with either the locks hold.

    If both swap_lock and swap_info_struct.lock must be hold, we always hold
    the former first to avoid deadlock.

    swap_entry_free() can change swap_list. To delete that code, we add a
    new highest_priority_index. Whenever get_swap_page() is called, we
    check it. If it's valid, we use it.

    It's a pity get_swap_page() still holds swap_lock(). But in practice,
    swap_lock() isn't heavily contended in my test with this patch (or I can
    say there are other much more heavier bottlenecks like TLB flush). And
    BTW, looks get_swap_page() doesn't really need the lock. We never free
    swap_info[] and we check SWAP_WRITEOK flag. The only risk without the
    lock is we could swapout to some low priority swap, but we can quickly
    recover after several rounds of swap, so sounds not a big deal to me.
    But I'd prefer to fix this if it's a real problem.

    "swap: make each swap partition have one address_space" improved the
    swapout speed from 1.7G/s to 2G/s. This patch further improves the
    speed to 2.3G/s, so around 15% improvement. It's a multi-process test,
    so TLB flush isn't the biggest bottleneck before the patches.

    [arnd@arndb.de: fix it for nommu]
    [hughd@google.com: add missing unlock]
    [minchan@kernel.org: get rid of lockdep whinge on sys_swapon]
    Signed-off-by: Shaohua Li
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Greg Kroah-Hartman
    Cc: Seth Jennings
    Cc: Konrad Rzeszutek Wilk
    Cc: Xiao Guangrong
    Cc: Dan Magenheimer
    Cc: Stephen Rothwell
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Hugh Dickins
    Signed-off-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • When I use several fast SSD to do swap, swapper_space.tree_lock is
    heavily contended. This makes each swap partition have one
    address_space to reduce the lock contention. There is an array of
    address_space for swap. The swap entry type is the index to the array.

    In my test with 3 SSD, this increases the swapout throughput 20%.

    [akpm@linux-foundation.org: revert unneeded change to __add_to_swap_cache]
    Signed-off-by: Shaohua Li
    Cc: Hugh Dickins
    Acked-by: Rik van Riel
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • In certain cases (kswapd reclaim, memcg target reclaim), a fixed minimum
    amount of pages is scanned from the LRU lists on each iteration, to make
    progress.

    Do not make this minimum bigger than the respective LRU list size,
    however, and save some busy work trying to isolate and reclaim pages
    that are not there.

    Empty LRU lists are quite common with memory cgroups in NUMA
    environments because there exists a set of LRU lists for each zone for
    each memory cgroup, while the memory of a single cgroup is expected to
    stay on just one node. The number of expected empty LRU lists is thus

    memcgs * (nodes - 1) * lru types

    Each attempt to reclaim from an empty LRU list does expensive size
    comparisons between lists, acquires the zone's lru lock etc. Avoid
    that.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Acked-by: Mel Gorman
    Reviewed-by: Michal Hocko
    Cc: Hugh Dickins
    Cc: Satoru Moriya
    Cc: Simon Jeons
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

09 Oct, 2012

1 commit

  • page_evictable(page, vma) is an irritant: almost all its callers pass
    NULL for vma. Remove the vma arg and use mlocked_vma_newpage(vma, page)
    explicitly in the couple of places it's needed. But in those places we
    don't even need page_evictable() itself! They're dealing with a freshly
    allocated anonymous page, which has no "mapping" and cannot be mlocked yet.

    Signed-off-by: Hugh Dickins
    Acked-by: Mel Gorman
    Cc: Rik van Riel
    Acked-by: Johannes Weiner
    Cc: Michel Lespinasse
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

01 Aug, 2012

4 commits

  • The version of swap_activate introduced is sufficient for swap-over-NFS
    but would not provide enough information to implement a generic handler.
    This patch shuffles things slightly to ensure the same information is
    available for aops->swap_activate() as is available to the core.

    No functionality change.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Christoph Hellwig
    Cc: David S. Miller
    Cc: Eric B Munson
    Cc: Eric Paris
    Cc: James Morris
    Cc: Mel Gorman
    Cc: Mike Christie
    Cc: Neil Brown
    Cc: Peter Zijlstra
    Cc: Sebastian Andrzej Siewior
    Cc: Trond Myklebust
    Cc: Xiaotian Feng
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Currently swapfiles are managed entirely by the core VM by using ->bmap to
    allocate space and write to the blocks directly. This effectively ensures
    that the underlying blocks are allocated and avoids the need for the swap
    subsystem to locate what physical blocks store offsets within a file.

    If the swap subsystem is to use the filesystem information to locate the
    blocks, it is critical that information such as block groups, block
    bitmaps and the block descriptor table that map the swap file were
    resident in memory. This patch adds address_space_operations that the VM
    can call when activating or deactivating swap backed by a file.

    int swap_activate(struct file *);
    int swap_deactivate(struct file *);

    The ->swap_activate() method is used to communicate to the file that the
    VM relies on it, and the address_space should take adequate measures such
    as reserving space in the underlying device, reserving memory for mempools
    and pinning information such as the block descriptor table in memory. The
    ->swap_deactivate() method is called on sys_swapoff() if ->swap_activate()
    returned success.

    After a successful swapfile ->swap_activate, the swapfile is marked
    SWP_FILE and swapper_space.a_ops will proxy to
    sis->swap_file->f_mappings->a_ops using ->direct_io to write swapcache
    pages and ->readpage to read.

    It is perfectly possible that direct_IO be used to read the swap pages but
    it is an unnecessary complication. Similarly, it is possible that
    ->writepage be used instead of direct_io to write the pages but filesystem
    developers have stated that calling writepage from the VM is undesirable
    for a variety of reasons and using direct_IO opens up the possibility of
    writing back batches of swap pages in the future.

    [a.p.zijlstra@chello.nl: Original patch]
    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Christoph Hellwig
    Cc: David S. Miller
    Cc: Eric B Munson
    Cc: Eric Paris
    Cc: James Morris
    Cc: Mel Gorman
    Cc: Mike Christie
    Cc: Neil Brown
    Cc: Peter Zijlstra
    Cc: Sebastian Andrzej Siewior
    Cc: Trond Myklebust
    Cc: Xiaotian Feng
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • In order to teach filesystems to handle swap cache pages, three new page
    functions are introduced:

    pgoff_t page_file_index(struct page *);
    loff_t page_file_offset(struct page *);
    struct address_space *page_file_mapping(struct page *);

    page_file_index() - gives the offset of this page in the file in
    PAGE_CACHE_SIZE blocks. Like page->index is for mapped pages, this
    function also gives the correct index for PG_swapcache pages.

    page_file_offset() - uses page_file_index(), so that it will give the
    expected result, even for PG_swapcache pages.

    page_file_mapping() - gives the mapping backing the actual page; that is
    for swap cache pages it will give swap_file->f_mapping.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Christoph Hellwig
    Cc: David S. Miller
    Cc: Eric B Munson
    Cc: Eric Paris
    Cc: James Morris
    Cc: Mel Gorman
    Cc: Mike Christie
    Cc: Neil Brown
    Cc: Sebastian Andrzej Siewior
    Cc: Trond Myklebust
    Cc: Xiaotian Feng
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Sanity:

    CONFIG_CGROUP_MEM_RES_CTLR -> CONFIG_MEMCG
    CONFIG_CGROUP_MEM_RES_CTLR_SWAP -> CONFIG_MEMCG_SWAP
    CONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED -> CONFIG_MEMCG_SWAP_ENABLED
    CONFIG_CGROUP_MEM_RES_CTLR_KMEM -> CONFIG_MEMCG_KMEM

    [mhocko@suse.cz: fix missed bits]
    Cc: Glauber Costa
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Tejun Heo
    Cc: Aneesh Kumar K.V
    Cc: David Rientjes
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

05 Jun, 2012

1 commit

  • Pull frontswap feature from Konrad Rzeszutek Wilk:
    "Frontswap provides a "transcendent memory" interface for swap pages.
    In some environments, dramatic performance savings may be obtained
    because swapped pages are saved in RAM (or a RAM-like device) instead
    of a swap disk. This tag provides the basic infrastructure along with
    some changes to the existing backends."

    Fix up trivial conflict in mm/Makefile due to removal of swap token code
    changing a line next to the new frontswap entry.

    This pull request came in before the merge window even opened, it got
    delayed to after the merge window by me just wanting to make sure it had
    actual users. Apparently IBM is using this on their embedded side, and
    Jan Beulich says that it's already made available for SLES and OpenSUSE
    users.

    Also acked by Rik van Riel, and Konrad points to other people liking it
    too. So in it goes.

    By Dan Magenheimer (4) and Konrad Rzeszutek Wilk (2)
    via Konrad Rzeszutek Wilk
    * tag 'stable/frontswap.v16-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/mm:
    frontswap: s/put_page/store/g s/get_page/load
    MAINTAINER: Add myself for the frontswap API
    mm: frontswap: config and doc files
    mm: frontswap: core frontswap functionality
    mm: frontswap: core swap subsystem hooks and headers
    mm: frontswap: add frontswap header file

    Linus Torvalds
     

30 May, 2012

3 commits

  • Take lruvec further: pass it instead of zone to add_page_to_lru_list() and
    del_page_from_lru_list(); and pagevec_lru_move_fn() pass lruvec down to
    its target functions.

    This cleanup eliminates a swathe of cruft in memcontrol.c, including
    mem_cgroup_lru_add_list(), mem_cgroup_lru_del_list() and
    mem_cgroup_lru_move_lists() - which never actually touched the lists.

    In their place, mem_cgroup_page_lruvec() to decide the lruvec, previously
    a side-effect of add, and mem_cgroup_update_lru_size() to maintain the
    lru_size stats.

    Whilst these are simplifications in their own right, the goal is to bring
    the evaluation of lruvec next to the spin_locking of the lrus, in
    preparation for a future patch.

    Signed-off-by: Hugh Dickins
    Cc: KOSAKI Motohiro
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Acked-by: Konstantin Khlebnikov
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • After patch "mm: forbid lumpy-reclaim in shrink_active_list()" we can
    completely remove anon/file and active/inactive lru type filters from
    __isolate_lru_page(), because isolation for 0-order reclaim always
    isolates pages from right lru list. And pages-isolation for lumpy
    shrink_inactive_list() or memory-compaction anyway allowed to isolate
    pages from all evictable lru lists.

    Signed-off-by: Konstantin Khlebnikov
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Acked-by: Michal Hocko
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • This patch changes memcg's behavior at task_move().

    At task_move(), the kernel scans a task's page table and move the changes
    for mapped pages from source cgroup to target cgroup. There has been a
    bug at handling shared anonymous pages for a long time.

    Before patch:
    - The spec says 'shared anonymous pages are not moved.'
    - The implementation was 'shared anonymoys pages may be moved'.
    If page_mapcount
    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Naoya Horiguchi
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki