09 Jun, 2014

1 commit

  • Now that 3.15 is released, this merges the 'next' branch into 'master',
    bringing us to the normal situation where my 'master' branch is the
    merge window.

    * accumulated work in next: (6809 commits)
    ufs: sb mutex merge + mutex_destroy
    powerpc: update comments for generic idle conversion
    cris: update comments for generic idle conversion
    idle: remove cpu_idle() forward declarations
    nbd: zero from and len fields in NBD_CMD_DISCONNECT.
    mm: convert some level-less printks to pr_*
    MAINTAINERS: adi-buildroot-devel is moderated
    MAINTAINERS: add linux-api for review of API/ABI changes
    mm/kmemleak-test.c: use pr_fmt for logging
    fs/dlm/debug_fs.c: replace seq_printf by seq_puts
    fs/dlm/lockspace.c: convert simple_str to kstr
    fs/dlm/config.c: convert simple_str to kstr
    mm: mark remap_file_pages() syscall as deprecated
    mm: memcontrol: remove unnecessary memcg argument from soft limit functions
    mm: memcontrol: clean up memcg zoneinfo lookup
    mm/memblock.c: call kmemleak directly from memblock_(alloc|free)
    mm/mempool.c: update the kmemleak stack trace for mempool allocations
    lib/radix-tree.c: update the kmemleak stack trace for radix tree allocations
    mm: introduce kmemleak_update_trace()
    mm/kmemleak.c: use %u to print ->checksum
    ...

    Linus Torvalds
     

06 Jun, 2014

1 commit

  • While working address sanitizer for kernel I've discovered
    use-after-free bug in __put_anon_vma.

    For the last anon_vma, anon_vma->root freed before child anon_vma.
    Later in anon_vma_free(anon_vma) we are referencing to already freed
    anon_vma->root to check rwsem.

    This fixes it by freeing the child anon_vma before freeing
    anon_vma->root.

    Signed-off-by: Andrey Ryabinin
    Acked-by: Peter Zijlstra
    Cc: # v3.0+
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     

05 Jun, 2014

7 commits

  • Transform action part of ttu_flags into individiual bits. These flags
    aren't part of any uses-space visible api or even trace events.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • In its munmap mode, try_to_unmap_one() searches other mlocked vmas, it
    never unmaps pages. There is no reason for invalidation because ptes are
    left unchanged.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • In previous commit(mm: use the light version __mod_zone_page_state in
    mlocked_vma_newpage()) a irq-unsafe __mod_zone_page_state is used. And as
    suggested by Andrew, to reduce the risks that new call sites incorrectly
    using mlocked_vma_newpage() without knowing they are adding racing, this
    patch folds mlocked_vma_newpage() into its only call site,
    page_add_new_anon_rmap, to make it open-cocded for people to know what is
    going on.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Jianyu Zhan
    Suggested-by: Andrew Morton
    Suggested-by: Hugh Dickins
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jianyu Zhan
     
  • mlocked_vma_newpage() is called with pte lock held(a spinlock), which
    implies preemtion disabled, and the vm stat counter is not modified from
    interrupt context, so we need not use an irq-safe mod_zone_page_state()
    here, using a light-weight version __mod_zone_page_state() would be OK.

    This patch also documents __mod_zone_page_state() and some of its
    callsites. The comment above __mod_zone_page_state() is from Hugh
    Dickins, and acked by Christoph.

    Most credits to Hugh and Christoph for the clarification on the usage of
    the __mod_zone_page_state().

    [akpm@linux-foundation.org: coding-style fixes]
    Suggested-by: Andrew Morton
    Acked-by: Hugh Dickins
    Signed-off-by: Jianyu Zhan
    Reviewed-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jianyu Zhan
     
  • KSM was converted to use rmap_walk() and now nobody uses these functions
    outside mm/rmap.c.

    Let's covert them back to static.

    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • pte_file_mksoft_dirty operates with argument passed by a value and
    returns modified result thus we need to assign @ptfile here, otherwise
    itis a no-op which may lead to loss of the softdirty bit.

    Signed-off-by: Cyrill Gorcunov
    Cc: Pavel Emelyanov
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • Trinity reports BUG:

    sleeping function called from invalid context at kernel/locking/rwsem.c:47
    in_atomic(): 0, irqs_disabled(): 0, pid: 5787, name: trinity-c27

    __might_sleep < down_write < __put_anon_vma < page_get_anon_vma <
    migrate_pages < compact_zone < compact_zone_order < try_to_compact_pages ..

    Right, since conversion to mutex then rwsem, we should not put_anon_vma()
    from inside an rcu_read_lock()ed section: fix the two places that did so.
    And add might_sleep() to anon_vma_free(), as suggested by Peter Zijlstra.

    Fixes: 88c22088bf23 ("mm: optimize page_lock_anon_vma() fast-path")
    Reported-by: Dave Jones
    Signed-off-by: Hugh Dickins
    Cc: Peter Zijlstra
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

08 Apr, 2014

1 commit

  • A BUG_ON(!PageLocked) was triggered in mlock_vma_page() by Sasha Levin
    fuzzing with trinity. The call site try_to_unmap_cluster() does not lock
    the pages other than its check_page parameter (which is already locked).

    The BUG_ON in mlock_vma_page() is not documented and its purpose is
    somewhat unclear, but apparently it serializes against page migration,
    which could otherwise fail to transfer the PG_mlocked flag. This would
    not be fatal, as the page would be eventually encountered again, but
    NR_MLOCK accounting would become distorted nevertheless. This patch adds
    a comment to the BUG_ON in mlock_vma_page() and munlock_vma_page() to that
    effect.

    The call site try_to_unmap_cluster() is fixed so that for page !=
    check_page, trylock_page() is attempted (to avoid possible deadlocks as we
    already have check_page locked) and mlock_vma_page() is performed only
    upon success. If the page lock cannot be obtained, the page is left
    without PG_mlocked, which is again not a problem in the whole unevictable
    memory design.

    Signed-off-by: Vlastimil Babka
    Signed-off-by: Bob Liu
    Reported-by: Sasha Levin
    Cc: Wanpeng Li
    Cc: Michel Lespinasse
    Cc: KOSAKI Motohiro
    Acked-by: Rik van Riel
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Joonsoo Kim
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

01 Apr, 2014

1 commit

  • Pull s390 updates from Martin Schwidefsky:
    "There are two memory management related changes, the CMMA support for
    KVM to avoid swap-in of freed pages and the split page table lock for
    the PMD level. These two come with common code changes in mm/.

    A fix for the long standing theoretical TLB flush problem, this one
    comes with a common code change in kernel/sched/.

    Another set of changes is Heikos uaccess work, included is the initial
    set of patches with more to come.

    And fixes and cleanups as usual"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (36 commits)
    s390/con3270: optionally disable auto update
    s390/mm: remove unecessary parameter from pgste_ipte_notify
    s390/mm: remove unnecessary parameter from gmap_do_ipte_notify
    s390/mm: fixing comment so that parameter name match
    s390/smp: limit number of cpus in possible cpu mask
    hypfs: Add clarification for "weight_min" attribute
    s390: update defconfigs
    s390/ptrace: add support for PTRACE_SINGLEBLOCK
    s390/perf: make print_debug_cf() static
    s390/topology: Remove call to update_cpu_masks()
    s390/compat: remove compat exec domain
    s390: select CONFIG_TTY for use of tty in unconditional keyboard driver
    s390/appldata_os: fix cpu array size calculation
    s390/checksum: remove memset() within csum_partial_copy_from_user()
    s390/uaccess: remove copy_from_user_real()
    s390/sclp_early: Return correct HSA block count also for zero
    s390: add some drivers/subsystems to the MAINTAINERS file
    s390: improve debug feature usage
    s390/airq: add support for irq ranges
    s390/mm: enable split page table lock for PMD level
    ...

    Linus Torvalds
     

21 Mar, 2014

1 commit

  • Add remove_linear_migration_ptes_from_nonlinear(), to fix an interesting
    little include/linux/swapops.h:131 BUG_ON(!PageLocked) found by trinity:
    indicating that remove_migration_ptes() failed to find one of the
    migration entries that was temporarily inserted.

    The problem comes from remap_file_pages()'s switch from vma_interval_tree
    (good for inserting the migration entry) to i_mmap_nonlinear list (no good
    for locating it again); but can only be a problem if the remap_file_pages()
    range does not cover the whole of the vma (zap_pte() clears the range).

    remove_migration_ptes() needs a file_nonlinear method to go down the
    i_mmap_nonlinear list, applying linear location to look for migration
    entries in those vmas too, just in case there was this race.

    The file_nonlinear method does need rmap_walk_control.arg to do this;
    but it never needed vma passed in - vma comes from its own iteration.

    Reported-and-tested-by: Dave Jones
    Reported-and-tested-by: Sasha Levin
    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

21 Feb, 2014

1 commit

  • In a virtualized environment and given an appropriate interface the guest
    can mark pages as unused while they are free (for the s390 implementation
    see git commit 45e576b1c3d00206 "guest page hinting light"). For the host
    the unused state is a property of the pte.

    This patch adds the primitive 'pte_unused' and code to the host swap out
    handler so that pages marked as unused by all mappers are not swapped out
    but discarded instead, thus saving one IO for swap out and potentially
    another one for swap in.

    [ Martin Schwidefsky: patch reordering and simplification ]

    Signed-off-by: Konstantin Weitz
    Signed-off-by: Martin Schwidefsky

    Konstantin Weitz
     

24 Jan, 2014

2 commits

  • mm/rmap.c:851:9-10: WARNING: return of 0/1 in function 'invalid_mkclean_vma' with return type bool

    Return statements in functions returning bool should use
    true/false instead of 1/0.

    Generated by: coccinelle/misc/boolreturn.cocci

    Signed-off-by: Fengguang Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     
  • Most of the VM_BUG_ON assertions are performed on a page. Usually, when
    one of these assertions fails we'll get a BUG_ON with a call stack and
    the registers.

    I've recently noticed based on the requests to add a small piece of code
    that dumps the page to various VM_BUG_ON sites that the page dump is
    quite useful to people debugging issues in mm.

    This patch adds a VM_BUG_ON_PAGE(cond, page) which beyond doing what
    VM_BUG_ON() does, also dumps the page before executing the actual
    BUG_ON.

    [akpm@linux-foundation.org: fix up includes]
    Signed-off-by: Sasha Levin
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     

22 Jan, 2014

9 commits

  • Now, we have an infrastructure in rmap_walk() to handle difference from
    variants of rmap traversing functions.

    So, just use it in page_mkclean().

    In this patch, I change following things.

    1. remove some variants of rmap traversing functions.
    cf> page_mkclean_file
    2. mechanical change to use rmap_walk() in page_mkclean().

    Signed-off-by: Joonsoo Kim
    Reviewed-by: Naoya Horiguchi
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Ingo Molnar
    Cc: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Now, we have an infrastructure in rmap_walk() to handle difference from
    variants of rmap traversing functions.

    So, just use it in page_referenced().

    In this patch, I change following things.

    1. remove some variants of rmap traversing functions.
    cf> page_referenced_ksm, page_referenced_anon,
    page_referenced_file

    2. introduce new struct page_referenced_arg and pass it to
    page_referenced_one(), main function of rmap_walk, in order to count
    reference, to store vm_flags and to check finish condition.

    3. mechanical change to use rmap_walk() in page_referenced().

    [liwanp@linux.vnet.ibm.com: fix BUG at rmap_walk]
    Signed-off-by: Joonsoo Kim
    Reviewed-by: Naoya Horiguchi
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Ingo Molnar
    Cc: Hillf Danton
    Signed-off-by: Wanpeng Li
    Cc: Sasha Levin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Now, we have an infrastructure in rmap_walk() to handle difference from
    variants of rmap traversing functions.

    So, just use it in try_to_munlock().

    In this patch, I change following things.

    1. remove some variants of rmap traversing functions.
    cf> try_to_unmap_ksm, try_to_unmap_anon, try_to_unmap_file
    2. mechanical change to use rmap_walk() in try_to_munlock().
    3. copy and paste comments.

    Signed-off-by: Joonsoo Kim
    Reviewed-by: Naoya Horiguchi
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Ingo Molnar
    Cc: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Now, we have an infrastructure in rmap_walk() to handle difference from
    variants of rmap traversing functions.

    So, just use it in try_to_unmap().

    In this patch, I change following things.

    1. enable rmap_walk() if !CONFIG_MIGRATION.
    2. mechanical change to use rmap_walk() in try_to_unmap().

    Signed-off-by: Joonsoo Kim
    Reviewed-by: Naoya Horiguchi
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Ingo Molnar
    Cc: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • There are a lot of common parts in traversing functions, but there are
    also a little of uncommon parts in it. By assigning proper function
    pointer on each rmap_walker_control, we can handle these difference
    correctly.

    Following are differences we should handle.

    1. difference of lock function in anon mapping case
    2. nonlinear handling in file mapping case
    3. prechecked condition:
    checking memcg in page_referenced(),
    checking VM_SHARE in page_mkclean()
    checking temporary vma in try_to_unmap()
    4. exit condition:
    checking page_mapped() in try_to_unmap()

    So, in this patch, I introduce 4 function pointers to handle above
    differences.

    Signed-off-by: Joonsoo Kim
    Cc: Naoya Horiguchi
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Ingo Molnar
    Cc: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • In each rmap traverse case, there is some difference so that we need
    function pointers and arguments to them in order to handle these

    For this purpose, struct rmap_walk_control is introduced in this patch,
    and will be extended in following patch. Introducing and extending are
    separate, because it clarify changes.

    Signed-off-by: Joonsoo Kim
    Reviewed-by: Naoya Horiguchi
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Ingo Molnar
    Cc: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • When we traverse anon_vma, we need to take a read-side anon_lock. But
    there is subtle difference in the situation so that we can't use same
    method to take a lock in each cases. Therefore, we need to make
    rmap_walk_anon() taking difference lock function.

    This patch is the first step, factoring lock function for anon_lock out
    of rmap_walk_anon(). It will be used in case of removing migration
    entry and in default of rmap_walk_anon().

    Signed-off-by: Joonsoo Kim
    Reviewed-by: Naoya Horiguchi
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Ingo Molnar
    Cc: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • To merge all kinds of rmap traverse functions, try_to_unmap(),
    try_to_munlock(), page_referenced() and page_mkclean(), we need to
    extract common parts and separate out non-common parts.

    Nonlinear handling is handled just in try_to_unmap_file() and other rmap
    traverse functions doesn't care of it. Therfore it is better to factor
    nonlinear handling out of try_to_unmap_file() in order to merge all
    kinds of rmap traverse functions easily.

    Signed-off-by: Joonsoo Kim
    Reviewed-by: Naoya Horiguchi
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Ingo Molnar
    Cc: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Rmap traversing is used in five different cases, try_to_unmap(),
    try_to_munlock(), page_referenced(), page_mkclean() and
    remove_migration_ptes(). Each one implements its own traversing
    functions for the cases, anon, file, ksm, respectively. These cause
    lots of duplications and cause maintenance overhead. They also make
    codes being hard to understand and error-prone. One example is hugepage
    handling. There is a code to compute hugepage offset correctly in
    try_to_unmap_file(), but, there isn't a code to compute hugepage offset
    in rmap_walk_file(). These are used pairwise in migration context, but
    we missed to modify pairwise.

    To overcome these drawbacks, we should unify these through one unified
    function. I decide rmap_walk() as main function since it has no
    unnecessity. And to control behavior of rmap_walk(), I introduce struct
    rmap_walk_control having some function pointers. These makes
    rmap_walk() working for their specific needs.

    This patchset remove a lot of duplicated code as you can see in below
    short-stat and kernel text size also decrease slightly.

    text data bss dec hex filename
    10640 1 16 10657 29a1 mm/rmap.o
    10047 1 16 10064 2750 mm/rmap.o

    13823 705 8288 22816 5920 mm/ksm.o
    13199 705 8288 22192 56b0 mm/ksm.o

    This patch (of 9):

    We have to recompute pgoff if the given page is huge, since result based
    on HPAGE_SIZE is not approapriate for scanning the vma interval tree, as
    shown by commit 36e4f20af833 ("hugetlb: do not use
    vma_hugecache_offset() for vma_prio_tree_foreach") and commit 369a713e
    ("rmap: recompute pgoff for unmapping huge page").

    To handle both the cases, normal page for page cache and hugetlb page,
    by same way, we can use compound_page(). It returns 0 on non-compound
    page and it also returns proper value on compound page.

    Signed-off-by: Joonsoo Kim
    Cc: Naoya Horiguchi
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Ingo Molnar
    Cc: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

19 Dec, 2013

1 commit


15 Nov, 2013

2 commits

  • Hugetlb supports multiple page sizes. We use split lock only for PMD
    level, but not for PUD.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Kirill A. Shutemov
    Tested-by: Alex Thorlton
    Cc: Ingo Molnar
    Cc: "Eric W . Biederman"
    Cc: "Paul E . McKenney"
    Cc: Al Viro
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Dave Hansen
    Cc: Dave Jones
    Cc: David Howells
    Cc: Frederic Weisbecker
    Cc: Johannes Weiner
    Cc: Kees Cook
    Cc: Mel Gorman
    Cc: Michael Kerrisk
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Robin Holt
    Cc: Sedat Dilek
    Cc: Srikar Dronamraju
    Cc: Thomas Gleixner
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • With split page table lock we can't know which lock we need to take
    before we find the relevant pmd.

    Let's move lock taking inside the function.

    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Kirill A. Shutemov
    Tested-by: Alex Thorlton
    Cc: Ingo Molnar
    Cc: "Eric W . Biederman"
    Cc: "Paul E . McKenney"
    Cc: Al Viro
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Dave Hansen
    Cc: Dave Jones
    Cc: David Howells
    Cc: Frederic Weisbecker
    Cc: Johannes Weiner
    Cc: Kees Cook
    Cc: Mel Gorman
    Cc: Michael Kerrisk
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Robin Holt
    Cc: Sedat Dilek
    Cc: Srikar Dronamraju
    Cc: Thomas Gleixner
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

13 Sep, 2013

2 commits

  • We use NR_ANON_PAGES as base for reporting AnonPages to user. There's
    not much sense in not accounting transparent huge pages there, but add
    them on printing to user.

    Let's account transparent huge pages in NR_ANON_PAGES in the first place.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Dave Hansen
    Cc: Andrea Arcangeli
    Cc: Al Viro
    Cc: Hugh Dickins
    Cc: Wu Fengguang
    Cc: Jan Kara
    Cc: Mel Gorman
    Cc: Andi Kleen
    Cc: Matthew Wilcox
    Cc: Hillf Danton
    Cc: Ning Qu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • While accounting memcg page stat, it's not worth to use
    MEMCG_NR_FILE_MAPPED as an extra layer of indirection because of the
    complexity and presumed performance overhead. We can use
    MEM_CGROUP_STAT_FILE_MAPPED directly.

    Signed-off-by: Sha Zhengju
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Acked-by: Fengguang Wu
    Reviewed-by: Greg Thelen
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sha Zhengju
     

29 Aug, 2013

1 commit

  • The last remaining use for the storage key of the s390 architecture
    is reference counting. The alternative is to make page table entries
    invalid while they are old. On access the fault handler marks the
    pte/pmd as young which makes the pte/pmd valid if the access rights
    allow read access. The pte/pmd invalidations required for software
    managed reference bits cost a bit of performance, on the other hand
    the RRBE/RRBM instructions to read and reset the referenced bits are
    quite expensive as well.

    Reviewed-by: Gerald Schaefer
    Signed-off-by: Martin Schwidefsky

    Martin Schwidefsky
     

14 Aug, 2013

2 commits

  • Andy reported that if file page get reclaimed we lose the soft-dirty bit
    if it was there, so save _PAGE_BIT_SOFT_DIRTY bit when page address get
    encoded into pte entry. Thus when #pf happens on such non-present pte
    we can restore it back.

    Reported-by: Andy Lutomirski
    Signed-off-by: Cyrill Gorcunov
    Acked-by: Pavel Emelyanov
    Cc: Matt Mackall
    Cc: Xiao Guangrong
    Cc: Marcelo Tosatti
    Cc: KOSAKI Motohiro
    Cc: Stephen Rothwell
    Cc: Peter Zijlstra
    Cc: "Aneesh Kumar K.V"
    Cc: Minchan Kim
    Cc: Wanpeng Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • Andy Lutomirski reported that if a page with _PAGE_SOFT_DIRTY bit set
    get swapped out, the bit is getting lost and no longer available when
    pte read back.

    To resolve this we introduce _PTE_SWP_SOFT_DIRTY bit which is saved in
    pte entry for the page being swapped out. When such page is to be read
    back from a swap cache we check for bit presence and if it's there we
    clear it and restore the former _PAGE_SOFT_DIRTY bit back.

    One of the problem was to find a place in pte entry where we can save
    the _PTE_SWP_SOFT_DIRTY bit while page is in swap. The _PAGE_PSE was
    chosen for that, it doesn't intersect with swap entry format stored in
    pte.

    Reported-by: Andy Lutomirski
    Signed-off-by: Cyrill Gorcunov
    Acked-by: Pavel Emelyanov
    Cc: Matt Mackall
    Cc: Xiao Guangrong
    Cc: Marcelo Tosatti
    Cc: KOSAKI Motohiro
    Cc: Stephen Rothwell
    Cc: Peter Zijlstra
    Cc: "Aneesh Kumar K.V"
    Reviewed-by: Minchan Kim
    Reviewed-by: Wanpeng Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     

10 Jul, 2013

1 commit

  • These VM_ macros aren't used very often and three of them
    aren't used at all.

    Expand the ones that are used in-place, and remove all the now unused
    #define VM_ macros.

    VM_READHINTMASK, VM_NormalReadHint and VM_ClearReadHint were added just
    before 2.4 and appears have never been used.

    Signed-off-by: Joe Perches
    Acked-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     

04 Jul, 2013

1 commit

  • Similar to __pagevec_lru_add, this patch removes the LRU parameter from
    __lru_cache_add and lru_cache_add_lru as the caller does not control the
    exact LRU the page gets added to. lru_cache_add_lru gets renamed to
    lru_cache_add the name is silly without the lru parameter. With the
    parameter removed, it is required that the caller indicate if they want
    the page added to the active or inactive list by setting or clearing
    PageActive respectively.

    [akpm@linux-foundation.org: Suggested the patch]
    [gang.chen@asianux.com: fix used-unintialized warning]
    Signed-off-by: Mel Gorman
    Signed-off-by: Chen Gang
    Cc: Jan Kara
    Cc: Rik van Riel
    Acked-by: Johannes Weiner
    Cc: Alexey Lyahkov
    Cc: Andrew Perepechko
    Cc: Robin Dong
    Cc: Theodore Tso
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Bernd Schubert
    Cc: David Howells
    Cc: Trond Myklebust
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

30 Apr, 2013

1 commit

  • We have to recompute pgoff if the given page is huge, since result based
    on HPAGE_SIZE is not approapriate for scanning the vma interval tree, as
    shown by commit 36e4f20af833 ("hugetlb: do not use vma_hugecache_offset()
    for vma_prio_tree_foreach").

    Signed-off-by: Hillf Danton
    Cc: Michal Hocko
    Cc: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hillf Danton
     

24 Feb, 2013

1 commit

  • The comment in commit 4fc3f1d66b1e ("mm/rmap, migration: Make
    rmap_walk_anon() and try_to_unmap_anon() more scalable") says:

    | Rename anon_vma_[un]lock() => anon_vma_[un]lock_write(),
    | to make it clearer that it's an exclusive write-lock in
    | that case - suggested by Rik van Riel.

    But that commit renames only anon_vma_lock()

    Signed-off-by: Konstantin Khlebnikov
    Cc: Ingo Molnar
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

14 Feb, 2013

1 commit

  • The s390 architecture is unique in respect to dirty page detection,
    it uses the change bit in the per-page storage key to track page
    modifications. All other architectures track dirty bits by means
    of page table entries. This property of s390 has caused numerous
    problems in the past, e.g. see git commit ef5d437f71afdf4a
    "mm: fix XFS oops due to dirty pages without buffers on s390".

    To avoid future issues in regard to per-page dirty bits convert
    s390 to a fault based software dirty bit detection mechanism. All
    user page table entries which are marked as clean will be hardware
    read-only, even if the pte is supposed to be writable. A write by
    the user process will trigger a protection fault which will cause
    the user pte to be marked as dirty and the hardware read-only bit
    is removed.

    With this change the dirty bit in the storage key is irrelevant
    for Linux as a host, but the storage key is still required for
    KVM guests. The effect is that page_test_and_clear_dirty and the
    related code can be removed. The referenced bit in the storage
    key is still used by the page_test_and_clear_young primitive to
    provide page age information.

    For page cache pages of mappings with mapping_cap_account_dirty
    there will not be any change in behavior as the dirty bit tracking
    already uses read-only ptes to control the amount of dirty pages.
    Only for swap cache pages and pages of mappings without
    mapping_cap_account_dirty there can be additional protection faults.
    To avoid an excessive number of additional faults the mk_pte
    primitive checks for PageDirty if the pgprot value allows for writes
    and pre-dirties the pte. That avoids all additional faults for
    tmpfs and shmem pages until these pages are added to the swap cache.

    Signed-off-by: Martin Schwidefsky

    Martin Schwidefsky
     

17 Dec, 2012

1 commit

  • Pull Automatic NUMA Balancing bare-bones from Mel Gorman:
    "There are three implementations for NUMA balancing, this tree
    (balancenuma), numacore which has been developed in tip/master and
    autonuma which is in aa.git.

    In almost all respects balancenuma is the dumbest of the three because
    its main impact is on the VM side with no attempt to be smart about
    scheduling. In the interest of getting the ball rolling, it would be
    desirable to see this much merged for 3.8 with the view to building
    scheduler smarts on top and adapting the VM where required for 3.9.

    The most recent set of comparisons available from different people are

    mel: https://lkml.org/lkml/2012/12/9/108
    mingo: https://lkml.org/lkml/2012/12/7/331
    tglx: https://lkml.org/lkml/2012/12/10/437
    srikar: https://lkml.org/lkml/2012/12/10/397

    The results are a mixed bag. In my own tests, balancenuma does
    reasonably well. It's dumb as rocks and does not regress against
    mainline. On the other hand, Ingo's tests shows that balancenuma is
    incapable of converging for this workloads driven by perf which is bad
    but is potentially explained by the lack of scheduler smarts. Thomas'
    results show balancenuma improves on mainline but falls far short of
    numacore or autonuma. Srikar's results indicate we all suffer on a
    large machine with imbalanced node sizes.

    My own testing showed that recent numacore results have improved
    dramatically, particularly in the last week but not universally.
    We've butted heads heavily on system CPU usage and high levels of
    migration even when it shows that overall performance is better.
    There are also cases where it regresses. Of interest is that for
    specjbb in some configurations it will regress for lower numbers of
    warehouses and show gains for higher numbers which is not reported by
    the tool by default and sometimes missed in treports. Recently I
    reported for numacore that the JVM was crashing with
    NullPointerExceptions but currently it's unclear what the source of
    this problem is. Initially I thought it was in how numacore batch
    handles PTEs but I'm no longer think this is the case. It's possible
    numacore is just able to trigger it due to higher rates of migration.

    These reports were quite late in the cycle so I/we would like to start
    with this tree as it contains much of the code we can agree on and has
    not changed significantly over the last 2-3 weeks."

    * tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma: (50 commits)
    mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable
    mm/rmap: Convert the struct anon_vma::mutex to an rwsem
    mm: migrate: Account a transhuge page properly when rate limiting
    mm: numa: Account for failed allocations and isolations as migration failures
    mm: numa: Add THP migration for the NUMA working set scanning fault case build fix
    mm: numa: Add THP migration for the NUMA working set scanning fault case.
    mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node
    mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG
    mm: sched: numa: Control enabling and disabling of NUMA balancing
    mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate
    mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely tasknode relationships
    mm: numa: migrate: Set last_nid on newly allocated page
    mm: numa: split_huge_page: Transfer last_nid on tail page
    mm: numa: Introduce last_nid to the page frame
    sched: numa: Slowly increase the scanning period as NUMA faults are handled
    mm: numa: Rate limit setting of pte_numa if node is saturated
    mm: numa: Rate limit the amount of memory that is migrated between nodes
    mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting
    mm: numa: Migrate pages handled during a pmd_numa hinting fault
    mm: numa: Migrate on reference policy
    ...

    Linus Torvalds
     

13 Dec, 2012

1 commit

  • Memory error handling on hugepages can break a RSS counter, which emits a
    message like "Bad rss-counter state mm:ffff88040abecac0 idx:1 val:-1".
    This is because PageAnon returns true for hugepage (this behavior is
    necessary for reverse mapping to work on hugetlbfs).

    [akpm@linux-foundation.org: clean up code layout]
    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Tony Luck
    Cc: Wu Fengguang
    Cc: Aneesh Kumar K.V
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

12 Dec, 2012

1 commit

  • Add comments that dirty bit in storage key gets set whenever page content
    is changed. Hopefully if someone will use this function, he'll have a
    look at one of the two places where we comment on this.

    Signed-off-by: Jan Kara
    Cc: Martin Schwidefsky
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara