24 Feb, 2013

40 commits

  • An inconsistency emerged in reviewing the NUMA node changes to KSM: when
    meeting a page from the wrong NUMA node in a stable tree, we say that
    it's okay for comparisons, but not as a leaf for merging; whereas when
    meeting a page from the wrong NUMA node in an unstable tree, we bail out
    immediately.

    Now, it might be that a wrong NUMA node in an unstable tree is more
    likely to correlate with instablility (different content, with rbnode
    now misplaced) than page migration; but even so, we are accustomed to
    instablility in the unstable tree.

    Without strong evidence for which strategy is generally better, I'd
    rather be consistent with what's done in the stable tree: accept a page
    from the wrong NUMA node for comparison, but not as a leaf for merging.

    Signed-off-by: Hugh Dickins
    Cc: Mel Gorman
    Cc: Petr Holasek
    Cc: Andrea Arcangeli
    Cc: Izik Eidus
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Added slightly more detail to the Documentation of merge_across_nodes, a
    few comments in areas indicated by review, and renamed get_ksm_page()'s
    argument from "locked" to "lock_it". No functional change.

    Signed-off-by: Hugh Dickins
    Cc: Mel Gorman
    Cc: Petr Holasek
    Cc: Andrea Arcangeli
    Cc: Izik Eidus
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Fix several mempolicy leaks in the tmpfs mount logic. These leaks are
    slow - on the order of one object leaked per mount attempt.

    Leak 1 (umount doesn't free mpol allocated in mount):
    while true; do
    mount -t tmpfs -o mpol=interleave,size=100M nodev /mnt
    umount /mnt
    done

    Leak 2 (errors parsing remount options will leak mpol):
    mount -t tmpfs -o size=100M nodev /mnt
    while true; do
    mount -o remount,mpol=interleave,size=x /mnt 2> /dev/null
    done
    umount /mnt

    Leak 3 (multiple mpol per mount leak mpol):
    while true; do
    mount -t tmpfs -o mpol=interleave,mpol=interleave,size=100M nodev /mnt
    umount /mnt
    done

    This patch fixes all of the above. I could have broken the patch into
    three pieces but is seemed easier to review as one.

    [akpm@linux-foundation.org: fix handling of mpol_parse_str() errors, per Hugh]
    Signed-off-by: Greg Thelen
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Thelen
     
  • The tmpfs remount logic preserves filesystem mempolicy if the mpol=M
    option is not specified in the remount request. A new policy can be
    specified if mpol=M is given.

    Before this patch remounting an mpol bound tmpfs without specifying
    mpol= mount option in the remount request would set the filesystem's
    mempolicy object to a freed mempolicy object.

    To reproduce the problem boot a DEBUG_PAGEALLOC kernel and run:
    # mkdir /tmp/x

    # mount -t tmpfs -o size=100M,mpol=interleave nodev /tmp/x

    # grep /tmp/x /proc/mounts
    nodev /tmp/x tmpfs rw,relatime,size=102400k,mpol=interleave:0-3 0 0

    # mount -o remount,size=200M nodev /tmp/x

    # grep /tmp/x /proc/mounts
    nodev /tmp/x tmpfs rw,relatime,size=204800k,mpol=??? 0 0
    # note ? garbage in mpol=... output above

    # dd if=/dev/zero of=/tmp/x/f count=1
    # panic here

    Panic:
    BUG: unable to handle kernel NULL pointer dereference at (null)
    IP: [< (null)>] (null)
    [...]
    Oops: 0010 [#1] SMP DEBUG_PAGEALLOC
    Call Trace:
    mpol_shared_policy_init+0xa5/0x160
    shmem_get_inode+0x209/0x270
    shmem_mknod+0x3e/0xf0
    shmem_create+0x18/0x20
    vfs_create+0xb5/0x130
    do_last+0x9a1/0xea0
    path_openat+0xb3/0x4d0
    do_filp_open+0x42/0xa0
    do_sys_open+0xfe/0x1e0
    compat_sys_open+0x1b/0x20
    cstar_dispatch+0x7/0x1f

    Non-debug kernels will not crash immediately because referencing the
    dangling mpol will not cause a fault. Instead the filesystem will
    reference a freed mempolicy object, which will cause unpredictable
    behavior.

    The problem boils down to a dropped mpol reference below if
    shmem_parse_options() does not allocate a new mpol:

    config = *sbinfo
    shmem_parse_options(data, &config, true)
    mpol_put(sbinfo->mpol)
    sbinfo->mpol = config.mpol /* BUG: saves unreferenced mpol */

    This patch avoids the crash by not releasing the mempolicy if
    shmem_parse_options() doesn't create a new mpol.

    How far back does this issue go? I see it in both 2.6.36 and 3.3. I did
    not look back further.

    Signed-off-by: Greg Thelen
    Acked-by: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Thelen
     
  • Rob van der Heij reported the following (paraphrased) on private mail.

    The scenario is that I want to avoid backups to fill up the page
    cache and purge stuff that is more likely to be used again (this is
    with s390x Linux on z/VM, so I don't give it as much memory that
    we don't care anymore). So I have something with LD_PRELOAD that
    intercepts the close() call (from tar, in this case) and issues
    a posix_fadvise() just before closing the file.

    This mostly works, except for small files (less than 14 pages)
    that remains in page cache after the face.

    Unfortunately Rob has not had a chance to test this exact patch but the
    test program below should be reproducing the problem he described.

    The issue is the per-cpu pagevecs for LRU additions. If the pages are
    added by one CPU but fadvise() is called on another then the pages
    remain resident as the invalidate_mapping_pages() only drains the local
    pagevecs via its call to pagevec_release(). The user-visible effect is
    that a program that uses fadvise() properly is not obeyed.

    A possible fix for this is to put the necessary smarts into
    invalidate_mapping_pages() to globally drain the LRU pagevecs if a
    pagevec page could not be discarded. The downside with this is that an
    inode cache shrink would send a global IPI and memory pressure
    potentially causing global IPI storms is very undesirable.

    Instead, this patch adds a check during fadvise(POSIX_FADV_DONTNEED) to
    check if invalidate_mapping_pages() discarded all the requested pages.
    If a subset of pages are discarded it drains the LRU pagevecs and tries
    again. If the second attempt fails, it assumes it is due to the pages
    being mapped, locked or dirty and does not care. With this patch, an
    application using fadvise() correctly will be obeyed but there is a
    downside that a malicious application can force the kernel to send
    global IPIs and increase overhead.

    If accepted, I would like this to be considered as a -stable candidate.
    It's not an urgent issue but it's a system call that is not working as
    advertised which is weak.

    The following test program demonstrates the problem. It should never
    report that pages are still resident but will without this patch. It
    assumes that CPU 0 and 1 exist.

    int main() {
    int fd;
    int pagesize = getpagesize();
    ssize_t written = 0, expected;
    char *buf;
    unsigned char *vec;
    int resident, i;
    cpu_set_t set;

    /* Prepare a buffer for writing */
    expected = FILESIZE_PAGES * pagesize;
    buf = malloc(expected + 1);
    if (buf == NULL) {
    printf("ENOMEM\n");
    exit(EXIT_FAILURE);
    }
    buf[expected] = 0;
    memset(buf, 'a', expected);

    /* Prepare the mincore vec */
    vec = malloc(FILESIZE_PAGES);
    if (vec == NULL) {
    printf("ENOMEM\n");
    exit(EXIT_FAILURE);
    }

    /* Bind ourselves to CPU 0 */
    CPU_ZERO(&set);
    CPU_SET(0, &set);
    if (sched_setaffinity(getpid(), sizeof(set), &set) == -1) {
    perror("sched_setaffinity");
    exit(EXIT_FAILURE);
    }

    /* open file, unlink and write buffer */
    fd = open("fadvise-test-file", O_CREAT|O_EXCL|O_RDWR);
    if (fd == -1) {
    perror("open");
    exit(EXIT_FAILURE);
    }
    unlink("fadvise-test-file");
    while (written < expected) {
    ssize_t this_write;
    this_write = write(fd, buf + written, expected - written);

    if (this_write == -1) {
    perror("write");
    exit(EXIT_FAILURE);
    }

    written += this_write;
    }
    free(buf);

    /*
    * Force ourselves to another CPU. If fadvise only flushes the local
    * CPUs pagevecs then the fadvise will fail to discard all file pages
    */
    CPU_ZERO(&set);
    CPU_SET(1, &set);
    if (sched_setaffinity(getpid(), sizeof(set), &set) == -1) {
    perror("sched_setaffinity");
    exit(EXIT_FAILURE);
    }

    /* sync and fadvise to discard the page cache */
    fsync(fd);
    if (posix_fadvise(fd, 0, expected, POSIX_FADV_DONTNEED) == -1) {
    perror("posix_fadvise");
    exit(EXIT_FAILURE);
    }

    /* map the file and use mincore to see which parts of it are resident */
    buf = mmap(NULL, expected, PROT_READ, MAP_SHARED, fd, 0);
    if (buf == NULL) {
    perror("mmap");
    exit(EXIT_FAILURE);
    }
    if (mincore(buf, expected, vec) == -1) {
    perror("mincore");
    exit(EXIT_FAILURE);
    }

    /* Check residency */
    for (i = 0, resident = 0; i < FILESIZE_PAGES; i++) {
    if (vec[i])
    resident++;
    }
    if (resident != 0) {
    printf("Nr unexpected pages resident: %d\n", resident);
    exit(EXIT_FAILURE);
    }

    munmap(buf, expected);
    close(fd);
    free(vec);
    exit(EXIT_SUCCESS);
    }

    Signed-off-by: Mel Gorman
    Reported-by: Rob van der Heij
    Tested-by: Rob van der Heij
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • We at SGI have a need to address some very high physical address ranges
    with our GRU (global reference unit), sometimes across partitioned
    machine boundaries and sometimes with larger addresses than the cpu
    supports. We do this with the aid of our own 'extended vma' module
    which mimics the vma. When something (either unmap or exit) frees an
    'extended vma' we use the mmu notifiers to clean them up.

    We had been able to mimic the functions
    __mmu_notifier_invalidate_range_start() and
    __mmu_notifier_invalidate_range_end() by locking the per-mm lock and
    walking the per-mm notifier list. But with the change to a global srcu
    lock (static in mmu_notifier.c) we can no longer do that. Our module has
    no access to that lock.

    So we request that these two functions be exported.

    Signed-off-by: Cliff Wickman
    Acked-by: Robin Holt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cliff Wickman
     
  • This change adds a follow_page_mask function which is equivalent to
    follow_page, but with an extra page_mask argument.

    follow_page_mask sets *page_mask to HPAGE_PMD_NR - 1 when it encounters
    a THP page, and to 0 in other cases.

    __get_user_pages() makes use of this in order to accelerate populating
    THP ranges - that is, when both the pages and vmas arrays are NULL, we
    don't need to iterate HPAGE_PMD_NR times to cover a single THP page (and
    we also avoid taking mm->page_table_lock that many times).

    Signed-off-by: Michel Lespinasse
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Use long type for page counts in mm_populate() so as to avoid integer
    overflow when running the following test code:

    int main(void) {
    void *p = mmap(NULL, 0x100000000000, PROT_READ,
    MAP_PRIVATE | MAP_ANON, -1, 0);
    printf("p: %p\n", p);
    mlockall(MCL_CURRENT);
    printf("done\n");
    return 0;
    }

    Signed-off-by: Michel Lespinasse
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • nr_free_zone_pages(), nr_free_buffer_pages() and nr_free_pagecache_pages()
    are horribly badly named, so accurately document them with code comments
    in case of the misuse of them.

    [akpm@linux-foundation.org: tweak comments]
    Reviewed-by: Randy Dunlap
    Signed-off-by: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang Yanfei
     
  • error_states[] has two separate states "unevictable LRU page" and
    "mlocked LRU page", and the former one has the higher priority now. But
    because of that the latter one is rarely chosen because pages with
    PageMlocked highly likely have PG_unevictable set. On the other hand,
    PG_unevictable without PageMlocked is common for ramfs or SHM_LOCKed
    shared memory, so reversing the priority of these two states helps us
    clearly distinguish them.

    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Chen Gong
    Cc: Tony Luck
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • memory_failure() can't handle memory errors on mlocked pages correctly,
    because page_action() judges such errors as ones on "unknown pages"
    instead of ones on "unevictable LRU page" or "mlocked LRU page". In
    order to determine page_state page_action() checks page flags at the
    timing of the judgement, but such page flags are not the same with those
    just after memory_failure() is called, because memory_failure() does
    unmapping of the error pages before doing page_action(). This unmapping
    changes the page state, especially page_remove_rmap() (called from
    try_to_unmap_one()) clears PG_mlocked, so page_action() can't catch
    mlocked pages after that.

    With this patch, we store the page flag of the error page before doing
    unmap, and (only) if the first check with page flags at the time decided
    the error page is unknown, we do the second check with the stored page
    flag. This implementation doesn't change error handling for the page
    types for which the first check can determine the page state correctly.

    [akpm@linux-foundation.org: tweak comments]
    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Tony Luck
    Cc: Chen Gong
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Whilst I run the risk of a flogging for disloyalty to the Lord of Sealand,
    I do have CONFIG_MEMCG=y CONFIG_MEMCG_KMEM not set, and grow tired of the
    "mm/memcontrol.c:4972:12: warning: `memcg_propagate_kmem' defined but not
    used [-Wunused-function]" seen in 3.8-rc: move the #ifdef outwards.

    Signed-off-by: Hugh Dickins
    Acked-by: Michal Hocko
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • This member of struct virtio_chan is calculated from nr_free_buffer_pages
    so change its type to unsigned long in case of overflow.

    Signed-off-by: Zhang Yanfei
    Cc: David Miller
    Cc: Eric Van Hensbergen
    Cc: Ron Minnich
    Cc: Latchesar Ionkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang Yanfei
     
  • This variable is calculated from nr_free_pagecache_pages so
    change its type to unsigned long.

    Signed-off-by: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang Yanfei
     
  • The three variables are calculated from nr_free_buffer_pages so change
    their types to unsigned long in case of overflow.

    Signed-off-by: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang Yanfei
     
  • max_buffer_heads is calculated from nr_free_buffer_pages(), so change
    its type to unsigned long in case of overflow.

    Signed-off-by: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang Yanfei
     
  • Now the function nr_free_buffer_pages returns unsigned long, so use %ld
    to print its return value.

    Signed-off-by: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang Yanfei
     
  • Currently, the amount of RAM that functions nr_free_*_pages return is
    held in unsigned int. But in machines with big memory (exceeding 16TB),
    the amount may be incorrect because of overflow, so fix it.

    Signed-off-by: Zhang Yanfei
    Cc: Simon Horman
    Cc: Julian Anastasov
    Cc: David Miller
    Cc: Eric Van Hensbergen
    Cc: Ron Minnich
    Cc: Latchesar Ionkov
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang Yanfei
     
  • We should encourage all memcg controller initialization independent on a
    specific mem_cgroup to be done here rather than exploit css_alloc
    callback and assume that nothing happens before root cgroup is created.

    Signed-off-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: Tejun Heo
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • memcg_stock are currently initialized during the root cgroup allocation
    which is OK but it pointlessly pollutes memcg allocation code with
    something that can be called when the memcg subsystem is initialized by
    mem_cgroup_init along with other controller specific parts.

    This patch wraps the current memcg_stock initialization code into a
    helper calls it from the controller subsystem initialization code.

    Signed-off-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: Tejun Heo
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Per-node-zone soft limit tree is currently initialized when the root
    cgroup is created which is OK but it pointlessly pollutes memcg
    allocation code with something that can be called when the memcg
    subsystem is initialized by mem_cgroup_init along with other controller
    specific parts.

    While we are at it let's make mem_cgroup_soft_limit_tree_init void
    because it doesn't make much sense to report memory failure because if
    we fail to allocate memory that early during the boot then we are
    screwed anyway (this saves some code).

    Signed-off-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: Tejun Heo
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Recently, Luigi reported there are lots of free swap space when OOM
    happens. It's easily reproduced on zram-over-swap, where many instance
    of memory hogs are running and laptop_mode is enabled. He said there
    was no problem when he disabled laptop_mode. The problem when I
    investigate problem is following as.

    Assumption for easy explanation: There are no page cache page in system
    because they all are already reclaimed.

    1. try_to_free_pages disable may_writepage when laptop_mode is enabled.
    2. shrink_inactive_list isolates victim pages from inactive anon lru list.
    3. shrink_page_list adds them to swapcache via add_to_swap but it doesn't
    pageout because sc->may_writepage is 0 so the page is rotated back into
    inactive anon lru list. The add_to_swap made the page Dirty by SetPageDirty.
    4. 3 couldn't reclaim any pages so do_try_to_free_pages increase priority and
    retry reclaim with higher priority.
    5. shrink_inactlive_list try to isolate victim pages from inactive anon lru list
    but got failed because it try to isolate pages with ISOLATE_CLEAN mode but
    inactive anon lru list is full of dirty pages by 3 so it just returns
    without any reclaim progress.
    6. do_try_to_free_pages doesn't set may_writepage due to zero total_scanned.
    Because sc->nr_scanned is increased by shrink_page_list but we don't call
    shrink_page_list in 5 due to short of isolated pages.

    Above loop is continued until OOM happens.

    The problem didn't happen before [1] was merged because old logic's
    isolatation in shrink_inactive_list was successful and tried to call
    shrink_page_list to pageout them but it still ends up failed to page out
    by may_writepage. But important point is that sc->nr_scanned was
    increased although we couldn't swap out them so do_try_to_free_pages
    could set may_writepages.

    Since commit f80c0673610e ("mm: zone_reclaim: make isolate_lru_page()
    filter-aware") was introduced, it's not a good idea any more to depends
    on only the number of scanned pages for setting may_writepage. So this
    patch adds new trigger point of setting may_writepage as below
    DEF_PRIOIRTY - 2 which is used to show the significant memory pressure
    in VM so it's good fit for our purpose which would be better to lose
    power saving or clickety rather than OOM killing.

    Signed-off-by: Minchan Kim
    Reported-by: Luigi Semenzato
    Cc: Rik van Riel
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Make a sweep through mm/ and convert code that uses -1 directly to using
    the more appropriate NUMA_NO_NODE.

    Signed-off-by: David Rientjes
    Reviewed-by: Yasuaki Ishimatsu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • There is a race condition between mmu_notifier_unregister() and
    __mmu_notifier_release().

    Assume two tasks, one calling mmu_notifier_unregister() as a result of a
    filp_close() ->flush() callout (task A), and the other calling
    mmu_notifier_release() from an mmput() (task B).

    A B
    t1 srcu_read_lock()
    t2 if (!hlist_unhashed())
    t3 srcu_read_unlock()
    t4 srcu_read_lock()
    t5 hlist_del_init_rcu()
    t6 synchronize_srcu()
    t7 srcu_read_unlock()
    t8 hlist_del_rcu() hlist_lock which can result in
    callouts to the ->release() notifier from both mmu_notifier_unregister()
    and __mmu_notifier_release().

    -stable suggestions:

    The stable trees prior to 3.7.y need commits 21a92735f660 and
    70400303ce0c cherry-picked in that order prior to cherry-picking this
    commit. The 3.7.y tree already has those two commits.

    Signed-off-by: Robin Holt
    Cc: Andrea Arcangeli
    Cc: Wanpeng Li
    Cc: Xiao Guangrong
    Cc: Avi Kivity
    Cc: Hugh Dickins
    Cc: Marcelo Tosatti
    Cc: Sagi Grimberg
    Cc: Haggai Eran
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robin Holt
     
  • Replace open coded pgdat_end_pfn() with helper function.

    Signed-off-by: Cody P Schafer
    Cc: David Hansen
    Cc: Catalin Marinas
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cody P Schafer
     
  • Remove open coding of ensure_zone_is_initialzied().

    Signed-off-by: Cody P Schafer
    Cc: David Hansen
    Cc: Catalin Marinas
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cody P Schafer
     
  • ensure_zone_is_initialized() checks if a zone is in a empty & not
    initialized state (typically occuring after it is created in memory
    hotplugging), and, if so, calls init_currently_empty_zone() to
    initialize the zone.

    Signed-off-by: Cody P Schafer
    Cc: David Hansen
    Cc: Catalin Marinas
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cody P Schafer
     
  • Add a debug message which prints when a page is found outside of the
    boundaries of the zone it should belong to. Format is:
    "page $pfn outside zone [ $start_pfn - $end_pfn ]"

    [akpm@linux-foundation.org: s/pr_debug/pr_err/]
    Signed-off-by: Cody P Schafer
    Cc: David Hansen
    Cc: Catalin Marinas
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cody P Schafer
     
  • Add pgdat_end_pfn() and pgdat_is_empty() helpers which match the similar
    zone_*() functions.

    Change node_end_pfn() to be a wrapper of pgdat_end_pfn().

    Signed-off-by: Cody P Schafer
    Cc: David Hansen
    Cc: Catalin Marinas
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cody P Schafer
     
  • Freeing pages to uninitialized zones is not handled by
    __free_one_page(), and should never happen when the code is correct.

    Ran into this while writing some code that dynamically onlines extra
    zones.

    Signed-off-by: Cody P Schafer
    Cc: David Hansen
    Cc: Catalin Marinas
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cody P Schafer
     
  • Factoring out these 2 checks makes it more clear what we are actually
    checking for.

    Signed-off-by: Cody P Schafer
    Cc: David Hansen
    Cc: Catalin Marinas
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cody P Schafer
     
  • Add 2 helpers (zone_end_pfn() and zone_spans_pfn()) to reduce code
    duplication.

    This also switches to using them in compaction (where an additional
    variable needed to be renamed), page_alloc, vmstat, memory_hotplug, and
    kmemleak.

    Note that in compaction.c I avoid calling zone_end_pfn() repeatedly
    because I expect at some point the sycronization issues with start_pfn &
    spanned_pages will need fixing, either by actually using the seqlock or
    clever memory barrier usage.

    Signed-off-by: Cody P Schafer
    Cc: David Hansen
    Cc: Catalin Marinas
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cody P Schafer
     
  • Instead of directly utilizing a combination of config options to determine
    this, add a macro to specifically address it.

    Signed-off-by: Cody P Schafer
    Cc: David Hansen
    Cc: Catalin Marinas
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cody P Schafer
     
  • The fact that mlock calls get_user_pages, and get_user_pages might call
    mlock when expanding a stack looks like a potential recursion.

    However, mlock makes sure the requested range is already contained
    within a vma, so no stack expansion will actually happen from mlock.

    Should this ever change: the stack expansion mlocks only the newly
    expanded range and so will not result in recursive expansion.

    Signed-off-by: Johannes Weiner
    Reported-by: Al Viro
    Cc: Hugh Dickins
    Acked-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • An inactive file list is considered low when its active counterpart is
    bigger, regardless of whether it is a global zone LRU list or a memcg
    zone LRU list. The only difference is in how the LRU size is assessed.

    get_lru_size() does the right thing for both global and memcg reclaim
    situations.

    Get rid of inactive_file_is_low_global() and
    mem_cgroup_inactive_file_is_low() by using get_lru_size() and compare
    the numbers in common code.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • In shmem_find_get_pages_and_swap(), use the faster radix tree iterator
    construct from commit 78c1d78488a3 ("radix-tree: introduce bit-optimized
    iterator").

    Signed-off-by: Johannes Weiner
    Acked-by: Hugh Dickins
    Cc: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Complaints are rare, but lockdep still does not understand the way
    ksm_memory_callback(MEM_GOING_OFFLINE) takes ksm_thread_mutex, and holds
    it until the ksm_memory_callback(MEM_OFFLINE): that appears to be a
    problem because notifier callbacks are made under down_read of
    blocking_notifier_head->rwsem (so first the mutex is taken while holding
    the rwsem, then later the rwsem is taken while still holding the mutex);
    but is not in fact a problem because mem_hotplug_mutex is held
    throughout the dance.

    There was an attempt to fix this with mutex_lock_nested(); but if that
    happened to fool lockdep two years ago, apparently it does so no longer.

    I had hoped to eradicate this issue in extending KSM page migration not
    to need the ksm_thread_mutex. But then realized that although the page
    migration itself is safe, we do still need to lock out ksmd and other
    users of get_ksm_page() while offlining memory - at some point between
    MEM_GOING_OFFLINE and MEM_OFFLINE, the struct pages themselves may
    vanish, and get_ksm_page()'s accesses to them become a violation.

    So, give up on holding ksm_thread_mutex itself from MEM_GOING_OFFLINE to
    MEM_OFFLINE, and add a KSM_RUN_OFFLINE flag, and wait_while_offlining()
    checks, to achieve the same lockout without being caught by lockdep.
    This is less elegant for KSM, but it's more important to keep lockdep
    useful to other users - and I apologize for how long it took to fix.

    Signed-off-by: Hugh Dickins
    Reported-by: Gerald Schaefer
    Tested-by: Gerald Schaefer
    Cc: Rik van Riel
    Cc: Petr Holasek
    Cc: Andrea Arcangeli
    Cc: Izik Eidus
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • No functional change, but the only purpose of the offlining argument to
    migrate_pages() etc, was to ensure that __unmap_and_move() could migrate a
    KSM page for memory hotremove (which took ksm_thread_mutex) but not for
    other callers. Now all cases are safe, remove the arg.

    Signed-off-by: Hugh Dickins
    Cc: Rik van Riel
    Cc: Petr Holasek
    Cc: Andrea Arcangeli
    Cc: Izik Eidus
    Cc: Gerald Schaefer
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Migration of KSM pages is now safe: remove the PageKsm restrictions from
    mempolicy.c and migrate.c.

    But keep PageKsm out of __unmap_and_move()'s anon_vma contortions, which
    are irrelevant to KSM: it looks as if that code was preventing hotremove
    migration of KSM pages, unless they happened to be in swapcache.

    There is some question as to whether enforcing a NUMA mempolicy migration
    ought to migrate KSM pages, mapped into entirely unrelated processes; but
    moving page_mapcount > 1 is only permitted with MPOL_MF_MOVE_ALL anyway,
    and it seems reasonable to assume that you wouldn't set MADV_MERGEABLE on
    any area where this is a worry.

    Signed-off-by: Hugh Dickins
    Cc: Rik van Riel
    Cc: Petr Holasek
    Cc: Andrea Arcangeli
    Cc: Izik Eidus
    Cc: Gerald Schaefer
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The new KSM NUMA merge_across_nodes knob introduces a problem, when it's
    set to non-default 0: if a KSM page is migrated to a different NUMA node,
    how do we migrate its stable node to the right tree? And what if that
    collides with an existing stable node?

    ksm_migrate_page() can do no more than it's already doing, updating
    stable_node->kpfn: the stable tree itself cannot be manipulated without
    holding ksm_thread_mutex. So accept that a stable tree may temporarily
    indicate a page belonging to the wrong NUMA node, leave updating until the
    next pass of ksmd, just be careful not to merge other pages on to a
    misplaced page. Note nid of holding tree in stable_node, and recognize
    that it will not always match nid of kpfn.

    A misplaced KSM page is discovered, either when ksm_do_scan() next comes
    around to one of its rmap_items (we now have to go to cmp_and_merge_page
    even on pages in a stable tree), or when stable_tree_search() arrives at a
    matching node for another page, and this node page is found misplaced.

    In each case, move the misplaced stable_node to a list of migrate_nodes
    (and use the address of migrate_nodes as magic by which to identify them):
    we don't need them in a tree. If stable_tree_search() finds no match for
    a page, but it's currently exiled to this list, then slot its stable_node
    right there into the tree, bringing all of its mappings with it; otherwise
    they get migrated one by one to the original page of the colliding node.
    stable_tree_search() is now modelled more like stable_tree_insert(), in
    order to handle these insertions of migrated nodes.

    remove_node_from_stable_tree(), remove_all_stable_nodes() and
    ksm_check_stable_tree() have to handle the migrate_nodes list as well as
    the stable tree itself. Less obviously, we do need to prune the list of
    stale entries from time to time (scan_get_next_rmap_item() does it once
    each full scan): whereas stale nodes in the stable tree get naturally
    pruned as searches try to brush past them, these migrate_nodes may get
    forgotten and accumulate.

    Signed-off-by: Hugh Dickins
    Cc: Rik van Riel
    Cc: Petr Holasek
    Cc: Andrea Arcangeli
    Cc: Izik Eidus
    Cc: Gerald Schaefer
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins