23 Oct, 2015

1 commit

  • commit 3aaa76e125c1dd58c9b599baa8c6021896874c12 upstream.

    Since commit bcc54222309c ("mm: hugetlb: introduce page_huge_active")
    each hugetlb page maintains its active flag to avoid a race condition
    betwe= en multiple calls of isolate_huge_page(), but current kernel
    doesn't set the f= lag on a hugepage allocated by migration because the
    proper putback routine isn= 't called. This means that users could
    still encounter the race referred to by bcc54222309c in this special
    case, so this patch fixes it.

    Fixes: bcc54222309c ("mm: hugetlb: introduce page_huge_active")
    Signed-off-by: Naoya Horiguchi
    Cc: Michal Hocko
    Cc: Andi Kleen
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Naoya Horiguchi
     

16 Apr, 2015

1 commit

  • With the page flag sanitization patchset, an invalid usage of
    ClearPageSwapCache() is detected in migration_page_copy().
    migrate_page_copy() is shared by both normal and hugepage (both thp and
    hugetlb) code path, so let's check PageSwapCache() and clear it if it's
    set to avoid misuse of the invalid clear operation.

    Signed-off-by: Naoya Horiguchi
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

15 Apr, 2015

2 commits

  • This code is dead since commit 9e645ab6d089 ("sched/numa: Continue PTE
    scanning even if migrate rate limited") so remove it.

    Signed-off-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • With gcc version 4.7.3 (Ubuntu/Linaro 4.7.3-12ubuntu1) :

    mm/migrate.c: In function `migrate_pages':
    mm/migrate.c:1148:1: internal compiler error: in push_minipool_fix, at config/arm/arm.c:13500
    Please submit a full bug report,
    with preprocessed source if appropriate.
    See for instructions.
    Preprocessed source stored into /tmp/ccPoM1tr.out file, please attach this to your bugreport.
    make[1]: *** [mm/migrate.o] Error 1
    make: *** [mm/migrate.o] Error 2

    Mark unmap_and_move() (which is used in a single place only) "noinline"
    to work around this compiler bug.

    [akpm@linux-foundation.org: make it conditional on gcc-4.7.3 and arm]
    [khilman@kernel.org: fine-tune compiler versions]
    [akpm@linux-foundation.org: fix comment]
    Signed-off-by: Geert Uytterhoeven
    Reported-by: Kevin Hilman
    Cc: Marc Zyngier
    Tested-by: Kevin Hilman
    Tested-by: Lina Iyer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Geert Uytterhoeven
     

13 Feb, 2015

2 commits

  • With PROT_NONE, the traditional page table manipulation functions are
    sufficient.

    [andre.przywara@arm.com: fix compiler warning in pmdp_invalidate()]
    [akpm@linux-foundation.org: fix build with STRICT_MM_TYPECHECKS]
    Signed-off-by: Mel Gorman
    Acked-by: Linus Torvalds
    Acked-by: Aneesh Kumar
    Tested-by: Sasha Levin
    Cc: Benjamin Herrenschmidt
    Cc: Dave Jones
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Cc: Kirill Shutemov
    Cc: Paul Mackerras
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Automatic NUMA balancing depends on being able to protect PTEs to trap a
    fault and gather reference locality information. Very broadly speaking
    it would mark PTEs as not present and use another bit to distinguish
    between NUMA hinting faults and other types of faults. It was
    universally loved by everybody and caused no problems whatsoever. That
    last sentence might be a lie.

    This series is very heavily based on patches from Linus and Aneesh to
    replace the existing PTE/PMD NUMA helper functions with normal change
    protections. I did alter and add parts of it but I consider them
    relatively minor contributions. At their suggestion, acked-bys are in
    there but I've no problem converting them to Signed-off-by if requested.

    AFAIK, this has received no testing on ppc64 and I'm depending on Aneesh
    for that. I tested trinity under kvm-tool and passed and ran a few
    other basic tests. At the time of writing, only the short-lived tests
    have completed but testing of V2 indicated that long-term testing had no
    surprises. In most cases I'm leaving out detail as it's not that
    interesting.

    specjbb single JVM: There was negligible performance difference in the
    benchmark itself for short runs. However, system activity is
    higher and interrupts are much higher over time -- possibly TLB
    flushes. Migrations are also higher. Overall, this is more overhead
    but considering the problems faced with the old approach I think
    we just have to suck it up and find another way of reducing the
    overhead.

    specjbb multi JVM: Negligible performance difference to the actual benchmark
    but like the single JVM case, the system overhead is noticeably
    higher. Again, interrupts are a major factor.

    autonumabench: This was all over the place and about all that can be
    reasonably concluded is that it's different but not necessarily
    better or worse.

    autonumabench
    3.18.0-rc5 3.18.0-rc5
    mmotm-20141119 protnone-v3r3
    User NUMA01 32380.24 ( 0.00%) 21642.92 ( 33.16%)
    User NUMA01_THEADLOCAL 22481.02 ( 0.00%) 22283.22 ( 0.88%)
    User NUMA02 3137.00 ( 0.00%) 3116.54 ( 0.65%)
    User NUMA02_SMT 1614.03 ( 0.00%) 1543.53 ( 4.37%)
    System NUMA01 322.97 ( 0.00%) 1465.89 (-353.88%)
    System NUMA01_THEADLOCAL 91.87 ( 0.00%) 49.32 ( 46.32%)
    System NUMA02 37.83 ( 0.00%) 14.61 ( 61.38%)
    System NUMA02_SMT 7.36 ( 0.00%) 7.45 ( -1.22%)
    Elapsed NUMA01 716.63 ( 0.00%) 599.29 ( 16.37%)
    Elapsed NUMA01_THEADLOCAL 553.98 ( 0.00%) 539.94 ( 2.53%)
    Elapsed NUMA02 83.85 ( 0.00%) 83.04 ( 0.97%)
    Elapsed NUMA02_SMT 86.57 ( 0.00%) 79.15 ( 8.57%)
    CPU NUMA01 4563.00 ( 0.00%) 3855.00 ( 15.52%)
    CPU NUMA01_THEADLOCAL 4074.00 ( 0.00%) 4136.00 ( -1.52%)
    CPU NUMA02 3785.00 ( 0.00%) 3770.00 ( 0.40%)
    CPU NUMA02_SMT 1872.00 ( 0.00%) 1959.00 ( -4.65%)

    System CPU usage of NUMA01 is worse but it's an adverse workload on this
    machine so I'm reluctant to conclude that it's a problem that matters. On
    the other workloads that are sensible on this machine, system CPU usage is
    great. Overall time to complete the benchmark is comparable

    3.18.0-rc5 3.18.0-rc5
    mmotm-20141119protnone-v3r3
    User 59612.50 48586.44
    System 460.22 1537.45
    Elapsed 1442.20 1304.29

    NUMA alloc hit 5075182 5743353
    NUMA alloc miss 0 0
    NUMA interleave hit 0 0
    NUMA alloc local 5075174 5743339
    NUMA base PTE updates 637061448 443106883
    NUMA huge PMD updates 1243434 864747
    NUMA page range updates 1273699656 885857347
    NUMA hint faults 1658116 1214277
    NUMA hint local faults 959487 754113
    NUMA hint local percent 57 62
    NUMA pages migrated 5467056 61676398

    The NUMA pages migrated look terrible but when I looked at a graph of the
    activity over time I see that the massive spike in migration activity was
    during NUMA01. This correlates with high system CPU usage and could be
    simply down to bad luck but any modifications that affect that workload
    would be related to scan rates and migrations, not the protection
    mechanism. For all other workloads, migration activity was comparable.

    Overall, headline performance figures are comparable but the overhead is
    higher, mostly in interrupts. To some extent, higher overhead from this
    approach was anticipated but not to this degree. It's going to be
    necessary to reduce this again with a separate series in the future. It's
    still worth going ahead with this series though as it's likely to avoid
    constant headaches with Xen and is probably easier to maintain.

    This patch (of 10):

    A transhuge NUMA hinting fault may find the page is migrating and should
    wait until migration completes. The check is race-prone because the pmd
    is deferenced outside of the page lock and while the race is tiny, it'll
    be larger if the PMD is cleared while marking PMDs for hinting fault.
    This patch closes the race.

    Signed-off-by: Mel Gorman
    Cc: Aneesh Kumar K.V
    Cc: Benjamin Herrenschmidt
    Cc: Dave Jones
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Cc: Kirill Shutemov
    Cc: Linus Torvalds
    Cc: Paul Mackerras
    Cc: Rik van Riel
    Cc: Sasha Levin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

12 Feb, 2015

1 commit

  • We have a race condition between move_pages() and freeing hugepages, where
    move_pages() calls follow_page(FOLL_GET) for hugepages internally and
    tries to get its refcount without preventing concurrent freeing. This
    race crashes the kernel, so this patch fixes it by moving FOLL_GET code
    for hugepages into follow_huge_pmd() with taking the page table lock.

    This patch intentionally removes page==NULL check after pte_page.
    This is justified because pte_page() never returns NULL for any
    architectures or configurations.

    This patch changes the behavior of follow_huge_pmd() for tail pages and
    then tail pages can be pinned/returned. So the caller must be changed to
    properly handle the returned tail pages.

    We could have a choice to add the similar locking to
    follow_huge_(addr|pud) for consistency, but it's not necessary because
    currently these functions don't support FOLL_GET flag, so let's leave it
    for future development.

    Here is the reproducer:

    $ cat movepages.c
    #include
    #include
    #include

    #define ADDR_INPUT 0x700000000000UL
    #define HPS 0x200000
    #define PS 0x1000

    int main(int argc, char *argv[]) {
    int i;
    int nr_hp = strtol(argv[1], NULL, 0);
    int nr_p = nr_hp * HPS / PS;
    int ret;
    void **addrs;
    int *status;
    int *nodes;
    pid_t pid;

    pid = strtol(argv[2], NULL, 0);
    addrs = malloc(sizeof(char *) * nr_p + 1);
    status = malloc(sizeof(char *) * nr_p + 1);
    nodes = malloc(sizeof(char *) * nr_p + 1);

    while (1) {
    for (i = 0; i < nr_p; i++) {
    addrs[i] = (void *)ADDR_INPUT + i * PS;
    nodes[i] = 1;
    status[i] = 0;
    }
    ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
    MPOL_MF_MOVE_ALL);
    if (ret == -1)
    err("move_pages");

    for (i = 0; i < nr_p; i++) {
    addrs[i] = (void *)ADDR_INPUT + i * PS;
    nodes[i] = 0;
    status[i] = 0;
    }
    ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
    MPOL_MF_MOVE_ALL);
    if (ret == -1)
    err("move_pages");
    }
    return 0;
    }

    $ cat hugepage.c
    #include
    #include
    #include

    #define ADDR_INPUT 0x700000000000UL
    #define HPS 0x200000

    int main(int argc, char *argv[]) {
    int nr_hp = strtol(argv[1], NULL, 0);
    char *p;

    while (1) {
    p = mmap((void *)ADDR_INPUT, nr_hp * HPS, PROT_READ | PROT_WRITE,
    MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
    if (p != (void *)ADDR_INPUT) {
    perror("mmap");
    break;
    }
    memset(p, 0, nr_hp * HPS);
    munmap(p, nr_hp * HPS);
    }
    }

    $ sysctl vm.nr_hugepages=40
    $ ./hugepage 10 &
    $ ./movepages 10 $(pgrep -f hugepage)

    Fixes: e632a938d914 ("mm: migrate: add hugepage migration code to move_pages()")
    Signed-off-by: Naoya Horiguchi
    Reported-by: Hugh Dickins
    Cc: James Hogan
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Luiz Capitulino
    Cc: Nishanth Aravamudan
    Cc: Lee Schermerhorn
    Cc: Steve Capper
    Cc: [3.12+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

11 Feb, 2015

1 commit


17 Dec, 2014

1 commit


16 Dec, 2014

1 commit

  • Pull drm updates from Dave Airlie:
    "Highlights:

    - AMD KFD driver merge

    This is the AMD HSA interface for exposing a lowlevel interface for
    GPGPU use. They have an open source userspace built on top of this
    interface, and the code looks as good as it was going to get out of
    tree.

    - Initial atomic modesetting work

    The need for an atomic modesetting interface to allow userspace to
    try and send a complete set of modesetting state to the driver has
    arisen, and been suffering from neglect this past year. No more,
    the start of the common code and changes for msm driver to use it
    are in this tree. Ongoing work to get the userspace ioctl finished
    and the code clean will probably wait until next kernel.

    - DisplayID 1.3 and tiled monitor exposed to userspace.

    Tiled monitor property is now exposed for userspace to make use of.

    - Rockchip drm driver merged.

    - imx gpu driver moved out of staging

    Other stuff:

    - core:
    panel - MIPI DSI + new panels.
    expose suggested x/y properties for virtual GPUs

    - i915:
    Initial Skylake (SKL) support
    gen3/4 reset work
    start of dri1/ums removal
    infoframe tracking
    fixes for lots of things.

    - nouveau:
    tegra k1 voltage support
    GM204 modesetting support
    GT21x memory reclocking work

    - radeon:
    CI dpm fixes
    GPUVM improvements
    Initial DPM fan control

    - rcar-du:
    HDMI support added
    removed some support for old boards
    slave encoder driver for Analog Devices adv7511

    - exynos:
    Exynos4415 SoC support

    - msm:
    a4xx gpu support
    atomic helper conversion

    - tegra:
    iommu support
    universal plane support
    ganged-mode DSI support

    - sti:
    HDMI i2c improvements

    - vmwgfx:
    some late fixes.

    - qxl:
    use suggested x/y properties"

    * 'drm-next' of git://people.freedesktop.org/~airlied/linux: (969 commits)
    drm: sti: fix module compilation issue
    drm/i915: save/restore GMBUS freq across suspend/resume on gen4
    drm: sti: correctly cleanup CRTC and planes
    drm: sti: add HQVDP plane
    drm: sti: add cursor plane
    drm: sti: enable auxiliary CRTC
    drm: sti: fix delay in VTG programming
    drm: sti: prepare sti_tvout to support auxiliary crtc
    drm: sti: use drm_crtc_vblank_{on/off} instead of drm_vblank_{on/off}
    drm: sti: fix hdmi avi infoframe
    drm: sti: remove event lock while disabling vblank
    drm: sti: simplify gdp code
    drm: sti: clear all mixer control
    drm: sti: remove gpio for HDMI hot plug detection
    drm: sti: allow to change hdmi ddc i2c adapter
    drm/doc: Document drm_add_modes_noedid() usage
    drm/i915: Remove '& 0xffff' from the mask given to WA_REG()
    drm/i915: Invert the mask and val arguments in wa_add() and WA_REG()
    drm: Zero out DRM object memory upon cleanup
    drm/i915/bdw: Fix the write setting up the WIZ hashing mode
    ...

    Linus Torvalds
     

14 Dec, 2014

1 commit

  • Page migration's __unmap_and_move(), and rmap's try_to_unmap(), were
    created for use on pages almost certainly mapped into userspace. But
    nowadays compaction often applies them to unmapped page cache pages: which
    may exacerbate contention on i_mmap_rwsem quite unnecessarily, since
    try_to_unmap_file() makes no preliminary page_mapped() check.

    Now check page_mapped() in __unmap_and_move(); and avoid repeating the
    same overhead in rmap_walk_file() - don't remove_migration_ptes() when we
    never inserted any.

    (The PageAnon(page) comment blocks now look even sillier than before, but
    clean that up on some other occasion. And note in passing that
    try_to_unmap_one() does not use a migration entry when PageSwapCache, so
    remove_migration_ptes() will then not update that swap entry to newpage
    pte: not a big deal, but something else to clean up later.)

    Davidlohr remarked in "mm,fs: introduce helpers around the i_mmap_mutex"
    conversion to i_mmap_rwsem, that "The biggest winner of these changes is
    migration": a part of the reason might be all of that unnecessary taking
    of i_mmap_mutex in page migration; and it's rather a shame that I didn't
    get around to sending this patch in before his - this one is much less
    useful after Davidlohr's conversion to rwsem, but still good.

    Signed-off-by: Hugh Dickins
    Cc: Davidlohr Bueso
    Cc: Rik van Riel
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

13 Nov, 2014

1 commit

  • Add calls to the new mmu_notifier_invalidate_range() function to all
    places in the VMM that need it.

    Signed-off-by: Joerg Roedel
    Reviewed-by: Andrea Arcangeli
    Reviewed-by: Jérôme Glisse
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Jay Cornwall
    Cc: Oded Gabbay
    Cc: Suravee Suthikulpanit
    Cc: Jesse Barnes
    Cc: David Woodhouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Oded Gabbay

    Joerg Roedel
     

10 Oct, 2014

1 commit

  • Sasha Levin reported KASAN splash inside isolate_migratepages_range().
    Problem is in the function __is_movable_balloon_page() which tests
    AS_BALLOON_MAP in page->mapping->flags. This function has no protection
    against anonymous pages. As result it tried to check address space flags
    inside struct anon_vma.

    Further investigation shows more problems in current implementation:

    * Special branch in __unmap_and_move() never works:
    balloon_page_movable() checks page flags and page_count. In
    __unmap_and_move() page is locked, reference counter is elevated, thus
    balloon_page_movable() always fails. As a result execution goes to the
    normal migration path. virtballoon_migratepage() returns
    MIGRATEPAGE_BALLOON_SUCCESS instead of MIGRATEPAGE_SUCCESS,
    move_to_new_page() thinks this is an error code and assigns
    newpage->mapping to NULL. Newly migrated page lose connectivity with
    balloon an all ability for further migration.

    * lru_lock erroneously required in isolate_migratepages_range() for
    isolation ballooned page. This function releases lru_lock periodically,
    this makes migration mostly impossible for some pages.

    * balloon_page_dequeue have a tight race with balloon_page_isolate:
    balloon_page_isolate could be executed in parallel with dequeue between
    picking page from list and locking page_lock. Race is rare because they
    use trylock_page() for locking.

    This patch fixes all of them.

    Instead of fake mapping with special flag this patch uses special state of
    page->_mapcount: PAGE_BALLOON_MAPCOUNT_VALUE = -256. Buddy allocator uses
    PAGE_BUDDY_MAPCOUNT_VALUE = -128 for similar purpose. Storing mark
    directly in struct page makes everything safer and easier.

    PagePrivate is used to mark pages present in page list (i.e. not
    isolated, like PageLRU for normal pages). It replaces special rules for
    reference counter and makes balloon migration similar to migration of
    normal pages. This flag is protected by page_lock together with link to
    the balloon device.

    Signed-off-by: Konstantin Khlebnikov
    Reported-by: Sasha Levin
    Link: http://lkml.kernel.org/p/53E6CEAA.9020105@oracle.com
    Cc: Rafael Aquini
    Cc: Andrey Ryabinin
    Cc: [3.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

03 Oct, 2014

1 commit

  • A migration entry is marked as write if pte_write was true at the time the
    entry was created. The VMA protections are not double checked when migration
    entries are being removed as mprotect marks write-migration-entries as
    read. It means that potentially we take a spurious fault to mark PTEs write
    again but it's straight-forward. However, there is a race between write
    migrations being marked read and migrations finishing. This potentially
    allows a PTE to be write that should have been read. Close this race by
    double checking the VMA permissions using maybe_mkwrite when migration
    completes.

    [torvalds@linux-foundation.org: use maybe_mkwrite]
    Cc: stable@vger.kernel.org
    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

09 Aug, 2014

1 commit

  • The memcg uncharging code that is involved towards the end of a page's
    lifetime - truncation, reclaim, swapout, migration - is impressively
    complicated and fragile.

    Because anonymous and file pages were always charged before they had their
    page->mapping established, uncharges had to happen when the page type
    could still be known from the context; as in unmap for anonymous, page
    cache removal for file and shmem pages, and swap cache truncation for swap
    pages. However, these operations happen well before the page is actually
    freed, and so a lot of synchronization is necessary:

    - Charging, uncharging, page migration, and charge migration all need
    to take a per-page bit spinlock as they could race with uncharging.

    - Swap cache truncation happens during both swap-in and swap-out, and
    possibly repeatedly before the page is actually freed. This means
    that the memcg swapout code is called from many contexts that make
    no sense and it has to figure out the direction from page state to
    make sure memory and memory+swap are always correctly charged.

    - On page migration, the old page might be unmapped but then reused,
    so memcg code has to prevent untimely uncharging in that case.
    Because this code - which should be a simple charge transfer - is so
    special-cased, it is not reusable for replace_page_cache().

    But now that charged pages always have a page->mapping, introduce
    mem_cgroup_uncharge(), which is called after the final put_page(), when we
    know for sure that nobody is looking at the page anymore.

    For page migration, introduce mem_cgroup_migrate(), which is called after
    the migration is successful and the new page is fully rmapped. Because
    the old page is no longer uncharged after migration, prevent double
    charges by decoupling the page's memcg association (PCG_USED and
    pc->mem_cgroup) from the page holding an actual charge. The new bits
    PCG_MEM and PCG_MEMSW represent the respective charges and are transferred
    to the new page during migration.

    mem_cgroup_migrate() is suitable for replace_page_cache() as well,
    which gets rid of mem_cgroup_replace_page_cache(). However, care
    needs to be taken because both the source and the target page can
    already be charged and on the LRU when fuse is splicing: grab the page
    lock on the charge moving side to prevent changing pc->mem_cgroup of a
    page under migration. Also, the lruvecs of both pages change as we
    uncharge the old and charge the new during migration, and putback may
    race with us, so grab the lru lock and isolate the pages iff on LRU to
    prevent races and ensure the pages are on the right lruvec afterward.

    Swap accounting is massively simplified: because the page is no longer
    uncharged as early as swap cache deletion, a new mem_cgroup_swapout() can
    transfer the page's memory+swap charge (PCG_MEMSW) to the swap entry
    before the final put_page() in page reclaim.

    Finally, page_cgroup changes are now protected by whatever protection the
    page itself offers: anonymous pages are charged under the page table lock,
    whereas page cache insertions, swapin, and migration hold the page lock.
    Uncharging happens under full exclusion with no outstanding references.
    Charging and uncharging also ensure that the page is off-LRU, which
    serializes against charge migration. Remove the very costly page_cgroup
    lock and set pc->flags non-atomically.

    [mhocko@suse.cz: mem_cgroup_charge_statistics needs preempt_disable]
    [vdavydov@parallels.com: fix flags definition]
    Signed-off-by: Johannes Weiner
    Cc: Hugh Dickins
    Cc: Tejun Heo
    Cc: Vladimir Davydov
    Tested-by: Jet Chen
    Acked-by: Michal Hocko
    Tested-by: Felipe Balbi
    Signed-off-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

27 Jul, 2014

1 commit

  • Shortly before 3.16-rc1, Dave Jones reported:

    WARNING: CPU: 3 PID: 19721 at fs/xfs/xfs_aops.c:971
    xfs_vm_writepage+0x5ce/0x630 [xfs]()
    CPU: 3 PID: 19721 Comm: trinity-c61 Not tainted 3.15.0+ #3
    Call Trace:
    xfs_vm_writepage+0x5ce/0x630 [xfs]
    shrink_page_list+0x8f9/0xb90
    shrink_inactive_list+0x253/0x510
    shrink_lruvec+0x563/0x6c0
    shrink_zone+0x3b/0x100
    shrink_zones+0x1f1/0x3c0
    try_to_free_pages+0x164/0x380
    __alloc_pages_nodemask+0x822/0xc90
    alloc_pages_vma+0xaf/0x1c0
    handle_mm_fault+0xa31/0xc50
    etc.

    970 if (WARN_ON_ONCE((current->flags & (PF_MEMALLOC|PF_KSWAPD)) ==
    971 PF_MEMALLOC))

    I did not respond at the time, because a glance at the PageDirty block
    in shrink_page_list() quickly shows that this is impossible: we don't do
    writeback on file pages (other than tmpfs) from direct reclaim nowadays.
    Dave was hallucinating, but it would have been disrespectful to say so.

    However, my own /var/log/messages now shows similar complaints

    WARNING: CPU: 1 PID: 28814 at fs/ext4/inode.c:1881 ext4_writepage+0xa7/0x38b()
    WARNING: CPU: 0 PID: 27347 at fs/ext4/inode.c:1764 ext4_writepage+0xa7/0x38b()

    from stressing some mmotm trees during July.

    Could a dirty xfs or ext4 file page somehow get marked PageSwapBacked,
    so fail shrink_page_list()'s page_is_file_cache() test, and so proceed
    to mapping->a_ops->writepage()?

    Yes, 3.16-rc1's commit 68711a746345 ("mm, migration: add destination
    page freeing callback") has provided such a way to compaction: if
    migrating a SwapBacked page fails, its newpage may be put back on the
    list for later use with PageSwapBacked still set, and nothing will clear
    it.

    Whether that can do anything worse than issue WARN_ON_ONCEs, and get
    some statistics wrong, is unclear: easier to fix than to think through
    the consequences.

    Fixing it here, before the put_new_page(), addresses the bug directly,
    but is probably the worst place to fix it. Page migration is doing too
    many parts of the job on too many levels: fixing it in
    move_to_new_page() to complement its SetPageSwapBacked would be
    preferable, except why is it (and newpage->mapping and newpage->index)
    done there, rather than down in migrate_page_move_mapping(), once we are
    sure of success? Not a cleanup to get into right now, especially not
    with memcg cleanups coming in 3.17.

    Reported-by: Dave Jones
    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

24 Jun, 2014

1 commit

  • Trinity has reported:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
    IP: __lock_acquire (kernel/locking/lockdep.c:3070 (discriminator 1))
    CPU: 6 PID: 16173 Comm: trinity-c364 Tainted: G W
    3.15.0-rc1-next-20140415-sasha-00020-gaa90d09 #398
    lock_acquire (arch/x86/include/asm/current.h:14
    kernel/locking/lockdep.c:3602)
    _raw_spin_lock (include/linux/spinlock_api_smp.h:143
    kernel/locking/spinlock.c:151)
    remove_migration_pte (mm/migrate.c:137)
    rmap_walk (mm/rmap.c:1628 mm/rmap.c:1699)
    remove_migration_ptes (mm/migrate.c:224)
    migrate_pages (mm/migrate.c:922 mm/migrate.c:960 mm/migrate.c:1126)
    migrate_misplaced_page (mm/migrate.c:1733)
    __handle_mm_fault (mm/memory.c:3762 mm/memory.c:3812 mm/memory.c:3925)
    handle_mm_fault (mm/memory.c:3948)
    __get_user_pages (mm/memory.c:1851)
    __mlock_vma_pages_range (mm/mlock.c:255)
    __mm_populate (mm/mlock.c:711)
    SyS_mlockall (include/linux/mm.h:1799 mm/mlock.c:817 mm/mlock.c:791)

    I believe this comes about because, whereas collapsing and splitting THP
    functions take anon_vma lock in write mode (which excludes concurrent
    rmap walks), faulting THP functions (write protection and misplaced
    NUMA) do not - and mostly they do not need to.

    But they do use a pmdp_clear_flush(), set_pmd_at() sequence which, for
    an instant (indeed, for a long instant, given the inter-CPU TLB flush in
    there), leaves *pmd neither present not trans_huge.

    Which can confuse a concurrent rmap walk, as when removing migration
    ptes, seen in the dumped trace. Although that rmap walk has a 4k page
    to insert, anon_vmas containing THPs are in no way segregated from
    4k-page anon_vmas, so the 4k-intent mm_find_pmd() does need to cope with
    that instant when a trans_huge pmd is temporarily absent.

    I don't think we need strengthen the locking at the THP end: it's easily
    handled with an ACCESS_ONCE() before testing both conditions.

    And since mm_find_pmd() had only one caller who wanted a THP rather than
    a pmd, let's slightly repurpose it to fail when it hits a THP or
    non-present pmd, and open code split_huge_page_address() again.

    Signed-off-by: Hugh Dickins
    Reported-by: Sasha Levin
    Acked-by: Kirill A. Shutemov
    Cc: Konstantin Khlebnikov
    Cc: Mel Gorman
    Cc: Bob Liu
    Cc: Christoph Lameter
    Cc: Dave Jones
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

05 Jun, 2014

3 commits

  • We already have a function named hugepages_supported(), and the similar
    name hugepage_migration_support() is a bit unconfortable, so let's rename
    it hugepage_migration_supported().

    Signed-off-by: Naoya Horiguchi
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Memory migration uses a callback defined by the caller to determine how to
    allocate destination pages. When migration fails for a source page,
    however, it frees the destination page back to the system.

    This patch adds a memory migration callback defined by the caller to
    determine how to free destination pages. If a caller, such as memory
    compaction, builds its own freelist for migration targets, this can reuse
    already freed memory instead of scanning additional memory.

    If the caller provides a function to handle freeing of destination pages,
    it is called when page migration fails. If the caller passes NULL then
    freeing back to the system will be handled as usual. This patch
    introduces no functional change.

    Signed-off-by: David Rientjes
    Reviewed-by: Naoya Horiguchi
    Acked-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Greg Thelen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Migration of misplaced transhuge pages uses page_add_new_anon_rmap() when
    putting the page back as it avoided an atomic operations and added the new
    page to the correct LRU. A side-effect is that the page gets marked
    activated as part of the migration meaning that transhuge and base pages
    are treated differently from an aging perspective than base page
    migration.

    This patch uses page_add_anon_rmap() and putback_lru_page() on completion
    of a transhuge migration similar to base page migration. It would require
    fewer atomic operations to use lru_cache_add without taking an additional
    reference to the page. The downside would be that it's still different to
    base page migration and unevictable pages may be added to the wrong LRU
    for cleaning up later. Testing of the usual workloads did not show any
    adverse impact to the change.

    Signed-off-by: Mel Gorman
    Cc: Rik van Riel
    Cc: Sasha Levin
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

21 Mar, 2014

1 commit

  • Add remove_linear_migration_ptes_from_nonlinear(), to fix an interesting
    little include/linux/swapops.h:131 BUG_ON(!PageLocked) found by trinity:
    indicating that remove_migration_ptes() failed to find one of the
    migration entries that was temporarily inserted.

    The problem comes from remap_file_pages()'s switch from vma_interval_tree
    (good for inserting the migration entry) to i_mmap_nonlinear list (no good
    for locating it again); but can only be a problem if the remap_file_pages()
    range does not cover the whole of the vma (zap_pte() clears the range).

    remove_migration_ptes() needs a file_nonlinear method to go down the
    i_mmap_nonlinear list, applying linear location to look for migration
    entries in those vmas too, just in case there was this race.

    The file_nonlinear method does need rmap_walk_control.arg to do this;
    but it never needed vma passed in - vma comes from its own iteration.

    Reported-and-tested-by: Dave Jones
    Reported-and-tested-by: Sasha Levin
    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

11 Mar, 2014

1 commit

  • GFP_THISNODE is for callers that implement their own clever fallback to
    remote nodes. It restricts the allocation to the specified node and
    does not invoke reclaim, assuming that the caller will take care of it
    when the fallback fails, e.g. through a subsequent allocation request
    without GFP_THISNODE set.

    However, many current GFP_THISNODE users only want the node exclusive
    aspect of the flag, without actually implementing their own fallback or
    triggering reclaim if necessary. This results in things like page
    migration failing prematurely even when there is easily reclaimable
    memory available, unless kswapd happens to be running already or a
    concurrent allocation attempt triggers the necessary reclaim.

    Convert all callsites that don't implement their own fallback strategy
    to __GFP_THISNODE. This restricts the allocation a single node too, but
    at the same time allows the allocator to enter the slowpath, wake
    kswapd, and invoke direct reclaim if necessary, to make the allocation
    happen when memory is full.

    Signed-off-by: Johannes Weiner
    Acked-by: Rik van Riel
    Cc: Jan Stancek
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

28 Jan, 2014

1 commit


24 Jan, 2014

2 commits

  • Commit 7851a45cd3f6 ("mm: numa: Copy cpupid on page migration") copiess
    over the cpupid at page migration time. It is unnecessary to set it
    again in migrate_misplaced_transhuge_page().

    Signed-off-by: Wanpeng Li
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • Most of the VM_BUG_ON assertions are performed on a page. Usually, when
    one of these assertions fails we'll get a BUG_ON with a call stack and
    the registers.

    I've recently noticed based on the requests to add a small piece of code
    that dumps the page to various VM_BUG_ON sites that the page dump is
    quite useful to people debugging issues in mm.

    This patch adds a VM_BUG_ON_PAGE(cond, page) which beyond doing what
    VM_BUG_ON() does, also dumps the page before executing the actual
    BUG_ON.

    [akpm@linux-foundation.org: fix up includes]
    Signed-off-by: Sasha Levin
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     

22 Jan, 2014

8 commits

  • fail_migrate_page() isn't used anywhere, so remove it.

    Signed-off-by: Joonsoo Kim
    Acked-by: Christoph Lameter
    Reviewed-by: Naoya Horiguchi
    Reviewed-by: Wanpeng Li
    Cc: Rafael Aquini
    Cc: Vlastimil Babka
    Cc: Wanpeng Li
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Some part of putback_lru_pages() and putback_movable_pages() is
    duplicated, so it could confuse us what we should use. We can remove
    putback_lru_pages() since it is not really needed now. This makes us
    undestand and maintain the code more easily.

    And comment on putback_movable_pages() is stale now, so fix it.

    Signed-off-by: Joonsoo Kim
    Reviewed-by: Wanpeng Li
    Cc: Christoph Lameter
    Cc: Naoya Horiguchi
    Cc: Rafael Aquini
    Cc: Vlastimil Babka
    Cc: Wanpeng Li
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • We should remove the page from the list if we fail with ENOSYS, since
    migrate_pages() consider error cases except -ENOMEM and -EAGAIN as
    permanent failure and it assumes that the page would be removed from the
    list. Without this patch, we could overcount number of failure.

    In addition, we should put back the new hugepage if
    !hugepage_migration_support(). If not, we would leak hugepage memory.

    Signed-off-by: Joonsoo Kim
    Acked-by: Christoph Lameter
    Reviewed-by: Wanpeng Li
    Reviewed-by: Naoya Horiguchi
    Cc: Rafael Aquini
    Cc: Vlastimil Babka
    Cc: Wanpeng Li
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Let's add a comment about where the failed page goes to, which makes
    code more readable.

    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Joonsoo Kim
    Acked-by: Christoph Lameter
    Reviewed-by: Wanpeng Li
    Acked-by: Rafael Aquini
    Cc: Vlastimil Babka
    Cc: Wanpeng Li
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • A low local/remote numa hinting fault ratio is potentially explained by
    failed migrations. This patch adds a tracepoint that fires when
    migration fails due to migration rate limitation.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Alex Thorlton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • NUMA migrate rate limiting protects a migration counter and window using
    a lock but in some cases this can be a contended lock. It is not
    critical that the number of pages be perfect, lost updates are
    acceptable. Reduce the importance of this lock.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Alex Thorlton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • numamigrate_update_ratelimit and numamigrate_isolate_page only have
    callers in mm/migrate.c. This patch makes them static.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Alex Thorlton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • In each rmap traverse case, there is some difference so that we need
    function pointers and arguments to them in order to handle these

    For this purpose, struct rmap_walk_control is introduced in this patch,
    and will be extended in following patch. Introducing and extending are
    separate, because it clarify changes.

    Signed-off-by: Joonsoo Kim
    Reviewed-by: Naoya Horiguchi
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Ingo Molnar
    Cc: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

22 Dec, 2013

1 commit

  • The arbitrary restriction on page counts offered by the core
    migrate_page_move_mapping() code results in rather suspicious looking
    fiddling with page reference counts in the aio_migratepage() operation.
    To fix this, make migrate_page_move_mapping() take an extra_count parameter
    that allows aio to tell the code about its own reference count on the page
    being migrated.

    While cleaning up aio_migratepage(), make it validate that the old page
    being passed in is actually what aio_migratepage() expects to prevent
    misbehaviour in the case of races.

    Signed-off-by: Benjamin LaHaise

    Benjamin LaHaise
     

19 Dec, 2013

5 commits

  • THP migration can fail for a variety of reasons. Avoid flushing the TLB
    to deal with THP migration races until the copy is ready to start.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Alex Thorlton
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • do_huge_pmd_numa_page() handles the case where there is parallel THP
    migration. However, by the time it is checked the NUMA hinting
    information has already been disrupted. This patch adds an earlier
    check with some helpers.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Alex Thorlton
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • If a PMD changes during a THP migration then migration aborts but the
    failure path is doing more work than is necessary.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Alex Thorlton
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • MMU notifiers must be called on THP page migration or secondary MMUs
    will get very confused.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Alex Thorlton
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Base pages are unmapped and flushed from cache and TLB during normal
    page migration and replaced with a migration entry that causes any
    parallel NUMA hinting fault or gup to block until migration completes.

    THP does not unmap pages due to a lack of support for migration entries
    at a PMD level. This allows races with get_user_pages and
    get_user_pages_fast which commit 3f926ab945b6 ("mm: Close races between
    THP migration and PMD numa clearing") made worse by introducing a
    pmd_clear_flush().

    This patch forces get_user_page (fast and normal) on a pmd_numa page to
    go through the slow get_user_page path where it will serialise against
    THP migration and properly account for the NUMA hinting fault. On the
    migration side the page table lock is taken for each PTE update.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Alex Thorlton
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

22 Nov, 2013

1 commit

  • Right now, the migration code in migrate_page_copy() uses copy_huge_page()
    for hugetlbfs and thp pages:

    if (PageHuge(page) || PageTransHuge(page))
    copy_huge_page(newpage, page);

    So, yay for code reuse. But:

    void copy_huge_page(struct page *dst, struct page *src)
    {
    struct hstate *h = page_hstate(src);

    and a non-hugetlbfs page has no page_hstate(). This works 99% of the
    time because page_hstate() determines the hstate from the page order
    alone. Since the page order of a THP page matches the default hugetlbfs
    page order, it works.

    But, if you change the default huge page size on the boot command-line
    (say default_hugepagesz=1G), then we might not even *have* a 2MB hstate
    so page_hstate() returns null and copy_huge_page() oopses pretty fast
    since copy_huge_page() dereferences the hstate:

    void copy_huge_page(struct page *dst, struct page *src)
    {
    struct hstate *h = page_hstate(src);
    if (unlikely(pages_per_huge_page(h) > MAX_ORDER_NR_PAGES)) {
    ...

    Mel noticed that the migration code is really the only user of these
    functions. This moves all the copy code over to migrate.c and makes
    copy_huge_page() work for THP by checking for it explicitly.

    I believe the bug was introduced in commit b32967ff101a ("mm: numa: Add
    THP migration for the NUMA working set scanning fault case")

    [akpm@linux-foundation.org: fix coding-style and comment text, per Naoya Horiguchi]
    Signed-off-by: Dave Hansen
    Acked-by: Mel Gorman
    Reviewed-by: Naoya Horiguchi
    Cc: Hillf Danton
    Cc: Andrea Arcangeli
    Tested-by: Dave Jiang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen