08 Apr, 2020

2 commits

  • If CONFIG_DEVICE_PRIVATE is defined, but neither CONFIG_MEMORY_FAILURE nor
    CONFIG_MIGRATION, then non_swap_entry() will return 0, meaning that the
    condition (non_swap_entry(entry) && is_device_private_entry(entry)) in
    zap_pte_range() will never be true even if the entry is a device private
    one.

    Equally any other code depending on non_swap_entry() will not function as
    expected.

    I originally spotted this just by looking at the code, I haven't actually
    observed any problems.

    Looking a bit more closely it appears that actually this situation
    (currently at least) cannot occur:

    DEVICE_PRIVATE depends on ZONE_DEVICE
    ZONE_DEVICE depends on MEMORY_HOTREMOVE
    MEMORY_HOTREMOVE depends on MIGRATION

    Fixes: 5042db43cc26 ("mm/ZONE_DEVICE: new type of ZONE_DEVICE for unaddressable memory")
    Signed-off-by: Steven Price
    Signed-off-by: Andrew Morton
    Cc: Jérôme Glisse
    Cc: Arnd Bergmann
    Cc: Dan Williams
    Cc: John Hubbard
    Link: http://lkml.kernel.org/r/20200305130550.22693-1-steven.price@arm.com
    Signed-off-by: Linus Torvalds

    Steven Price
     
  • For either swap and page migration, we all use the bit 2 of the entry to
    identify whether this entry is uffd write-protected. It plays a similar
    role as the existing soft dirty bit in swap entries but only for keeping
    the uffd-wp tracking for a specific PTE/PMD.

    Something special here is that when we want to recover the uffd-wp bit
    from a swap/migration entry to the PTE bit we'll also need to take care of
    the _PAGE_RW bit and make sure it's cleared, otherwise even with the
    _PAGE_UFFD_WP bit we can't trap it at all.

    In change_pte_range() we do nothing for uffd if the PTE is a swap entry.
    That can lead to data mismatch if the page that we are going to write
    protect is swapped out when sending the UFFDIO_WRITEPROTECT. This patch
    also applies/removes the uffd-wp bit even for the swap entries.

    Signed-off-by: Peter Xu
    Signed-off-by: Andrew Morton
    Cc: Andrea Arcangeli
    Cc: Bobby Powers
    Cc: Brian Geffon
    Cc: David Hildenbrand
    Cc: Denis Plotnikov
    Cc: "Dr . David Alan Gilbert"
    Cc: Hugh Dickins
    Cc: Jerome Glisse
    Cc: Johannes Weiner
    Cc: "Kirill A . Shutemov"
    Cc: Martin Cracauer
    Cc: Marty McFadden
    Cc: Maya Gokhale
    Cc: Mel Gorman
    Cc: Mike Kravetz
    Cc: Mike Rapoport
    Cc: Pavel Emelyanov
    Cc: Rik van Riel
    Cc: Shaohua Li
    Link: http://lkml.kernel.org/r/20200220163112.11409-11-peterx@redhat.com
    Signed-off-by: Linus Torvalds

    Peter Xu
     

17 Jul, 2019

1 commit

  • The whole header file deals with swap entries and PTEs, none of which
    can exist for nommu builds. The current nommu ports have lots of stubs
    to allow the inline functions in swapops.h to compile, but as none of
    this functionality is actually used there is no point in even providing
    it. This way we don't have to provide the stubs for the upcoming RISC-V
    nommu port, and can eventually remove it from the existing ports.

    Link: http://lkml.kernel.org/r/20190703122359.18200-4-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Cc: Vladimir Murzin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     

03 Jul, 2019

1 commit


30 Sep, 2018

1 commit

  • Introduce xarray value entries and tagged pointers to replace radix
    tree exceptional entries. This is a slight change in encoding to allow
    the use of an extra bit (we can now store BITS_PER_LONG - 1 bits in a
    value entry). It is also a change in emphasis; exceptional entries are
    intimidating and different. As the comment explains, you can choose
    to store values or pointers in the xarray and they are both first-class
    citizens.

    Signed-off-by: Matthew Wilcox
    Reviewed-by: Josef Bacik

    Matthew Wilcox
     

24 Aug, 2018

2 commits

  • Use new return type vm_fault_t for fault handler. For now, this is just
    documenting that the function returns a VM_FAULT value rather than an
    errno. Once all instances are converted, vm_fault_t will become a
    distinct type.

    Ref-> commit 1c8f422059ae ("mm: change return type to vm_fault_t")

    The aim is to change the return type of finish_fault() and
    handle_mm_fault() to vm_fault_t type. As part of that clean up return
    type of all other recursively called functions have been changed to
    vm_fault_t type.

    The places from where handle_mm_fault() is getting invoked will be
    change to vm_fault_t type but in a separate patch.

    vmf_error() is the newly introduce inline function in 4.17-rc6.

    [akpm@linux-foundation.org: don't shadow outer local `ret' in __do_huge_pmd_anonymous_page()]
    Link: http://lkml.kernel.org/r/20180604171727.GA20279@jordon-HP-15-Notebook-PC
    Signed-off-by: Souptick Joarder
    Reviewed-by: Matthew Wilcox
    Reviewed-by: Andrew Morton
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Souptick Joarder
     
  • A process can be killed with SIGBUS(BUS_MCEERR_AR) when it tries to
    allocate a page that was just freed on the way of soft-offline. This is
    undesirable because soft-offline (which is about corrected error) is
    less aggressive than hard-offline (which is about uncorrected error),
    and we can make soft-offline fail and keep using the page for good
    reason like "system is busy."

    Two main changes of this patch are:

    - setting migrate type of the target page to MIGRATE_ISOLATE. As done
    in free_unref_page_commit(), this makes kernel bypass pcplist when
    freeing the page. So we can assume that the page is in freelist just
    after put_page() returns,

    - setting PG_hwpoison on free page under zone->lock which protects
    freelists, so this allows us to avoid setting PG_hwpoison on a page
    that is decided to be allocated soon.

    [akpm@linux-foundation.org: tweak set_hwpoison_free_buddy_page() comment]
    Link: http://lkml.kernel.org/r/1531452366-11661-3-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Reported-by: Xishi Qiu
    Tested-by: Mike Kravetz
    Cc: Michal Hocko
    Cc:
    Cc: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

22 Jan, 2018

1 commit

  • Tetsuo reported random crashes under memory pressure on 32-bit x86
    system and tracked down to change that introduced
    page_vma_mapped_walk().

    The root cause of the issue is the faulty pointer math in check_pte().
    As ->pte may point to an arbitrary page we have to check that they are
    belong to the section before doing math. Otherwise it may lead to weird
    results.

    It wasn't noticed until now as mem_map[] is virtually contiguous on
    flatmem or vmemmap sparsemem. Pointer arithmetic just works against all
    'struct page' pointers. But with classic sparsemem, it doesn't because
    each section memap is allocated separately and so consecutive pfns
    crossing two sections might have struct pages at completely unrelated
    addresses.

    Let's restructure code a bit and replace pointer arithmetic with
    operations on pfns.

    Signed-off-by: Kirill A. Shutemov
    Reported-and-tested-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Fixes: ace71a19cec5 ("mm: introduce page_vma_mapped_walk()")
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

09 Sep, 2017

3 commits

  • HMM (heterogeneous memory management) need struct page to support
    migration from system main memory to device memory. Reasons for HMM and
    migration to device memory is explained with HMM core patch.

    This patch deals with device memory that is un-addressable memory (ie CPU
    can not access it). Hence we do not want those struct page to be manage
    like regular memory. That is why we extend ZONE_DEVICE to support
    different types of memory.

    A persistent memory type is define for existing user of ZONE_DEVICE and a
    new device un-addressable type is added for the un-addressable memory
    type. There is a clear separation between what is expected from each
    memory type and existing user of ZONE_DEVICE are un-affected by new
    requirement and new use of the un-addressable type. All specific code
    path are protect with test against the memory type.

    Because memory is un-addressable we use a new special swap type for when a
    page is migrated to device memory (this reduces the number of maximum swap
    file).

    The main two additions beside memory type to ZONE_DEVICE is two callbacks.
    First one, page_free() is call whenever page refcount reach 1 (which
    means the page is free as ZONE_DEVICE page never reach a refcount of 0).
    This allow device driver to manage its memory and associated struct page.

    The second callback page_fault() happens when there is a CPU access to an
    address that is back by a device page (which are un-addressable by the
    CPU). This callback is responsible to migrate the page back to system
    main memory. Device driver can not block migration back to system memory,
    HMM make sure that such page can not be pin into device memory.

    If device is in some error condition and can not migrate memory back then
    a CPU page fault to device memory should end with SIGBUS.

    [arnd@arndb.de: fix warning]
    Link: http://lkml.kernel.org/r/20170823133213.712917-1-arnd@arndb.de
    Link: http://lkml.kernel.org/r/20170817000548.32038-8-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Signed-off-by: Arnd Bergmann
    Acked-by: Dan Williams
    Cc: Ross Zwisler
    Cc: Aneesh Kumar
    Cc: Balbir Singh
    Cc: Benjamin Herrenschmidt
    Cc: David Nellans
    Cc: Evgeny Baskakov
    Cc: Johannes Weiner
    Cc: John Hubbard
    Cc: Kirill A. Shutemov
    Cc: Mark Hairgrove
    Cc: Michal Hocko
    Cc: Paul E. McKenney
    Cc: Sherry Cheung
    Cc: Subhash Gutti
    Cc: Vladimir Davydov
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • Soft dirty bit is designed to keep tracked over page migration. This
    patch makes it work in the same manner for thp migration too.

    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Zi Yan
    Cc: "H. Peter Anvin"
    Cc: Anshuman Khandual
    Cc: Dave Hansen
    Cc: David Nellans
    Cc: Ingo Molnar
    Cc: Kirill A. Shutemov
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Thomas Gleixner
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Add thp migration's core code, including conversions between a PMD entry
    and a swap entry, setting PMD migration entry, removing PMD migration
    entry, and waiting on PMD migration entries.

    This patch makes it possible to support thp migration. If you fail to
    allocate a destination page as a thp, you just split the source thp as
    we do now, and then enter the normal page migration. If you succeed to
    allocate destination thp, you enter thp migration. Subsequent patches
    actually enable thp migration for each caller of page migration by
    allowing its get_new_page() callback to allocate thps.

    [zi.yan@cs.rutgers.edu: fix gcc-4.9.0 -Wmissing-braces warning]
    Link: http://lkml.kernel.org/r/A0ABA698-7486-46C3-B209-E95A9048B22C@cs.rutgers.edu
    [akpm@linux-foundation.org: fix x86_64 allnoconfig warning]
    Signed-off-by: Zi Yan
    Acked-by: Kirill A. Shutemov
    Cc: "H. Peter Anvin"
    Cc: Anshuman Khandual
    Cc: Dave Hansen
    Cc: David Nellans
    Cc: Ingo Molnar
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Naoya Horiguchi
    Cc: Thomas Gleixner
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zi Yan
     

11 Jul, 2017

1 commit

  • We'd like to narrow down the error region in memory error on hugetlb
    pages. However, currently we set PageHWPoison flags on all subpages in
    the error hugepage and add # of subpages to num_hwpoison_pages, which
    doesn't fit our purpose.

    So this patch changes the behavior and we only set PageHWPoison on the
    head page then increase num_hwpoison_pages only by 1. This is a
    preparation for narrow-down part which comes in later patches.

    Link: http://lkml.kernel.org/r/1496305019-5493-4-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Cc: Michal Hocko
    Cc: "Aneesh Kumar K.V"
    Cc: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

09 Sep, 2015

2 commits

  • Wanpeng Li reported a race between soft_offline_page() and
    unpoison_memory(), which causes the following kernel panic:

    BUG: Bad page state in process bash pfn:97000
    page:ffffea00025c0000 count:0 mapcount:1 mapping: (null) index:0x7f4fdbe00
    flags: 0x1fffff80080048(uptodate|active|swapbacked)
    page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
    bad because of flags:
    flags: 0x40(active)
    Modules linked in: snd_hda_codec_hdmi i915 rpcsec_gss_krb5 nfsv4 dns_resolver bnep rfcomm nfsd bluetooth auth_rpcgss nfs_acl nfs rfkill lockd grace sunrpc i2c_algo_bit drm_kms_helper snd_hda_codec_realtek snd_hda_codec_generic drm snd_hda_intel fscache snd_hda_codec x86_pkg_temp_thermal coretemp kvm_intel snd_hda_core snd_hwdep kvm snd_pcm snd_seq_dummy snd_seq_oss crct10dif_pclmul snd_seq_midi crc32_pclmul snd_seq_midi_event ghash_clmulni_intel snd_rawmidi aesni_intel lrw gf128mul snd_seq glue_helper ablk_helper snd_seq_device cryptd fuse snd_timer dcdbas serio_raw mei_me parport_pc snd mei ppdev i2c_core video lp soundcore parport lpc_ich shpchp mfd_core ext4 mbcache jbd2 sd_mod e1000e ahci ptp libahci crc32c_intel libata pps_core
    CPU: 3 PID: 2211 Comm: bash Not tainted 4.2.0-rc5-mm1+ #45
    Hardware name: Dell Inc. OptiPlex 7020/0F5C5X, BIOS A03 01/08/2015
    Call Trace:
    dump_stack+0x48/0x5c
    bad_page+0xe6/0x140
    free_pages_prepare+0x2f9/0x320
    ? uncharge_list+0xdd/0x100
    free_hot_cold_page+0x40/0x170
    __put_single_page+0x20/0x30
    put_page+0x25/0x40
    unmap_and_move+0x1a6/0x1f0
    migrate_pages+0x100/0x1d0
    ? kill_procs+0x100/0x100
    ? unlock_page+0x6f/0x90
    __soft_offline_page+0x127/0x2a0
    soft_offline_page+0xa6/0x200

    This race is explained like below:

    CPU0 CPU1

    soft_offline_page
    __soft_offline_page
    TestSetPageHWPoison
    unpoison_memory
    PageHWPoison check (true)
    TestClearPageHWPoison
    put_page -> release refcount held by get_hwpoison_page in unpoison_memory
    put_page -> release refcount held by isolate_lru_page in __soft_offline_page
    migrate_pages

    The second put_page() releases refcount held by isolate_lru_page() which
    will lead to unmap_and_move() releases the last refcount of page and w/
    mapcount still 1 since try_to_unmap() is not called if there is only one
    user map the page. Anyway, the page refcount and mapcount will still
    mess if the page is mapped by multiple users.

    This race was introduced by commit 4491f71260 ("mm/memory-failure: set
    PageHWPoison before migrate_pages()"), which focuses on preventing the
    reuse of successfully migrated page. Before this commit we prevent the
    reuse by changing the migratetype to MIGRATE_ISOLATE during soft
    offlining, which has the following problems, so simply reverting the
    commit is not a best option:

    1) it doesn't eliminate the reuse completely, because
    set_migratetype_isolate() can fail to set MIGRATE_ISOLATE to the
    target page if the pageblock of the page contains one or more
    unmovable pages (i.e. has_unmovable_pages() returns true).

    2) the original code changes migratetype to MIGRATE_ISOLATE
    forcibly, and sets it to MIGRATE_MOVABLE forcibly after soft offline,
    regardless of the original migratetype state, which could impact
    other subsystems like memory hotplug or compaction.

    This patch moves PageSetHWPoison just after put_page() in
    unmap_and_move(), which closes up the reported race window and minimizes
    another race window b/w SetPageHWPoison and reallocation (which causes
    the reuse of soft-offlined page.) The latter race window still exists
    but it's acceptable, because it's rare and effectively the same as
    ordinary "containment failure" case even if it happens, so keep the
    window open is acceptable.

    Fixes: 4491f71260 ("mm/memory-failure: set PageHWPoison before migrate_pages()")
    Signed-off-by: Wanpeng Li
    Signed-off-by: Naoya Horiguchi
    Reported-by: Wanpeng Li
    Tested-by: Wanpeng Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • num_poisoned_pages counter will be changed outside mm/memory-failure.c
    by a subsequent patch, so this patch prepares wrappers to manipulate it.

    Signed-off-by: Naoya Horiguchi
    Tested-by: Wanpeng Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

13 Feb, 2015

1 commit

  • This patch removes the NUMA PTE bits and associated helpers. As a
    side-effect it increases the maximum possible swap space on x86-64.

    One potential source of problems is races between the marking of PTEs
    PROT_NONE, NUMA hinting faults and migration. It must be guaranteed that
    a PTE being protected is not faulted in parallel, seen as a pte_none and
    corrupting memory. The base case is safe but transhuge has problems in
    the past due to an different migration mechanism and a dependance on page
    lock to serialise migrations and warrants a closer look.

    task_work hinting update parallel fault
    ------------------------ --------------
    change_pmd_range
    change_huge_pmd
    __pmd_trans_huge_lock
    pmdp_get_and_clear
    __handle_mm_fault
    pmd_none
    do_huge_pmd_anonymous_page
    read? pmd_lock blocks until hinting complete, fail !pmd_none test
    write? __do_huge_pmd_anonymous_page acquires pmd_lock, checks pmd_none
    pmd_modify
    set_pmd_at

    task_work hinting update parallel migration
    ------------------------ ------------------
    change_pmd_range
    change_huge_pmd
    __pmd_trans_huge_lock
    pmdp_get_and_clear
    __handle_mm_fault
    do_huge_pmd_numa_page
    migrate_misplaced_transhuge_page
    pmd_lock waits for updates to complete, recheck pmd_same
    pmd_modify
    set_pmd_at

    Both of those are safe and the case where a transhuge page is inserted
    during a protection update is unchanged. The case where two processes try
    migrating at the same time is unchanged by this series so should still be
    ok. I could not find a case where we are accidentally depending on the
    PTE not being cleared and flushed. If one is missed, it'll manifest as
    corruption problems that start triggering shortly after this series is
    merged and only happen when NUMA balancing is enabled.

    Signed-off-by: Mel Gorman
    Tested-by: Sasha Levin
    Cc: Aneesh Kumar K.V
    Cc: Benjamin Herrenschmidt
    Cc: Dave Jones
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Cc: Kirill Shutemov
    Cc: Linus Torvalds
    Cc: Paul Mackerras
    Cc: Rik van Riel
    Cc: Mark Brown
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

12 Feb, 2015

1 commit

  • We have a race condition between move_pages() and freeing hugepages, where
    move_pages() calls follow_page(FOLL_GET) for hugepages internally and
    tries to get its refcount without preventing concurrent freeing. This
    race crashes the kernel, so this patch fixes it by moving FOLL_GET code
    for hugepages into follow_huge_pmd() with taking the page table lock.

    This patch intentionally removes page==NULL check after pte_page.
    This is justified because pte_page() never returns NULL for any
    architectures or configurations.

    This patch changes the behavior of follow_huge_pmd() for tail pages and
    then tail pages can be pinned/returned. So the caller must be changed to
    properly handle the returned tail pages.

    We could have a choice to add the similar locking to
    follow_huge_(addr|pud) for consistency, but it's not necessary because
    currently these functions don't support FOLL_GET flag, so let's leave it
    for future development.

    Here is the reproducer:

    $ cat movepages.c
    #include
    #include
    #include

    #define ADDR_INPUT 0x700000000000UL
    #define HPS 0x200000
    #define PS 0x1000

    int main(int argc, char *argv[]) {
    int i;
    int nr_hp = strtol(argv[1], NULL, 0);
    int nr_p = nr_hp * HPS / PS;
    int ret;
    void **addrs;
    int *status;
    int *nodes;
    pid_t pid;

    pid = strtol(argv[2], NULL, 0);
    addrs = malloc(sizeof(char *) * nr_p + 1);
    status = malloc(sizeof(char *) * nr_p + 1);
    nodes = malloc(sizeof(char *) * nr_p + 1);

    while (1) {
    for (i = 0; i < nr_p; i++) {
    addrs[i] = (void *)ADDR_INPUT + i * PS;
    nodes[i] = 1;
    status[i] = 0;
    }
    ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
    MPOL_MF_MOVE_ALL);
    if (ret == -1)
    err("move_pages");

    for (i = 0; i < nr_p; i++) {
    addrs[i] = (void *)ADDR_INPUT + i * PS;
    nodes[i] = 0;
    status[i] = 0;
    }
    ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
    MPOL_MF_MOVE_ALL);
    if (ret == -1)
    err("move_pages");
    }
    return 0;
    }

    $ cat hugepage.c
    #include
    #include
    #include

    #define ADDR_INPUT 0x700000000000UL
    #define HPS 0x200000

    int main(int argc, char *argv[]) {
    int nr_hp = strtol(argv[1], NULL, 0);
    char *p;

    while (1) {
    p = mmap((void *)ADDR_INPUT, nr_hp * HPS, PROT_READ | PROT_WRITE,
    MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
    if (p != (void *)ADDR_INPUT) {
    perror("mmap");
    break;
    }
    memset(p, 0, nr_hp * HPS);
    munmap(p, nr_hp * HPS);
    }
    }

    $ sysctl vm.nr_hugepages=40
    $ ./hugepage 10 &
    $ ./movepages 10 $(pgrep -f hugepage)

    Fixes: e632a938d914 ("mm: migrate: add hugepage migration code to move_pages()")
    Signed-off-by: Naoya Horiguchi
    Reported-by: Hugh Dickins
    Cc: James Hogan
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Luiz Capitulino
    Cc: Nishanth Aravamudan
    Cc: Lee Schermerhorn
    Cc: Steve Capper
    Cc: [3.12+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

11 Feb, 2015

1 commit


05 Jun, 2014

1 commit

  • _PAGE_NUMA is currently an alias of _PROT_PROTNONE to trap NUMA hinting
    faults on x86. Care is taken such that _PAGE_NUMA is used only in
    situations where the VMA flags distinguish between NUMA hinting faults
    and prot_none faults. This decision was x86-specific and conceptually
    it is difficult requiring special casing to distinguish between PROTNONE
    and NUMA ptes based on context.

    Fundamentally, we only need the _PAGE_NUMA bit to tell the difference
    between an entry that is really unmapped and a page that is protected
    for NUMA hinting faults as if the PTE is not present then a fault will
    be trapped.

    Swap PTEs on x86-64 use the bits after _PAGE_GLOBAL for the offset.
    This patch shrinks the maximum possible swap size and uses the bit to
    uniquely distinguish between NUMA hinting ptes and swap ptes.

    Signed-off-by: Mel Gorman
    Cc: David Vrabel
    Cc: Ingo Molnar
    Cc: Peter Anvin
    Cc: Fengguang Wu
    Cc: Linus Torvalds
    Cc: Steven Noonan
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: Dave Hansen
    Cc: Srikar Dronamraju
    Cc: Cyrill Gorcunov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

15 Nov, 2013

1 commit

  • Hugetlb supports multiple page sizes. We use split lock only for PMD
    level, but not for PUD.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Kirill A. Shutemov
    Tested-by: Alex Thorlton
    Cc: Ingo Molnar
    Cc: "Eric W . Biederman"
    Cc: "Paul E . McKenney"
    Cc: Al Viro
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Dave Hansen
    Cc: Dave Jones
    Cc: David Howells
    Cc: Frederic Weisbecker
    Cc: Johannes Weiner
    Cc: Kees Cook
    Cc: Mel Gorman
    Cc: Michael Kerrisk
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Robin Holt
    Cc: Sedat Dilek
    Cc: Srikar Dronamraju
    Cc: Thomas Gleixner
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

14 Aug, 2013

1 commit

  • Andy Lutomirski reported that if a page with _PAGE_SOFT_DIRTY bit set
    get swapped out, the bit is getting lost and no longer available when
    pte read back.

    To resolve this we introduce _PTE_SWP_SOFT_DIRTY bit which is saved in
    pte entry for the page being swapped out. When such page is to be read
    back from a swap cache we check for bit presence and if it's there we
    clear it and restore the former _PAGE_SOFT_DIRTY bit back.

    One of the problem was to find a place in pte entry where we can save
    the _PTE_SWP_SOFT_DIRTY bit while page is in swap. The _PAGE_PSE was
    chosen for that, it doesn't intersect with swap entry format stored in
    pte.

    Reported-by: Andy Lutomirski
    Signed-off-by: Cyrill Gorcunov
    Acked-by: Pavel Emelyanov
    Cc: Matt Mackall
    Cc: Xiao Guangrong
    Cc: Marcelo Tosatti
    Cc: KOSAKI Motohiro
    Cc: Stephen Rothwell
    Cc: Peter Zijlstra
    Cc: "Aneesh Kumar K.V"
    Reviewed-by: Minchan Kim
    Reviewed-by: Wanpeng Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     

13 Jun, 2013

1 commit

  • When we have a page fault for the address which is backed by a hugepage
    under migration, the kernel can't wait correctly and do busy looping on
    hugepage fault until the migration finishes. As a result, users who try
    to kick hugepage migration (via soft offlining, for example) occasionally
    experience long delay or soft lockup.

    This is because pte_offset_map_lock() can't get a correct migration entry
    or a correct page table lock for hugepage. This patch introduces
    migration_entry_wait_huge() to solve this.

    Signed-off-by: Naoya Horiguchi
    Reviewed-by: Rik van Riel
    Reviewed-by: Wanpeng Li
    Reviewed-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Andi Kleen
    Cc: KOSAKI Motohiro
    Cc: [2.6.35+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

16 Jun, 2012

1 commit

  • Minchan Kim reports that when a system has many swap areas, and tmpfs
    swaps out to the ninth or more, shmem_getpage_gfp()'s attempts to read
    back the page cannot locate it, and the read fails with -ENOMEM.

    Whoops. Yes, I blindly followed read_swap_header()'s pte_to_swp_entry(
    swp_entry_to_pte()) technique for determining maximum usable swap
    offset, without stopping to realize that that actually depends upon the
    pte swap encoding shifting swap offset to the higher bits and truncating
    it there. Whereas our radix_tree swap encoding leaves offset in the
    lower bits: it's swap "type" (that is, index of swap area) that was
    truncated.

    Fix it by reducing the SWP_TYPE_SHIFT() in swapops.h, and removing the
    broken radix_to_swp_entry(swp_to_radix_entry()) from read_swap_header().

    This does not reduce the usable size of a swap area any further, it
    leaves it as claimed when making the original commit: no change from 3.0
    on x86_64, nor on i386 without PAE; but 3.0's 512GB is reduced to 128GB
    per swapfile on i386 with PAE. It's not a change I would have risked
    five years ago, but with x86_64 supported for ten years, I believe it's
    appropriate now.

    Hmm, and what if some architecture implements its swap pte with offset
    encoded below type? That would equally break the maximum usable swap
    offset check. Happily, they all follow the same tradition of encoding
    offset above type, but I'll prepare a check on that for next.

    Reported-and-Reviewed-and-Tested-by: Minchan Kim
    Signed-off-by: Hugh Dickins
    Cc: stable@vger.kernel.org [3.1, 3.2, 3.3, 3.4]
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

05 Mar, 2012

1 commit

  • If a header file is making use of BUG, BUG_ON, BUILD_BUG_ON, or any
    other BUG variant in a static inline (i.e. not in a #define) then
    that header really should be including and not just
    expecting it to be implicitly present.

    We can make this change risk-free, since if the files using these
    headers didn't have exposure to linux/bug.h already, they would have
    been causing compile failures/warnings.

    Signed-off-by: Paul Gortmaker

    Paul Gortmaker
     

04 Aug, 2011

1 commit

  • If swap entries are to be stored along with struct page pointers in a
    radix tree, they need to be distinguished as exceptional entries.

    Most of the handling of swap entries in radix tree will be contained in
    shmem.c, but a few functions in filemap.c's common code need to check
    for their appearance: find_get_page(), find_lock_page(),
    find_get_pages() and find_get_pages_contig().

    So as not to slow their fast paths, tuck those checks inside the
    existing checks for unlikely radix_tree_deref_slot(); except for
    find_lock_page(), where it is an added test. And make it a BUG in
    find_get_pages_tag(), which is not applied to tmpfs files.

    A part of the reason for eliminating shmem_readpage() earlier, was to
    minimize the places where common code would need to allow for swap
    entries.

    The swp_entry_t known to swapfile.c must be massaged into a slightly
    different form when stored in the radix tree, just as it gets massaged
    into a pte_t when stored in page tables.

    In an i386 kernel this limits its information (type and page offset) to
    30 bits: given 32 "types" of swapfile and 4kB pagesize, that's a maximum
    swapfile size of 128GB. Which is less than the 512GB we previously
    allowed with X86_PAE (where the swap entry can occupy the entire upper
    32 bits of a pte_t), but not a new limitation on 32-bit without PAE; and
    there's not a new limitation on 64-bit (where swap filesize is already
    limited to 16TB by a 32-bit page offset). Thirty areas of 128GB is
    probably still enough swap for a 64GB 32-bit machine.

    Provide swp_to_radix_entry() and radix_to_swp_entry() conversions, and
    enforce filesize limit in read_swap_header(), just as for ptes.

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

16 Sep, 2009

1 commit

  • Memory migration uses special swap entry types to trigger special actions on
    page faults. Extend this mechanism to also support poisoned swap entries, to
    trigger poison handling on page faults. This allows follow-on patches to
    prevent processes from faulting in poisoned pages again.

    v2: Fix overflow in MAX_SWAPFILES (Fengguang Wu)
    v3: Better overflow fix (Hidehiro Kawai)

    Signed-off-by: Andi Kleen

    Andi Kleen
     

10 Feb, 2008

1 commit

  • CC mm/vmscan.o
    In file included from
    /home/bunk/linux/kernel-2.6/git/linux-2.6/mm/vmscan.c:44:
    /home/bunk/linux/kernel-2.6/git/linux-2.6/include/linux/swapops.h: In function 'is_swap_pte':
    /home/bunk/linux/kernel-2.6/git/linux-2.6/include/linux/swapops.h:48: error: implicit declaration of function 'pte_none'
    /home/bunk/linux/kernel-2.6/git/linux-2.6/include/linux/swapops.h:48: error: implicit declaration of function 'pte_present'

    Does it ever make sense to ask "is this pte a swap entry?" on a machine
    with no MMU? Presumably this also means it has no ptes too, right? In
    which case, it's better to comment the whole function out. Then when
    someone tries to ask the above meaningless question, they get a compile
    error rather than a meaningless answer.

    Signed-off-by: Matt Mackall
    Cc: Mike Frysinger
    Reported-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matt Mackall
     

06 Feb, 2008

1 commit


21 Feb, 2007

1 commit

  • allnoconfig:

    mm/mincore.c: In function 'do_mincore':
    mm/mincore.c:122: warning: unused variable 'entry'

    Yet another entry in the why-macros-are-wrong encyclopedia.

    Cc: Christoph Lameter
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

23 Jun, 2006

1 commit

  • Implement read/write migration ptes

    We take the upper two swapfiles for the two types of migration ptes and define
    a series of macros in swapops.h.

    The VM is modified to handle the migration entries. migration entries can
    only be encountered when the page they are pointing to is locked. This limits
    the number of places one has to fix. We also check in copy_pte_range and in
    mprotect_pte_range() for migration ptes.

    We check for migration ptes in do_swap_cache and call a function that will
    then wait on the page lock. This allows us to effectively stop all accesses
    to apge.

    Migration entries are created by try_to_unmap if called for migration and
    removed by local functions in migrate.c

    From: Hugh Dickins

    Several times while testing swapless page migration (I've no NUMA, just
    hacking it up to migrate recklessly while running load), I've hit the
    BUG_ON(!PageLocked(p)) in migration_entry_to_page.

    This comes from an orphaned migration entry, unrelated to the current
    correctly locked migration, but hit by remove_anon_migration_ptes as it
    checks an address in each vma of the anon_vma list.

    Such an orphan may be left behind if an earlier migration raced with fork:
    copy_one_pte can duplicate a migration entry from parent to child, after
    remove_anon_migration_ptes has checked the child vma, but before it has
    removed it from the parent vma. (If the process were later to fault on this
    orphaned entry, it would hit the same BUG from migration_entry_wait.)

    This could be fixed by locking anon_vma in copy_one_pte, but we'd rather
    not. There's no such problem with file pages, because vma_prio_tree_add
    adds child vma after parent vma, and the page table locking at each end is
    enough to serialize. Follow that example with anon_vma: add new vmas to the
    tail instead of the head.

    (There's no corresponding problem when inserting migration entries,
    because a missed pte will leave the page count and mapcount high, which is
    allowed for. And there's no corresponding problem when migrating via swap,
    because a leftover swap entry will be correctly faulted. But the swapless
    method has no refcounting of its entries.)

    From: Ingo Molnar

    pte_unmap_unlock() takes the pte pointer as an argument.

    From: Hugh Dickins

    Several times while testing swapless page migration, gcc has tried to exec
    a pointer instead of a string: smells like COW mappings are not being
    properly write-protected on fork.

    The protection in copy_one_pte looks very convincing, until at last you
    realize that the second arg to make_migration_entry is a boolean "write",
    and SWP_MIGRATION_READ is 30.

    Anyway, it's better done like in change_pte_range, using
    is_write_migration_entry and make_migration_entry_read.

    From: Hugh Dickins

    Remove unnecessary obfuscation from sys_swapon's range check on swap type,
    which blew up causing memory corruption once swapless migration made
    MAX_SWAPFILES no longer 2 ^ MAX_SWAPFILES_SHIFT.

    Signed-off-by: Hugh Dickins
    Acked-by: Martin Schwidefsky
    Signed-off-by: Hugh Dickins
    Signed-off-by: Christoph Lameter
    Signed-off-by: Ingo Molnar
    From: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

05 Sep, 2005

1 commit


17 Apr, 2005

1 commit

  • Initial git repository build. I'm not bothering with the full history,
    even though we have it. We can create a separate "historical" git
    archive of that later if we want to, and in the meantime it's about
    3.2GB when imported into git - space that would just make the early
    git days unnecessarily complicated, when we don't have a lot of good
    infrastructure for it.

    Let it rip!

    Linus Torvalds