05 Feb, 2013

1 commit


19 Dec, 2012

1 commit

  • Build kernel with CONFIG_HUGETLBFS=y,CONFIG_HUGETLB_PAGE=y and
    CONFIG_CGROUP_HUGETLB=y, then specify hugepagesz=xx boot option, system
    will fail to boot.

    This failure is caused by following code path:

    setup_hugepagesz
    hugetlb_add_hstate
    hugetlb_cgroup_file_init
    cgroup_add_cftypes
    kzalloc
    Signed-off-by: Jiang Liu
    Reviewed-by: Aneesh Kumar K.V
    Acked-by: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jianguo Wu
     

17 Dec, 2012

1 commit

  • Pull Automatic NUMA Balancing bare-bones from Mel Gorman:
    "There are three implementations for NUMA balancing, this tree
    (balancenuma), numacore which has been developed in tip/master and
    autonuma which is in aa.git.

    In almost all respects balancenuma is the dumbest of the three because
    its main impact is on the VM side with no attempt to be smart about
    scheduling. In the interest of getting the ball rolling, it would be
    desirable to see this much merged for 3.8 with the view to building
    scheduler smarts on top and adapting the VM where required for 3.9.

    The most recent set of comparisons available from different people are

    mel: https://lkml.org/lkml/2012/12/9/108
    mingo: https://lkml.org/lkml/2012/12/7/331
    tglx: https://lkml.org/lkml/2012/12/10/437
    srikar: https://lkml.org/lkml/2012/12/10/397

    The results are a mixed bag. In my own tests, balancenuma does
    reasonably well. It's dumb as rocks and does not regress against
    mainline. On the other hand, Ingo's tests shows that balancenuma is
    incapable of converging for this workloads driven by perf which is bad
    but is potentially explained by the lack of scheduler smarts. Thomas'
    results show balancenuma improves on mainline but falls far short of
    numacore or autonuma. Srikar's results indicate we all suffer on a
    large machine with imbalanced node sizes.

    My own testing showed that recent numacore results have improved
    dramatically, particularly in the last week but not universally.
    We've butted heads heavily on system CPU usage and high levels of
    migration even when it shows that overall performance is better.
    There are also cases where it regresses. Of interest is that for
    specjbb in some configurations it will regress for lower numbers of
    warehouses and show gains for higher numbers which is not reported by
    the tool by default and sometimes missed in treports. Recently I
    reported for numacore that the JVM was crashing with
    NullPointerExceptions but currently it's unclear what the source of
    this problem is. Initially I thought it was in how numacore batch
    handles PTEs but I'm no longer think this is the case. It's possible
    numacore is just able to trigger it due to higher rates of migration.

    These reports were quite late in the cycle so I/we would like to start
    with this tree as it contains much of the code we can agree on and has
    not changed significantly over the last 2-3 weeks."

    * tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma: (50 commits)
    mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable
    mm/rmap: Convert the struct anon_vma::mutex to an rwsem
    mm: migrate: Account a transhuge page properly when rate limiting
    mm: numa: Account for failed allocations and isolations as migration failures
    mm: numa: Add THP migration for the NUMA working set scanning fault case build fix
    mm: numa: Add THP migration for the NUMA working set scanning fault case.
    mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node
    mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG
    mm: sched: numa: Control enabling and disabling of NUMA balancing
    mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate
    mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely tasknode relationships
    mm: numa: migrate: Set last_nid on newly allocated page
    mm: numa: split_huge_page: Transfer last_nid on tail page
    mm: numa: Introduce last_nid to the page frame
    sched: numa: Slowly increase the scanning period as NUMA faults are handled
    mm: numa: Rate limit setting of pte_numa if node is saturated
    mm: numa: Rate limit the amount of memory that is migrated between nodes
    mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting
    mm: numa: Migrate pages handled during a pmd_numa hinting fault
    mm: numa: Migrate on reference policy
    ...

    Linus Torvalds
     

14 Dec, 2012

2 commits

  • Merge misc VM changes from Andrew Morton:
    "The rest of most-of-MM. The other MM bits await a slab merge.

    This patch includes the addition of a huge zero_page. Not a
    performance boost but it an save large amounts of physical memory in
    some situations.

    Also a bunch of Fujitsu engineers are working on memory hotplug.
    Which, as it turns out, was badly broken. About half of their patches
    are included here; the remainder are 3.8 material."

    However, this merge disables CONFIG_MOVABLE_NODE, which was totally
    broken. We don't add new features with "default y", nor do we add
    Kconfig questions that are incomprehensible to most people without any
    help text. Does the feature even make sense without compaction or
    memory hotplug?

    * akpm: (54 commits)
    mm/bootmem.c: remove unused wrapper function reserve_bootmem_generic()
    mm/memory.c: remove unused code from do_wp_page()
    asm-generic, mm: pgtable: consolidate zero page helpers
    mm/hugetlb.c: fix warning on freeing hwpoisoned hugepage
    hwpoison, hugetlbfs: fix RSS-counter warning
    hwpoison, hugetlbfs: fix "bad pmd" warning in unmapping hwpoisoned hugepage
    mm: protect against concurrent vma expansion
    memcg: do not check for mm in __mem_cgroup_count_vm_event
    tmpfs: support SEEK_DATA and SEEK_HOLE (reprise)
    mm: provide more accurate estimation of pages occupied by memmap
    fs/buffer.c: remove redundant initialization in alloc_page_buffers()
    fs/buffer.c: do not inline exported function
    writeback: fix a typo in comment
    mm: introduce new field "managed_pages" to struct zone
    mm, oom: remove statically defined arch functions of same name
    mm, oom: remove redundant sleep in pagefault oom handler
    mm, oom: cleanup pagefault oom handler
    memory_hotplug: allow online/offline memory to result movable node
    numa: add CONFIG_MOVABLE_NODE for movable-dedicated node
    mm, memcg: avoid unnecessary function call when memcg is disabled
    ...

    Linus Torvalds
     
  • Pull trivial branch from Jiri Kosina:
    "Usual stuff -- comment/printk typo fixes, documentation updates, dead
    code elimination."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits)
    HOWTO: fix double words typo
    x86 mtrr: fix comment typo in mtrr_bp_init
    propagate name change to comments in kernel source
    doc: Update the name of profiling based on sysfs
    treewide: Fix typos in various drivers
    treewide: Fix typos in various Kconfig
    wireless: mwifiex: Fix typo in wireless/mwifiex driver
    messages: i2o: Fix typo in messages/i2o
    scripts/kernel-doc: check that non-void fcts describe their return value
    Kernel-doc: Convention: Use a "Return" section to describe return values
    radeon: Fix typo and copy/paste error in comments
    doc: Remove unnecessary declarations from Documentation/accounting/getdelays.c
    various: Fix spelling of "asynchronous" in comments.
    Fix misspellings of "whether" in comments.
    eisa: Fix spelling of "asynchronous".
    various: Fix spelling of "registered" in comments.
    doc: fix quite a few typos within Documentation
    target: iscsi: fix comment typos in target/iscsi drivers
    treewide: fix typo of "suport" in various comments and Kconfig
    treewide: fix typo of "suppport" in various comments
    ...

    Linus Torvalds
     

13 Dec, 2012

3 commits

  • Fix the warning from __list_del_entry() which is triggered when a process
    tries to do free_huge_page() for a hwpoisoned hugepage.

    free_huge_page() can be called for hwpoisoned hugepage from
    unpoison_memory(). This function gets refcount once and clears
    PageHWPoison, and then puts refcount twice to return the hugepage back to
    free pool. The second put_page() finally reaches free_huge_page().

    Signed-off-by: Naoya Horiguchi
    Reviewed-by: Aneesh Kumar K.V
    Cc: Andi Kleen
    Cc: Tony Luck
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • When a process which used a hwpoisoned hugepage tries to exit() or
    munmap(), the kernel can print out "bad pmd" message because page table
    walker in free_pgtables() encounters 'hwpoisoned entry' on pmd.

    This is because currently we fail to clear the hwpoisoned entry in
    __unmap_hugepage_range(), so this patch simply does it.

    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Tony Luck
    Cc: Wu Fengguang
    Cc: Aneesh Kumar K.V
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • N_HIGH_MEMORY stands for the nodes that has normal or high memory.
    N_MEMORY stands for the nodes that has any memory.

    The code here need to handle with the nodes which have memory, we should
    use N_MEMORY instead.

    Signed-off-by: Lai Jiangshan
    Acked-by: Hillf Danton
    Signed-off-by: Wen Congyang
    Cc: Christoph Lameter
    Cc: Hillf Danton
    Cc: Lin Feng
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan
     

12 Dec, 2012

1 commit


11 Dec, 2012

1 commit

  • This will be used for three kinds of purposes:

    - to optimize mprotect()

    - to speed up working set scanning for working set areas that
    have not been touched

    - to more accurately scan per real working set

    No change in functionality from this patch.

    Suggested-by: Ingo Molnar
    Signed-off-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

06 Dec, 2012

1 commit


09 Oct, 2012

6 commits

  • Acked-by: David Rientjes
    Cc: Mel Gorman
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • In order to allow sleeping during mmu notifier calls, we need to avoid
    invoking them under the page table spinlock. This patch solves the
    problem by calling invalidate_page notification after releasing the lock
    (but before freeing the page itself), or by wrapping the page invalidation
    with calls to invalidate_range_begin and invalidate_range_end.

    To prevent accidental changes to the invalidate_range_end arguments after
    the call to invalidate_range_begin, the patch introduces a convention of
    saving the arguments in consistently named locals:

    unsigned long mmun_start; /* For mmu_notifiers */
    unsigned long mmun_end; /* For mmu_notifiers */

    ...

    mmun_start = ...
    mmun_end = ...
    mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);

    ...

    mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);

    The patch changes code to use this convention for all calls to
    mmu_notifier_invalidate_range_start/end, except those where the calls are
    close enough so that anyone who glances at the code can see the values
    aren't changing.

    This patchset is a preliminary step towards on-demand paging design to be
    added to the RDMA stack.

    Why do we want on-demand paging for Infiniband?

    Applications register memory with an RDMA adapter using system calls,
    and subsequently post IO operations that refer to the corresponding
    virtual addresses directly to HW. Until now, this was achieved by
    pinning the memory during the registration calls. The goal of on demand
    paging is to avoid pinning the pages of registered memory regions (MRs).
    This will allow users the same flexibility they get when swapping any
    other part of their processes address spaces. Instead of requiring the
    entire MR to fit in physical memory, we can allow the MR to be larger,
    and only fit the current working set in physical memory.

    Why should anyone care? What problems are users currently experiencing?

    This can make programming with RDMA much simpler. Today, developers
    that are working with more data than their RAM can hold need either to
    deregister and reregister memory regions throughout their process's
    life, or keep a single memory region and copy the data to it. On demand
    paging will allow these developers to register a single MR at the
    beginning of their process's life, and let the operating system manage
    which pages needs to be fetched at a given time. In the future, we
    might be able to provide a single memory access key for each process
    that would provide the entire process's address as one large memory
    region, and the developers wouldn't need to register memory regions at
    all.

    Is there any prospect that any other subsystems will utilise these
    infrastructural changes? If so, which and how, etc?

    As for other subsystems, I understand that XPMEM wanted to sleep in
    MMU notifiers, as Christoph Lameter wrote at
    http://lkml.indiana.edu/hypermail/linux/kernel/0802.1/0460.html and
    perhaps Andrea knows about other use cases.

    Scheduling in mmu notifications is required since we need to sync the
    hardware with the secondary page tables change. A TLB flush of an IO
    device is inherently slower than a CPU TLB flush, so our design works by
    sending the invalidation request to the device, and waiting for an
    interrupt before exiting the mmu notifier handler.

    Avi said:

    kvm may be a buyer. kvm::mmu_lock, which serializes guest page
    faults, also protects long operations such as destroying large ranges.
    It would be good to convert it into a spinlock, but as it is used inside
    mmu notifiers, this cannot be done.

    (there are alternatives, such as keeping the spinlock and using a
    generation counter to do the teardown in O(1), which is what the "may"
    is doing up there).

    [akpm@linux-foundation.orgpossible speed tweak in hugetlb_cow(), cleanups]
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Haggai Eran
    Cc: Peter Zijlstra
    Cc: Xiao Guangrong
    Cc: Or Gerlitz
    Cc: Haggai Eran
    Cc: Shachar Raindel
    Cc: Liran Liss
    Cc: Christoph Lameter
    Cc: Avi Kivity
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sagi Grimberg
     
  • Commit 0c176d52b0b2 ("mm: hugetlb: fix pgoff computation when unmapping
    page from vma") fixed pgoff calculation but it has replaced it by
    vma_hugecache_offset() which is not approapriate for offsets used for
    vma_prio_tree_foreach() because that one expects index in page units
    rather than in huge_page_shift.

    Johannes said:

    : The resulting index may not be too big, but it can be too small: assume
    : hpage size of 2M and the address to unmap to be 0x200000. This is regular
    : page index 512 and hpage index 1. If you have a VMA that maps the file
    : only starting at the second huge page, that VMAs vm_pgoff will be 512 but
    : you ask for offset 1 and miss it even though it does map the page of
    : interest. hugetlb_cow() will try to unmap, miss the vma, and retry the
    : cow until the allocation succeeds or the skipped vma(s) go away.

    Signed-off-by: Michal Hocko
    Acked-by: Hillf Danton
    Cc: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Acked-by: Johannes Weiner
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Signed-off-by: Sachin Kamat
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sachin Kamat
     
  • Implement an interval tree as a replacement for the VMA prio_tree. The
    algorithms are similar to lib/interval_tree.c; however that code can't be
    directly reused as the interval endpoints are not explicitly stored in the
    VMA. So instead, the common algorithm is moved into a template and the
    details (node type, how to get interval endpoints from the node, etc) are
    filled in using the C preprocessor.

    Once the interval tree functions are available, using them as a
    replacement to the VMA prio tree is a relatively simple, mechanical job.

    Signed-off-by: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Hillf Danton
    Cc: Peter Zijlstra
    Cc: Catalin Marinas
    Cc: Andrea Arcangeli
    Cc: David Woodhouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • The core page allocator ensures that page flags are zeroed when freeing
    pages via free_pages_check. A number of architectures (ARM, PPC, MIPS)
    rely on this property to treat new pages as dirty with respect to the data
    cache and perform the appropriate flushing before mapping the pages into
    userspace.

    This can lead to cache synchronisation problems when using hugepages,
    since the allocator keeps its own pool of pages above the usual page
    allocator and does not reset the page flags when freeing a page into the
    pool.

    This patch adds a new architecture hook, arch_clear_hugepage_flags, so
    that architectures which rely on the page flags being in a particular
    state for fresh allocations can adjust the flags accordingly when a page
    is freed into the pool.

    Signed-off-by: Will Deacon
    Cc: Michal Hocko
    Reviewed-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Will Deacon
     

01 Aug, 2012

12 commits

  • If a process creates a large hugetlbfs mapping that is eligible for page
    table sharing and forks heavily with children some of whom fault and
    others which destroy the mapping then it is possible for page tables to
    get corrupted. Some teardowns of the mapping encounter a "bad pmd" and
    output a message to the kernel log. The final teardown will trigger a
    BUG_ON in mm/filemap.c.

    This was reproduced in 3.4 but is known to have existed for a long time
    and goes back at least as far as 2.6.37. It was probably was introduced
    in 2.6.20 by [39dde65c: shared page table for hugetlb page]. The messages
    look like this;

    [ ..........] Lots of bad pmd messages followed by this
    [ 127.164256] mm/memory.c:391: bad pmd ffff880412e04fe8(80000003de4000e7).
    [ 127.164257] mm/memory.c:391: bad pmd ffff880412e04ff0(80000003de6000e7).
    [ 127.164258] mm/memory.c:391: bad pmd ffff880412e04ff8(80000003de0000e7).
    [ 127.186778] ------------[ cut here ]------------
    [ 127.186781] kernel BUG at mm/filemap.c:134!
    [ 127.186782] invalid opcode: 0000 [#1] SMP
    [ 127.186783] CPU 7
    [ 127.186784] Modules linked in: af_packet cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf ext3 jbd dm_mod coretemp crc32c_intel usb_storage ghash_clmulni_intel aesni_intel i2c_i801 r8169 mii uas sr_mod cdrom sg iTCO_wdt iTCO_vendor_support shpchp serio_raw cryptd aes_x86_64 e1000e pci_hotplug dcdbas aes_generic container microcode ext4 mbcache jbd2 crc16 sd_mod crc_t10dif i915 drm_kms_helper drm i2c_algo_bit ehci_hcd ahci libahci usbcore rtc_cmos usb_common button i2c_core intel_agp video intel_gtt fan processor thermal thermal_sys hwmon ata_generic pata_atiixp libata scsi_mod
    [ 127.186801]
    [ 127.186802] Pid: 9017, comm: hugetlbfs-test Not tainted 3.4.0-autobuild #53 Dell Inc. OptiPlex 990/06D7TR
    [ 127.186804] RIP: 0010:[] [] __delete_from_page_cache+0x15e/0x160
    [ 127.186809] RSP: 0000:ffff8804144b5c08 EFLAGS: 00010002
    [ 127.186810] RAX: 0000000000000001 RBX: ffffea000a5c9000 RCX: 00000000ffffffc0
    [ 127.186811] RDX: 0000000000000000 RSI: 0000000000000009 RDI: ffff88042dfdad00
    [ 127.186812] RBP: ffff8804144b5c18 R08: 0000000000000009 R09: 0000000000000003
    [ 127.186813] R10: 0000000000000000 R11: 000000000000002d R12: ffff880412ff83d8
    [ 127.186814] R13: ffff880412ff83d8 R14: 0000000000000000 R15: ffff880412ff83d8
    [ 127.186815] FS: 00007fe18ed2c700(0000) GS:ffff88042dce0000(0000) knlGS:0000000000000000
    [ 127.186816] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    [ 127.186817] CR2: 00007fe340000503 CR3: 0000000417a14000 CR4: 00000000000407e0
    [ 127.186818] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [ 127.186819] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    [ 127.186820] Process hugetlbfs-test (pid: 9017, threadinfo ffff8804144b4000, task ffff880417f803c0)
    [ 127.186821] Stack:
    [ 127.186822] ffffea000a5c9000 0000000000000000 ffff8804144b5c48 ffffffff810ed83b
    [ 127.186824] ffff8804144b5c48 000000000000138a 0000000000001387 ffff8804144b5c98
    [ 127.186825] ffff8804144b5d48 ffffffff811bc925 ffff8804144b5cb8 0000000000000000
    [ 127.186827] Call Trace:
    [ 127.186829] [] delete_from_page_cache+0x3b/0x80
    [ 127.186832] [] truncate_hugepages+0x115/0x220
    [ 127.186834] [] hugetlbfs_evict_inode+0x13/0x30
    [ 127.186837] [] evict+0xa7/0x1b0
    [ 127.186839] [] iput_final+0xd3/0x1f0
    [ 127.186840] [] iput+0x39/0x50
    [ 127.186842] [] d_kill+0xf8/0x130
    [ 127.186843] [] dput+0xd2/0x1a0
    [ 127.186845] [] __fput+0x170/0x230
    [ 127.186848] [] ? rb_erase+0xce/0x150
    [ 127.186849] [] fput+0x1d/0x30
    [ 127.186851] [] remove_vma+0x37/0x80
    [ 127.186853] [] do_munmap+0x2d2/0x360
    [ 127.186855] [] sys_shmdt+0xc9/0x170
    [ 127.186857] [] system_call_fastpath+0x16/0x1b
    [ 127.186858] Code: 0f 1f 44 00 00 48 8b 43 08 48 8b 00 48 8b 40 28 8b b0 40 03 00 00 85 f6 0f 88 df fe ff ff 48 89 df e8 e7 cb 05 00 e9 d2 fe ff ff 0b 55 83 e2 fd 48 89 e5 48 83 ec 30 48 89 5d d8 4c 89 65 e0
    [ 127.186868] RIP [] __delete_from_page_cache+0x15e/0x160
    [ 127.186870] RSP
    [ 127.186871] ---[ end trace 7cbac5d1db69f426 ]---

    The bug is a race and not always easy to reproduce. To reproduce it I was
    doing the following on a single socket I7-based machine with 16G of RAM.

    $ hugeadm --pool-pages-max DEFAULT:13G
    $ echo $((18*1048576*1024)) > /proc/sys/kernel/shmmax
    $ echo $((18*1048576*1024)) > /proc/sys/kernel/shmall
    $ for i in `seq 1 9000`; do ./hugetlbfs-test; done

    On my particular machine, it usually triggers within 10 minutes but
    enabling debug options can change the timing such that it never hits.
    Once the bug is triggered, the machine is in trouble and needs to be
    rebooted. The machine will respond but processes accessing proc like "ps
    aux" will hang due to the BUG_ON. shutdown will also hang and needs a
    hard reset or a sysrq-b.

    The basic problem is a race between page table sharing and teardown. For
    the most part page table sharing depends on i_mmap_mutex. In some cases,
    it is also taking the mm->page_table_lock for the PTE updates but with
    shared page tables, it is the i_mmap_mutex that is more important.

    Unfortunately it appears to be also insufficient. Consider the following
    situation

    Process A Process B
    --------- ---------
    hugetlb_fault shmdt
    LockWrite(mmap_sem)
    do_munmap
    unmap_region
    unmap_vmas
    unmap_single_vma
    unmap_hugepage_range
    Lock(i_mmap_mutex)
    Lock(mm->page_table_lock)
    huge_pmd_unshare/unmap tables page_table_lock)
    Unlock(i_mmap_mutex)
    huge_pte_alloc ...
    Lock(i_mmap_mutex) ...
    vma_prio_walk, find svma, spte ...
    Lock(mm->page_table_lock) ...
    share spte ...
    Unlock(mm->page_table_lock) ...
    Unlock(i_mmap_mutex) ...
    hugetlb_no_page < end; a += 4096)
    *a = 0;
    }

    int main(int argc, char **argv)
    {
    key_t key = IPC_PRIVATE;
    size_t sizeA = nr_huge_page_A * huge_page_size;
    size_t sizeB = nr_huge_page_B * huge_page_size;
    int shmidA, shmidB;
    void *addrA = NULL, *addrB = NULL;
    int nr_children = 300, n = 0;

    if ((shmidA = shmget(key, sizeA, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
    perror("shmget:");
    return 1;
    }

    if ((addrA = shmat(shmidA, addrA, SHM_R|SHM_W)) == (void *)-1UL) {
    perror("shmat");
    return 1;
    }
    if ((shmidB = shmget(key, sizeB, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
    perror("shmget:");
    return 1;
    }

    if ((addrB = shmat(shmidB, addrB, SHM_R|SHM_W)) == (void *)-1UL) {
    perror("shmat");
    return 1;
    }

    fork_child:
    switch(fork()) {
    case 0:
    switch (n%3) {
    case 0:
    play(addrA, sizeA);
    break;
    case 1:
    play(addrB, sizeB);
    break;
    case 2:
    break;
    }
    break;
    case -1:
    perror("fork:");
    break;
    default:
    if (++n < nr_children)
    goto fork_child;
    play(addrA, sizeA);
    break;
    }
    shmdt(addrA);
    shmdt(addrB);
    do {
    wait(NULL);
    } while (--n > 0);
    shmctl(shmidA, IPC_RMID, NULL);
    shmctl(shmidB, IPC_RMID, NULL);
    return 0;
    }

    [akpm@linux-foundation.org: name the declaration's args, fix CONFIG_HUGETLBFS=n build]
    Signed-off-by: Hugh Dickins
    Reviewed-by: Michal Hocko
    Signed-off-by: Mel Gorman
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • A page's hugetlb cgroup assignment and movement to the active list should
    occur with hugetlb_lock held. Otherwise when we remove the hugetlb cgroup
    we will iterate the active list and find pages with NULL hugetlb cgroup
    values.

    Signed-off-by: Aneesh Kumar K.V
    Reviewed-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     
  • When we fail to allocate pages from the reserve pool, hugetlb tries to
    allocate huge pages using alloc_buddy_huge_page. Add these to the active
    list. We also need to add the huge page we allocate when we soft offline
    the oldpage to active list.

    Signed-off-by: Aneesh Kumar K.V
    Reviewed-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     
  • Add the control files for hugetlb controller

    [akpm@linux-foundation.org: s/CONFIG_CGROUP_HUGETLB_RES_CTLR/CONFIG_MEMCG_HUGETLB/g]
    [akpm@linux-foundation.org: s/CONFIG_MEMCG_HUGETLB/CONFIG_CGROUP_HUGETLB/]
    Signed-off-by: Aneesh Kumar K.V
    Cc: David Rientjes
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hillf Danton
    Reviewed-by: Michal Hocko
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     
  • Add the charge and uncharge routines for hugetlb cgroup. We do cgroup
    charging in page alloc and uncharge in compound page destructor.
    Assigning page's hugetlb cgroup is protected by hugetlb_lock.

    [liwp@linux.vnet.ibm.com: add huge_page_order check to avoid incorrect uncharge]
    Signed-off-by: Aneesh Kumar K.V
    Cc: David Rientjes
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hillf Danton
    Cc: Michal Hocko
    Cc: KOSAKI Motohiro
    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Wanpeng Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     
  • Add the hugetlb cgroup pointer to 3rd page lru.next. This limit the usage
    to hugetlb cgroup to only hugepages with 3 or more normal pages. I guess
    that is an acceptable limitation.

    Signed-off-by: Aneesh Kumar K.V
    Cc: David Rientjes
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hillf Danton
    Reviewed-by: Michal Hocko
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     
  • We will use them later in hugetlb_cgroup.c

    Signed-off-by: Aneesh Kumar K.V
    Cc: David Rientjes
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hillf Danton
    Reviewed-by: Michal Hocko
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     
  • hugepage_activelist will be used to track currently used HugeTLB pages.
    We need to find the in-use HugeTLB pages to support HugeTLB cgroup removal.
    On cgroup removal we update the page's HugeTLB cgroup to point to parent
    cgroup.

    Signed-off-by: Aneesh Kumar K.V
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: David Rientjes
    Cc: Hillf Danton
    Reviewed-by: Michal Hocko
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     
  • Use a mmu_gather instead of a temporary linked list for accumulating pages
    when we unmap a hugepage range

    Signed-off-by: Aneesh Kumar K.V
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: David Rientjes
    Cc: Hillf Danton
    Cc: Michal Hocko
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     
  • Add an inline helper and use it in the code.

    Signed-off-by: Aneesh Kumar K.V
    Acked-by: David Rientjes
    Acked-by: Michal Hocko
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Hillf Danton
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     
  • The current use of VM_FAULT_* codes with ERR_PTR requires us to ensure
    VM_FAULT_* values will not exceed MAX_ERRNO value. Decouple the
    VM_FAULT_* values from MAX_ERRNO.

    Signed-off-by: Aneesh Kumar K.V
    Acked-by: Hillf Danton
    Acked-by: KOSAKI Motohiro
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: David Rientjes
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     
  • This patchset implements a cgroup resource controller for HugeTLB pages.
    The controller allows to limit the HugeTLB usage per control group and
    enforces the controller limit during page fault. Since HugeTLB doesn't
    support page reclaim, enforcing the limit at page fault time implies that,
    the application will get SIGBUS signal if it tries to access HugeTLB pages
    beyond its limit. This requires the application to know beforehand how
    much HugeTLB pages it would require for its use.

    The goal is to control how many HugeTLB pages a group of task can
    allocate. It can be looked at as an extension of the existing quota
    interface which limits the number of HugeTLB pages per hugetlbfs
    superblock. HPC job scheduler requires jobs to specify their resource
    requirements in the job file. Once their requirements can be met, job
    schedulers like (SLURM) will schedule the job. We need to make sure that
    the jobs won't consume more resources than requested. If they do we
    should either error out or kill the application.

    This patch:

    Rename max_hstate to hugetlb_max_hstate. We will be using this from other
    subsystems like hugetlb controller in later patches.

    Signed-off-by: Aneesh Kumar K.V
    Acked-by: David Rientjes
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: Hillf Danton
    Acked-by: Michal Hocko
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     

30 May, 2012

3 commits

  • hugetlb_reserve_pages() can be used for either normal file-backed
    hugetlbfs mappings, or MAP_HUGETLB. In the MAP_HUGETLB, semi-anonymous
    mode, there is not a VMA around. The new call to resv_map_put() assumed
    that there was, and resulted in a NULL pointer dereference:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000030
    IP: vma_resv_map+0x9/0x30
    PGD 141453067 PUD 1421e1067 PMD 0
    Oops: 0000 [#1] PREEMPT SMP
    ...
    Pid: 14006, comm: trinity-child6 Not tainted 3.4.0+ #36
    RIP: vma_resv_map+0x9/0x30
    ...
    Process trinity-child6 (pid: 14006, threadinfo ffff8801414e0000, task ffff8801414f26b0)
    Call Trace:
    resv_map_put+0xe/0x40
    hugetlb_reserve_pages+0xa6/0x1d0
    hugetlb_file_setup+0x102/0x2c0
    newseg+0x115/0x360
    ipcget+0x1ce/0x310
    sys_shmget+0x5a/0x60
    system_call_fastpath+0x16/0x1b

    This was reported by Dave Jones, but was reproducible with the
    libhugetlbfs test cases, so shame on me for not running them in the
    first place.

    With this, the oops is gone, and the output of libhugetlbfs's
    run_tests.py is identical to plain 3.4 again.

    [ Marked for stable, since this was introduced by commit c50ac050811d
    ("hugetlb: fix resv_map leak in error path") which was also marked for
    stable ]

    Reported-by: Dave Jones
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: Andrea Arcangeli
    Cc: Andrew Morton
    Cc: [2.6.32+]
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • When called for anonymous (non-shared) mappings, hugetlb_reserve_pages()
    does a resv_map_alloc(). It depends on code in hugetlbfs's
    vm_ops->close() to release that allocation.

    However, in the mmap() failure path, we do a plain unmap_region() without
    the remove_vma() which actually calls vm_ops->close().

    This is a decent fix. This leak could get reintroduced if new code (say,
    after hugetlb_reserve_pages() in hugetlbfs_file_mmap()) decides to return
    an error. But, I think it would have to unroll the reservation anyway.

    Christoph's test case:

    http://marc.info/?l=linux-mm&m=133728900729735

    This patch applies to 3.4 and later. A version for earlier kernels is at
    https://lkml.org/lkml/2012/5/22/418.

    Signed-off-by: Dave Hansen
    Acked-by: Mel Gorman
    Acked-by: KOSAKI Motohiro
    Reported-by: Christoph Lameter
    Tested-by: Christoph Lameter
    Cc: Andrea Arcangeli
    Cc: [2.6.32+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • The arguments f & t and fields from & to of struct file_region are
    defined as long. So use long instead of int to type the temp vars.

    Signed-off-by: Wang Sheng-Hui
    Acked-by: David Rientjes
    Acked-by: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wang Sheng-Hui
     

26 May, 2012

1 commit

  • The tile support for multiple-size huge pages requires tagging
    the hugetlb PTE with a "super" bit for PTEs that are multiples of
    the basic size of a pagetable span. To set that bit properly
    we need to tweak the PTe in make_huge_pte() based on the vma.

    This change provides the API for a subsequent tile-specific
    change to use.

    Reviewed-by: Hillf Danton
    Signed-off-by: Chris Metcalf

    Chris Metcalf
     

11 May, 2012

1 commit

  • Commit 66aebce747eaf ("hugetlb: fix race condition in hugetlb_fault()")
    added code to avoid a race condition by elevating the page refcount in
    hugetlb_fault() while calling hugetlb_cow().

    However, one code path in hugetlb_cow() includes an assertion that the
    page count is 1, whereas it may now also have the value 2 in this path.

    The consensus is that this BUG_ON has served its purpose, so rather than
    extending it to cover both cases, we just remove it.

    Signed-off-by: Chris Metcalf
    Acked-by: Mel Gorman
    Acked-by: Hillf Danton
    Acked-by: Hugh Dickins
    Cc: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Cc: [3.0.29+, 3.2.16+, 3.3.3+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Metcalf
     

26 Apr, 2012

1 commit

  • Fix a gcc warning (and bug?) introduced in cc9a6c877 ("cpuset: mm: reduce
    large amounts of memory barrier related damage v3")

    Local variable "page" can be uninitialized if the nodemask from vma policy
    does not intersects with nodemask from cpuset. Even if it doesn't happens
    it is better to initialize this variable explicitly than to introduce
    a kernel oops in a weird corner case.

    mm/hugetlb.c: In function `alloc_huge_page':
    mm/hugetlb.c:1135:5: warning: `page' may be used uninitialized in this function

    Signed-off-by: Konstantin Khlebnikov
    Acked-by: Mel Gorman
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

13 Apr, 2012

1 commit

  • The race is as follows:

    Suppose a multi-threaded task forks a new process (on cpu A), thus
    bumping up the ref count on all the pages. While the fork is occurring
    (and thus we have marked all the PTEs as read-only), another thread in
    the original process (on cpu B) tries to write to a huge page, taking an
    access violation from the write-protect and calling hugetlb_cow(). Now,
    suppose the fork() fails. It will undo the COW and decrement the ref
    count on the pages, so the ref count on the huge page drops back to 1.
    Meanwhile hugetlb_cow() also decrements the ref count by one on the
    original page, since the original address space doesn't need it any
    more, having copied a new page to replace the original page. This
    leaves the ref count at zero, and when we call unlock_page(), we panic.

    fork on CPU A fault on CPU B
    ============= ==============
    ...
    down_write(&parent->mmap_sem);
    down_write_nested(&child->mmap_sem);
    ...
    while duplicating vmas
    if error
    break;
    ...
    up_write(&child->mmap_sem);
    up_write(&parent->mmap_sem); ...
    down_read(&parent->mmap_sem);
    ...
    lock_page(page);
    handle COW
    page_mapcount(old_page) == 2
    alloc and prepare new_page
    ...
    handle error
    page_remove_rmap(page);
    put_page(page);
    ...
    fold new_page into pte
    page_remove_rmap(page);
    put_page(page);
    ...
    oops ==> unlock_page(page);
    up_read(&parent->mmap_sem);

    The solution is to take an extra reference to the page while we are
    holding the lock on it.

    Signed-off-by: Chris Metcalf
    Cc: Hillf Danton
    Cc: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Metcalf
     

24 Mar, 2012

1 commit


22 Mar, 2012

3 commits

  • hugetlbfs_{get,put}_quota() are badly named. They don't interact with the
    general quota handling code, and they don't much resemble its behaviour.
    Rather than being about maintaining limits on on-disk block usage by
    particular users, they are instead about maintaining limits on in-memory
    page usage (including anonymous MAP_PRIVATE copied-on-write pages)
    associated with a particular hugetlbfs filesystem instance.

    Worse, they work by having callbacks to the hugetlbfs filesystem code from
    the low-level page handling code, in particular from free_huge_page().
    This is a layering violation of itself, but more importantly, if the
    kernel does a get_user_pages() on hugepages (which can happen from KVM
    amongst others), then the free_huge_page() can be delayed until after the
    associated inode has already been freed. If an unmount occurs at the
    wrong time, even the hugetlbfs superblock where the "quota" limits are
    stored may have been freed.

    Andrew Barry proposed a patch to fix this by having hugepages, instead of
    storing a pointer to their address_space and reaching the superblock from
    there, had the hugepages store pointers directly to the superblock,
    bumping the reference count as appropriate to avoid it being freed.
    Andrew Morton rejected that version, however, on the grounds that it made
    the existing layering violation worse.

    This is a reworked version of Andrew's patch, which removes the extra, and
    some of the existing, layering violation. It works by introducing the
    concept of a hugepage "subpool" at the lower hugepage mm layer - that is a
    finite logical pool of hugepages to allocate from. hugetlbfs now creates
    a subpool for each filesystem instance with a page limit set, and a
    pointer to the subpool gets added to each allocated hugepage, instead of
    the address_space pointer used now. The subpool has its own lifetime and
    is only freed once all pages in it _and_ all other references to it (i.e.
    superblocks) are gone.

    subpools are optional - a NULL subpool pointer is taken by the code to
    mean that no subpool limits are in effect.

    Previous discussion of this bug found in: "Fix refcounting in hugetlbfs
    quota handling.". See: https://lkml.org/lkml/2011/8/11/28 or
    http://marc.info/?l=linux-mm&m=126928970510627&w=1

    v2: Fixed a bug spotted by Hillf Danton, and removed the extra parameter to
    alloc_huge_page() - since it already takes the vma, it is not necessary.

    Signed-off-by: Andrew Barry
    Signed-off-by: David Gibson
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Hillf Danton
    Cc: Paul Mackerras
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Gibson
     
  • Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when
    changing cpuset's mems") wins a super prize for the largest number of
    memory barriers entered into fast paths for one commit.

    [get|put]_mems_allowed is incredibly heavy with pairs of full memory
    barriers inserted into a number of hot paths. This was detected while
    investigating at large page allocator slowdown introduced some time
    after 2.6.32. The largest portion of this overhead was shown by
    oprofile to be at an mfence introduced by this commit into the page
    allocator hot path.

    For extra style points, the commit introduced the use of yield() in an
    implementation of what looks like a spinning mutex.

    This patch replaces the full memory barriers on both read and write
    sides with a sequence counter with just read barriers on the fast path
    side. This is much cheaper on some architectures, including x86. The
    main bulk of the patch is the retry logic if the nodemask changes in a
    manner that can cause a false failure.

    While updating the nodemask, a check is made to see if a false failure
    is a risk. If it is, the sequence number gets bumped and parallel
    allocators will briefly stall while the nodemask update takes place.

    In a page fault test microbenchmark, oprofile samples from
    __alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The
    actual results were

    3.3.0-rc3 3.3.0-rc3
    rc3-vanilla nobarrier-v2r1
    Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%)
    Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%)
    Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%)
    Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%)
    Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%)
    Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%)
    Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%)
    Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%)
    Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%)
    Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%)
    Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%)
    Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%)
    Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%)
    Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%)
    Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%)
    MMTests Statistics: duration
    Sys Time Running Test (seconds) 135.68 132.17
    User+Sys Time Running Test (seconds) 164.2 160.13
    Total Elapsed Time (seconds) 123.46 120.87

    The overall improvement is small but the System CPU time is much
    improved and roughly in correlation to what oprofile reported (these
    performance figures are without profiling so skew is expected). The
    actual number of page faults is noticeably improved.

    For benchmarks like kernel builds, the overall benefit is marginal but
    the system CPU time is slightly reduced.

    To test the actual bug the commit fixed I opened two terminals. The
    first ran within a cpuset and continually ran a small program that
    faulted 100M of anonymous data. In a second window, the nodemask of the
    cpuset was continually randomised in a loop.

    Without the commit, the program would fail every so often (usually
    within 10 seconds) and obviously with the commit everything worked fine.
    With this patch applied, it also worked fine so the fix should be
    functionally equivalent.

    Signed-off-by: Mel Gorman
    Cc: Miao Xie
    Cc: David Rientjes
    Cc: Peter Zijlstra
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • When unmapping a given VM range, we could bail out if a reference page is
    supplied and is unmapped, which is a minor optimization.

    Signed-off-by: Hillf Danton
    Cc: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hillf Danton