16 Dec, 2014

1 commit

  • Pull drm updates from Dave Airlie:
    "Highlights:

    - AMD KFD driver merge

    This is the AMD HSA interface for exposing a lowlevel interface for
    GPGPU use. They have an open source userspace built on top of this
    interface, and the code looks as good as it was going to get out of
    tree.

    - Initial atomic modesetting work

    The need for an atomic modesetting interface to allow userspace to
    try and send a complete set of modesetting state to the driver has
    arisen, and been suffering from neglect this past year. No more,
    the start of the common code and changes for msm driver to use it
    are in this tree. Ongoing work to get the userspace ioctl finished
    and the code clean will probably wait until next kernel.

    - DisplayID 1.3 and tiled monitor exposed to userspace.

    Tiled monitor property is now exposed for userspace to make use of.

    - Rockchip drm driver merged.

    - imx gpu driver moved out of staging

    Other stuff:

    - core:
    panel - MIPI DSI + new panels.
    expose suggested x/y properties for virtual GPUs

    - i915:
    Initial Skylake (SKL) support
    gen3/4 reset work
    start of dri1/ums removal
    infoframe tracking
    fixes for lots of things.

    - nouveau:
    tegra k1 voltage support
    GM204 modesetting support
    GT21x memory reclocking work

    - radeon:
    CI dpm fixes
    GPUVM improvements
    Initial DPM fan control

    - rcar-du:
    HDMI support added
    removed some support for old boards
    slave encoder driver for Analog Devices adv7511

    - exynos:
    Exynos4415 SoC support

    - msm:
    a4xx gpu support
    atomic helper conversion

    - tegra:
    iommu support
    universal plane support
    ganged-mode DSI support

    - sti:
    HDMI i2c improvements

    - vmwgfx:
    some late fixes.

    - qxl:
    use suggested x/y properties"

    * 'drm-next' of git://people.freedesktop.org/~airlied/linux: (969 commits)
    drm: sti: fix module compilation issue
    drm/i915: save/restore GMBUS freq across suspend/resume on gen4
    drm: sti: correctly cleanup CRTC and planes
    drm: sti: add HQVDP plane
    drm: sti: add cursor plane
    drm: sti: enable auxiliary CRTC
    drm: sti: fix delay in VTG programming
    drm: sti: prepare sti_tvout to support auxiliary crtc
    drm: sti: use drm_crtc_vblank_{on/off} instead of drm_vblank_{on/off}
    drm: sti: fix hdmi avi infoframe
    drm: sti: remove event lock while disabling vblank
    drm: sti: simplify gdp code
    drm: sti: clear all mixer control
    drm: sti: remove gpio for HDMI hot plug detection
    drm: sti: allow to change hdmi ddc i2c adapter
    drm/doc: Document drm_add_modes_noedid() usage
    drm/i915: Remove '& 0xffff' from the mask given to WA_REG()
    drm/i915: Invert the mask and val arguments in wa_add() and WA_REG()
    drm: Zero out DRM object memory upon cleanup
    drm/i915/bdw: Fix the write setting up the WIZ hashing mode
    ...

    Linus Torvalds
     

14 Dec, 2014

4 commits

  • This function is only called during initialization.

    Signed-off-by: Luiz Capitulino
    Cc: Andi Kleen
    Acked-by: David Rientjes
    Cc: Rik van Riel
    Cc: Yasuaki Ishimatsu
    Cc: Yinghai Lu
    Cc: Davidlohr Bueso
    Acked-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luiz Capitulino
     
  • No reason to duplicate the code of an existing macro.

    Signed-off-by: Luiz Capitulino
    Cc: Andi Kleen
    Acked-by: David Rientjes
    Cc: Rik van Riel
    Cc: Yasuaki Ishimatsu
    Cc: Yinghai Lu
    Cc: Davidlohr Bueso
    Acked-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luiz Capitulino
     
  • The i_mmap_mutex is a close cousin of the anon vma lock, both protecting
    similar data, one for file backed pages and the other for anon memory. To
    this end, this lock can also be a rwsem. In addition, there are some
    important opportunities to share the lock when there are no tree
    modifications.

    This conversion is straightforward. For now, all users take the write
    lock.

    [sfr@canb.auug.org.au: update fremap.c]
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Rik van Riel
    Acked-by: "Kirill A. Shutemov"
    Acked-by: Hugh Dickins
    Cc: Oleg Nesterov
    Acked-by: Peter Zijlstra (Intel)
    Cc: Srikar Dronamraju
    Acked-by: Mel Gorman
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Convert all open coded mutex_lock/unlock calls to the
    i_mmap_[lock/unlock]_write() helpers.

    Signed-off-by: Davidlohr Bueso
    Acked-by: Rik van Riel
    Acked-by: "Kirill A. Shutemov"
    Acked-by: Hugh Dickins
    Cc: Oleg Nesterov
    Acked-by: Peter Zijlstra (Intel)
    Cc: Srikar Dronamraju
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

12 Dec, 2014

1 commit

  • Pull cgroup update from Tejun Heo:
    "cpuset got simplified a bit. cgroup core got a fix on unified
    hierarchy and grew some effective css related interfaces which will be
    used for blkio support for writeback IO traffic which is currently
    being worked on"

    * 'for-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: implement cgroup_get_e_css()
    cgroup: add cgroup_subsys->css_e_css_changed()
    cgroup: add cgroup_subsys->css_released()
    cgroup: fix the async css offline wait logic in cgroup_subtree_control_write()
    cgroup: restructure child_subsys_mask handling in cgroup_subtree_control_write()
    cgroup: separate out cgroup_calc_child_subsys_mask() from cgroup_refresh_child_subsys_mask()
    cpuset: lock vs unlock typo
    cpuset: simplify cpuset_node_allowed API
    cpuset: convert callback_mutex to a spinlock

    Linus Torvalds
     

11 Dec, 2014

1 commit

  • First, after flushing TLB, we have no need to scan pte from start again.
    Second, before bail out loop, the address is forwarded one step.

    Signed-off-by: Hillf Danton
    Reviewed-by: Michal Hocko
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hillf Danton
     

27 Oct, 2014

1 commit

  • Current cpuset API for checking if a zone/node is allowed to allocate
    from looks rather awkward. We have hardwall and softwall versions of
    cpuset_node_allowed with the softwall version doing literally the same
    as the hardwall version if __GFP_HARDWALL is passed to it in gfp flags.
    If it isn't, the softwall version may check the given node against the
    enclosing hardwall cpuset, which it needs to take the callback lock to
    do.

    Such a distinction was introduced by commit 02a0e53d8227 ("cpuset:
    rework cpuset_zone_allowed api"). Before, we had the only version with
    the __GFP_HARDWALL flag determining its behavior. The purpose of the
    commit was to avoid sleep-in-atomic bugs when someone would mistakenly
    call the function without the __GFP_HARDWALL flag for an atomic
    allocation. The suffixes introduced were intended to make the callers
    think before using the function.

    However, since the callback lock was converted from mutex to spinlock by
    the previous patch, the softwall check function cannot sleep, and these
    precautions are no longer necessary.

    So let's simplify the API back to the single check.

    Suggested-by: David Rientjes
    Signed-off-by: Vladimir Davydov
    Acked-by: Christoph Lameter
    Acked-by: Zefan Li
    Signed-off-by: Tejun Heo

    Vladimir Davydov
     

10 Oct, 2014

1 commit

  • Trivially convert a few VM_BUG_ON calls to VM_BUG_ON_VMA to extract
    more information when they trigger.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Sasha Levin
    Reviewed-by: Naoya Horiguchi
    Cc: Kirill A. Shutemov
    Cc: Konstantin Khlebnikov
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Vlastimil Babka
    Cc: Michel Lespinasse
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     

07 Aug, 2014

5 commits

  • It is possible for some platforms, such as powerpc to set HPAGE_SHIFT to
    0 to indicate huge pages not supported.

    When this is the case, hugetlbfs could be disabled during boot time:
    hugetlbfs: disabling because there are no supported hugepage sizes

    Then in dissolve_free_huge_pages(), order is kept maximum (64 for
    64bits), and the for loop below won't end: for (pfn = start_pfn; pfn <
    end_pfn; pfn += 1 << order)

    As suggested by Naoya, below fix checks hugepages_supported() before
    calling dissolve_free_huge_pages().

    [rientjes@google.com: no legitimate reason to call dissolve_free_huge_pages() when !hugepages_supported()]
    Signed-off-by: Li Zhong
    Acked-by: Naoya Horiguchi
    Acked-by: David Rientjes
    Signed-off-by: David Rientjes
    Cc: [3.12+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zhong
     
  • They are unnecessary: "zero" can be used in place of "hugetlb_zero" and
    passing extra2 == NULL is equivalent to infinity.

    Signed-off-by: David Rientjes
    Cc: Joonsoo Kim
    Reviewed-by: Naoya Horiguchi
    Reviewed-by: Luiz Capitulino
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Three different interfaces alter the maximum number of hugepages for an
    hstate:

    - /proc/sys/vm/nr_hugepages for global number of hugepages of the default
    hstate,

    - /sys/kernel/mm/hugepages/hugepages-X/nr_hugepages for global number of
    hugepages for a specific hstate, and

    - /sys/kernel/mm/hugepages/hugepages-X/nr_hugepages/mempolicy for number of
    hugepages for a specific hstate over the set of allowed nodes.

    Generalize the code so that a single function handles all of these
    writes instead of duplicating the code in two different functions.

    This decreases the number of lines of code, but also reduces the size of
    .text by about half a percent since set_max_huge_pages() can be inlined.

    Signed-off-by: David Rientjes
    Cc: Joonsoo Kim
    Reviewed-by: Naoya Horiguchi
    Reviewed-by: Luiz Capitulino
    Cc: "Kirill A. Shutemov"
    Acked-by: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • When returning from hugetlb_cow(), we always (1) put back the refcount
    for each referenced page -- always 'old', and 'new' if allocation was
    successful. And (2) retake the page table lock right before returning,
    as the callers expects. This logic can be simplified and encapsulated,
    as proposed in this patch. In addition to cleaner code, we also shave a
    few bytes off the instruction text:

    text data bss dec hex filename
    28399 462 41328 70189 1122d mm/hugetlb.o-baseline
    28367 462 41328 70157 1120d mm/hugetlb.o-patched

    Passes libhugetlbfs testcases.

    Signed-off-by: Davidlohr Bueso
    Cc: Aswin Chandramouleeswaran
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • This function always returns 1, thus no need to check return value in
    hugetlb_cow(). By doing so, we can get rid of the unnecessary WARN_ON
    call. While this logic perhaps existed as a way of identifying future
    unmap_ref_private() mishandling, reality is it serves no apparent
    purpose.

    Signed-off-by: Davidlohr Bueso
    Cc: Aswin Chandramouleeswaran
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

31 Jul, 2014

1 commit

  • PG_head_mask was added into VMCOREINFO to filter huge pages in b3acc56bfe1
    ("kexec: save PG_head_mask in VMCOREINFO"), but makedumpfile still need
    another symbol to filter *hugetlbfs* pages.

    If a user hope to filter user pages, makedumpfile tries to exclude them by
    checking the condition whether the page is anonymous, but hugetlbfs pages
    aren't anonymous while they also be user pages.

    We know it's possible to detect them in the same way as PageHuge(),
    so we need the start address of free_huge_page():

    int PageHuge(struct page *page)
    {
    if (!PageCompound(page))
    return 0;

    page = compound_head(page);
    return get_compound_page_dtor(page) == free_huge_page;
    }

    For that reason, this patch changes free_huge_page() into public
    to export it to VMCOREINFO.

    Signed-off-by: Atsushi Kumagai
    Acked-by: Baoquan He
    Cc: Vivek Goyal
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Atsushi Kumagai
     

24 Jul, 2014

1 commit

  • Commit 4a705fef9862 ("hugetlb: fix copy_hugetlb_page_range() to handle
    migration/hwpoisoned entry") changed the order of
    huge_ptep_set_wrprotect() and huge_ptep_get(), which leads to breakage
    in some workloads like hugepage-backed heap allocation via libhugetlbfs.
    This patch fixes it.

    The test program for the problem is shown below:

    $ cat heap.c
    #include
    #include
    #include

    #define HPS 0x200000

    int main() {
    int i;
    char *p = malloc(HPS);
    memset(p, '1', HPS);
    for (i = 0; i < 5; i++) {
    if (!fork()) {
    memset(p, '2', HPS);
    p = malloc(HPS);
    memset(p, '3', HPS);
    free(p);
    return 0;
    }
    }
    sleep(1);
    free(p);
    return 0;
    }

    $ export HUGETLB_MORECORE=yes ; export HUGETLB_NO_PREFAULT= ; hugectl --heap ./heap

    Fixes 4a705fef9862 ("hugetlb: fix copy_hugetlb_page_range() to handle
    migration/hwpoisoned entry"), so is applicable to -stable kernels which
    include it.

    Signed-off-by: Naoya Horiguchi
    Reported-by: Guillaume Morin
    Suggested-by: Guillaume Morin
    Acked-by: Hugh Dickins
    Cc: [2.6.37+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

24 Jun, 2014

1 commit

  • There's a race between fork() and hugepage migration, as a result we try
    to "dereference" a swap entry as a normal pte, causing kernel panic.
    The cause of the problem is that copy_hugetlb_page_range() can't handle
    "swap entry" family (migration entry and hwpoisoned entry) so let's fix
    it.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Naoya Horiguchi
    Acked-by: Hugh Dickins
    Cc: Christoph Lameter
    Cc: [2.6.37+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

05 Jun, 2014

7 commits

  • We already have a function named hugepages_supported(), and the similar
    name hugepage_migration_support() is a bit unconfortable, so let's rename
    it hugepage_migration_supported().

    Signed-off-by: Naoya Horiguchi
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • alloc_huge_page() now mixes normal code path with error handle logic.
    This patches move out the error handle logic, to make normal code path
    more clean and redue code duplicate.

    Signed-off-by: Jianyu Zhan
    Acked-by: Davidlohr Bueso
    Reviewed-by: Michal Hocko
    Reviewed-by: Aneesh Kumar K.V
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jianyu Zhan
     
  • HugeTLB is limited to allocating hugepages whose size are less than
    MAX_ORDER order. This is so because HugeTLB allocates hugepages via the
    buddy allocator. Gigantic pages (that is, pages whose size is greater
    than MAX_ORDER order) have to be allocated at boottime.

    However, boottime allocation has at least two serious problems. First,
    it doesn't support NUMA and second, gigantic pages allocated at boottime
    can't be freed.

    This commit solves both issues by adding support for allocating gigantic
    pages during runtime. It works just like regular sized hugepages,
    meaning that the interface in sysfs is the same, it supports NUMA, and
    gigantic pages can be freed.

    For example, on x86_64 gigantic pages are 1GB big. To allocate two 1G
    gigantic pages on node 1, one can do:

    # echo 2 > \
    /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages

    And to free them all:

    # echo 0 > \
    /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages

    The one problem with gigantic page allocation at runtime is that it
    can't be serviced by the buddy allocator. To overcome that problem,
    this commit scans all zones from a node looking for a large enough
    contiguous region. When one is found, it's allocated by using CMA, that
    is, we call alloc_contig_range() to do the actual allocation. For
    example, on x86_64 we scan all zones looking for a 1GB contiguous
    region. When one is found, it's allocated by alloc_contig_range().

    One expected issue with that approach is that such gigantic contiguous
    regions tend to vanish as runtime goes by. The best way to avoid this
    for now is to make gigantic page allocations very early during system
    boot, say from a init script. Other possible optimization include using
    compaction, which is supported by CMA but is not explicitly used by this
    commit.

    It's also important to note the following:

    1. Gigantic pages allocated at boottime by the hugepages= command-line
    option can be freed at runtime just fine

    2. This commit adds support for gigantic pages only to x86_64. The
    reason is that I don't have access to nor experience with other archs.
    The code is arch indepedent though, so it should be simple to add
    support to different archs

    3. I didn't add support for hugepage overcommit, that is allocating
    a gigantic page on demand when
    /proc/sys/vm/nr_overcommit_hugepages > 0. The reason is that I don't
    think it's reasonable to do the hard and long work required for
    allocating a gigantic page at fault time. But it should be simple
    to add this if wanted

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Luiz Capitulino
    Reviewed-by: Davidlohr Bueso
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Zhang Yanfei
    Reviewed-by: Yasuaki Ishimatsu
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Cc: Marcelo Tosatti
    Cc: Naoya Horiguchi
    Cc: Rik van Riel
    Cc: Yinghai Lu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luiz Capitulino
     
  • Next commit will add new code which will want to call
    for_each_node_mask_to_alloc() macro. Move it, its buddy
    for_each_node_mask_to_free() and their dependencies up in the file so the
    new code can use them. This is just code movement, no logic change.

    Signed-off-by: Luiz Capitulino
    Reviewed-by: Andrea Arcangeli
    Reviewed-by: Naoya Horiguchi
    Reviewed-by: Yasuaki Ishimatsu
    Reviewed-by: Davidlohr Bueso
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Zhang Yanfei
    Cc: David Rientjes
    Cc: Marcelo Tosatti
    Cc: Rik van Riel
    Cc: Yinghai Lu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luiz Capitulino
     
  • Hugepages pages never get the PG_reserved bit set, so don't clear it.

    However, note that if the bit gets mistakenly set free_pages_check() will
    catch it.

    Signed-off-by: Luiz Capitulino
    Reviewed-by: Davidlohr Bueso
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Zhang Yanfei
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Cc: Marcelo Tosatti
    Cc: Naoya Horiguchi
    Cc: Rik van Riel
    Cc: Yasuaki Ishimatsu
    Cc: Yinghai Lu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luiz Capitulino
     
  • Signed-off-by: Luiz Capitulino
    Reviewed-by: Andrea Arcangeli
    Reviewed-by: Naoya Horiguchi
    Reviewed-by: Yasuaki Ishimatsu
    Reviewed-by: Davidlohr Bueso
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Zhang Yanfei
    Cc: David Rientjes
    Cc: Marcelo Tosatti
    Cc: Rik van Riel
    Cc: Yinghai Lu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luiz Capitulino
     
  • The HugeTLB subsystem uses the buddy allocator to allocate hugepages
    during runtime. This means that hugepages allocation during runtime is
    limited to MAX_ORDER order. For archs supporting gigantic pages (that
    is, page sizes greater than MAX_ORDER), this in turn means that those
    pages can't be allocated at runtime.

    HugeTLB supports gigantic page allocation during boottime, via the boot
    allocator. To this end the kernel provides the command-line options
    hugepagesz= and hugepages=, which can be used to instruct the kernel to
    allocate N gigantic pages during boot.

    For example, x86_64 supports 2M and 1G hugepages, but only 2M hugepages
    can be allocated and freed at runtime. If one wants to allocate 1G
    gigantic pages, this has to be done at boot via the hugepagesz= and
    hugepages= command-line options.

    Now, gigantic page allocation at boottime has two serious problems:

    1. Boottime allocation is not NUMA aware. On a NUMA machine the kernel
    evenly distributes boottime allocated hugepages among nodes.

    For example, suppose you have a four-node NUMA machine and want
    to allocate four 1G gigantic pages at boottime. The kernel will
    allocate one gigantic page per node.

    On the other hand, we do have users who want to be able to specify
    which NUMA node gigantic pages should allocated from. So that they
    can place virtual machines on a specific NUMA node.

    2. Gigantic pages allocated at boottime can't be freed

    At this point it's important to observe that regular hugepages allocated
    at runtime don't have those problems. This is so because HugeTLB
    interface for runtime allocation in sysfs supports NUMA and runtime
    allocated pages can be freed just fine via the buddy allocator.

    This series adds support for allocating gigantic pages at runtime. It
    does so by allocating gigantic pages via CMA instead of the buddy
    allocator. Releasing gigantic pages is also supported via CMA. As this
    series builds on top of the existing HugeTLB interface, it makes gigantic
    page allocation and releasing just like regular sized hugepages. This
    also means that NUMA support just works.

    For example, to allocate two 1G gigantic pages on node 1, one can do:

    # echo 2 > \
    /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages

    And, to release all gigantic pages on the same node:

    # echo 0 > \
    /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages

    Please, refer to patch 5/5 for full technical details.

    Finally, please note that this series is a follow up for a previous series
    that tried to extend the command-line options set to be NUMA aware:

    http://marc.info/?l=linux-mm&m=139593335312191&w=2

    During the discussion of that series it was agreed that having runtime
    allocation support for gigantic pages was a better solution.

    This patch (of 5):

    This function is going to be used by non-init code in a future
    commit.

    Signed-off-by: Luiz Capitulino
    Reviewed-by: Davidlohr Bueso
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Zhang Yanfei
    Cc: Marcelo Tosatti
    Cc: Andrea Arcangeli
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Yasuaki Ishimatsu
    Cc: Yinghai Lu
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luiz Capitulino
     

07 May, 2014

1 commit

  • Currently, I am seeing the following when I `mount -t hugetlbfs /none
    /dev/hugetlbfs`, and then simply do a `ls /dev/hugetlbfs`. I think it's
    related to the fact that hugetlbfs is properly not correctly setting
    itself up in this state?:

    Unable to handle kernel paging request for data at address 0x00000031
    Faulting instruction address: 0xc000000000245710
    Oops: Kernel access of bad area, sig: 11 [#1]
    SMP NR_CPUS=2048 NUMA pSeries
    ....

    In KVM guests on Power, in a guest not backed by hugepages, we see the
    following:

    AnonHugePages: 0 kB
    HugePages_Total: 0
    HugePages_Free: 0
    HugePages_Rsvd: 0
    HugePages_Surp: 0
    Hugepagesize: 64 kB

    HPAGE_SHIFT == 0 in this configuration, which indicates that hugepages
    are not supported at boot-time, but this is only checked in
    hugetlb_init(). Extract the check to a helper function, and use it in a
    few relevant places.

    This does make hugetlbfs not supported (not registered at all) in this
    environment. I believe this is fine, as there are no valid hugepages
    and that won't change at runtime.

    [akpm@linux-foundation.org: use pr_info(), per Mel]
    [akpm@linux-foundation.org: fix build when HPAGE_SHIFT is undefined]
    Signed-off-by: Nishanth Aravamudan
    Reviewed-by: Aneesh Kumar K.V
    Acked-by: Mel Gorman
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nishanth Aravamudan
     

19 Apr, 2014

1 commit

  • soft lockup in freeing gigantic hugepage fixed in commit 55f67141a892 "mm:
    hugetlb: fix softlockup when a large number of hugepages are freed." can
    happen in return_unused_surplus_pages(), so let's fix it.

    Signed-off-by: Masayoshi Mizuma
    Signed-off-by: Naoya Horiguchi
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Aneesh Kumar
    Cc: KOSAKI Motohiro
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mizuma, Masayoshi
     

08 Apr, 2014

5 commits

  • When I decrease the value of nr_hugepage in procfs a lot, softlockup
    happens. It is because there is no chance of context switch during this
    process.

    On the other hand, when I allocate a large number of hugepages, there is
    some chance of context switch. Hence softlockup doesn't happen during
    this process. So it's necessary to add the context switch in the
    freeing process as same as allocating process to avoid softlockup.

    When I freed 12 TB hugapages with kernel-2.6.32-358.el6, the freeing
    process occupied a CPU over 150 seconds and following softlockup message
    appeared twice or more.

    $ echo 6000000 > /proc/sys/vm/nr_hugepages
    $ cat /proc/sys/vm/nr_hugepages
    6000000
    $ grep ^Huge /proc/meminfo
    HugePages_Total: 6000000
    HugePages_Free: 6000000
    HugePages_Rsvd: 0
    HugePages_Surp: 0
    Hugepagesize: 2048 kB
    $ echo 0 > /proc/sys/vm/nr_hugepages

    BUG: soft lockup - CPU#16 stuck for 67s! [sh:12883] ...
    Pid: 12883, comm: sh Not tainted 2.6.32-358.el6.x86_64 #1
    Call Trace:
    free_pool_huge_page+0xb8/0xd0
    set_max_huge_pages+0x128/0x190
    hugetlb_sysctl_handler_common+0x113/0x140
    hugetlb_sysctl_handler+0x1e/0x20
    proc_sys_call_handler+0x97/0xd0
    proc_sys_write+0x14/0x20
    vfs_write+0xb8/0x1a0
    sys_write+0x51/0x90
    __audit_syscall_exit+0x265/0x290
    system_call_fastpath+0x16/0x1b

    I have not confirmed this problem with upstream kernels because I am not
    able to prepare the machine equipped with 12TB memory now. However I
    confirmed that the amount of decreasing hugepages was directly
    proportional to the amount of required time.

    I measured required times on a smaller machine. It showed 130-145
    hugepages decreased in a millisecond.

    Amount of decreasing Required time Decreasing rate
    hugepages (msec) (pages/msec)
    ------------------------------------------------------------
    10,000 pages == 20GB 70 - 74 135-142
    30,000 pages == 60GB 208 - 229 131-144

    It means decrement of 6TB hugepages will trigger softlockup with the
    default threshold 20sec, in this decreasing rate.

    Signed-off-by: Masayoshi Mizuma
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Wanpeng Li
    Cc: Aneesh Kumar
    Cc: KOSAKI Motohiro
    Cc: Naoya Horiguchi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mizuma, Masayoshi
     
  • Signed-off-by: Choi Gi-yong
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Choi Gi-yong
     
  • To increase compiler portability there is which
    provides convenience macros for various gcc constructs. Eg: __weak for
    __attribute__((weak)). I've replaced all instances of gcc attributes with
    the right macro in the memory management (/mm) subsystem.

    [akpm@linux-foundation.org: while-we're-there consistency tweaks]
    Signed-off-by: Gideon Israel Dsouza
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gideon Israel Dsouza
     
  • The NUMA scanning code can end up iterating over many gigabytes of
    unpopulated memory, especially in the case of a freshly started KVM
    guest with lots of memory.

    This results in the mmu notifier code being called even when there are
    no mapped pages in a virtual address range. The amount of time wasted
    can be enough to trigger soft lockup warnings with very large KVM
    guests.

    This patch moves the mmu notifier call to the pmd level, which
    represents 1GB areas of memory on x86-64. Furthermore, the mmu notifier
    code is only called from the address in the PMD where present mappings
    are first encountered.

    The hugetlbfs code is left alone for now; hugetlb mappings are not
    relocatable, and as such are left alone by the NUMA code, and should
    never trigger this problem to begin with.

    Signed-off-by: Rik van Riel
    Acked-by: David Rientjes
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Reported-by: Xing Gang
    Tested-by: Chegu Vinod
    Cc: Sasha Levin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • huge_pte_offset() could return NULL, so we need NULL check to avoid
    potential NULL pointer dereferences.

    Signed-off-by: Naoya Horiguchi
    Cc: Mel Gorman
    Cc: Sasha Levin
    Cc: Kirill A. Shutemov
    Cc: Aneesh Kumar K.V
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

04 Apr, 2014

8 commits

  • Both prep_compound_huge_page() and prep_compound_gigantic_page() are
    only called at bootstrap and can be marked as __init.

    The __SetPageTail(page) in prep_compound_gigantic_page() happening
    before page->first_page is initialized is not concerning since this is
    bootstrap.

    Signed-off-by: David Rientjes
    Reviewed-by: Michal Hocko
    Cc: Joonsoo Kim
    Reviewed-by: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • The kernel can currently only handle a single hugetlb page fault at a
    time. This is due to a single mutex that serializes the entire path.
    This lock protects from spurious OOM errors under conditions of low
    availability of free hugepages. This problem is specific to hugepages,
    because it is normal to want to use every single hugepage in the system
    - with normal pages we simply assume there will always be a few spare
    pages which can be used temporarily until the race is resolved.

    Address this problem by using a table of mutexes, allowing a better
    chance of parallelization, where each hugepage is individually
    serialized. The hash key is selected depending on the mapping type.
    For shared ones it consists of the address space and file offset being
    faulted; while for private ones the mm and virtual address are used.
    The size of the table is selected based on a compromise of collisions
    and memory footprint of a series of database workloads.

    Large database workloads that make heavy use of hugepages can be
    particularly exposed to this issue, causing start-up times to be
    painfully slow. This patch reduces the startup time of a 10 Gb Oracle
    DB (with ~5000 faults) from 37.5 secs to 25.7 secs. Larger workloads
    will naturally benefit even more.

    NOTE:
    The only downside to this patch, detected by Joonsoo Kim, is that a
    small race is possible in private mappings: A child process (with its
    own mm, after cow) can instantiate a page that is already being handled
    by the parent in a cow fault. When low on pages, can trigger spurious
    OOMs. I have not been able to think of a efficient way of handling
    this... but do we really care about such a tiny window? We already
    maintain another theoretical race with normal pages. If not, one
    possible way to is to maintain the single hash for private mappings --
    any workloads that *really* suffer from this scaling problem should
    already use shared mappings.

    [akpm@linux-foundation.org: remove stray + characters, go BUG if hugetlb_init() kmalloc fails]
    Signed-off-by: Davidlohr Bueso
    Cc: Aneesh Kumar K.V
    Cc: David Gibson
    Cc: Joonsoo Kim
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Util now, we get a resv_map by two ways according to each mapping type.
    This makes code dirty and unreadable. Unify it.

    [davidlohr@hp.com: code cleanups]
    Signed-off-by: Joonsoo Kim
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Aneesh Kumar K.V
    Reviewed-by: Naoya Horiguchi
    Cc: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • This is a preparation patch to unify the use of vma_resv_map()
    regardless of the map type. This patch prepares it by removing
    resv_map_put(), which only works for HPAGE_RESV_OWNER's resv_map, not
    for all resv_maps.

    [davidlohr@hp.com: update changelog]
    Signed-off-by: Joonsoo Kim
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Aneesh Kumar K.V
    Reviewed-by: Naoya Horiguchi
    Cc: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • There is a race condition if we map a same file on different processes.
    Region tracking is protected by mmap_sem and hugetlb_instantiation_mutex.
    When we do mmap, we don't grab a hugetlb_instantiation_mutex, but only
    mmap_sem (exclusively). This doesn't prevent other tasks from modifying
    the region structure, so it can be modified by two processes
    concurrently.

    To solve this, introduce a spinlock to resv_map and make region
    manipulation function grab it before they do actual work.

    [davidlohr@hp.com: updated changelog]
    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Joonsoo Kim
    Suggested-by: Joonsoo Kim
    Acked-by: David Gibson
    Cc: David Gibson
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • To change a protection method for region tracking to find grained one,
    we pass the resv_map, instead of list_head, to region manipulation
    functions.

    This doesn't introduce any functional change, and it is just for
    preparing a next step.

    [davidlohr@hp.com: update changelog]
    Signed-off-by: Joonsoo Kim
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Aneesh Kumar K.V
    Reviewed-by: Naoya Horiguchi
    Cc: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Currently, to track reserved and allocated regions, we use two different
    ways, depending on the mapping. For MAP_SHARED, we use
    address_mapping's private_list and, while for MAP_PRIVATE, we use a
    resv_map.

    Now, we are preparing to change a coarse grained lock which protect a
    region structure to fine grained lock, and this difference hinder it.
    So, before changing it, unify region structure handling, consistently
    using a resv_map regardless of the kind of mapping.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Aneesh Kumar K.V
    Reviewed-by: Naoya Horiguchi
    Cc: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Since put_mems_allowed() is strictly optional, its a seqcount retry, we
    don't need to evaluate the function if the allocation was in fact
    successful, saving a smp_rmb some loads and comparisons on some relative
    fast-paths.

    Since the naming, get/put_mems_allowed() does suggest a mandatory
    pairing, rename the interface, as suggested by Mel, to resemble the
    seqcount interface.

    This gives us: read_mems_allowed_begin() and read_mems_allowed_retry(),
    where it is important to note that the return value of the latter call
    is inverted from its previous incarnation.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

24 Jan, 2014

1 commit

  • Most of the VM_BUG_ON assertions are performed on a page. Usually, when
    one of these assertions fails we'll get a BUG_ON with a call stack and
    the registers.

    I've recently noticed based on the requests to add a small piece of code
    that dumps the page to various VM_BUG_ON sites that the page dump is
    quite useful to people debugging issues in mm.

    This patch adds a VM_BUG_ON_PAGE(cond, page) which beyond doing what
    VM_BUG_ON() does, also dumps the page before executing the actual
    BUG_ON.

    [akpm@linux-foundation.org: fix up includes]
    Signed-off-by: Sasha Levin
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin