05 Jun, 2014

40 commits

  • kmem_cache_{create,destroy,shrink} need to get a stable value of
    cpu/node online mask, because they init/destroy/access per-cpu/node
    kmem_cache parts, which can be allocated or destroyed on cpu/mem
    hotplug. To protect against cpu hotplug, these functions use
    {get,put}_online_cpus. However, they do nothing to synchronize with
    memory hotplug - taking the slab_mutex does not eliminate the
    possibility of race as described in patch 2.

    What we need there is something like get_online_cpus, but for memory.
    We already have lock_memory_hotplug, which serves for the purpose, but
    it's a bit of a hammer right now, because it's backed by a mutex. As a
    result, it imposes some limitations to locking order, which are not
    desirable, and can't be used just like get_online_cpus. That's why in
    patch 1 I substitute it with get/put_online_mems, which work exactly
    like get/put_online_cpus except they block not cpu, but memory hotplug.

    [ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
    myself, because it used an rw semaphore for get/put_online_mems,
    making them dead lock prune. ]

    This patch (of 2):

    {un}lock_memory_hotplug, which is used to synchronize against memory
    hotplug, is currently backed by a mutex, which makes it a bit of a
    hammer - threads that only want to get a stable value of online nodes
    mask won't be able to proceed concurrently. Also, it imposes some
    strong locking ordering rules on it, which narrows down the set of its
    usage scenarios.

    This patch introduces get/put_online_mems, which are the same as
    get/put_online_cpus, but for memory hotplug, i.e. executing a code
    inside a get/put_online_mems section will guarantee a stable value of
    online nodes, present pages, etc.

    lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.

    Signed-off-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Tang Chen
    Cc: Zhang Yanfei
    Cc: Toshi Kani
    Cc: Xishi Qiu
    Cc: Jiang Liu
    Cc: Rafael J. Wysocki
    Cc: David Rientjes
    Cc: Wen Congyang
    Cc: Yasuaki Ishimatsu
    Cc: Lai Jiangshan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • It is only used in slab and should not be used anywhere else so there is
    no need in exporting it.

    Signed-off-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • pgdat->reclaim_nodes tracks if a remote node is allowed to be reclaimed
    by zone_reclaim due to its distance. As it is expected that
    zone_reclaim_mode will be rarely enabled it is unreasonable for all
    machines to take a penalty. Fortunately, the zone_reclaim_mode() path
    is already slow and it is the path that takes the hit.

    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Reviewed-by: Zhang Yanfei
    Acked-by: Michal Hocko
    Reviewed-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • When it was introduced, zone_reclaim_mode made sense as NUMA distances
    punished and workloads were generally partitioned to fit into a NUMA
    node. NUMA machines are now common but few of the workloads are
    NUMA-aware and it's routine to see major performance degradation due to
    zone_reclaim_mode being enabled but relatively few can identify the
    problem.

    Those that require zone_reclaim_mode are likely to be able to detect
    when it needs to be enabled and tune appropriately so lets have a
    sensible default for the bulk of users.

    This patch (of 2):

    zone_reclaim_mode causes processes to prefer reclaiming memory from
    local node instead of spilling over to other nodes. This made sense
    initially when NUMA machines were almost exclusively HPC and the
    workload was partitioned into nodes. The NUMA penalties were
    sufficiently high to justify reclaiming the memory. On current machines
    and workloads it is often the case that zone_reclaim_mode destroys
    performance but not all users know how to detect this. Favour the
    common case and disable it by default. Users that are sophisticated
    enough to know they need zone_reclaim_mode will detect it.

    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Reviewed-by: Zhang Yanfei
    Acked-by: Michal Hocko
    Reviewed-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • HugeTLB is limited to allocating hugepages whose size are less than
    MAX_ORDER order. This is so because HugeTLB allocates hugepages via the
    buddy allocator. Gigantic pages (that is, pages whose size is greater
    than MAX_ORDER order) have to be allocated at boottime.

    However, boottime allocation has at least two serious problems. First,
    it doesn't support NUMA and second, gigantic pages allocated at boottime
    can't be freed.

    This commit solves both issues by adding support for allocating gigantic
    pages during runtime. It works just like regular sized hugepages,
    meaning that the interface in sysfs is the same, it supports NUMA, and
    gigantic pages can be freed.

    For example, on x86_64 gigantic pages are 1GB big. To allocate two 1G
    gigantic pages on node 1, one can do:

    # echo 2 > \
    /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages

    And to free them all:

    # echo 0 > \
    /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages

    The one problem with gigantic page allocation at runtime is that it
    can't be serviced by the buddy allocator. To overcome that problem,
    this commit scans all zones from a node looking for a large enough
    contiguous region. When one is found, it's allocated by using CMA, that
    is, we call alloc_contig_range() to do the actual allocation. For
    example, on x86_64 we scan all zones looking for a 1GB contiguous
    region. When one is found, it's allocated by alloc_contig_range().

    One expected issue with that approach is that such gigantic contiguous
    regions tend to vanish as runtime goes by. The best way to avoid this
    for now is to make gigantic page allocations very early during system
    boot, say from a init script. Other possible optimization include using
    compaction, which is supported by CMA but is not explicitly used by this
    commit.

    It's also important to note the following:

    1. Gigantic pages allocated at boottime by the hugepages= command-line
    option can be freed at runtime just fine

    2. This commit adds support for gigantic pages only to x86_64. The
    reason is that I don't have access to nor experience with other archs.
    The code is arch indepedent though, so it should be simple to add
    support to different archs

    3. I didn't add support for hugepage overcommit, that is allocating
    a gigantic page on demand when
    /proc/sys/vm/nr_overcommit_hugepages > 0. The reason is that I don't
    think it's reasonable to do the hard and long work required for
    allocating a gigantic page at fault time. But it should be simple
    to add this if wanted

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Luiz Capitulino
    Reviewed-by: Davidlohr Bueso
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Zhang Yanfei
    Reviewed-by: Yasuaki Ishimatsu
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Cc: Marcelo Tosatti
    Cc: Naoya Horiguchi
    Cc: Rik van Riel
    Cc: Yinghai Lu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luiz Capitulino
     
  • Next commit will add new code which will want to call
    for_each_node_mask_to_alloc() macro. Move it, its buddy
    for_each_node_mask_to_free() and their dependencies up in the file so the
    new code can use them. This is just code movement, no logic change.

    Signed-off-by: Luiz Capitulino
    Reviewed-by: Andrea Arcangeli
    Reviewed-by: Naoya Horiguchi
    Reviewed-by: Yasuaki Ishimatsu
    Reviewed-by: Davidlohr Bueso
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Zhang Yanfei
    Cc: David Rientjes
    Cc: Marcelo Tosatti
    Cc: Rik van Riel
    Cc: Yinghai Lu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luiz Capitulino
     
  • Hugepages pages never get the PG_reserved bit set, so don't clear it.

    However, note that if the bit gets mistakenly set free_pages_check() will
    catch it.

    Signed-off-by: Luiz Capitulino
    Reviewed-by: Davidlohr Bueso
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Zhang Yanfei
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Cc: Marcelo Tosatti
    Cc: Naoya Horiguchi
    Cc: Rik van Riel
    Cc: Yasuaki Ishimatsu
    Cc: Yinghai Lu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luiz Capitulino
     
  • Signed-off-by: Luiz Capitulino
    Reviewed-by: Andrea Arcangeli
    Reviewed-by: Naoya Horiguchi
    Reviewed-by: Yasuaki Ishimatsu
    Reviewed-by: Davidlohr Bueso
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Zhang Yanfei
    Cc: David Rientjes
    Cc: Marcelo Tosatti
    Cc: Rik van Riel
    Cc: Yinghai Lu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luiz Capitulino
     
  • The HugeTLB subsystem uses the buddy allocator to allocate hugepages
    during runtime. This means that hugepages allocation during runtime is
    limited to MAX_ORDER order. For archs supporting gigantic pages (that
    is, page sizes greater than MAX_ORDER), this in turn means that those
    pages can't be allocated at runtime.

    HugeTLB supports gigantic page allocation during boottime, via the boot
    allocator. To this end the kernel provides the command-line options
    hugepagesz= and hugepages=, which can be used to instruct the kernel to
    allocate N gigantic pages during boot.

    For example, x86_64 supports 2M and 1G hugepages, but only 2M hugepages
    can be allocated and freed at runtime. If one wants to allocate 1G
    gigantic pages, this has to be done at boot via the hugepagesz= and
    hugepages= command-line options.

    Now, gigantic page allocation at boottime has two serious problems:

    1. Boottime allocation is not NUMA aware. On a NUMA machine the kernel
    evenly distributes boottime allocated hugepages among nodes.

    For example, suppose you have a four-node NUMA machine and want
    to allocate four 1G gigantic pages at boottime. The kernel will
    allocate one gigantic page per node.

    On the other hand, we do have users who want to be able to specify
    which NUMA node gigantic pages should allocated from. So that they
    can place virtual machines on a specific NUMA node.

    2. Gigantic pages allocated at boottime can't be freed

    At this point it's important to observe that regular hugepages allocated
    at runtime don't have those problems. This is so because HugeTLB
    interface for runtime allocation in sysfs supports NUMA and runtime
    allocated pages can be freed just fine via the buddy allocator.

    This series adds support for allocating gigantic pages at runtime. It
    does so by allocating gigantic pages via CMA instead of the buddy
    allocator. Releasing gigantic pages is also supported via CMA. As this
    series builds on top of the existing HugeTLB interface, it makes gigantic
    page allocation and releasing just like regular sized hugepages. This
    also means that NUMA support just works.

    For example, to allocate two 1G gigantic pages on node 1, one can do:

    # echo 2 > \
    /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages

    And, to release all gigantic pages on the same node:

    # echo 0 > \
    /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages

    Please, refer to patch 5/5 for full technical details.

    Finally, please note that this series is a follow up for a previous series
    that tried to extend the command-line options set to be NUMA aware:

    http://marc.info/?l=linux-mm&m=139593335312191&w=2

    During the discussion of that series it was agreed that having runtime
    allocation support for gigantic pages was a better solution.

    This patch (of 5):

    This function is going to be used by non-init code in a future
    commit.

    Signed-off-by: Luiz Capitulino
    Reviewed-by: Davidlohr Bueso
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Zhang Yanfei
    Cc: Marcelo Tosatti
    Cc: Andrea Arcangeli
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Yasuaki Ishimatsu
    Cc: Yinghai Lu
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luiz Capitulino
     
  • Fix a coccinelle error regarding usage of IS_ERR and PTR_ERR instead of
    PTR_ERR_OR_ZERO.

    Signed-off-by: Duan Jiong
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Duan Jiong
     
  • Signed-off-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Seems we all agree that information about SECTION, e.g. section size,
    sections per memory block should be kept as kernel internals, and not
    exposed to userspace.

    This patch updates Documentation/memory-hotplug.txt to refer to memory
    blocks instead of memory sections where appropriate and added a
    paragraph to explain that memory blocks are made of memory sections.
    The documentation update is mostly provided by Nathan.

    Also, as end_phys_index in code is actually not the end section id, but
    the end memory block id, which should always be the same as phys_index.
    So it is removed here.

    Signed-off-by: Li Zhong
    Reviewed-by: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zhong
     
  • I recently added a patch to let folks pass a "reason" string dump_page()
    which gets dumped out along with the page's data. This essentially
    saves the bug-reader a trip in to the source to figure out why we
    BUG_ON()'d.

    The new VM_BUG_ON_PAGE() passes in NULL for "reason". It seems like we
    might as well pass the BUG_ON() condition if we have it. This will
    bloat kernels a bit with ~160 new strings, but this is all under a
    debugging option anyway.

    page:ffffea0008560280 count:1 mapcount:0 mapping:(null) index:0x0
    page flags: 0xbfffc0000000001(locked)
    page dumped because: VM_BUG_ON_PAGE(PageLocked(page))
    ------------[ cut here ]------------
    kernel BUG at /home/davehans/linux.git/mm/filemap.c:464!
    invalid opcode: 0000 [#1] SMP
    CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.14.0+ #251
    Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    ...

    [akpm@linux-foundation.org: include stringify.h]
    Signed-off-by: Dave Hansen
    Acked-by: Kirill A. Shutemov
    Acked-by: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • Per-memcg swappiness and oom killing can currently not be tweaked on a
    memcg that is part of a hierarchy, but not the root of that hierarchy.
    Users have complained that they can't configure this when they turned on
    hierarchy mode. In fact, with hierarchy mode becoming the default, this
    restriction disables the tunables entirely.

    But there is no good reason for this restriction. The settings for
    swappiness and OOM killing are taken from whatever memcg whose limit
    triggered reclaim and OOM invocation, regardless of its position in the
    hierarchy tree.

    Allow setting swappiness on any group. The knob on the root memcg
    already reads the global VM swappiness, make it writable as well.

    Allow disabling the OOM killer on any non-root memcg.

    Signed-off-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Memory obtained via mempool_alloc is not always zeroed even when
    called with __GFP_ZERO. Add a note and VM_BUG_ON statement to make
    that clear.

    [akpm@linux-foundation.org: use VM_WARN_ON_ONCE]
    Signed-off-by: Sebastian Ott
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sebastian Ott
     
  • WARN_ON() and WARN_ON_ONCE(), dependent on CONFIG_DEBUG_VM

    Cc: Sebastian Ott
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • It was using a mix of pr_foo() and printk(KERN_ERR ...).

    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • It doesn't make sense to have two assert checks for each invariant: one
    for printing and one for BUG().

    Let's trigger BUG() if we print error message.

    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • dma_generic_alloc_coherent() firstly attempts to allocate by
    dma_alloc_from_contiguous() if CONFIG_DMA_CMA is enabled. But the
    memory region allocated by it may not fit within the device's DMA mask.
    This change makes it fall back to usual alloc_pages_node() allocation
    for such cases.

    Signed-off-by: Akinobu Mita
    Cc: Marek Szyprowski
    Cc: Konrad Rzeszutek Wilk
    Cc: David Woodhouse
    Cc: Don Dutile
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • Currently, "cma=" kernel parameter is used to specify the size of CMA,
    but we can't specify where it is located. We want to locate CMA below
    4GB for devices only supporting 32-bit addressing on 64-bit systems
    without iommu.

    This enables to specify the placement of CMA by extending "cma=" kernel
    parameter.

    Examples:
    1. locate 64MB CMA below 4GB by "cma=64M@0-4G"
    2. locate 64MB CMA exact at 512MB by "cma=64M@512M"

    Note that the DMA contiguous memory allocator on x86 assumes that
    page_address() works for the pages to allocate. So this change requires
    to limit end address of contiguous memory area upto max_pfn_mapped to
    prevent from locating it on highmem area by the argument of
    dma_contiguous_reserve().

    Signed-off-by: Akinobu Mita
    Cc: Marek Szyprowski
    Cc: Konrad Rzeszutek Wilk
    Cc: David Woodhouse
    Cc: Don Dutile
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Andi Kleen
    Cc: Yinghai Lu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • This introduces memblock_alloc_range() which allocates memblock from the
    specified range of physical address. I would like to use this function
    to specify the location of CMA.

    Signed-off-by: Akinobu Mita
    Cc: Marek Szyprowski
    Cc: Konrad Rzeszutek Wilk
    Cc: David Woodhouse
    Cc: Don Dutile
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Andi Kleen
    Cc: Yinghai Lu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • This adds support for the DMA Contiguous Memory Allocator for
    intel-iommu. This change enables dma_alloc_coherent() to allocate big
    contiguous memory.

    It is achieved in the same way as nommu_dma_ops currently does, i.e.
    trying to allocate memory by dma_alloc_from_contiguous() and
    alloc_pages() is used as a fallback.

    Signed-off-by: Akinobu Mita
    Cc: Marek Szyprowski
    Cc: Konrad Rzeszutek Wilk
    Cc: David Woodhouse
    Cc: Don Dutile
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Andi Kleen
    Cc: Yinghai Lu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • The DMA Contiguous Memory Allocator support on x86 is disabled when
    swiotlb config option is enabled. So DMA CMA is always disabled on
    x86_64 because swiotlb is always enabled. This attempts to support for
    DMA CMA with enabling swiotlb config option.

    The contiguous memory allocator on x86 is integrated in the function
    dma_generic_alloc_coherent() which is .alloc callback in nommu_dma_ops
    for dma_alloc_coherent().

    x86_swiotlb_alloc_coherent() which is .alloc callback in swiotlb_dma_ops
    tries to allocate with dma_generic_alloc_coherent() firstly and then
    swiotlb_alloc_coherent() is called as a fallback.

    The main part of supporting DMA CMA with swiotlb is that changing
    x86_swiotlb_free_coherent() which is .free callback in swiotlb_dma_ops
    for dma_free_coherent() so that it can distinguish memory allocated by
    dma_generic_alloc_coherent() from one allocated by
    swiotlb_alloc_coherent() and release it with dma_generic_free_coherent()
    which can handle contiguous memory. This change requires making
    is_swiotlb_buffer() global function.

    This also needs to change .free callback in the dma_map_ops for amd_gart
    and sta2x11, because these dma_ops are also using
    dma_generic_alloc_coherent().

    Signed-off-by: Akinobu Mita
    Acked-by: Marek Szyprowski
    Acked-by: Konrad Rzeszutek Wilk
    Cc: David Woodhouse
    Cc: Don Dutile
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Andi Kleen
    Cc: Yinghai Lu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • This patchset enhances the DMA Contiguous Memory Allocator on x86.

    Currently the DMA CMA is only supported with pci-nommu dma_map_ops and
    furthermore it can't be enabled on x86_64. But I would like to allocate
    big contiguous memory with dma_alloc_coherent() and tell it to the device
    that requires it, regardless of which dma mapping implementation is
    actually used in the system.

    So this makes it work with swiotlb and intel-iommu dma_map_ops, too. And
    this also extends "cma=" kernel parameter to specify placement constraint
    by the physical address range of memory allocations. For example, CMA
    allocates memory below 4GB by "cma=64M@0-4G", it is required for the
    devices only supporting 32-bit addressing on 64-bit systems without iommu.

    This patch (of 5):

    Calling dma_alloc_coherent() with __GFP_ZERO must return zeroed memory.

    But when the contiguous memory allocator (CMA) is enabled on x86 and the
    memory region is allocated by dma_alloc_from_contiguous(), it doesn't
    return zeroed memory. Because dma_generic_alloc_coherent() forgot to fill
    the memory region with zero if it was allocated by
    dma_alloc_from_contiguous()

    Most implementations of dma_alloc_coherent() return zeroed memory
    regardless of whether __GFP_ZERO is specified. So this fixes it by
    unconditionally zeroing the allocated memory region.

    Alternatively, we could fix dma_alloc_from_contiguous() to return zeroed
    out memory and remove memset() from all caller of it. But we can't simply
    remove the memset on arm because __dma_clear_buffer() is used there for
    ensuring cache flushing and it is used in many places. Of course we can
    do redundant memset in dma_alloc_from_contiguous(), but I think this patch
    is less impact for fixing this problem.

    Signed-off-by: Akinobu Mita
    Cc: Marek Szyprowski
    Cc: Konrad Rzeszutek Wilk
    Cc: David Woodhouse
    Cc: Don Dutile
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Andi Kleen
    Cc: Yinghai Lu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • For single threaded workloads, we can avoid flushing and iterating through
    the entire list of tasks, making the whole function a lot faster,
    requiring only a single atomic read for the mm_users.

    Signed-off-by: Davidlohr Bueso
    Suggested-by: Oleg Nesterov
    Cc: Aswin Chandramouleeswaran
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Introduce a CONFIG_DEBUG_VM_VMACACHE option to enable counting the cache
    hit rate -- exported in /proc/vmstat.

    Any updates to the caching scheme needs this kind of data, thus it can
    save some work re-implementing the counting all the time.

    Signed-off-by: Davidlohr Bueso
    Cc: Aswin Chandramouleeswaran
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Prior to this change, we would decide whether to force scan a LRU during
    reclaim if that LRU itself was too small for the current priority.
    However, this can lead to the file LRU getting force scanned even if
    there are a lot of anonymous pages we can reclaim, leading to hot file
    pages getting needlessly reclaimed.

    To address this, we instead only force scan when none of the reclaimable
    LRUs are big enough.

    Gives huge improvements with zswap. For example, when doing -j20 kernel
    build in a 500MB container with zswap enabled, runtime (in seconds) is
    greatly reduced:

    x without this change
    + with this change
    N Min Max Median Avg Stddev
    x 5 700.997 790.076 763.928 754.05 39.59493
    + 5 141.634 197.899 155.706 161.9 21.270224
    Difference at 95.0% confidence
    -592.15 +/- 46.3521
    -78.5293% +/- 6.14709%
    (Student's t, pooled s = 31.7819)

    Should also give some improvements in regular (non-zswap) swap cases.

    Yes, hughd found significant speedup using regular swap, with several
    memcgs under pressure; and it should also be effective in the non-memcg
    case, whenever one or another zone LRU is forced too small.

    Signed-off-by: Suleiman Souhlal
    Signed-off-by: Hugh Dickins
    Cc: Suleiman Souhlal
    Cc: Mel Gorman
    Acked-by: Rik van Riel
    Acked-by: Rafael Aquini
    Cc: Michal Hocko
    Cc: Yuanhan Liu
    Cc: Seth Jennings
    Cc: Bob Liu
    Cc: Minchan Kim
    Cc: Luigi Semenzato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Suleiman Souhlal
     
  • clear_refs_write() is called earlier than clear_soft_dirty() and it is
    more natural to clear VM_SOFTDIRTY (which belongs to VMA entry but not
    PTEs) that early instead of clearing it a way deeper inside call chain.

    Signed-off-by: Cyrill Gorcunov
    Cc: Pavel Emelyanov
    Cc: Hugh Dickins
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • pte_file_mksoft_dirty operates with argument passed by a value and
    returns modified result thus we need to assign @ptfile here, otherwise
    itis a no-op which may lead to loss of the softdirty bit.

    Signed-off-by: Cyrill Gorcunov
    Cc: Pavel Emelyanov
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • Hugh reported:

    | I noticed your soft_dirty work in install_file_pte(): which looked
    | good at first, until I realized that it's propagating the soft_dirty
    | of a pte it's about to zap completely, to the unrelated entry it's
    | about to insert in its place. Which seems very odd to me.

    Indeed this code ends up being nop in result -- pte_file_mksoft_dirty()
    operates with pte_t argument and returns new pte_t which were never used
    after. After looking more I think what we need is to soft-dirtify all
    newely remapped file pages because it should look like a new mapping for
    memory tracker.

    Signed-off-by: Cyrill Gorcunov
    Reported-by: Hugh Dickins
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • Currently to allocate a page that should be charged to kmemcg (e.g.
    threadinfo), we pass __GFP_KMEMCG flag to the page allocator. The page
    allocated is then to be freed by free_memcg_kmem_pages. Apart from
    looking asymmetrical, this also requires intrusion to the general
    allocation path. So let's introduce separate functions that will
    alloc/free pages charged to kmemcg.

    The new functions are called alloc_kmem_pages and free_kmem_pages. They
    should be used when the caller actually would like to use kmalloc, but
    has to fall back to the page allocator for the allocation is large.
    They only differ from alloc_pages and free_pages in that besides
    allocating or freeing pages they also charge them to the kmem resource
    counter of the current memory cgroup.

    [sfr@canb.auug.org.au: export kmalloc_order() to modules]
    Signed-off-by: Vladimir Davydov
    Acked-by: Greg Thelen
    Cc: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Glauber Costa
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • We have only a few places where we actually want to charge kmem so
    instead of intruding into the general page allocation path with
    __GFP_KMEMCG it's better to explictly charge kmem there. All kmem
    charges will be easier to follow that way.

    This is a step towards removing __GFP_KMEMCG. It removes __GFP_KMEMCG
    from memcg caches' allocflags. Instead it makes slab allocation path
    call memcg_charge_kmem directly getting memcg to charge from the cache's
    memcg params.

    This also eliminates any possibility of misaccounting an allocation
    going from one memcg's cache to another memcg, because now we always
    charge slabs against the memcg the cache belongs to. That's why this
    patch removes the big comment to memcg_kmem_get_cache.

    Signed-off-by: Vladimir Davydov
    Acked-by: Greg Thelen
    Cc: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Glauber Costa
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • There used to be only one path out of __slab_alloc(), and ALLOC_SLOWPATH
    got bumped in that exit path. Now there are two, and a bunch of gotos.
    ALLOC_SLOWPATH can now get set more than once during a single call to
    __slab_alloc() which is pretty bogus. Here's the sequence:

    1. Enter __slab_alloc(), fall through all the way to the
    stat(s, ALLOC_SLOWPATH);
    2. hit 'if (!freelist)', and bump DEACTIVATE_BYPASS, jump to
    new_slab (goto #1)
    3. Hit 'if (c->partial)', bump CPU_PARTIAL_ALLOC, goto redo
    (goto #2)
    4. Fall through in the same path we did before all the way to
    stat(s, ALLOC_SLOWPATH)
    5. bump ALLOC_REFILL stat, then return

    Doing this is obviously bogus. It keeps us from being able to
    accurately compare ALLOC_SLOWPATH vs. ALLOC_FASTPATH. It also means
    that the total number of allocs always exceeds the total number of
    frees.

    This patch moves stat(s, ALLOC_SLOWPATH) to be called from the same
    place that __slab_alloc() is. This makes it much less likely that
    ALLOC_SLOWPATH will get botched again in the spaghetti-code inside
    __slab_alloc().

    Signed-off-by: Dave Hansen
    Acked-by: Christoph Lameter
    Acked-by: David Rientjes
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • When the slab or slub allocators cannot allocate additional slab pages,
    they emit diagnostic information to the kernel log such as current
    number of slabs, number of objects, active objects, etc. This is always
    coupled with a page allocation failure warning since it is controlled by
    !__GFP_NOWARN.

    Suppress this out of memory warning if the allocator is configured
    without debug supported. The page allocation failure warning will
    indicate it is a failed slab allocation, the order, and the gfp mask, so
    this is only useful to diagnose allocator issues.

    Since CONFIG_SLUB_DEBUG is already enabled by default for the slub
    allocator, there is no functional change with this patch. If debug is
    disabled, however, the warnings are now suppressed.

    Signed-off-by: David Rientjes
    Cc: Pekka Enberg
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Inspired by Joe Perches suggestion in ntfs logging clean-up.

    Signed-off-by: Fabian Frederick
    Acked-by: Christoph Lameter
    Cc: Joe Perches
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     
  • All printk(KERN_foo converted to pr_foo()

    Default printk converted to pr_warn()

    Coalesce format fragments

    Signed-off-by: Fabian Frederick
    Acked-by: Christoph Lameter
    Cc: Joe Perches
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     
  • On system with 2TiB ram, current x86_64 have 128M as section size, and
    one memory_block only include one section. So will have 16400 entries
    under /sys/devices/system/memory/.

    Current code try to use block id to find block pointer in /sys for any
    section, and reuse that block pointer. that finding will take some time
    even after commit 7c243c7168dc ("mm: speedup in __early_pfn_to_nid")
    that will skip the search in that case during booting up.

    So solution could be increase block size just like SGI UV system did.
    (harded code to 2g).

    This patch is trying to probe the block size to make it match mmio remap
    size. for example, Intel Nehalem later system will have memory range [0,
    TOML), [4g, TOMH]. If the memory hole is 2g and total is 128g, TOM will
    be 2g, and TOM2 will be 130g.

    We could use 2g as block size instead of default 128M. That will reduce
    number of entries in /sys/devices/system/memory/

    On system 6TiB system will reduce boot time by 35 seconds.

    Signed-off-by: Yinghai Lu
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yinghai Lu
     
  • _PAGE_NUMA is currently an alias of _PROT_PROTNONE to trap NUMA hinting
    faults on x86. Care is taken such that _PAGE_NUMA is used only in
    situations where the VMA flags distinguish between NUMA hinting faults
    and prot_none faults. This decision was x86-specific and conceptually
    it is difficult requiring special casing to distinguish between PROTNONE
    and NUMA ptes based on context.

    Fundamentally, we only need the _PAGE_NUMA bit to tell the difference
    between an entry that is really unmapped and a page that is protected
    for NUMA hinting faults as if the PTE is not present then a fault will
    be trapped.

    Swap PTEs on x86-64 use the bits after _PAGE_GLOBAL for the offset.
    This patch shrinks the maximum possible swap size and uses the bit to
    uniquely distinguish between NUMA hinting ptes and swap ptes.

    Signed-off-by: Mel Gorman
    Cc: David Vrabel
    Cc: Ingo Molnar
    Cc: Peter Anvin
    Cc: Fengguang Wu
    Cc: Linus Torvalds
    Cc: Steven Noonan
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: Dave Hansen
    Cc: Srikar Dronamraju
    Cc: Cyrill Gorcunov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • 32-bit support for NUMA is an oddity on its own but with automatic NUMA
    balancing on top there is a reasonable risk that the CPUPID information
    cannot be stored in the page flags. This patch removes support for
    automatic NUMA support on 32-bit x86.

    Signed-off-by: Mel Gorman
    Cc: David Vrabel
    Cc: Ingo Molnar
    Cc: Peter Anvin
    Cc: Fengguang Wu
    Cc: Linus Torvalds
    Cc: Steven Noonan
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: Dave Hansen
    Cc: Srikar Dronamraju
    Cc: Cyrill Gorcunov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Description by Jan Kara:
    "A lot of older filesystems don't properly flush volatile disk caches
    on fsync(2) which can lead to loss of fsynced data after power failure.

    This patch makes generic_file_fsync() issue proper cache flush to fix the
    problem. Sysadmin can use /sys/devices/.../cache_type to tell the system
    it should not send the cache flush."

    [akpm@linux-foundation.org: nuke ifdef]
    [akpm@linux-foundation.org: fix warning]
    Signed-off-by: Fabian Frederick
    Suggested-by: Jan Kara
    Suggested-by: Christoph Hellwig
    Cc: Jan Kara
    Cc: Christoph Hellwig
    Cc: Alexander Viro
    Cc: "Theodore Ts'o"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick