11 Nov, 2015

1 commit

  • In commit a1c34a3bf00a ("mm: Don't offset memmap for flatmem") Laura
    fixed a problem for Srinivas relating to the bottom 2MB of RAM on an ARM
    IFC6410 board.

    One small wrinkle on ia64 is that it allocates the node_mem_map earlier
    in arch code, so it skips the block of code where "offset" is
    initialized.

    Move initialization of start and offset before the check for the
    node_mem_map so that they will always be available in the latter part of
    the function.

    Tested-by: Laura Abbott
    Fixes: a1c34a3bf00a (mm: Don't offset memmap for flatmem)
    Signed-off-by: Tony Luck
    Signed-off-by: Linus Torvalds

    Tony Luck
     

07 Nov, 2015

11 commits

  • Let's try to be consistent about data type of page order.

    [sfr@canb.auug.org.au: fix build (type of pageblock_order)]
    [hughd@google.com: some configs end up with MAX_ORDER and pageblock_order having different types]
    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Reviewed-by: Andrea Arcangeli
    Cc: "Paul E. McKenney"
    Cc: Andi Kleen
    Cc: Aneesh Kumar K.V
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Sergey Senozhatsky
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Hugh has pointed that compound_head() call can be unsafe in some
    context. There's one example:

    CPU0 CPU1

    isolate_migratepages_block()
    page_count()
    compound_head()
    !!PageTail() == true
    put_page()
    tail->first_page = NULL
    head = tail->first_page
    alloc_pages(__GFP_COMP)
    prep_compound_page()
    tail->first_page = head
    __SetPageTail(p);
    !!PageTail() == true

    The race is pure theoretical. I don't it's possible to trigger it in
    practice. But who knows.

    We can fix the race by changing how encode PageTail() and compound_head()
    within struct page to be able to update them in one shot.

    The patch introduces page->compound_head into third double word block in
    front of compound_dtor and compound_order. Bit 0 encodes PageTail() and
    the rest bits are pointer to head page if bit zero is set.

    The patch moves page->pmd_huge_pte out of word, just in case if an
    architecture defines pgtable_t into something what can have the bit 0
    set.

    hugetlb_cgroup uses page->lru.next in the second tail page to store
    pointer struct hugetlb_cgroup. The patch switch it to use page->private
    in the second tail page instead. The space is free since ->first_page is
    removed from the union.

    The patch also opens possibility to remove HUGETLB_CGROUP_MIN_ORDER
    limitation, since there's now space in first tail page to store struct
    hugetlb_cgroup pointer. But that's out of scope of the patch.

    That means page->compound_head shares storage space with:

    - page->lru.next;
    - page->next;
    - page->rcu_head.next;

    That's too long list to be absolutely sure, but looks like nobody uses
    bit 0 of the word.

    page->rcu_head.next guaranteed[1] to have bit 0 clean as long as we use
    call_rcu(), call_rcu_bh(), call_rcu_sched(), or call_srcu(). But future
    call_rcu_lazy() is not allowed as it makes use of the bit and we can
    get false positive PageTail().

    [1] http://lkml.kernel.org/g/20150827163634.GD4029@linux.vnet.ibm.com

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Reviewed-by: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: David Rientjes
    Cc: Vlastimil Babka
    Acked-by: Paul E. McKenney
    Cc: Aneesh Kumar K.V
    Cc: Andi Kleen
    Cc: Christoph Lameter
    Cc: Joonsoo Kim
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • The patch halves space occupied by compound_dtor and compound_order in
    struct page.

    For compound_order, it's trivial long -> short conversion.

    For get_compound_page_dtor(), we now use hardcoded table for destructor
    lookup and store its index in the struct page instead of direct pointer
    to destructor. It shouldn't be a big trouble to maintain the table: we
    have only two destructor and NULL currently.

    This patch free up one word in tail pages for reuse. This is preparation
    for the next patch.

    Signed-off-by: Kirill A. Shutemov
    Reviewed-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Reviewed-by: Andrea Arcangeli
    Cc: "Paul E. McKenney"
    Cc: Andi Kleen
    Cc: Aneesh Kumar K.V
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Joonsoo Kim
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • The primary purpose of watermarks is to ensure that reclaim can always
    make forward progress in PF_MEMALLOC context (kswapd and direct reclaim).
    These assume that order-0 allocations are all that is necessary for
    forward progress.

    High-order watermarks serve a different purpose. Kswapd had no high-order
    awareness before they were introduced
    (https://lkml.kernel.org/r/413AA7B2.4000907@yahoo.com.au). This was
    particularly important when there were high-order atomic requests. The
    watermarks both gave kswapd awareness and made a reserve for those atomic
    requests.

    There are two important side-effects of this. The most important is that
    a non-atomic high-order request can fail even though free pages are
    available and the order-0 watermarks are ok. The second is that
    high-order watermark checks are expensive as the free list counts up to
    the requested order must be examined.

    With the introduction of MIGRATE_HIGHATOMIC it is no longer necessary to
    have high-order watermarks. Kswapd and compaction still need high-order
    awareness which is handled by checking that at least one suitable
    high-order page is free.

    With the patch applied, there was little difference in the allocation
    failure rates as the atomic reserves are small relative to the number of
    allocation attempts. The expected impact is that there will never be an
    allocation failure report that shows suitable pages on the free lists.

    The one potential side-effect of this is that in a vanilla kernel, the
    watermark checks may have kept a free page for an atomic allocation. Now,
    we are 100% relying on the HighAtomic reserves and an early allocation to
    have allocated them. If the first high-order atomic allocation is after
    the system is already heavily fragmented then it'll fail.

    [akpm@linux-foundation.org: simplify __zone_watermark_ok(), per Vlastimil]
    Signed-off-by: Mel Gorman
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Vitaly Wool
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • High-order watermark checking exists for two reasons -- kswapd high-order
    awareness and protection for high-order atomic requests. Historically the
    kernel depended on MIGRATE_RESERVE to preserve min_free_kbytes as
    high-order free pages for as long as possible. This patch introduces
    MIGRATE_HIGHATOMIC that reserves pageblocks for high-order atomic
    allocations on demand and avoids using those blocks for order-0
    allocations. This is more flexible and reliable than MIGRATE_RESERVE was.

    A MIGRATE_HIGHORDER pageblock is created when an atomic high-order
    allocation request steals a pageblock but limits the total number to 1% of
    the zone. Callers that speculatively abuse atomic allocations for
    long-lived high-order allocations to access the reserve will quickly fail.
    Note that SLUB is currently not such an abuser as it reclaims at least
    once. It is possible that the pageblock stolen has few suitable
    high-order pages and will need to steal again in the near future but there
    would need to be strong justification to search all pageblocks for an
    ideal candidate.

    The pageblocks are unreserved if an allocation fails after a direct
    reclaim attempt.

    The watermark checks account for the reserved pageblocks when the
    allocation request is not a high-order atomic allocation.

    The reserved pageblocks can not be used for order-0 allocations. This may
    allow temporary wastage until a failed reclaim reassigns the pageblock.
    This is deliberate as the intent of the reservation is to satisfy a
    limited number of atomic high-order short-lived requests if the system
    requires them.

    The stutter benchmark was used to evaluate this but while it was running
    there was a systemtap script that randomly allocated between 1 high-order
    page and 12.5% of memory's worth of order-3 pages using GFP_ATOMIC. This
    is much larger than the potential reserve and it does not attempt to be
    realistic. It is intended to stress random high-order allocations from an
    unknown source, show that there is a reduction in failures without
    introducing an anomaly where atomic allocations are more reliable than
    regular allocations. The amount of memory reserved varied throughout the
    workload as reserves were created and reclaimed under memory pressure.
    The allocation failures once the workload warmed up were as follows;

    4.2-rc5-vanilla 70%
    4.2-rc5-atomic-reserve 56%

    The failure rate was also measured while building multiple kernels. The
    failure rate was 14% but is 6% with this patch applied.

    Overall, this is a small reduction but the reserves are small relative to
    the number of allocation requests. In early versions of the patch, the
    failure rate reduced by a much larger amount but that required much larger
    reserves and perversely made atomic allocations seem more reliable than
    regular allocations.

    [yalin.wang2010@gmail.com: fix redundant check and a memory leak]
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Vitaly Wool
    Cc: Rik van Riel
    Signed-off-by: yalin wang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • MIGRATE_RESERVE preserves an old property of the buddy allocator that
    existed prior to fragmentation avoidance -- min_free_kbytes worth of pages
    tended to remain contiguous until the only alternative was to fail the
    allocation. At the time it was discovered that high-order atomic
    allocations relied on this property so MIGRATE_RESERVE was introduced. A
    later patch will introduce an alternative MIGRATE_HIGHATOMIC so this patch
    deletes MIGRATE_RESERVE and supporting code so it'll be easier to review.
    Note that this patch in isolation may look like a false regression if
    someone was bisecting high-order atomic allocation failures.

    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vitaly Wool
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The zonelist cache (zlc) was introduced to skip over zones that were
    recently known to be full. This avoided expensive operations such as the
    cpuset checks, watermark calculations and zone_reclaim. The situation
    today is different and the complexity of zlc is harder to justify.

    1) The cpuset checks are no-ops unless a cpuset is active and in general
    are a lot cheaper.

    2) zone_reclaim is now disabled by default and I suspect that was a large
    source of the cost that zlc wanted to avoid. When it is enabled, it's
    known to be a major source of stalling when nodes fill up and it's
    unwise to hit every other user with the overhead.

    3) Watermark checks are expensive to calculate for high-order
    allocation requests. Later patches in this series will reduce the cost
    of the watermark checking.

    4) The most important issue is that in the current implementation it
    is possible for a failed THP allocation to mark a zone full for order-0
    allocations and cause a fallback to remote nodes.

    The last issue could be addressed with additional complexity but as the
    benefit of zlc is questionable, it is better to remove it. If stalls due
    to zone_reclaim are ever reported then an alternative would be to
    introduce deferring logic based on a timeout inside zone_reclaim itself
    and leave the page allocator fast paths alone.

    The impact on page-allocator microbenchmarks is negligible as they don't
    hit the paths where the zlc comes into play. Most page-reclaim related
    workloads showed no noticeable difference as a result of the removal.

    The impact was noticeable in a workload called "stutter". One part uses a
    lot of anonymous memory, a second measures mmap latency and a third copies
    a large file. In an ideal world the latency application would not notice
    the mmap latency. On a 2-node machine the results of this patch are

    stutter
    4.3.0-rc1 4.3.0-rc1
    baseline nozlc-v4
    Min mmap 20.9243 ( 0.00%) 20.7716 ( 0.73%)
    1st-qrtle mmap 22.0612 ( 0.00%) 22.0680 ( -0.03%)
    2nd-qrtle mmap 22.3291 ( 0.00%) 22.3809 ( -0.23%)
    3rd-qrtle mmap 25.2244 ( 0.00%) 25.2396 ( -0.06%)
    Max-90% mmap 48.0995 ( 0.00%) 28.3713 ( 41.02%)
    Max-93% mmap 52.5557 ( 0.00%) 36.0170 ( 31.47%)
    Max-95% mmap 55.8173 ( 0.00%) 47.3163 ( 15.23%)
    Max-99% mmap 67.3781 ( 0.00%) 70.1140 ( -4.06%)
    Max mmap 24447.6375 ( 0.00%) 12915.1356 ( 47.17%)
    Mean mmap 33.7883 ( 0.00%) 27.7944 ( 17.74%)
    Best99%Mean mmap 27.7825 ( 0.00%) 25.2767 ( 9.02%)
    Best95%Mean mmap 26.3912 ( 0.00%) 23.7994 ( 9.82%)
    Best90%Mean mmap 24.9886 ( 0.00%) 23.2251 ( 7.06%)
    Best50%Mean mmap 22.0157 ( 0.00%) 22.0261 ( -0.05%)
    Best10%Mean mmap 21.6705 ( 0.00%) 21.6083 ( 0.29%)
    Best5%Mean mmap 21.5581 ( 0.00%) 21.4611 ( 0.45%)
    Best1%Mean mmap 21.3079 ( 0.00%) 21.1631 ( 0.68%)

    Note that the maximum stall latency went from 24 seconds to 12 which is
    still bad but an improvement. The milage varies considerably 2-node
    machine on an earlier test went from 494 seconds to 47 seconds and a
    4-node machine that tested an earlier version of this patch went from a
    worst case stall time of 6 seconds to 67ms. The nature of the benchmark
    is inherently unpredictable as it is hammering the system and the milage
    will vary between machines.

    There is a secondary impact with potentially more direct reclaim because
    zones are now being considered instead of being skipped by zlc. In this
    particular test run it did not occur so will not be described. However,
    in at least one test the following was observed

    1. Direct reclaim rates were higher. This was likely due to direct reclaim
    being entered instead of the zlc disabling a zone and busy looping.
    Busy looping may have the effect of allowing kswapd to make more
    progress and in some cases may be better overall. If this is found then
    the correct action is to put direct reclaimers to sleep on a waitqueue
    and allow kswapd make forward progress. Busy looping on the zlc is even
    worse than when the allocator used to blindly call congestion_wait().

    2. There was higher swap activity as direct reclaim was active.

    3. Direct reclaim efficiency was lower. This is related to 1 as more
    scanning activity also encountered more pages that could not be
    immediately reclaimed

    In that case, the direct page scan and reclaim rates are noticeable but
    it is not considered a problem for a few reasons

    1. The test is primarily concerned with latency. The mmap attempts are also
    faulted which means there are THP allocation requests. The ZLC could
    cause zones to be disabled causing the process to busy loop instead
    of reclaiming. This looks like elevated direct reclaim activity but
    it's the correct action to take based on what processes requested.

    2. The test hammers reclaim and compaction heavily. The number of successful
    THP faults is highly variable but affects the reclaim stats. It's not a
    realistic or reasonable measure of page reclaim activity.

    3. No other page-reclaim intensive workload that was tested showed a problem.

    4. If a workload is identified that benefitted from the busy looping then it
    should be fixed by having direct reclaimers sleep on a wait queue until
    woken by kswapd instead of busy looping. We had this class of problem before
    when congestion_waits() with a fixed timeout was a brain damaged decision
    but happened to benefit some workloads.

    If a workload is identified that relied on the zlc to busy loop then it
    should be fixed correctly and have a direct reclaimer sleep on a waitqueue
    until woken by kswapd.

    Signed-off-by: Mel Gorman
    Acked-by: David Rientjes
    Acked-by: Christoph Lameter
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: Vitaly Wool
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • __GFP_WAIT was used to signal that the caller was in atomic context and
    could not sleep. Now it is possible to distinguish between true atomic
    context and callers that are not willing to sleep. The latter should
    clear __GFP_DIRECT_RECLAIM so kswapd will still wake. As clearing
    __GFP_WAIT behaves differently, there is a risk that people will clear the
    wrong flags. This patch renames __GFP_WAIT to __GFP_RECLAIM to clearly
    indicate what it does -- setting it allows all reclaim activity, clearing
    them prevents it.

    [akpm@linux-foundation.org: fix build]
    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Mel Gorman
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Acked-by: Johannes Weiner
    Cc: Christoph Lameter
    Acked-by: David Rientjes
    Cc: Vitaly Wool
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • …d avoiding waking kswapd

    __GFP_WAIT has been used to identify atomic context in callers that hold
    spinlocks or are in interrupts. They are expected to be high priority and
    have access one of two watermarks lower than "min" which can be referred
    to as the "atomic reserve". __GFP_HIGH users get access to the first
    lower watermark and can be called the "high priority reserve".

    Over time, callers had a requirement to not block when fallback options
    were available. Some have abused __GFP_WAIT leading to a situation where
    an optimisitic allocation with a fallback option can access atomic
    reserves.

    This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
    cannot sleep and have no alternative. High priority users continue to use
    __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and
    are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify
    callers that want to wake kswapd for background reclaim. __GFP_WAIT is
    redefined as a caller that is willing to enter direct reclaim and wake
    kswapd for background reclaim.

    This patch then converts a number of sites

    o __GFP_ATOMIC is used by callers that are high priority and have memory
    pools for those requests. GFP_ATOMIC uses this flag.

    o Callers that have a limited mempool to guarantee forward progress clear
    __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
    into this category where kswapd will still be woken but atomic reserves
    are not used as there is a one-entry mempool to guarantee progress.

    o Callers that are checking if they are non-blocking should use the
    helper gfpflags_allow_blocking() where possible. This is because
    checking for __GFP_WAIT as was done historically now can trigger false
    positives. Some exceptions like dm-crypt.c exist where the code intent
    is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
    flag manipulations.

    o Callers that built their own GFP flags instead of starting with GFP_KERNEL
    and friends now also need to specify __GFP_KSWAPD_RECLAIM.

    The first key hazard to watch out for is callers that removed __GFP_WAIT
    and was depending on access to atomic reserves for inconspicuous reasons.
    In some cases it may be appropriate for them to use __GFP_HIGH.

    The second key hazard is callers that assembled their own combination of
    GFP flags instead of starting with something like GFP_KERNEL. They may
    now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless
    if it's missed in most cases as other activity will wake kswapd.

    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Vitaly Wool <vitalywool@gmail.com>
    Cc: Rik van Riel <riel@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Mel Gorman
     
  • File-backed pages that will be immediately written are balanced between
    zones. This heuristic tries to avoid having a single zone filled with
    recently dirtied pages but the checks are unnecessarily expensive. Move
    consider_zone_balanced into the alloc_context instead of checking bitmaps
    multiple times. The patch also gives the parameter a more meaningful
    name.

    Signed-off-by: Mel Gorman
    Acked-by: David Rientjes
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Acked-by: Johannes Weiner
    Cc: Christoph Lameter
    Cc: Vitaly Wool
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Overall, the intent of this series is to remove the zonelist cache which
    was introduced to avoid high overhead in the page allocator. Once this is
    done, it is necessary to reduce the cost of watermark checks.

    The series starts with minor micro-optimisations.

    Next it notes that GFP flags that affect watermark checks are abused.
    __GFP_WAIT historically identified callers that could not sleep and could
    access reserves. This was later abused to identify callers that simply
    prefer to avoid sleeping and have other options. A patch distinguishes
    between atomic callers, high-priority callers and those that simply wish
    to avoid sleep.

    The zonelist cache has been around for a long time but it is of dubious
    merit with a lot of complexity and some issues that are explained. The
    most important issue is that a failed THP allocation can cause a zone to
    be treated as "full". This potentially causes unnecessary stalls, reclaim
    activity or remote fallbacks. The issues could be fixed but it's not
    worth it. The series places a small number of other micro-optimisations
    on top before examining GFP flags watermarks.

    High-order watermarks enforcement can cause high-order allocations to fail
    even though pages are free. The watermark checks both protect high-order
    atomic allocations and make kswapd aware of high-order pages but there is
    a much better way that can be handled using migrate types. This series
    uses page grouping by mobility to reserve pageblocks for high-order
    allocations with the size of the reservation depending on demand. kswapd
    awareness is maintained by examining the free lists. By patch 12 in this
    series, there are no high-order watermark checks while preserving the
    properties that motivated the introduction of the watermark checks.

    This patch (of 10):

    No user of zone_watermark_ok_safe() specifies alloc_flags. This patch
    removes the unnecessary parameter.

    Signed-off-by: Mel Gorman
    Acked-by: David Rientjes
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Reviewed-by: Christoph Lameter
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

06 Nov, 2015

5 commits

  • Charging kmem pages proceeds in two steps. First, we try to charge the
    allocation size to the memcg the current task belongs to, then we allocate
    a page and "commit" the charge storing the pointer to the memcg in the
    page struct.

    Such a design looks overcomplicated, because there is not much sense in
    trying charging the allocation before actually allocating a page: we won't
    be able to consume much memory over the limit even if we charge after
    doing the actual allocation, besides we already charge user pages post
    factum, so being pedantic with kmem pages just looks pointless.

    So this patch simplifies the design by merging the "charge" and the
    "commit" steps into the same function, which takes the allocated page.

    Also, rename the charge and uncharge methods to memcg_kmem_charge and
    memcg_kmem_uncharge and make the charge method return error code instead
    of bool to conform to mem_cgroup_try_charge.

    Signed-off-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • If kernelcore was not specified, or the kernelcore size is zero
    (required_movablecore >= totalpages), or the kernelcore size is larger
    than totalpages, there is no ZONE_MOVABLE. We should fill the zone with
    both kernel memory and movable memory.

    Signed-off-by: Xishi Qiu
    Reviewed-by: Yasuaki Ishimatsu
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Tang Chen
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xishi Qiu
     
  • Srinivas Kandagatla reported bad page messages when trying to remove the
    bottom 2MB on an ARM based IFC6410 board

    BUG: Bad page state in process swapper pfn:fffa8
    page:ef7fb500 count:0 mapcount:0 mapping: (null) index:0x0
    flags: 0x96640253(locked|error|dirty|active|arch_1|reclaim|mlocked)
    page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
    bad because of flags:
    flags: 0x200041(locked|active|mlocked)
    Modules linked in:
    CPU: 0 PID: 0 Comm: swapper Not tainted 3.19.0-rc3-00007-g412f9ba-dirty #816
    Hardware name: Qualcomm (Flattened Device Tree)
    unwind_backtrace
    show_stack
    dump_stack
    bad_page
    free_pages_prepare
    free_hot_cold_page
    __free_pages
    free_highmem_page
    mem_init
    start_kernel
    Disabling lock debugging due to kernel taint

    Removing the lower 2MB made the start of the lowmem zone to no longer be
    page block aligned. IFC6410 uses CONFIG_FLATMEM where alloc_node_mem_map
    allocates memory for the mem_map. alloc_node_mem_map will offset for
    unaligned nodes with the assumption the pfn/page translation functions
    will account for the offset. The functions for CONFIG_FLATMEM do not
    offset however, resulting in overrunning the memmap array. Just use the
    allocated memmap without any offset when running with CONFIG_FLATMEM to
    avoid the overrun.

    Signed-off-by: Laura Abbott
    Signed-off-by: Laura Abbott
    Reported-by: Srinivas Kandagatla
    Tested-by: Srinivas Kandagatla
    Acked-by: Vlastimil Babka
    Tested-by: Bjorn Andersson
    Cc: Santosh Shilimkar
    Cc: Russell King
    Cc: Kevin Hilman
    Cc: Arnd Bergman
    Cc: Stephen Boyd
    Cc: Andy Gross
    Cc: Mel Gorman
    Cc: Steven Rostedt
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Laura Abbott
     
  • If the user set "movablecore=xx" to a large number, corepages will
    overflow. Fix the problem.

    Signed-off-by: Xishi Qiu
    Reviewed-by: Yasuaki Ishimatsu
    Acked-by: Tang Chen
    Acked-by: David Rientjes
    Cc: Mel Gorman
    Cc: Tang Chen
    Cc: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xishi Qiu
     
  • Commit a2f3aa025766 ("[PATCH] Fix sparsemem on Cell") fixed an oops
    experienced on the Cell architecture when init-time functions,
    early_*(), are called at runtime by introducing an 'enum memmap_context'
    parameter to memmap_init_zone() and init_currently_empty_zone(). This
    parameter is intended to be used to tell whether the call of these two
    functions is being made on behalf of a hotplug event, or happening at
    boot-time. However, init_currently_empty_zone() does not use this
    parameter at all, so remove it.

    Signed-off-by: Yaowei Bai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yaowei Bai
     

04 Oct, 2015

1 commit

  • Its a bit odd that debugfs_create_bool() takes 'u32 *' as an argument,
    when all it needs is a boolean pointer.

    It would be better to update this API to make it accept 'bool *'
    instead, as that will make it more consistent and often more convenient.
    Over that bool takes just a byte.

    That required updates to all user sites as well, in the same commit
    updating the API. regmap core was also using
    debugfs_{read|write}_file_bool(), directly and variable types were
    updated for that to be bool as well.

    Signed-off-by: Viresh Kumar
    Acked-by: Mark Brown
    Acked-by: Charles Keepax
    Signed-off-by: Greg Kroah-Hartman

    Viresh Kumar
     

09 Sep, 2015

13 commits

  • Merge second patch-bomb from Andrew Morton:
    "Almost all of the rest of MM. There was an unusually large amount of
    MM material this time"

    * emailed patches from Andrew Morton : (141 commits)
    zpool: remove no-op module init/exit
    mm: zbud: constify the zbud_ops
    mm: zpool: constify the zpool_ops
    mm: swap: zswap: maybe_preload & refactoring
    zram: unify error reporting
    zsmalloc: remove null check from destroy_handle_cache()
    zsmalloc: do not take class lock in zs_shrinker_count()
    zsmalloc: use class->pages_per_zspage
    zsmalloc: consider ZS_ALMOST_FULL as migrate source
    zsmalloc: partial page ordering within a fullness_list
    zsmalloc: use shrinker to trigger auto-compaction
    zsmalloc: account the number of compacted pages
    zsmalloc/zram: introduce zs_pool_stats api
    zsmalloc: cosmetic compaction code adjustments
    zsmalloc: introduce zs_can_compact() function
    zsmalloc: always keep per-class stats
    zsmalloc: drop unused variable `nr_to_migrate'
    mm/memblock.c: fix comment in __next_mem_range()
    mm/page_alloc.c: fix type information of memoryless node
    memory-hotplug: fix comments in zone_spanned_pages_in_node() and zone_spanned_pages_in_node()
    ...

    Linus Torvalds
     
  • For a memoryless node, the output of get_pfn_range_for_nid are all zero.
    It will display mem from 0 to -1.

    Signed-off-by: Zhen Lei
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhen Lei
     
  • When hot adding a node from add_memory(), we will add memblock first, so
    the node is not empty. But when called from cpu_up(), the node should
    be empty.

    Signed-off-by: Xishi Qiu
    Cc: Tang Chen
    Cc: Yasuaki Ishimatsu
    Cc: Naoya Horiguchi
    Cc: Vlastimil Babka
    Cc: Taku Izumi \
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xishi Qiu
     
  • We use sysctl_lowmem_reserve_ratio rather than
    sysctl_lower_zone_reserve_ratio to determine how aggressive the kernel
    is in defending lowmem from the possibility of being captured into
    pinned user memory. To avoid misleading, correct it in some comments.

    Signed-off-by: Yaowei Bai
    Acked-by: Michal Hocko
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yaowei Bai
     
  • The comment says that the per-cpu batchsize and zone watermarks are
    determined by present_pages which is definitely wrong, they are both
    calculated from managed_pages. Fix it.

    Signed-off-by: Yaowei Bai
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yaowei Bai
     
  • alloc_pages_exact_node() was introduced in commit 6484eb3e2a81 ("page
    allocator: do not check NUMA node ID when the caller knows the node is
    valid") as an optimized variant of alloc_pages_node(), that doesn't
    fallback to current node for nid == NUMA_NO_NODE. Unfortunately the
    name of the function can easily suggest that the allocation is
    restricted to the given node and fails otherwise. In truth, the node is
    only preferred, unless __GFP_THISNODE is passed among the gfp flags.

    The misleading name has lead to mistakes in the past, see for example
    commits 5265047ac301 ("mm, thp: really limit transparent hugepage
    allocation to local node") and b360edb43f8e ("mm, mempolicy:
    migrate_to_node should only migrate to node").

    Another issue with the name is that there's a family of
    alloc_pages_exact*() functions where 'exact' means exact size (instead
    of page order), which leads to more confusion.

    To prevent further mistakes, this patch effectively renames
    alloc_pages_exact_node() to __alloc_pages_node() to better convey that
    it's an optimized variant of alloc_pages_node() not intended for general
    usage. Both functions get described in comments.

    It has been also considered to really provide a convenience function for
    allocations restricted to a node, but the major opinion seems to be that
    __GFP_THISNODE already provides that functionality and we shouldn't
    duplicate the API needlessly. The number of users would be small
    anyway.

    Existing callers of alloc_pages_exact_node() are simply converted to
    call __alloc_pages_node(), with the exception of sba_alloc_coherent()
    which open-codes the check for NUMA_NO_NODE, so it is converted to use
    alloc_pages_node() instead. This means it no longer performs some
    VM_BUG_ON checks, and since the current check for nid in
    alloc_pages_node() uses a 'nid < 0' comparison (which includes
    NUMA_NO_NODE), it may hide wrong values which would be previously
    exposed.

    Both differences will be rectified by the next patch.

    To sum up, this patch makes no functional changes, except temporarily
    hiding potentially buggy callers. Restricting the checks in
    alloc_pages_node() is left for the next patch which can in turn expose
    more existing buggy callers.

    Signed-off-by: Vlastimil Babka
    Acked-by: Johannes Weiner
    Acked-by: Robin Holt
    Acked-by: Michal Hocko
    Acked-by: Christoph Lameter
    Acked-by: Michael Ellerman
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Greg Thelen
    Cc: Aneesh Kumar K.V
    Cc: Pekka Enberg
    Cc: Joonsoo Kim
    Cc: Naoya Horiguchi
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Gleb Natapov
    Cc: Paolo Bonzini
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Cliff Whickman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The pair of get/set_freepage_migratetype() functions are used to cache
    pageblock migratetype for a page put on a pcplist, so that it does not
    have to be retrieved again when the page is put on a free list (e.g.
    when pcplists become full). Historically it was also assumed that the
    value is accurate for pages on freelists (as the functions' names
    unfortunately suggest), but that cannot be guaranteed without affecting
    various allocator fast paths. It is in fact not needed and all such
    uses have been removed.

    The last remaining (but pointless) usage related to pages of freelists
    is in move_freepages(), which this patch removes.

    To prevent further confusion, rename the functions to
    get/set_pcppage_migratetype() and expand their description. Since all
    the users are now in mm/page_alloc.c, move the functions there from the
    shared header.

    Signed-off-by: Vlastimil Babka
    Acked-by: David Rientjes
    Acked-by: Joonsoo Kim
    Cc: Minchan Kim
    Acked-by: Michal Nazarewicz
    Cc: Laura Abbott
    Reviewed-by: Naoya Horiguchi
    Cc: Seungho Park
    Cc: Johannes Weiner
    Cc: "Kirill A. Shutemov"
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The __test_page_isolated_in_pageblock() is used to verify whether all
    pages in pageblock were either successfully isolated, or are hwpoisoned.
    Two of the possible state of pages, that are tested, are however bogus
    and misleading.

    Both tests rely on get_freepage_migratetype(page), which however has no
    guarantees about pages on freelists. Specifically, it doesn't guarantee
    that the migratetype returned by the function actually matches the
    migratetype of the freelist that the page is on. Such guarantee is not
    its purpose and would have negative impact on allocator performance.

    The first test checks whether the freepage_migratetype equals
    MIGRATE_ISOLATE, supposedly to catch races between page isolation and
    allocator activity. These races should be fixed nowadays with
    51bb1a4093 ("mm/page_alloc: add freepage on isolate pageblock to correct
    buddy list") and related patches. As explained above, the check
    wouldn't be able to catch them reliably anyway. For the same reason
    false positives can happen, although they are harmless, as the
    move_freepages() call would just move the page to the same freelist it's
    already on. So removing the test is not a bug fix, just cleanup. After
    this patch, we assume that all PageBuddy pages are on the correct
    freelist and that the races were really fixed. A truly reliable
    verification in the form of e.g. VM_BUG_ON() would be complicated and
    is arguably not needed.

    The second test (page_count(page) == 0 && get_freepage_migratetype(page)
    == MIGRATE_ISOLATE) is probably supposed (the code comes from a big
    memory isolation patch from 2007) to catch pages on MIGRATE_ISOLATE
    pcplists. However, pcplists don't contain MIGRATE_ISOLATE freepages
    nowadays, those are freed directly to free lists, so the check is
    obsolete. Remove it as well.

    Signed-off-by: Vlastimil Babka
    Acked-by: Joonsoo Kim
    Cc: Minchan Kim
    Acked-by: Michal Nazarewicz
    Cc: Laura Abbott
    Reviewed-by: Naoya Horiguchi
    Cc: Seungho Park
    Cc: Johannes Weiner
    Cc: "Kirill A. Shutemov"
    Acked-by: Mel Gorman
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The force_kill member of struct oom_control isn't needed if an order of -1
    is used instead. This is the same as order == -1 in struct
    compact_control which requires full memory compaction.

    This patch introduces no functional change.

    Signed-off-by: David Rientjes
    Cc: Sergey Senozhatsky
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • There are essential elements to an oom context that are passed around to
    multiple functions.

    Organize these elements into a new struct, struct oom_control, that
    specifies the context for an oom condition.

    This patch introduces no functional change.

    Signed-off-by: David Rientjes
    Acked-by: Michal Hocko
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Commit febd5949e134 ("mm/memory hotplug: init the zone's size when
    calculating node totalpages") refines the function
    free_area_init_core().

    After doing so, these two parameters are not used anymore.

    This patch removes these two parameters.

    Signed-off-by: Wei Yang
    Cc: Gu Zheng
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • nr_node_ids records the highest possible node id, which is calculated by
    scanning the bitmap node_states[N_POSSIBLE]. Current implementation
    scan the bitmap from the beginning, which will scan the whole bitmap.

    This patch reverses the order by scanning from the end with
    find_last_bit().

    Signed-off-by: Wei Yang
    Cc: Tejun Heo
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • Pull libnvdimm updates from Dan Williams:
    "This update has successfully completed a 0day-kbuild run and has
    appeared in a linux-next release. The changes outside of the typical
    drivers/nvdimm/ and drivers/acpi/nfit.[ch] paths are related to the
    removal of IORESOURCE_CACHEABLE, the introduction of memremap(), and
    the introduction of ZONE_DEVICE + devm_memremap_pages().

    Summary:

    - Introduce ZONE_DEVICE and devm_memremap_pages() as a generic
    mechanism for adding device-driver-discovered memory regions to the
    kernel's direct map.

    This facility is used by the pmem driver to enable pfn_to_page()
    operations on the page frames returned by DAX ('direct_access' in
    'struct block_device_operations').

    For now, the 'memmap' allocation for these "device" pages comes
    from "System RAM". Support for allocating the memmap from device
    memory will arrive in a later kernel.

    - Introduce memremap() to replace usages of ioremap_cache() and
    ioremap_wt(). memremap() drops the __iomem annotation for these
    mappings to memory that do not have i/o side effects. The
    replacement of ioremap_cache() with memremap() is limited to the
    pmem driver to ease merging the api change in v4.3.

    Completion of the conversion is targeted for v4.4.

    - Similar to the usage of memcpy_to_pmem() + wmb_pmem() in the pmem
    driver, update the VFS DAX implementation and PMEM api to provide
    persistence guarantees for kernel operations on a DAX mapping.

    - Convert the ACPI NFIT 'BLK' driver to map the block apertures as
    cacheable to improve performance.

    - Miscellaneous updates and fixes to libnvdimm including support for
    issuing "address range scrub" commands, clarifying the optimal
    'sector size' of pmem devices, a clarification of the usage of the
    ACPI '_STA' (status) property for DIMM devices, and other minor
    fixes"

    * tag 'libnvdimm-for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (34 commits)
    libnvdimm, pmem: direct map legacy pmem by default
    libnvdimm, pmem: 'struct page' for pmem
    libnvdimm, pfn: 'struct page' provider infrastructure
    x86, pmem: clarify that ARCH_HAS_PMEM_API implies PMEM mapped WB
    add devm_memremap_pages
    mm: ZONE_DEVICE for "device memory"
    mm: move __phys_to_pfn and __pfn_to_phys to asm/generic/memory_model.h
    dax: drop size parameter to ->direct_access()
    nd_blk: change aperture mapping from WC to WB
    nvdimm: change to use generic kvfree()
    pmem, dax: have direct_access use __pmem annotation
    dax: update I/O path to do proper PMEM flushing
    pmem: add copy_from_iter_pmem() and clear_pmem()
    pmem, x86: clean up conditional pmem includes
    pmem: remove layer when calling arch_has_wmb_pmem()
    pmem, x86: move x86 PMEM API to new pmem.h header
    libnvdimm, e820: make CONFIG_X86_PMEM_LEGACY a tristate option
    pmem: switch to devm_ allocations
    devres: add devm_memremap
    libnvdimm, btt: write and validate parent_uuid
    ...

    Linus Torvalds
     

28 Aug, 2015

1 commit

  • While pmem is usable as a block device or via DAX mappings to userspace
    there are several usage scenarios that can not target pmem due to its
    lack of struct page coverage. In preparation for "hot plugging" pmem
    into the vmemmap add ZONE_DEVICE as a new zone to tag these pages
    separately from the ones that are subject to standard page allocations.
    Importantly "device memory" can be removed at will by userspace
    unbinding the driver of the device.

    Having a separate zone prevents allocation and otherwise marks these
    pages that are distinct from typical uniform memory. Device memory has
    different lifetime and performance characteristics than RAM. However,
    since we have run out of ZONES_SHIFT bits this functionality currently
    depends on sacrificing ZONE_DMA.

    Cc: H. Peter Anvin
    Cc: Ingo Molnar
    Cc: Dave Hansen
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Jerome Glisse
    [hch: various simplifications in the arch interface]
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Dan Williams

    Dan Williams
     

22 Aug, 2015

1 commit

  • Commit c48a11c7ad26 ("netvm: propagate page->pfmemalloc to skb") added
    checks for page->pfmemalloc to __skb_fill_page_desc():

    if (page->pfmemalloc && !page->mapping)
    skb->pfmemalloc = true;

    It assumes page->mapping == NULL implies that page->pfmemalloc can be
    trusted. However, __delete_from_page_cache() can set set page->mapping
    to NULL and leave page->index value alone. Due to being in union, a
    non-zero page->index will be interpreted as true page->pfmemalloc.

    So the assumption is invalid if the networking code can see such a page.
    And it seems it can. We have encountered this with a NFS over loopback
    setup when such a page is attached to a new skbuf. There is no copying
    going on in this case so the page confuses __skb_fill_page_desc which
    interprets the index as pfmemalloc flag and the network stack drops
    packets that have been allocated using the reserves unless they are to
    be queued on sockets handling the swapping which is the case here and
    that leads to hangs when the nfs client waits for a response from the
    server which has been dropped and thus never arrive.

    The struct page is already heavily packed so rather than finding another
    hole to put it in, let's do a trick instead. We can reuse the index
    again but define it to an impossible value (-1UL). This is the page
    index so it should never see the value that large. Replace all direct
    users of page->pfmemalloc by page_is_pfmemalloc which will hide this
    nastiness from unspoiled eyes.

    The information will get lost if somebody wants to use page->index
    obviously but that was the case before and the original code expected
    that the information should be persisted somewhere else if that is
    really needed (e.g. what SLAB and SLUB do).

    [akpm@linux-foundation.org: fix blooper in slub]
    Fixes: c48a11c7ad26 ("netvm: propagate page->pfmemalloc to skb")
    Signed-off-by: Michal Hocko
    Debugged-by: Vlastimil Babka
    Debugged-by: Jiri Bohac
    Cc: Eric Dumazet
    Cc: David Miller
    Acked-by: Mel Gorman
    Cc: [3.6+]
    Signed-off-by: Andrew Morton

    Signed-off-by: Linus Torvalds

    Michal Hocko
     

15 Aug, 2015

1 commit

  • When we add a new node, the edge of memory may be wrong.

    e.g. system has 4 nodes, and node3 is movable, node3 mem:[24G-32G],

    1. hotremove the node3,
    2. then hotadd node3 with a part of memory, mem:[26G-30G],
    3. call hotadd_new_pgdat()
    free_area_init_node()
    get_pfn_range_for_nid()
    4. it will return wrong start_pfn and end_pfn, because we have not
    update the memblock.

    This patch also fixes a BUG_ON during hot-addition, please see
    http://marc.info/?l=linux-kernel&m=142961156129456&w=2

    Signed-off-by: Xishi Qiu
    Cc: Yasuaki Ishimatsu
    Cc: Kamezawa Hiroyuki
    Cc: Taku Izumi
    Cc: Tang Chen
    Cc: Gu Zheng
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xishi Qiu
     

07 Aug, 2015

4 commits

  • The race condition addressed in commit add05cecef80 ("mm: soft-offline:
    don't free target page in successful page migration") was not closed
    completely, because that can happen not only for soft-offline, but also
    for hard-offline. Consider that a slab page is about to be freed into
    buddy pool, and then an uncorrected memory error hits the page just
    after entering __free_one_page(), then VM_BUG_ON_PAGE(page->flags &
    PAGE_FLAGS_CHECK_AT_PREP) is triggered, despite the fact that it's not
    necessary because the data on the affected page is not consumed.

    To solve it, this patch drops __PG_HWPOISON from page flag checks at
    allocation/free time. I think it's justified because __PG_HWPOISON
    flags is defined to prevent the page from being reused, and setting it
    outside the page's alloc-free cycle is a designed behavior (not a bug.)

    For recent months, I was annoyed about BUG_ON when soft-offlined page
    remains on lru cache list for a while, which is avoided by calling
    put_page() instead of putback_lru_page() in page migration's success
    path. This means that this patch reverts a major change from commit
    add05cecef80 about the new refcounting rule of soft-offlined pages, so
    "reuse window" revives. This will be closed by a subsequent patch.

    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Dean Nelson
    Cc: Tony Luck
    Cc: "Kirill A. Shutemov"
    Cc: Hugh Dickins
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Dave Hansen reported the following;

    My laptop has been behaving strangely with 4.2-rc2. Once I log
    in to my X session, I start getting all kinds of strange errors
    from applications and see this in my dmesg:

    VFS: file-max limit 8192 reached

    The problem is that the file-max is calculated before memory is fully
    initialised and miscalculates how much memory the kernel is using. This
    patch recalculates file-max after deferred memory initialisation. Note
    that using memory hotplug infrastructure would not have avoided this
    problem as the value is not recalculated after memory hot-add.

    4.1: files_stat.max_files = 6582781
    4.2-rc2: files_stat.max_files = 8192
    4.2-rc2 patched: files_stat.max_files = 6562467

    Small differences with the patch applied and 4.1 but not enough to matter.

    Signed-off-by: Mel Gorman
    Reported-by: Dave Hansen
    Cc: Nicolai Stange
    Cc: Dave Hansen
    Cc: Alex Ng
    Cc: Fengguang Wu
    Cc: Peter Zijlstra (Intel)
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Commit 0e1cc95b4cc7 ("mm: meminit: finish initialisation of struct pages
    before basic setup") introduced a rwsem to signal completion of the
    initialization workers.

    Lockdep complains about possible recursive locking:
    =============================================
    [ INFO: possible recursive locking detected ]
    4.1.0-12802-g1dc51b8 #3 Not tainted
    ---------------------------------------------
    swapper/0/1 is trying to acquire lock:
    (pgdat_init_rwsem){++++.+},
    at: [] page_alloc_init_late+0xc7/0xe6

    but task is already holding lock:
    (pgdat_init_rwsem){++++.+},
    at: [] page_alloc_init_late+0x3e/0xe6

    Replace the rwsem by a completion together with an atomic
    "outstanding work counter".

    [peterz@infradead.org: Barrier removal on the grounds of being pointless]
    [mgorman@suse.de: Applied review feedback]
    Signed-off-by: Nicolai Stange
    Signed-off-by: Mel Gorman
    Acked-by: Peter Zijlstra (Intel)
    Cc: Dave Hansen
    Cc: Alex Ng
    Cc: Fengguang Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nicolai Stange
     
  • early_pfn_to_nid() historically was inherently not SMP safe but only
    used during boot which is inherently single threaded or during hotplug
    which is protected by a giant mutex.

    With deferred memory initialisation there was a thread-safe version
    introduced and the early_pfn_to_nid would trigger a BUG_ON if used
    unsafely. Memory hotplug hit that check. This patch makes
    early_pfn_to_nid introduces a lock to make it safe to use during
    hotplug.

    Signed-off-by: Mel Gorman
    Reported-by: Alex Ng
    Tested-by: Alex Ng
    Acked-by: Peter Zijlstra (Intel)
    Cc: Nicolai Stange
    Cc: Dave Hansen
    Cc: Fengguang Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

18 Jul, 2015

2 commits

  • Currently, we set wrong gfp_mask to page_owner info in case of isolated
    freepage by compaction and split page. It causes incorrect mixed
    pageblock report that we can get from '/proc/pagetypeinfo'. This metric
    is really useful to measure fragmentation effect so should be accurate.
    This patch fixes it by setting correct information.

    Without this patch, after kernel build workload is finished, number of
    mixed pageblock is 112 among roughly 210 movable pageblocks.

    But, with this fix, output shows that mixed pageblock is just 57.

    Signed-off-by: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • When I tested my new patches, I found that page pointer which is used
    for setting page_owner information is changed. This is because page
    pointer is used to set new migratetype in loop. After this work, page
    pointer could be out of bound. If this wrong pointer is used for
    page_owner, access violation happens. Below is error message that I
    got.

    BUG: unable to handle kernel paging request at 0000000000b00018
    IP: [] save_stack_address+0x30/0x40
    PGD 1af2d067 PUD 166e0067 PMD 0
    Oops: 0002 [#1] SMP
    ...snip...
    Call Trace:
    print_context_stack+0xcf/0x100
    dump_trace+0x15f/0x320
    save_stack_trace+0x2f/0x50
    __set_page_owner+0x46/0x70
    __isolate_free_page+0x1f7/0x210
    split_free_page+0x21/0xb0
    isolate_freepages_block+0x1e2/0x410
    compaction_alloc+0x22d/0x2d0
    migrate_pages+0x289/0x8b0
    compact_zone+0x409/0x880
    compact_zone_order+0x6d/0x90
    try_to_compact_pages+0x110/0x210
    __alloc_pages_direct_compact+0x3d/0xe6
    __alloc_pages_nodemask+0x6cd/0x9a0
    alloc_pages_current+0x91/0x100
    runtest_store+0x296/0xa50
    simple_attr_write+0xbd/0xe0
    __vfs_write+0x28/0xf0
    vfs_write+0xa9/0x1b0
    SyS_write+0x46/0xb0
    system_call_fastpath+0x16/0x75

    This patch fixes this error by moving up set_page_owner().

    Signed-off-by: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Vlastimil Babka
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim