07 Sep, 2017

40 commits

  • Patch series "Ranged pagevec lookup", v2.

    In this series I make pagevec_lookup() update the index (to be
    consistent with pagevec_lookup_tag() and also as a preparation for
    ranged lookups), provide ranged variant of pagevec_lookup() and use it
    in places where it makes sense. This not only removes some common code
    but is also a measurable performance win for some use cases (see patch
    4/10) where radix tree is sparse and searching & grabing of a page after
    the end of the range has measurable overhead.

    This patch (of 10):

    The callback doesn't ever get called. Remove it.

    Link: http://lkml.kernel.org/r/20170726114704.7626-2-jack@suse.cz
    Signed-off-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Tetsuo Handa has reported[1][2][3] that direct reclaimers might get
    stuck in too_many_isolated loop basically for ever because the last few
    pages on the LRU lists are isolated by the kswapd which is stuck on fs
    locks when doing the pageout or slab reclaim. This in turn means that
    there is nobody to actually trigger the oom killer and the system is
    basically unusable.

    too_many_isolated has been introduced by commit 35cd78156c49 ("vmscan:
    throttle direct reclaim when too many pages are isolated already") to
    prevent from pre-mature oom killer invocations because back then no
    reclaim progress could indeed trigger the OOM killer too early.

    But since the oom detection rework in commit 0a0337e0d1d1 ("mm, oom:
    rework oom detection") the allocation/reclaim retry loop considers all
    the reclaimable pages and throttles the allocation at that layer so we
    can loosen the direct reclaim throttling.

    Make shrink_inactive_list loop over too_many_isolated bounded and
    returns immediately when the situation hasn't resolved after the first
    sleep.

    Replace congestion_wait by a simple schedule_timeout_interruptible
    because we are not really waiting on the IO congestion in this path.

    Please note that this patch can theoretically cause the OOM killer to
    trigger earlier while there are many pages isolated for the reclaim
    which makes progress only very slowly. This would be obvious from the
    oom report as the number of isolated pages are printed there. If we
    ever hit this should_reclaim_retry should consider those numbers in the
    evaluation in one way or another.

    [1] http://lkml.kernel.org/r/201602092349.ACG81273.OSVtMJQHLOFOFF@I-love.SAKURA.ne.jp
    [2] http://lkml.kernel.org/r/201702212335.DJB30777.JOFMHSFtVLQOOF@I-love.SAKURA.ne.jp
    [3] http://lkml.kernel.org/r/201706300914.CEH95859.FMQOLVFHJFtOOS@I-love.SAKURA.ne.jp

    [mhocko@suse.com: switch to uninterruptible sleep]
    Link: http://lkml.kernel.org/r/20170724065048.GB25221@dhcp22.suse.cz
    Link: http://lkml.kernel.org/r/20170710074842.23175-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Tetsuo Handa
    Tested-by: Tetsuo Handa
    Acked-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Acked-by: Rik van Riel
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Getting -EBUSY from zs_page_migrate will make migration slow (retry) or
    fail (zs_page_putback will schedule_work free_work, but it cannot ensure
    the success).

    I noticed this issue because my Kernel patched
    (https://lkml.org/lkml/2014/5/28/113) that will remove retry in
    __alloc_contig_migrate_range.

    This retry will handle the -EBUSY because it will re-isolate the page
    and re-call migrate_pages. Without it will make cma_alloc fail at once
    with -EBUSY.

    According to the review from Minchan Kim in
    https://lkml.org/lkml/2014/5/28/113, I update the patch to skip
    unnecessary loops but not return -EBUSY if zspage is not inuse.

    Following is what I got with highalloc-performance in a vbox with 2 cpu
    1G memory 512 zram as swap. And the swappiness is set to 100.

    ori ne
    orig new
    Minor Faults 50805113 50830235
    Major Faults 43918 56530
    Swap Ins 42087 55680
    Swap Outs 89718 104700
    Allocation stalls 0 0
    DMA allocs 57787 52364
    DMA32 allocs 47964599 48043563
    Normal allocs 0 0
    Movable allocs 0 0
    Direct pages scanned 45493 23167
    Kswapd pages scanned 1565222 1725078
    Kswapd pages reclaimed 1342222 1503037
    Direct pages reclaimed 45615 25186
    Kswapd efficiency 85% 87%
    Kswapd velocity 1897.101 1949.042
    Direct efficiency 100% 108%
    Direct velocity 55.139 26.175
    Percentage direct scans 2% 1%
    Zone normal velocity 1952.240 1975.217
    Zone dma32 velocity 0.000 0.000
    Zone dma velocity 0.000 0.000
    Page writes by reclaim 89764.000 105233.000
    Page writes file 46 533
    Page writes anon 89718 104700
    Page reclaim immediate 21457 3699
    Sector Reads 3259688 3441368
    Sector Writes 3667252 3754836
    Page rescued immediate 0 0
    Slabs scanned 1042872 1160855
    Direct inode steals 8042 10089
    Kswapd inode steals 54295 29170
    Kswapd skipped wait 0 0
    THP fault alloc 175 154
    THP collapse alloc 226 289
    THP splits 0 0
    THP fault fallback 11 14
    THP collapse fail 3 2
    Compaction stalls 536 646
    Compaction success 322 358
    Compaction failures 214 288
    Page migrate success 119608 111063
    Page migrate failure 2723 2593
    Compaction pages isolated 250179 232652
    Compaction migrate scanned 9131832 9942306
    Compaction free scanned 2093272 2613998
    Compaction cost 192 189
    NUMA alloc hit 47124555 47193990
    NUMA alloc miss 0 0
    NUMA interleave hit 0 0
    NUMA alloc local 47124555 47193990
    NUMA base PTE updates 0 0
    NUMA huge PMD updates 0 0
    NUMA page range updates 0 0
    NUMA hint faults 0 0
    NUMA hint local faults 0 0
    NUMA hint local percent 100 100
    NUMA pages migrated 0 0
    AutoNUMA cost 0% 0%

    [akpm@linux-foundation.org: remove newline, per Minchan]
    Link: http://lkml.kernel.org/r/1500889535-19648-1-git-send-email-zhuhui@xiaomi.com
    Signed-off-by: Hui Zhu
    Acked-by: Minchan Kim
    Reviewed-by: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hui Zhu
     
  • Nadav Amit report zap_page_range only specifies that the caller protect
    the VMA list but does not specify whether it is held for read or write
    with callers using either. madvise holds mmap_sem for read meaning that
    a parallel zap operation can unmap PTEs which are then potentially
    skipped by madvise which potentially returns with stale TLB entries
    present. While the API could be extended, it would be a difficult API
    to use. This patch causes zap_page_range() to always consider flushing
    the full affected range. For small ranges or sparsely populated
    mappings, this may result in one additional spurious TLB flush. For
    larger ranges, it is possible that the TLB has already been flushed and
    the overhead is negligible. Either way, this approach is safer overall
    and avoids stale entries being present when madvise returns.

    This can be illustrated with the following program provided by Nadav
    Amit and slightly modified. With the patch applied, it has an exit code
    of 0 indicating a stale TLB entry did not leak to userspace.

    ---8<< 32);
    }

    static inline void wait_rdtsc(unsigned long cycles)
    {
    unsigned long tsc = rdtsc();

    while (rdtsc() - tsc < cycles);
    }

    void *big_madvise_thread(void *ign)
    {
    sync_step = 1;
    while (sync_step != 2);
    madvise((void*)p, PAGE_SIZE * N_PAGES, MADV_DONTNEED);
    }

    int main(void)
    {
    pthread_t aux_thread;

    p = mmap(0, PAGE_SIZE * N_PAGES, PROT_READ|PROT_WRITE,
    MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);

    memset((void*)p, 8, PAGE_SIZE * N_PAGES);

    pthread_create(&aux_thread, NULL, big_madvise_thread, NULL);
    while (sync_step != 1);

    *p = 8; // Cache in TLB
    sync_step = 2;
    wait_rdtsc(100000);
    madvise((void*)p, PAGE_SIZE, MADV_DONTNEED);
    printf("data: %d (%s)\n", *p, (*p == 8 ? "stale, broken" : "cleared, fine"));
    return *p == 8 ? -1 : 0;
    }
    ---8
    Reported-by: Nadav Amit
    Cc: Andy Lutomirski
    Signed-off-by: Andrew Morton

    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • When walking the page tables to resolve an address that points to
    !p*d_present() entry, huge_pte_offset() returns inconsistent values
    depending on the level of page table (PUD or PMD).

    It returns NULL in the case of a PUD entry while in the case of a PMD
    entry, it returns a pointer to the page table entry.

    A similar inconsitency exists when handling swap entries - returns NULL
    for a PUD entry while a pointer to the pte_t is retured for the PMD
    entry.

    Update huge_pte_offset() to make the behaviour consistent - return a
    pointer to the pte_t for hugepage or swap entries. Only return NULL in
    instances where we have a p*d_none() entry and the size parameter
    doesn't match the hugepage size at this level of the page table.

    Document the behaviour to clarify the expected behaviour of this
    function. This is to set clear semantics for architecture specific
    implementations of huge_pte_offset().

    Discussions on the arm64 implementation of huge_pte_offset()
    (http://www.spinics.net/lists/linux-mm/msg133699.html) showed that there
    is benefit from returning a pte_t* in the case of p*d_none().

    The fault handling code in hugetlb_fault() can handle p*d_none() entries
    and saves an extra round trip to huge_pte_alloc(). Other callers of
    huge_pte_offset() should be ok as well.

    [punit.agrawal@arm.com: v2]
    Link: http://lkml.kernel.org/r/20170725154114.24131-2-punit.agrawal@arm.com
    Signed-off-by: Punit Agrawal
    Reviewed-by: Catalin Marinas
    Reviewed-by: Mike Kravetz
    Reviewed-by: Catalin Marinas
    Acked-by: Michal Hocko
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Will Deacon
    Cc: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Punit Agrawal
     
  • These functions are the only bits of generic code that use
    {pud,pmd}_pfn() without checking for CONFIG_TRANSPARENT_HUGEPAGE. This
    works fine on x86, the only arch with devmap support, since the *_pfn()
    functions are always defined there, but this isn't true for every
    architecture.

    Link: http://lkml.kernel.org/r/20170626063833.11094-1-oohall@gmail.com
    Signed-off-by: Oliver O'Halloran
    Cc: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oliver O'Halloran
     
  • mremap will attempt to create a 'duplicate' mapping if old_size == 0 is
    specified. In the case of private mappings, mremap will actually create
    a fresh separate private mapping unrelated to the original. This does
    not fit with the design semantics of mremap as the intention is to
    create a new mapping based on the original.

    Therefore, return EINVAL in the case where an attempt is made to
    duplicate a private mapping. Also, print a warning message (once) if
    such an attempt is made.

    Link: http://lkml.kernel.org/r/cb9d9f6a-7095-582f-15a5-62643d65c736@oracle.com
    Signed-off-by: Mike Kravetz
    Acked-by: Michal Hocko
    Cc: Andrea Arcangeli
    Cc: Aaron Lu
    Cc: "Kirill A . Shutemov"
    Cc: Vlastimil Babka
    Cc: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • init_pages_in_zone() is run under zone->lock, which means a long lock
    time and disabled interrupts on large machines. This is currently not
    an issue since it runs early in boot, but a later patch will change
    that.

    However, like other pfn scanners, we don't actually need zone->lock even
    when other cpus are running. The only potentially dangerous operation
    here is reading bogus buddy page owner due to race, and we already know
    how to handle that. The worst that can happen is that we skip some
    early allocated pages, which should not affect the debugging power of
    page_owner noticeably.

    Link: http://lkml.kernel.org/r/20170720134029.25268-4-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Yang Shi
    Cc: Laura Abbott
    Cc: Vinayak Menon
    Cc: zhong jiang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • page_ext_init() can take long on large machines, so add a cond_resched()
    point after each section is processed. This will allow moving the init
    to a later point at boot without triggering lockup reports.

    Link: http://lkml.kernel.org/r/20170720134029.25268-3-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Yang Shi
    Cc: Laura Abbott
    Cc: Vinayak Menon
    Cc: zhong jiang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • In init_pages_in_zone() we currently use the generic set_page_owner()
    function to initialize page_owner info for early allocated pages. This
    means we needlessly do lookup_page_ext() twice for each page, and more
    importantly save_stack(), which has to unwind the stack and find the
    corresponding stack depot handle. Because the stack is always the same
    for the initialization, unwind it once in init_pages_in_zone() and reuse
    the handle. Also avoid the repeated lookup_page_ext().

    This can significantly reduce boot times with page_owner=on on large
    machines, especially for kernels built without frame pointer, where the
    stack unwinding is noticeably slower.

    [vbabka@suse.cz: don't duplicate code of __set_page_owner(), per Michal Hocko]
    [akpm@linux-foundation.org: coding-style fixes]
    [vbabka@suse.cz: create statically allocated fake stack trace for early allocated pages, per Michal]
    Link: http://lkml.kernel.org/r/45813564-2342-fc8d-d31a-f4b68a724325@suse.cz
    Link: http://lkml.kernel.org/r/20170720134029.25268-2-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Yang Shi
    Cc: Laura Abbott
    Cc: Vinayak Menon
    Cc: zhong jiang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Commit f52407ce2dea ("memory hotplug: alloc page from other node in
    memory online") has introduced N_HIGH_MEMORY checks to only use NUMA
    aware allocations when there is some memory present because the
    respective node might not have any memory yet at the time and so it
    could fail or even OOM.

    Things have changed since then though. Zonelists are now always
    initialized before we do any allocations even for hotplug (see
    959ecc48fc75 ("mm/memory_hotplug.c: fix building of node hotplug
    zonelist")).

    Therefore these checks are not really needed. In fact caller of the
    allocator should never care about whether the node is populated because
    that might change at any time.

    Link: http://lkml.kernel.org/r/20170721143915.14161-10-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Shaohua Li
    Cc: Joonsoo Kim
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Toshi Kani
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • zonelists_mutex was introduced by commit 4eaf3f64397c ("mem-hotplug: fix
    potential race while building zonelist for new populated zone") to
    protect zonelist building from races. This is no longer needed though
    because both memory online and offline are fully serialized. New users
    have grown since then.

    Notably setup_per_zone_wmarks wants to prevent from races between memory
    hotplug, khugepaged setup and manual min_free_kbytes update via sysctl
    (see cfd3da1e49bb ("mm: Serialize access to min_free_kbytes"). Let's
    add a private lock for that purpose. This will not prevent from seeing
    halfway through memory hotplug operation but that shouldn't be a big
    deal becuse memory hotplug will update watermarks explicitly so we will
    eventually get a full picture. The lock just makes sure we won't race
    when updating watermarks leading to weird results.

    Also __build_all_zonelists manipulates global data so add a private lock
    for it as well. This doesn't seem to be necessary today but it is more
    robust to have a lock there.

    While we are at it make sure we document that memory online/offline
    depends on a full serialization either via mem_hotplug_begin() or
    device_lock.

    Link: http://lkml.kernel.org/r/20170721143915.14161-9-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Shaohua Li
    Cc: Toshi Kani
    Cc: Vlastimil Babka
    Cc: Haicheng Li
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • build_all_zonelists has been (ab)using stop_machine to make sure that
    zonelists do not change while somebody is looking at them. This is is
    just a gross hack because a) it complicates the context from which we
    can call build_all_zonelists (see 3f906ba23689 ("mm/memory-hotplug:
    switch locking to a percpu rwsem")) and b) is is not really necessary
    especially after "mm, page_alloc: simplify zonelist initialization" and
    c) it doesn't really provide the protection it claims (see below).

    Updates of the zonelists happen very seldom, basically only when a zone
    becomes populated during memory online or when it loses all the memory
    during offline. A racing iteration over zonelists could either miss a
    zone or try to work on one zone twice. Both of these are something we
    can live with occasionally because there will always be at least one
    zone visible so we are not likely to fail allocation too easily for
    example.

    Please note that the original stop_machine approach doesn't really
    provide a better exclusion because the iteration might be interrupted
    half way (unless the whole iteration is preempt disabled which is not
    the case in most cases) so the some zones could still be seen twice or a
    zone missed.

    I have run the pathological online/offline of the single memblock in the
    movable zone while stressing the same small node with some memory
    pressure.

    Node 1, zone DMA
    pages free 0
    min 0
    low 0
    high 0
    spanned 0
    present 0
    managed 0
    protection: (0, 943, 943, 943)
    Node 1, zone DMA32
    pages free 227310
    min 8294
    low 10367
    high 12440
    spanned 262112
    present 262112
    managed 241436
    protection: (0, 0, 0, 0)
    Node 1, zone Normal
    pages free 0
    min 0
    low 0
    high 0
    spanned 0
    present 0
    managed 0
    protection: (0, 0, 0, 1024)
    Node 1, zone Movable
    pages free 32722
    min 85
    low 117
    high 149
    spanned 32768
    present 32768
    managed 32768
    protection: (0, 0, 0, 0)

    root@test1:/sys/devices/system/node/node1# while true
    do
    echo offline > memory34/state
    echo online_movable > memory34/state
    done

    root@test1:/mnt/data/test/linux-3.7-rc5# numactl --preferred=1 make -j4

    and it survived without any unexpected behavior. While this is not
    really a great testing coverage it should exercise the allocation path
    quite a lot.

    Link: http://lkml.kernel.org/r/20170721143915.14161-8-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Shaohua Li
    Cc: Toshi Kani
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • build_zonelists gradually builds zonelists from the nearest to the most
    distant node. As we do not know how many populated zones we will have
    in each node we rely on the _zoneref to terminate initialized part of
    the zonelist by a NULL zone. While this is functionally correct it is
    quite suboptimal because we cannot allow updaters to race with zonelists
    users because they could see an empty zonelist and fail the allocation
    or hit the OOM killer in the worst case.

    We can do much better, though. We can store the node ordering into an
    already existing node_order array and then give this array to
    build_zonelists_in_node_order and do the whole initialization at once.
    zonelists consumers still might see halfway initialized state but that
    should be much more tolerateable because the list will not be empty and
    they would either see some zone twice or skip over some zone(s) in the
    worst case which shouldn't lead to immediate failures.

    While at it let's simplify build_zonelists_node which is rather
    confusing now. It gets an index into the zoneref array and returns the
    updated index for the next iteration. Let's rename the function to
    build_zonerefs_node to better reflect its purpose and give it zoneref
    array to update. The function doesn't the index anymore. It just
    returns the number of added zones so that the caller can advance the
    zonered array start for the next update.

    This patch alone doesn't introduce any functional change yet, though, it
    is merely a preparatory work for later changes.

    Link: http://lkml.kernel.org/r/20170721143915.14161-7-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Shaohua Li
    Cc: Toshi Kani
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • try_online_node calls hotadd_new_pgdat which already calls
    build_all_zonelists. So the additional call is redundant. Even though
    hotadd_new_pgdat will only initialize zonelists of the new node this is
    the right thing to do because such a node doesn't have any memory so
    other zonelists would ignore all the zones from this node anyway.

    Link: http://lkml.kernel.org/r/20170721143915.14161-6-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Toshi Kani
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Shaohua Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • build_all_zonelists gets a zone parameter to initialize zone's pagesets.
    There is only a single user which gives a non-NULL zone parameter and
    that one doesn't really need the rest of the build_all_zonelists (see
    commit 6dcd73d7011b ("memory-hotplug: allocate zone's pcp before
    onlining pages")).

    Therefore remove setup_zone_pageset from build_all_zonelists and call it
    from its only user directly. This will also remove a pointless zonlists
    rebuilding which is always good.

    Link: http://lkml.kernel.org/r/20170721143915.14161-5-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Shaohua Li
    Cc: Toshi Kani
    Cc: Wen Congyang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • __build_all_zonelists reinitializes each online cpu local node for
    CONFIG_HAVE_MEMORYLESS_NODES. This makes sense because previously
    memory less nodes could gain some memory during memory hotplug and so
    the local node should be changed for CPUs close to such a node. It
    makes less sense to do that unconditionally for a newly creaded NUMA
    node which is still offline and without any memory.

    Let's also simplify the cpu loop and use for_each_online_cpu instead of
    an explicit cpu_online check for all possible cpus.

    Link: http://lkml.kernel.org/r/20170721143915.14161-4-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Shaohua Li
    Cc: Toshi Kani
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • boot_pageset is a boot time hack which gets superseded by normal
    pagesets later in the boot process. It makes zero sense to reinitialize
    it again and again during memory hotplug.

    Link: http://lkml.kernel.org/r/20170721143915.14161-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Shaohua Li
    Cc: Toshi Kani
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Patch series "cleanup zonelists initialization", v1.

    This is aimed at cleaning up the zonelists initialization code we have
    but the primary motivation was bug report [2] which got resolved but the
    usage of stop_machine is just too ugly to live. Most patches are
    straightforward but 3 of them need a special consideration.

    Patch 1 removes zone ordered zonelists completely. I am CCing linux-api
    because this is a user visible change. As I argue in the patch
    description I do not think we have a strong usecase for it these days.
    I have kept sysctl in place and warn into the log if somebody tries to
    configure zone lists ordering. If somebody has a real usecase for it we
    can revert this patch but I do not expect anybody will actually notice
    runtime differences. This patch is not strictly needed for the rest but
    it made patch 6 easier to implement.

    Patch 7 removes stop_machine from build_all_zonelists without adding any
    special synchronization between iterators and updater which I _believe_
    is acceptable as explained in the changelog. I hope I am not missing
    anything.

    Patch 8 then removes zonelists_mutex which is kind of ugly as well and
    not really needed AFAICS but a care should be taken when double checking
    my thinking.

    This patch (of 9):

    Supporting zone ordered zonelists costs us just a lot of code while the
    usefulness is arguable if existent at all. Mel has already made node
    ordering default on 64b systems. 32b systems are still using
    ZONELIST_ORDER_ZONE because it is considered better to fallback to a
    different NUMA node rather than consume precious lowmem zones.

    This argument is, however, weaken by the fact that the memory reclaim
    has been reworked to be node rather than zone oriented. This means that
    lowmem requests have to skip over all highmem pages on LRUs already and
    so zone ordering doesn't save the reclaim time much. So the only
    advantage of the zone ordering is under a light memory pressure when
    highmem requests do not ever hit into lowmem zones and the lowmem
    pressure doesn't need to reclaim.

    Considering that 32b NUMA systems are rather suboptimal already and it
    is generally advisable to use 64b kernel on such a HW I believe we
    should rather care about the code maintainability and just get rid of
    ZONELIST_ORDER_ZONE altogether. Keep systcl in place and warn if
    somebody tries to set zone ordering either from kernel command line or
    the sysctl.

    [mhocko@suse.com: reading vm.numa_zonelist_order will never terminate]
    Link: http://lkml.kernel.org/r/20170721143915.14161-2-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Shaohua Li
    Cc: Toshi Kani
    Cc: Abdul Haleem
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • This patch adds document and kconfig for using of writeback feature.

    Link: http://lkml.kernel.org/r/1498459987-24562-10-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Cc: Juneho Choi
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • This patch enables read IO from backing device. For the feature, it
    implements two IO read functions to transfer data from backing storage.

    One is asynchronous IO function and other is synchronous one.

    A reason I need synchrnous IO is due to partial write which need to
    complete read IO before the overwriting partial data.

    We can make the partial IO's case asynchronous, too but at the moment, I
    don't feel adding more complexity to support such rare use cases so want
    to go with simple.

    [xieyisheng1@huawei.com: read_from_bdev_async(): return 1 to avoid call page_endio() in zram_rw_page()]
    Link: http://lkml.kernel.org/r/1502707447-6944-1-git-send-email-xieyisheng1@huawei.com
    Link: http://lkml.kernel.org/r/1498459987-24562-9-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Signed-off-by: Yisheng Xie
    Cc: Juneho Choi
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • This patch enables write IO to transfer data to backing device. For
    that, it implements write_to_bdev function which creates new bio and
    chaining with parent bio to make the parent bio asynchrnous.

    For rw_page which don't have parent bio, it submit owned bio and handle
    IO completion by zram_page_end_io.

    Also, this patch defines new flag ZRAM_WB to mark written page for later
    read IO.

    [xieyisheng1@huawei.com: fix typo in comment]
    Link: http://lkml.kernel.org/r/1502707447-6944-2-git-send-email-xieyisheng1@huawei.com
    Link: http://lkml.kernel.org/r/1498459987-24562-8-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Signed-off-by: Yisheng Xie
    Cc: Juneho Choi
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • For upcoming asynchronous IO like writeback, zram_rw_page should be
    aware of that whether requested IO was completed or submitted
    successfully, otherwise error.

    For the goal, zram_bvec_rw has three return values.

    -errno: returns error number
    0: IO request is done synchronously
    1: IO request is issued successfully.

    Link: http://lkml.kernel.org/r/1498459987-24562-7-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Cc: Juneho Choi
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • With backing device, zram needs management of free space of backing
    device.

    This patch adds bitmap logic to manage free space which is very naive.
    However, it would be simple enough as considering uncompressible pages's
    frequenty in zram.

    Link: http://lkml.kernel.org/r/1498459987-24562-6-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Cc: Juneho Choi
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • For writeback feature, user should set up backing device before the zram
    working.

    This patch enables the interface via /sys/block/zramX/backing_dev.

    Currently, it supports block device only but it could be enhanced for
    file as well.

    Link: http://lkml.kernel.org/r/1498459987-24562-5-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Cc: Juneho Choi
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • zram_decompress_page naming is not proper because it doesn't decompress
    if page was dedup hit or stored with compression.

    Use more abstract term and consistent with write path function
    __zram_bvec_write.

    Link: http://lkml.kernel.org/r/1498459987-24562-4-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Cc: Juneho Choi
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • zram_compress does several things, compress, entry alloc and check
    limitation. I did for just readbility but it hurts modulization.:(

    So this patch removes zram_compress functions and inline it in
    __zram_bvec_write for upcoming patches.

    Link: http://lkml.kernel.org/r/1498459987-24562-3-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Cc: Juneho Choi
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Patch series "writeback incompressible pages to storage", v1.

    zRam is useful for memory saving with compressible pages but sometime,
    workload can be changed and system has lots of incompressible pages
    which is very harmful for zram.

    This patch supports writeback feature of zram so admin can set up a
    block device and with it, zram can save the memory via writing out the
    incompressile pages once it found it's incompressible pages (1/4 comp
    ratio) instead of keeping the page in memory.

    [1-3] is just clean up and [4-8] is step by step feature enablement.
    [4-8] is logically not bisectable(ie, logical unit separation)
    although I tried to compiled out without breaking but I think it would
    be better to review.

    This patch (of 9):

    __zram_bvec_write has some of duplicated logic for zram meta data
    handling of same_page|compressed_page. This patch aims to clean it up
    without behavior change.

    [xieyisheng1@huawei.com: fix compr_data_size stat]
    Link: http://lkml.kernel.org/r/1502707447-6944-1-git-send-email-xieyisheng1@huawei.com
    Link: http://lkml.kernel.org/r/1496019048-27016-1-git-send-email-minchan@kernel.org
    Link: http://lkml.kernel.org/r/1498459987-24562-2-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Signed-off-by: Yisheng Xie
    Reviewed-by: Sergey Senozhatsky
    Cc: Juneho Choi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Historically we have enforced that any kernel zone (e.g ZONE_NORMAL) has
    to precede the Movable zone in the physical memory range. The purpose
    of the movable zone is, however, not bound to any physical memory
    restriction. It merely defines a class of migrateable and reclaimable
    memory.

    There are users (e.g. CMA) who might want to reserve specific physical
    memory ranges for their own purpose. Moreover our pfn walkers have to
    be prepared for zones overlapping in the physical range already because
    we do support interleaving NUMA nodes and therefore zones can interleave
    as well. This means we can allow each memory block to be associated
    with a different zone.

    Loosen the current onlining semantic and allow explicit onlining type on
    any memblock. That means that online_{kernel,movable} will be allowed
    regardless of the physical address of the memblock as long as it is
    offline of course. This might result in moveble zone overlapping with
    other kernel zones. Default onlining then becomes a bit tricky but
    still sensible. echo online > memoryXY/state will online the given
    block to

    1) the default zone if the given range is outside of any zone
    2) the enclosing zone if such a zone doesn't interleave with
    any other zone
    3) the default zone if more zones interleave for this range

    where default zone is movable zone only if movable_node is enabled
    otherwise it is a kernel zone.

    Here is an example of the semantic with (movable_node is not present but
    it work in an analogous way). We start with following memblocks, all of
    them offline:

    memory34/valid_zones:Normal Movable
    memory35/valid_zones:Normal Movable
    memory36/valid_zones:Normal Movable
    memory37/valid_zones:Normal Movable
    memory38/valid_zones:Normal Movable
    memory39/valid_zones:Normal Movable
    memory40/valid_zones:Normal Movable
    memory41/valid_zones:Normal Movable

    Now, we online block 34 in default mode and block 37 as movable

    root@test1:/sys/devices/system/node/node1# echo online > memory34/state
    root@test1:/sys/devices/system/node/node1# echo online_movable > memory37/state
    memory34/valid_zones:Normal
    memory35/valid_zones:Normal Movable
    memory36/valid_zones:Normal Movable
    memory37/valid_zones:Movable
    memory38/valid_zones:Normal Movable
    memory39/valid_zones:Normal Movable
    memory40/valid_zones:Normal Movable
    memory41/valid_zones:Normal Movable

    As we can see all other blocks can still be onlined both into Normal and
    Movable zones and the Normal is default because the Movable zone spans
    only block37 now.

    root@test1:/sys/devices/system/node/node1# echo online_movable > memory41/state
    memory34/valid_zones:Normal
    memory35/valid_zones:Normal Movable
    memory36/valid_zones:Normal Movable
    memory37/valid_zones:Movable
    memory38/valid_zones:Movable Normal
    memory39/valid_zones:Movable Normal
    memory40/valid_zones:Movable Normal
    memory41/valid_zones:Movable

    Now the default zone for blocks 37-41 has changed because movable zone
    spans that range.

    root@test1:/sys/devices/system/node/node1# echo online_kernel > memory39/state
    memory34/valid_zones:Normal
    memory35/valid_zones:Normal Movable
    memory36/valid_zones:Normal Movable
    memory37/valid_zones:Movable
    memory38/valid_zones:Normal Movable
    memory39/valid_zones:Normal
    memory40/valid_zones:Movable Normal
    memory41/valid_zones:Movable

    Note that the block 39 now belongs to the zone Normal and so block38
    falls into Normal by default as well.

    For completness

    root@test1:/sys/devices/system/node/node1# for i in memory[34]?
    do
    echo online > $i/state 2>/dev/null
    done

    memory34/valid_zones:Normal
    memory35/valid_zones:Normal
    memory36/valid_zones:Normal
    memory37/valid_zones:Movable
    memory38/valid_zones:Normal
    memory39/valid_zones:Normal
    memory40/valid_zones:Movable
    memory41/valid_zones:Movable

    Implementation wise the change is quite straightforward. We can get rid
    of allow_online_pfn_range altogether. online_pages allows only offline
    nodes already. The original default_zone_for_pfn will become
    default_kernel_zone_for_pfn. New default_zone_for_pfn implements the
    above semantic. zone_for_pfn_range is slightly reorganized to implement
    kernel and movable online type explicitly and MMOP_ONLINE_KEEP becomes a
    catch all default behavior.

    Link: http://lkml.kernel.org/r/20170714121233.16861-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Acked-by: Reza Arbab
    Cc: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Yasuaki Ishimatsu
    Cc: Xishi Qiu
    Cc: Kani Toshimitsu
    Cc:
    Cc: Daniel Kiper
    Cc: Igor Mammedov
    Cc: Vitaly Kuznetsov
    Cc: Wei Yang
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Prior to commit f1dd2cd13c4b ("mm, memory_hotplug: do not associate
    hotadded memory to zones until online") we used to allow to change the
    valid zone types of a memory block if it is adjacent to a different zone
    type.

    This fact was reflected in memoryNN/valid_zones by the ordering of
    printed zones. The first one was default (echo online > memoryNN/state)
    and the other one could be onlined explicitly by online_{movable,kernel}.

    This behavior was removed by the said patch and as such the ordering was
    not all that important. In most cases a kernel zone would be default
    anyway. The only exception is movable_node handled by "mm,
    memory_hotplug: support movable_node for hotpluggable nodes".

    Let's reintroduce this behavior again because later patch will remove
    the zone overlap restriction and so user will be allowed to online
    kernel resp. movable block regardless of its placement. Original
    behavior will then become significant again because it would be
    non-trivial for users to see what is the default zone to online into.

    Implementation is really simple. Pull out zone selection out of
    move_pfn_range into zone_for_pfn_range helper and use it in
    show_valid_zones to display the zone for default onlining and then both
    kernel and movable if they are allowed. Default online zone is not
    duplicated.

    Link: http://lkml.kernel.org/r/20170714121233.16861-2-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Reza Arbab
    Cc: Yasuaki Ishimatsu
    Cc: Xishi Qiu
    Cc: Kani Toshimitsu
    Cc:
    Cc: Daniel Kiper
    Cc: Igor Mammedov
    Cc: Vitaly Kuznetsov
    Cc: Wei Yang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Commit 9adb62a5df9c ("mm/hotplug: correctly setup fallback zonelists
    when creating new pgdat") tries to build the correct zonelist for a
    newly added node, while it is not necessary to rebuild it for already
    exist nodes.

    In build_zonelists(), it will iterate on nodes with memory. For a newly
    added node, it will have memory until node_states_set_node() is called
    in online_pages().

    This patch avoids rebuilding the zonelists for already existing nodes.

    build_zonelists_node() uses managed_zone(zone) checks, so it should not
    include empty zones anyway. So effectively we avoid some pointless work
    under stop_machine().

    [akpm@linux-foundation.org: tweak comment text]
    [akpm@linux-foundation.org: coding-style tweak, per Vlastimil]
    Link: http://lkml.kernel.org/r/20170626035822.50155-1-richard.weiyang@gmail.com
    Signed-off-by: Wei Yang
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Jiang Liu
    Cc: Xishi Qiu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • shrink_slab() allows us to report back the number of objects we
    successfully scanned (out of the target shrinkctl->nr_to_scan). As
    report the number of pages owned by each GEM object as a separate item
    to the shrinker, we cannot precisely control the number of shrinker
    objects we scan on each pass; and indeed may free more than requested.
    If we fail to tell the shrinker about the number of objects we process,
    it will continue to hold a grudge against us as any objects left
    unscanned are added to the next reclaim -- and so we will keep on
    "unfairly" shrinking our own slab in comparison to other slabs.

    Link: http://lkml.kernel.org/r/20170822135325.9191-2-chris@chris-wilson.co.uk
    Signed-off-by: Chris Wilson
    Cc: Joonas Lahtinen
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Hillf Danton
    Cc: Minchan Kim
    Cc: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Shaohua Li
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Wilson
     
  • Some shrinkers may only be able to free a bunch of objects at a time,
    and so free more than the requested nr_to_scan in one pass.

    Whilst other shrinkers may find themselves even unable to scan as many
    objects as they counted, and so underreport. Account for the extra
    freed/scanned objects against the total number of objects we intend to
    scan, otherwise we may end up penalising the slab far more than
    intended. Similarly, we want to add the underperforming scan to the
    deferred pass so that we try harder and harder in future passes.

    Link: http://lkml.kernel.org/r/20170822135325.9191-1-chris@chris-wilson.co.uk
    Signed-off-by: Chris Wilson
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Hillf Danton
    Cc: Minchan Kim
    Cc: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Shaohua Li
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Joonas Lahtinen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Wilson
     
  • Add an assertion similar to "fasttop" check in GNU C Library allocator
    as a part of SLAB_FREELIST_HARDENED feature. An object added to a
    singly linked freelist should not point to itself. That helps to detect
    some double free errors (e.g. CVE-2017-2636) without slub_debug and
    KASAN.

    Link: http://lkml.kernel.org/r/1502468246-1262-1-git-send-email-alex.popov@linux.com
    Signed-off-by: Alexander Popov
    Acked-by: Christoph Lameter
    Cc: Kees Cook
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Paul E McKenney
    Cc: Ingo Molnar
    Cc: Tejun Heo
    Cc: Andy Lutomirski
    Cc: Nicolas Pitre
    Cc: Rik van Riel
    Cc: Tycho Andersen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Popov
     
  • This SLUB free list pointer obfuscation code is modified from Brad
    Spengler/PaX Team's code in the last public patch of grsecurity/PaX
    based on my understanding of the code. Changes or omissions from the
    original code are mine and don't reflect the original grsecurity/PaX
    code.

    This adds a per-cache random value to SLUB caches that is XORed with
    their freelist pointer address and value. This adds nearly zero
    overhead and frustrates the very common heap overflow exploitation
    method of overwriting freelist pointers.

    A recent example of the attack is written up here:

    http://cyseclabs.com/blog/cve-2016-6187-heap-off-by-one-exploit

    and there is a section dedicated to the technique the book "A Guide to
    Kernel Exploitation: Attacking the Core".

    This is based on patches by Daniel Micay, and refactored to minimize the
    use of #ifdef.

    With 200-count cycles of "hackbench -g 20 -l 1000" I saw the following
    run times:

    before:
    mean 10.11882499999999999995
    variance .03320378329145728642
    stdev .18221905304181911048

    after:
    mean 10.12654000000000000014
    variance .04700556623115577889
    stdev .21680767106160192064

    The difference gets lost in the noise, but if the above is to be taken
    literally, using CONFIG_FREELIST_HARDENED is 0.07% slower.

    Link: http://lkml.kernel.org/r/20170802180609.GA66807@beast
    Signed-off-by: Kees Cook
    Suggested-by: Daniel Micay
    Cc: Rik van Riel
    Cc: Tycho Andersen
    Cc: Alexander Popov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • - free_kmem_cache_nodes() frees the cache node before nulling out a
    reference to it

    - init_kmem_cache_nodes() publishes the cache node before initializing
    it

    Neither of these matter at runtime because the cache nodes cannot be
    looked up by any other thread. But it's neater and more consistent to
    reorder these.

    Link: http://lkml.kernel.org/r/20170707083408.40410-1-glider@google.com
    Signed-off-by: Alexander Potapenko
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     
  • clean up some unused functions and parameters.

    Link: http://lkml.kernel.org/r/598A5E21.2080807@huawei.com
    Signed-off-by: Jun Piao
    Reviewed-by: Alex Chen
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jun Piao
     
  • The function is never called outside of fs/ocfs2/acl.c.

    Link: http://lkml.kernel.org/r/20170801141252.19675-2-jack@suse.cz
    Signed-off-by: Jan Kara
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • There is code duplication between sec_name() and sech_name(). Simplify
    sec_name() by re-using sech_name(). Also, move them up to remove the
    forward declaration of sec_name().

    Link: http://lkml.kernel.org/r/1502248721-22009-1-git-send-email-yamada.masahiro@socionext.com
    Signed-off-by: Masahiro Yamada
    Reviewed-by: Kees Cook
    Cc: Nicholas Piggin
    Cc: Jessica Yu
    Cc: Chris Metcalf
    Cc: Heinrich Schuchardt
    Cc: Ingo Molnar
    Cc: Ard Biesheuvel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Masahiro Yamada
     
  • dax_pmd_insert_mapping() contains the following code:

    pfn_t pfn;
    if (bdev_dax_pgoff(bdev, sector, size, &pgoff) != 0)
    goto fallback;
    /* ... */
    fallback:
    trace_dax_pmd_insert_mapping_fallback(inode, vmf, length, pfn, ret);

    When the condition in the if statement fails, the function calls
    trace_dax_pmd_insert_mapping_fallback() with an uninitialized pfn value.

    This issue has been found while building the kernel with clang. The
    compiler reported:

    fs/dax.c:1280:6: error: variable 'pfn' is used uninitialized
    whenever 'if' condition is true [-Werror,-Wsometimes-uninitialized]
    if (bdev_dax_pgoff(bdev, sector, size, &pgoff) != 0)
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    fs/dax.c:1310:60: note: uninitialized use occurs here
    trace_dax_pmd_insert_mapping_fallback(inode, vmf, length, pfn, ret);
    ^~~

    Link: http://lkml.kernel.org/r/20170903083000.587-1-nicolas.iooss_linux@m4x.org
    Signed-off-by: Nicolas Iooss
    Reviewed-by: Ross Zwisler
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nicolas Iooss