08 Oct, 2016

40 commits

  • Until now, if some page_ext users want to use it's own field on
    page_ext, it should be defined in struct page_ext by hard-coding. It
    has a problem that wastes memory in following situation.

    struct page_ext {
    #ifdef CONFIG_A
    int a;
    #endif
    #ifdef CONFIG_B
    int b;
    #endif
    };

    Assume that kernel is built with both CONFIG_A and CONFIG_B. Even if we
    enable feature A and doesn't enable feature B at runtime, each entry of
    struct page_ext takes two int rather than one int. It's undesirable
    result so this patch tries to fix it.

    To solve above problem, this patch implements to support extra space
    allocation at runtime. When need() callback returns true, it's extra
    memory requirement is summed to entry size of page_ext. Also, offset
    for each user's extra memory space is returned. With this offset, user
    can use this extra space and there is no need to define needed field on
    page_ext by hard-coding.

    This patch only implements an infrastructure. Following patch will use
    it for page_owner which is only user having it's own fields on page_ext.

    Link: http://lkml.kernel.org/r/1471315879-32294-6-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Minchan Kim
    Cc: Michal Hocko
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Here, 'offset' means entry index in page_ext array. Following patch
    will use 'offset' for field offset in each entry so rename current
    'offset' to prevent confusion.

    Link: http://lkml.kernel.org/r/1471315879-32294-5-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Minchan Kim
    Cc: Michal Hocko
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • There is no reason that page_owner specific function resides on
    vmstat.c.

    Link: http://lkml.kernel.org/r/1471315879-32294-4-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Joonsoo Kim
    Reviewed-by: Sergey Senozhatsky
    Acked-by: Vlastimil Babka
    Cc: Minchan Kim
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • What debug_pagealloc does is just mapping/unmapping page table.
    Basically, it doesn't need additional memory space to memorize
    something. But, with guard page feature, it requires additional memory
    to distinguish if the page is for guard or not. Guard page is only used
    when debug_guardpage_minorder is non-zero so this patch removes
    additional memory allocation (page_ext) if debug_guardpage_minorder is
    zero.

    It saves memory if we just use debug_pagealloc and not guard page.

    Link: http://lkml.kernel.org/r/1471315879-32294-3-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Joonsoo Kim
    Reviewed-by: Sergey Senozhatsky
    Acked-by: Vlastimil Babka
    Cc: Minchan Kim
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Patch series "Reduce memory waste by page extension user".

    This patchset tries to reduce memory waste by page extension user.

    First case is architecture supported debug_pagealloc. It doesn't
    requires additional memory if guard page isn't used. 8 bytes per page
    will be saved in this case.

    Second case is related to page owner feature. Until now, if page_ext
    users want to use it's own fields on page_ext, fields should be defined
    in struct page_ext by hard-coding. It has a following problem.

    struct page_ext {
    #ifdef CONFIG_A
    int a;
    #endif
    #ifdef CONFIG_B
    int b;
    #endif
    };

    Assume that kernel is built with both CONFIG_A and CONFIG_B. Even if we
    enable feature A and doesn't enable feature B at runtime, each entry of
    struct page_ext takes two int rather than one int. It's undesirable
    waste so this patch tries to reduce it. By this patchset, we can save
    20 bytes per page dedicated for page owner feature in some
    configurations.

    This patch (of 6):

    We can make code clean by moving decision condition for set_page_guard()
    into set_page_guard() itself. It will help code readability. There is
    no functional change.

    Link: http://lkml.kernel.org/r/1471315879-32294-2-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Minchan Kim
    Cc: Michal Hocko
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • throttle_vm_writeout() was introduced back in 2005 to fix OOMs caused by
    excessive pageout activity during the reclaim. Too many pages could be
    put under writeback therefore LRUs would be full of unreclaimable pages
    until the IO completes and in turn the OOM killer could be invoked.

    There have been some important changes introduced since then in the
    reclaim path though. Writers are throttled by balance_dirty_pages when
    initiating the buffered IO and later during the memory pressure, the
    direct reclaim is throttled by wait_iff_congested if the node is
    considered congested by dirty pages on LRUs and the underlying bdi is
    congested by the queued IO. The kswapd is throttled as well if it
    encounters pages marked for immediate reclaim or under writeback which
    signals that that there are too many pages under writeback already.
    Finally should_reclaim_retry does congestion_wait if the reclaim cannot
    make any progress and there are too many dirty/writeback pages.

    Another important aspect is that we do not issue any IO from the direct
    reclaim context anymore. In a heavy parallel load this could queue a
    lot of IO which would be very scattered and thus unefficient which would
    just make the problem worse.

    This three mechanisms should throttle and keep the amount of IO in a
    steady state even under heavy IO and memory pressure so yet another
    throttling point doesn't really seem helpful. Quite contrary, Mikulas
    Patocka has reported that swap backed by dm-crypt doesn't work properly
    because the swapout IO cannot make sufficient progress as the writeout
    path depends on dm_crypt worker which has to allocate memory to perform
    the encryption. In order to guarantee a forward progress it relies on
    the mempool allocator. mempool_alloc(), however, prefers to use the
    underlying (usually page) allocator before it grabs objects from the
    pool. Such an allocation can dive into the memory reclaim and
    consequently to throttle_vm_writeout. If there are too many dirty or
    pages under writeback it will get throttled even though it is in fact a
    flusher to clear pending pages.

    kworker/u4:0 D ffff88003df7f438 10488 6 2 0x00000000
    Workqueue: kcryptd kcryptd_crypt [dm_crypt]
    Call Trace:
    schedule+0x3c/0x90
    schedule_timeout+0x1d8/0x360
    io_schedule_timeout+0xa4/0x110
    congestion_wait+0x86/0x1f0
    throttle_vm_writeout+0x44/0xd0
    shrink_zone_memcg+0x613/0x720
    shrink_zone+0xe0/0x300
    do_try_to_free_pages+0x1ad/0x450
    try_to_free_pages+0xef/0x300
    __alloc_pages_nodemask+0x879/0x1210
    alloc_pages_current+0xa1/0x1f0
    new_slab+0x2d7/0x6a0
    ___slab_alloc+0x3fb/0x5c0
    __slab_alloc+0x51/0x90
    kmem_cache_alloc+0x27b/0x310
    mempool_alloc_slab+0x1d/0x30
    mempool_alloc+0x91/0x230
    bio_alloc_bioset+0xbd/0x260
    kcryptd_crypt+0x114/0x3b0 [dm_crypt]

    Let's just drop throttle_vm_writeout altogether. It is not very much
    helpful anymore.

    I have tried to test a potential writeback IO runaway similar to the one
    described in the original patch which has introduced that [1]. Small
    virtual machine (512MB RAM, 4 CPUs, 2G of swap space and disk image on a
    rather slow NFS in a sync mode on the host) with 8 parallel writers each
    writing 1G worth of data. As soon as the pagecache fills up and the
    direct reclaim hits then I start anon memory consumer in a loop
    (allocating 300M and exiting after populating it) in the background to
    make the memory pressure even stronger as well as to disrupt the steady
    state for the IO. The direct reclaim is throttled because of the
    congestion as well as kswapd hitting congestion_wait due to nr_immediate
    but throttle_vm_writeout doesn't ever trigger the sleep throughout the
    test. Dirty+writeback are close to nr_dirty_threshold with some
    fluctuations caused by the anon consumer.

    [1] https://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.9-rc1/2.6.9-rc1-mm3/broken-out/vm-pageout-throttling.patch
    Link: http://lkml.kernel.org/r/1471171473-21418-1-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Mikulas Patocka
    Cc: Marcelo Tosatti
    Cc: NeilBrown
    Cc: Ondrej Kozina
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • On x86_64 MAX_ORDER_NR_PAGES is usually 4M, and a pageblock is usually
    2M, so we only set one pageblock's migratetype in deferred_free_range()
    if pfn is aligned to MAX_ORDER_NR_PAGES. That means it causes
    uninitialized migratetype blocks, you can see from "cat
    /proc/pagetypeinfo", almost half blocks are Unmovable.

    Also we missed freeing the last block in deferred_init_memmap(), it
    causes memory leak.

    Fixes: ac5d2539b238 ("mm: meminit: reduce number of times pageblocks are set during struct page init")
    Link: http://lkml.kernel.org/r/57A3260F.4050709@huawei.com
    Signed-off-by: Xishi Qiu
    Cc: Taku Izumi
    Cc: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: "Kirill A . Shutemov"
    Cc: Kamezawa Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xishi Qiu
     
  • Commit 342332e6a925 ("mm/page_alloc.c: introduce kernelcore=mirror
    option") rewrote the calculation of node spanned pages. But when we
    have a movable node, the size of node spanned pages is double added.
    That's because we have an empty normal zone, the present pages is zero,
    but its spanned pages is not zero.

    e.g.
    Zone ranges:
    DMA [mem 0x0000000000001000-0x0000000000ffffff]
    DMA32 [mem 0x0000000001000000-0x00000000ffffffff]
    Normal [mem 0x0000000100000000-0x0000007c7fffffff]
    Movable zone start for each node
    Node 1: 0x0000001080000000
    Node 2: 0x0000002080000000
    Node 3: 0x0000003080000000
    Node 4: 0x0000003c80000000
    Node 5: 0x0000004c80000000
    Node 6: 0x0000005c80000000
    Early memory node ranges
    node 0: [mem 0x0000000000001000-0x000000000009ffff]
    node 0: [mem 0x0000000000100000-0x000000007552afff]
    node 0: [mem 0x000000007bd46000-0x000000007bd46fff]
    node 0: [mem 0x000000007bdcd000-0x000000007bffffff]
    node 0: [mem 0x0000000100000000-0x000000107fffffff]
    node 1: [mem 0x0000001080000000-0x000000207fffffff]
    node 2: [mem 0x0000002080000000-0x000000307fffffff]
    node 3: [mem 0x0000003080000000-0x0000003c7fffffff]
    node 4: [mem 0x0000003c80000000-0x0000004c7fffffff]
    node 5: [mem 0x0000004c80000000-0x0000005c7fffffff]
    node 6: [mem 0x0000005c80000000-0x0000006c7fffffff]
    node 7: [mem 0x0000006c80000000-0x0000007c7fffffff]

    node1:
    Normal, start=0x1080000, present=0x0, spanned=0x1000000
    Movable, start=0x1080000, present=0x1000000, spanned=0x1000000
    pgdat, start=0x1080000, present=0x1000000, spanned=0x2000000

    After this patch, the problem is fixed.

    node1:
    Normal, start=0x0, present=0x0, spanned=0x0
    Movable, start=0x1080000, present=0x1000000, spanned=0x1000000
    pgdat, start=0x1080000, present=0x1000000, spanned=0x1000000

    Link: http://lkml.kernel.org/r/57A325E8.6070100@huawei.com
    Signed-off-by: Xishi Qiu
    Cc: Taku Izumi
    Cc: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: "Kirill A . Shutemov"
    Cc: Kamezawa Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xishi Qiu
     
  • The compaction_ready() is used during direct reclaim for costly order
    allocations to skip reclaim for zones where compaction should be
    attempted instead. It's combining the standard compaction_suitable()
    check with its own watermark check based on high watermark with extra
    gap, and the result is confusing at best.

    This patch attempts to better structure and document the checks
    involved. First, compaction_suitable() can determine that the
    allocation should either succeed already, or that compaction doesn't
    have enough free pages to proceed. The third possibility is that
    compaction has enough free pages, but we still decide to reclaim first -
    unless we are already above the high watermark with gap. This does not
    mean that the reclaim will actually reach this watermark during single
    attempt, this is rather an over-reclaim protection. So document the
    code as such. The check for compaction_deferred() is removed
    completely, as it in fact had no proper role here.

    The result after this patch is mainly a less confusing code. We also
    skip some over-reclaim in cases where the allocation should already
    succed.

    Link: http://lkml.kernel.org/r/20160810091226.6709-12-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Tested-by: Lorenzo Stoakes
    Acked-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The __compaction_suitable() function checks the low watermark plus a
    compact_gap() gap to decide if there's enough free memory to perform
    compaction. Then __isolate_free_page uses low watermark check to decide
    if particular free page can be isolated. In the latter case, using low
    watermark is needlessly pessimistic, as the free page isolations are
    only temporary. For __compaction_suitable() the higher watermark makes
    sense for high-order allocations where more freepages increase the
    chance of success, and we can typically fail with some order-0 fallback
    when the system is struggling to reach that watermark. But for
    low-order allocation, forming the page should not be that hard. So
    using low watermark here might just prevent compaction from even trying,
    and eventually lead to OOM killer even if we are above min watermarks.

    So after this patch, we use min watermark for non-costly orders in
    __compaction_suitable(), and for all orders in __isolate_free_page().

    [vbabka@suse.cz: clarify __isolate_free_page() comment]
    Link: http://lkml.kernel.org/r/7ae4baec-4eca-e70b-2a69-94bea4fb19fa@suse.cz
    Link: http://lkml.kernel.org/r/20160810091226.6709-11-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Tested-by: Lorenzo Stoakes
    Acked-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Rik van Riel
    Signed-off-by: Vlastimil Babka
    Tested-by: Lorenzo Stoakes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The __compaction_suitable() function checks the low watermark plus a
    compact_gap() gap to decide if there's enough free memory to perform
    compaction. This check uses direct compactor's alloc_flags, but that's
    wrong, since these flags are not applicable for freepage isolation.

    For example, alloc_flags may indicate access to memory reserves, making
    compaction proceed, and then fail watermark check during the isolation.

    A similar problem exists for ALLOC_CMA, which may be part of
    alloc_flags, but not during freepage isolation. In this case however it
    makes sense to use ALLOC_CMA both in __compaction_suitable() and
    __isolate_free_page(), since there's actually nothing preventing the
    freepage scanner to isolate from CMA pageblocks, with the assumption
    that a page that could be migrated once by compaction can be migrated
    also later by CMA allocation. Thus we should count pages in CMA
    pageblocks when considering compaction suitability and when isolating
    freepages.

    To sum up, this patch should remove some false positives from
    __compaction_suitable(), and allow compaction to proceed when free pages
    required for compaction reside in the CMA pageblocks.

    Link: http://lkml.kernel.org/r/20160810091226.6709-10-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Tested-by: Lorenzo Stoakes
    Cc: Michal Hocko
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Compaction uses a watermark gap of (2UL << order) pages at various
    places and it's not immediately obvious why. Abstract it through a
    compact_gap() wrapper to create a single place with a thorough
    explanation.

    [vbabka@suse.cz: clarify the comment of compact_gap()]
    Link: http://lkml.kernel.org/r/7b6aed1f-fdf8-2063-9ff4-bbe4de712d37@suse.cz
    Link: http://lkml.kernel.org/r/20160810091226.6709-9-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Tested-by: Lorenzo Stoakes
    Acked-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Rik van Riel
    Signed-off-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The __compact_finished() function uses low watermark in a check that has
    to pass if the direct compaction is to finish and allocation should
    succeed. This is too pessimistic, as the allocation will typically use
    min watermark. It may happen that during compaction, we drop below the
    low watermark (due to parallel activity), but still form the target
    high-order page. By checking against low watermark, we might needlessly
    continue compaction.

    Similarly, __compaction_suitable() uses low watermark in a check whether
    allocation can succeed without compaction. Again, this is unnecessarily
    pessimistic.

    After this patch, these check will use direct compactor's alloc_flags to
    determine the watermark, which is effectively the min watermark.

    Link: http://lkml.kernel.org/r/20160810091226.6709-8-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Tested-by: Lorenzo Stoakes
    Acked-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • During reclaim/compaction loop, it's desirable to get a final answer
    from unsuccessful compaction so we can either fail the allocation or
    invoke the OOM killer. However, heuristics such as deferred compaction
    or pageblock skip bits can cause compaction to skip parts or whole zones
    and lead to premature OOM's, failures or excessive reclaim/compaction
    retries.

    To remedy this, we introduce a new direct compaction priority called
    COMPACT_PRIO_SYNC_FULL, which instructs direct compaction to:

    - ignore deferred compaction status for a zone
    - ignore pageblock skip hints
    - ignore cached scanner positions and scan the whole zone

    The new priority should get eventually picked up by
    should_compact_retry() and this should improve success rates for costly
    allocations using __GFP_REPEAT, such as hugetlbfs allocations, and
    reduce some corner-case OOM's for non-costly allocations.

    Link: http://lkml.kernel.org/r/20160810091226.6709-6-vbabka@suse.cz
    [vbabka@suse.cz: use the MIN_COMPACT_PRIORITY alias]
    Link: http://lkml.kernel.org/r/d443b884-87e7-1c93-8684-3a3a35759fb1@suse.cz
    Signed-off-by: Vlastimil Babka
    Tested-by: Lorenzo Stoakes
    Acked-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Joonsoo has reminded me that in a later patch changing watermark checks
    throughout compaction I forgot to update checks in
    try_to_compact_pages() and compactd_do_work(). Closer inspection
    however shows that they are redundant now in the success case, because
    compact_zone() now reliably reports this with COMPACT_SUCCESS. So
    effectively the checks just repeat (a subset) of checks that have just
    passed. So instead of checking watermarks again, just test the return
    value.

    Note it's also possible that compaction would declare failure e.g.
    because its find_suitable_fallback() is more strict than simple
    watermark check, and then the watermark check we are removing would then
    still succeed. After this patch this is not possible and it's arguably
    better, because for long-term fragmentation avoidance we should rather
    try a different zone than allocate with the unsuitable fallback. If
    compaction of all zones fail and the allocation is important enough, it
    will retry and succeed anyway.

    Also remove the stray "bool success" variable from kcompactd_do_work().

    Link: http://lkml.kernel.org/r/20160810091226.6709-5-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Reported-by: Joonsoo Kim
    Tested-by: Lorenzo Stoakes
    Acked-by: Michal Hocko
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • COMPACT_PARTIAL has historically meant that compaction returned after
    doing some work without fully compacting a zone. It however didn't
    distinguish if compaction terminated because it succeeded in creating
    the requested high-order page. This has changed recently and now we
    only return COMPACT_PARTIAL when compaction thinks it succeeded, or the
    high-order watermark check in compaction_suitable() passes and no
    compaction needs to be done.

    So at this point we can make the return value clearer by renaming it to
    COMPACT_SUCCESS. The next patch will remove some redundant tests for
    success where compaction just returned COMPACT_SUCCESS.

    Link: http://lkml.kernel.org/r/20160810091226.6709-4-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Tested-by: Lorenzo Stoakes
    Acked-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Since kswapd compaction moved to kcompactd, compact_pgdat() is not
    called anymore, so we remove it. The only caller of __compact_pgdat()
    is compact_node(), so we merge them and remove code that was only
    reachable from kswapd.

    Link: http://lkml.kernel.org/r/20160810091226.6709-3-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Tested-by: Lorenzo Stoakes
    Acked-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Patch series "make direct compaction more deterministic")

    This is mostly a followup to Michal's oom detection rework, which
    highlighted the need for direct compaction to provide better feedback in
    reclaim/compaction loop, so that it can reliably recognize when
    compaction cannot make further progress, and allocation should invoke
    OOM killer or fail. We've discussed this at LSF/MM [1] where I proposed
    expanding the async/sync migration mode used in compaction to more
    general "priorities". This patchset adds one new priority that just
    overrides all the heuristics and makes compaction fully scan all zones.
    I don't currently think that we need more fine-grained priorities, but
    we'll see. Other than that there's some smaller fixes and cleanups,
    mainly related to the THP-specific hacks.

    I've tested this with stress-highalloc in GFP_KERNEL order-4 and
    THP-like order-9 scenarios. There's some improvement for compaction
    stats for the order-4, which is likely due to the better watermarks
    handling. In the previous version I reported mostly noise wrt
    compaction stats, and decreased direct reclaim - now the reclaim is
    without difference. I believe this is due to the less aggressive
    compaction priority increase in patch 6.

    "before" is a mmotm tree prior to 4.7 release plus the first part of the
    series that was sent and merged separately

    before after
    order-4:

    Compaction stalls 27216 30759
    Compaction success 19598 25475
    Compaction failures 7617 5283
    Page migrate success 370510 464919
    Page migrate failure 25712 27987
    Compaction pages isolated 849601 1041581
    Compaction migrate scanned 143146541 101084990
    Compaction free scanned 208355124 144863510
    Compaction cost 1403 1210

    order-9:

    Compaction stalls 7311 7401
    Compaction success 1634 1683
    Compaction failures 5677 5718
    Page migrate success 194657 183988
    Page migrate failure 4753 4170
    Compaction pages isolated 498790 456130
    Compaction migrate scanned 565371 524174
    Compaction free scanned 4230296 4250744
    Compaction cost 215 203

    [1] https://lwn.net/Articles/684611/

    This patch (of 11):

    A recent patch has added whole_zone flag that compaction sets when
    scanning starts from the zone boundary, in order to report that zone has
    been fully scanned in one attempt. For allocations that want to try
    really hard or cannot fail, we will want to introduce a mode where
    scanning whole zone is guaranteed regardless of the cached positions.

    This patch reuses the whole_zone flag in a way that if it's already
    passed true to compaction, the cached scanner positions are ignored.
    Employing this flag during reclaim/compaction loop will be done in the
    next patch. This patch however converts compaction invoked from
    userspace via procfs to use this flag. Before this patch, the cached
    positions were first reset to zone boundaries and then read back from
    struct zone, so there was a window where a parallel compaction could
    replace the reset values, making the manual compaction less effective.
    Using the flag instead of performing reset is more robust.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/20160810091226.6709-2-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Tested-by: Lorenzo Stoakes
    Acked-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Attempt to demystify the task_will_free_mem() loop.

    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Link: http://lkml.kernel.org/r/1c5ddb1c171dbdfc3262252769d6138a29b35b70.1470219853.git.vdavydov@virtuozzo.com
    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • It causes double align requirement for __get_vm_area_node() if parameter
    size is power of 2 and VM_IOREMAP is set in parameter flags, for example
    size=0x10000 -> fls_long(0x10000)=17 -> align=0x20000

    get_count_order_long() is implemented and can be used instead of
    fls_long() for fixing the bug, for example size=0x10000 ->
    get_count_order_long(0x10000)=16 -> align=0x10000

    [akpm@linux-foundation.org: s/get_order_long()/get_count_order_long()/]
    [zijun_hu@zoho.com: fixes]
    Link: http://lkml.kernel.org/r/57AABC8B.1040409@zoho.com
    [akpm@linux-foundation.org: locate get_count_order_long() next to get_count_order()]
    [akpm@linux-foundation.org: move get_count_order[_long] definitions to pick up fls_long()]
    [zijun_hu@htc.com: move out get_count_order[_long]() from __KERNEL__ scope]
    Link: http://lkml.kernel.org/r/57B2C4CE.80303@zoho.com
    Link: http://lkml.kernel.org/r/fc045ecf-20fa-0722-b3ac-9a6140488fad@zoho.com
    Signed-off-by: zijun_hu
    Cc: Tejun Heo
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: David Rientjes
    Signed-off-by: zijun_hu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    zijun_hu
     
  • When selecting an oom victim, we use the same heuristic for both memory
    cgroup and global oom. The only difference is the scope of tasks to
    select the victim from. So we could just export an iterator over all
    memcg tasks and keep all oom related logic in oom_kill.c, but instead we
    duplicate pieces of it in memcontrol.c reusing some initially private
    functions of oom_kill.c in order to not duplicate all of it. That looks
    ugly and error prone, because any modification of select_bad_process
    should also be propagated to mem_cgroup_out_of_memory.

    Let's rework this as follows: keep all oom heuristic related code private
    to oom_kill.c and make oom_kill.c use exported memcg functions when it's
    really necessary (like in case of iterating over memcg tasks).

    Link: http://lkml.kernel.org/r/1470056933-7505-1-git-send-email-vdavydov@virtuozzo.com
    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tetsuo Handa
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • The extern struct variable ocfs2_inode_cache is not defined. It meant to
    use ocfs2_inode_cachep defined in super.c, I think. Fortunately it is
    not used anywhere now, so no impact actually. Clean it up to fix this
    mistake.

    Link: http://lkml.kernel.org/r/57E1E49D.8050503@huawei.com
    Signed-off-by: Joseph Qi
    Reviewed-by: Eric Ren
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     
  • The workqueue "dlm_worker" queues a single work item &dlm->dispatched_work
    and thus it doesn't require execution ordering. Hence, alloc_workqueue
    has been used to replace the deprecated create_singlethread_workqueue
    instance.

    The WQ_MEM_RECLAIM flag has been set to ensure forward progress under
    memory pressure.

    Since there are fixed number of work items, explicit concurrency
    limit is unnecessary here.

    Link: http://lkml.kernel.org/r/2b5ad8d6688effe1a9ddb2bc2082d26fbbe00302.1472590094.git.bhaktipriya96@gmail.com
    Signed-off-by: Bhaktipriya Shridhar
    Acked-by: Tejun Heo
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bhaktipriya Shridhar
     
  • The workqueue "ocfs2_wq" queues multiple work items viz
    &osb->la_enable_wq, &journal->j_recovery_work, &os->os_orphan_scan_work,
    &osb->osb_truncate_log_wq which require strict execution ordering. Hence,
    an ordered dedicated workqueue has been used.

    WQ_MEM_RECLAIM has been set to ensure forward progress under memory
    pressure because the workqueue is being used on a memory reclaim path.

    Link: http://lkml.kernel.org/r/66279de510a7f4cfc6e386d99b7e04b3f65fb11b.1472590094.git.bhaktipriya96@gmail.com
    Signed-off-by: Bhaktipriya Shridhar
    Acked-by: Tejun Heo
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bhaktipriya Shridhar
     
  • The workqueue "o2net_wq" queues multiple work items viz
    &old_sc->sc_shutdown_work, &sc->sc_rx_work, &sc->sc_connect_work which
    require strict execution ordering. Hence, an ordered dedicated
    workqueue has been used.

    WQ_MEM_RECLAIM has been set to ensure forward progress under memory
    pressure.

    Link: http://lkml.kernel.org/r/ddc12e5766c79ba26f8a00d98049107f8a1d4866.1472590094.git.bhaktipriya96@gmail.com
    Signed-off-by: Bhaktipriya Shridhar
    Acked-by: Tejun Heo
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bhaktipriya Shridhar
     
  • The workqueue "user_dlm_worker" queues a single work item
    &lockres->l_work per user_lock_res instance and so it doesn't require
    execution ordering. Hence, alloc_workqueue has been used to replace the
    deprecated create_singlethread_workqueue instance.

    The WQ_MEM_RECLAIM flag has been set to ensure forward progress under
    memory pressure.

    Since there are fixed number of work items, explicit concurrency
    limit is unnecessary here.

    Link: http://lkml.kernel.org/r/9748136d3a3b18138ad1d6ba708367aa1fe9f98c.1472590094.git.bhaktipriya96@gmail.com
    Signed-off-by: Bhaktipriya Shridhar
    Acked-by: Tejun Heo
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bhaktipriya Shridhar
     
  • Though the time_before and time_after family of functions were nicely
    extended to support jiffies64, so that the interface would be consistent,
    it was forgotten to also extend the before/after jiffies functions to
    support jiffies64. This commit brings the interface to parity between
    jiffies and jiffies64, which is quite convenient.

    Link: http://lkml.kernel.org/r/20160929033319.12188-1-Jason@zx2c4.com
    Signed-off-by: Jason A. Donenfeld
    Cc: Thomas Gleixner
    Cc: John Stultz
    Signed-off-by: Linus Torvalds

    Jason A. Donenfeld
     
  • Use assert_spin_locked() macro instead of hand-made BUG_ON statements.

    Link: http://lkml.kernel.org/r/1474537439-18919-1-git-send-email-jack@suse.cz
    Signed-off-by: Jan Kara
    Suggested-by: Heiner Kallweit
    Reviewed-by: Jeff Layton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • When freeing permission events by fsnotify_destroy_event(), the warning
    WARN_ON(!list_empty(&event->list)); may falsely hit.

    This is because although fanotify_get_response() saw event->response
    set, there is nothing to make sure the current CPU also sees the removal
    of the event from the list. Add proper locking around the WARN_ON() to
    avoid the false warning.

    Link: http://lkml.kernel.org/r/1473797711-14111-7-git-send-email-jack@suse.cz
    Reported-by: Miklos Szeredi
    Signed-off-by: Jan Kara
    Reviewed-by: Lino Sanfilippo
    Cc: Eric Paris
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Fanotify code has its own lock (access_lock) to protect a list of events
    waiting for a response from userspace.

    However this is somewhat awkward as the same list_head in the event is
    protected by notification_lock if it is part of the notification queue
    and by access_lock if it is part of the fanotify private queue which
    makes it difficult for any reliable checks in the generic code. So make
    fanotify use the same lock - notification_lock - for protecting its
    private event list.

    Link: http://lkml.kernel.org/r/1473797711-14111-6-git-send-email-jack@suse.cz
    Signed-off-by: Jan Kara
    Reviewed-by: Lino Sanfilippo
    Cc: Miklos Szeredi
    Cc: Eric Paris
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • notification_mutex is used to protect the list of pending events. As such
    there's no reason to use a sleeping lock for it. Convert it to a
    spinlock.

    [jack@suse.cz: fixed version]
    Link: http://lkml.kernel.org/r/1474031567-1831-1-git-send-email-jack@suse.cz
    Link: http://lkml.kernel.org/r/1473797711-14111-5-git-send-email-jack@suse.cz
    Signed-off-by: Jan Kara
    Reviewed-by: Lino Sanfilippo
    Tested-by: Guenter Roeck
    Cc: Miklos Szeredi
    Cc: Eric Paris
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • fsnotify_flush_notify() and fanotify_release() destroy notification
    event while holding notification_mutex.

    The destruction of fanotify event includes a path_put() call which may
    end up calling into a filesystem to delete an inode if we happen to be
    the last holders of dentry reference which happens to be the last holder
    of inode reference.

    That in turn may violate lock ordering for some filesystems since
    notification_mutex is also acquired e. g. during write when generating
    fanotify event.

    Also this is the only thing that forces notification_mutex to be a
    sleeping lock. So drop notification_mutex before destroying a
    notification event.

    Link: http://lkml.kernel.org/r/1473797711-14111-4-git-send-email-jack@suse.cz
    Signed-off-by: Jan Kara
    Cc: Miklos Szeredi
    Cc: Lino Sanfilippo
    Cc: Eric Paris
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Pull i2c updates from Wolfram Sang:
    "Here is the 4.9 pull request from I2C including:

    - centralized error messages when registering to the core
    - improved lockdep annotations to prevent false positives
    - DT support for muxes, gates, and arbitrators
    - bus speeds can now be obtained from ACPI
    - i2c-octeon got refactored and now supports ThunderX SoCs, too
    - i2c-tegra and i2c-designware got a bigger bunch of updates
    - a couple of standard driver fixes and improvements"

    * 'i2c/for-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: (71 commits)
    i2c: axxia: disable clks in case of failure in probe
    i2c: octeon: thunderx: Limit register access retries
    i2c: uniphier-f: fix misdetection of incomplete STOP condition
    gpio: pca953x: variable 'id' was used twice
    i2c: i801: Add support for Kaby Lake PCH-H
    gpio: pca953x: fix an incorrect lockdep warning
    i2c: add a warning to i2c_adapter_depth()
    lockdep: make MAX_LOCKDEP_SUBCLASSES unconditionally visible
    i2c: export i2c_adapter_depth()
    i2c: rk3x: Fix variable 'min_total_ns' unused warning
    i2c: rk3x: Fix sparse warning
    i2c / ACPI: Do not touch an I2C device if it belongs to another adapter
    i2c: octeon: Fix high-level controller status check
    i2c: octeon: Avoid sending STOP during recovery
    i2c: octeon: Fix set SCL recovery function
    i2c: rcar: add support for r8a7796 (R-Car M3-W)
    i2c: imx: make bus recovery through pinctrl optional
    i2c: meson: add gxbb compatible string
    i2c: uniphier-f: set the adapter to master mode when probing
    i2c: uniphier-f: avoid WARN_ON() of clk_disable() in failure path
    ...

    Linus Torvalds
     
  • Pull trivial updates from Jiri Kosina:
    "The usual rocket science from the trivial tree"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
    tracing/syscalls: fix multiline in error message text
    lib/Kconfig.debug: fix DEBUG_SECTION_MISMATCH description
    doc: vfs: fix fadvise() sycall name
    x86/entry: spell EBX register correctly in documentation
    securityfs: fix securityfs_create_dir comment
    irq: Fix typo in tracepoint.xml

    Linus Torvalds
     
  • Pull livepatching updates from Jiri Kosina:

    - fix for patching modules that contain .altinstructions or
    .parainstructions sections, from Jessica Yu

    - make TAINT_LIVEPATCH a per-module flag (so that it's immediately
    clear which module caused the taint), from Josh Poimboeuf

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching:
    livepatch/module: make TAINT_LIVEPATCH module-specific
    Documentation: livepatch: add section about arch-specific code
    livepatch/x86: apply alternatives and paravirt patches after relocations
    livepatch: use arch_klp_init_object_loaded() to finish arch-specific tasks

    Linus Torvalds
     
  • Pull HID updates from Jiri Kosina:

    - Integrated Sensor Hub support (Cherrytrail+) from Srinivas Pandruvada

    - Big cleanup of Wacom driver; namely it's now using devres, and the
    standardized LED API so that libinput doesn't need to have root
    access any more, with substantial amount of other cleanups
    piggy-backing on top. All this from Benjamin Tissoires

    - Report descriptor parsing would now ignore and out-of-range System
    controls in case of the application actually being System Control.
    This fixes quite some issues with several devices, and allows us to
    remove a few ->report_fixup callbacks. From Benjamin Tissoires

    - ... a lot of other assorted small fixes and device ID additions

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid: (76 commits)
    HID: add missing \n to end of dev_warn messages
    HID: alps: fix multitouch cursor issue
    HID: hid-logitech: Documentation updates/corrections
    HID: hid-logitech: Improve Wingman Formula Force GP support
    HID: hid-logitech: Rewrite of descriptor for all DF wheels
    HID: hid-logitech: Compute combined pedals value
    HID: hid-logitech: Add combined pedal support Logitech wheels
    HID: hid-logitech: Introduce control for combined pedals feature
    HID: sony: Update copyright and add Dualshock 4 rate control note
    HID: sony: Defer the initial USB Sixaxis output report
    HID: sony: Relax duplicate checking for USB-only devices
    Revert "HID: microsoft: fix invalid rdesc for 3k kbd"
    HID: alps: fix error return code in alps_input_configured()
    HID: alps: fix stick device not working after resume
    HID: support for keyboard - Corsair STRAFE
    HID: alps: Fix memory leak
    HID: uclogic: Add support for UC-Logic TWHA60 v3
    HID: uclogic: Override constant descriptors
    HID: uclogic: Support UGTizer GP0610 partially
    HID: uclogic: Add support for several more tablets
    ...

    Linus Torvalds
     
  • Pull PCI updates from Bjorn Helgaas:
    "Summary of PCI changes for the v4.9 merge window:

    Enumeration:
    - microblaze: Add multidomain support for procfs (Bharat Kumar Gogada)

    Resource management:
    - Ignore requested alignment for PROBE_ONLY and fixed resources (Yongji Xie)
    - Ignore requested alignment for VF BARs (Yongji Xie)

    PCI device hotplug:
    - Make core explicitly non-modular (Paul Gortmaker)

    PCIe native device hotplug:
    - Rename pcie_isr() locals for clarity (Bjorn Helgaas)
    - Return IRQ_NONE when we can't read interrupt status (Bjorn Helgaas)
    - Remove unnecessary guard (Bjorn Helgaas)
    - Clean up dmesg "Slot(%s)" messages (Bjorn Helgaas)
    - Remove useless pciehp_get_latch_status() calls (Bjorn Helgaas)
    - Clear attention LED on device add (Keith Busch)
    - Allow exclusive userspace control of indicators (Keith Busch)
    - Process all hotplug events before looking for new ones (Mayurkumar Patel)
    - Don't re-read Slot Status when queuing hotplug event (Mayurkumar Patel)
    - Don't re-read Slot Status when handling surprise event (Mayurkumar Patel)
    - Make explicitly non-modular (Paul Gortmaker)

    Power management:
    - Afford direct-complete to devices with non-standard PM (Lukas Wunner)
    - Query platform firmware for device power state (Lukas Wunner)
    - Recognize D3cold in pci_update_current_state() (Lukas Wunner)
    - Avoid unnecessary resume after direct-complete (Lukas Wunner)
    - Make explicitly non-modular (Paul Gortmaker)

    Virtualization:
    - Mark Atheros AR9580 to avoid bus reset (Maik Broemme)
    - Check for pci_setup_device() failure in pci_iov_add_virtfn() (Po Liu)

    MSI:
    - Enable PCI_MSI_IRQ_DOMAIN support for ARC (Joao Pinto)

    AER:
    - Remove aerdriver.nosourceid kernel parameter (Bjorn Helgaas)
    - Remove aerdriver.forceload kernel parameter (Bjorn Helgaas)
    - Fix aer_probe() kernel-doc comment (Cao jin)
    - Add bus flag to skip source ID matching (Jon Derrick)
    - Avoid memory allocation in interrupt handling path (Jon Derrick)
    - Cache capability position (Keith Busch)
    - Make explicitly non-modular (Paul Gortmaker)
    - Remove duplicate AER severity translation (Tyler Baicar)
    - Send correct severity to calculate AER severity (Tyler Baicar)

    Precision Time Measurement:
    - Add Precision Time Measurement (PTM) support (Jonathan Yong)
    - Add PTM clock granularity information (Bjorn Helgaas)
    - Add pci_enable_ptm() for drivers to enable PTM on endpoints (Bjorn Helgaas)

    Generic host bridge driver:
    - Fix pci_remap_iospace() failure path (Lorenzo Pieralisi)
    - Make explicitly non-modular (Paul Gortmaker)

    Altera host bridge driver:
    - Remove redundant platform_get_resource() return value check (Bjorn Helgaas)
    - Poll for link training status after retraining the link (Ley Foon Tan)
    - Rework config accessors for use without a struct pci_bus (Ley Foon Tan)
    - Move retrain from fixup to altera_pcie_host_init() (Ley Foon Tan)
    - Make MSI explicitly non-modular (Paul Gortmaker)
    - Make explicitly non-modular (Paul Gortmaker)
    - Relax device number checking to allow SR-IOV (Po Liu)

    ARM Versatile host bridge driver:
    - Fix pci_remap_iospace() failure path (Lorenzo Pieralisi)

    Axis ARTPEC-6 host bridge driver:
    - Drop __init from artpec6_add_pcie_port() (Niklas Cassel)

    Freescale i.MX6 host bridge driver:
    - Make explicitly non-modular (Paul Gortmaker)

    Intel VMD host bridge driver:
    - Add quirk for AER to ignore source ID (Jon Derrick)
    - Allocate IRQ lists with correct MSI-X count (Jon Derrick)
    - Convert to use pci_alloc_irq_vectors() API (Jon Derrick)
    - Eliminate vmd_vector member from list type (Jon Derrick)
    - Eliminate index member from IRQ list (Jon Derrick)
    - Synchronize with RCU freeing MSI IRQ descs (Keith Busch)
    - Request userspace control of PCIe hotplug indicators (Keith Busch)
    - Move VMD driver to drivers/pci/host (Keith Busch)

    Marvell Aardvark host bridge driver:
    - Fix pci_remap_iospace() failure path (Lorenzo Pieralisi)
    - Remove redundant dev_err call in advk_pcie_probe() (Wei Yongjun)

    Microsoft Hyper-V host bridge driver:
    - Use zero-length array in struct pci_packet (Dexuan Cui)
    - Use pci_function_description[0] in struct definitions (Dexuan Cui)
    - Remove the unused 'wrk' in struct hv_pcibus_device (Dexuan Cui)
    - Handle vmbus_sendpacket() failure in hv_compose_msi_msg() (Dexuan Cui)
    - Handle hv_pci_generic_compl() error case (Dexuan Cui)
    - Use list_move_tail() instead of list_del() + list_add_tail() (Wei Yongjun)

    NVIDIA Tegra host bridge driver:
    - Fix pci_remap_iospace() failure path (Lorenzo Pieralisi)
    - Remove redundant _data suffix (Thierry Reding)
    - Use of_device_get_match_data() (Thierry Reding)

    Qualcomm host bridge driver:
    - Make explicitly non-modular (Paul Gortmaker)

    Renesas R-Car host bridge driver:
    - Consolidate register space lookup and ioremap (Bjorn Helgaas)
    - Don't disable/unprepare clocks on prepare/enable failure (Geert Uytterhoeven)
    - Add multi-MSI support (Grigory Kletsko)
    - Fix pci_remap_iospace() failure path (Lorenzo Pieralisi)
    - Fix some checkpatch warnings (Sergei Shtylyov)
    - Try increasing PCIe link speed to 5 GT/s at boot (Sergei Shtylyov)

    Rockchip host bridge driver:
    - Add DT bindings for Rockchip PCIe controller (Shawn Lin)
    - Add Rockchip PCIe controller support (Shawn Lin)
    - Improve the deassert sequence of four reset pins (Shawn Lin)
    - Fix wrong transmitted FTS count (Shawn Lin)
    - Increase the Max Credit update interval (Rajat Jain)

    Samsung Exynos host bridge driver:
    - Make explicitly non-modular (Paul Gortmaker)

    ST Microelectronics SPEAr13xx host bridge driver:
    - Make explicitly non-modular (Paul Gortmaker)

    Synopsys DesignWare host bridge driver:
    - Return data directly from dw_pcie_readl_rc() (Bjorn Helgaas)
    - Exchange viewport of `MEMORYs' and `CFGs/IOs' (Dong Bo)
    - Check LTSSM training bit before deciding link is up (Jisheng Zhang)
    - Move link wait definitions to .c file (Joao Pinto)
    - Wait for iATU enable (Joao Pinto)
    - Add iATU Unroll feature (Joao Pinto)
    - Fix pci_remap_iospace() failure path (Lorenzo Pieralisi)
    - Make explicitly non-modular (Paul Gortmaker)
    - Relax device number checking to allow SR-IOV (Po Liu)
    - Keep viewport fixed for IO transaction if num_viewport > 2 (Pratyush Anand)
    - Remove redundant platform_get_resource() return value check (Wei Yongjun)

    TI DRA7xx host bridge driver:
    - Make explicitly non-modular (Paul Gortmaker)

    TI Keystone host bridge driver:
    - Propagate request_irq() failure (Wei Yongjun)

    Xilinx AXI host bridge driver:
    - Keep both legacy and MSI interrupt domain references (Bharat Kumar Gogada)
    - Clear interrupt register for invalid interrupt (Bharat Kumar Gogada)
    - Clear correct MSI set bit (Bharat Kumar Gogada)
    - Dispose of MSI virtual IRQ (Bharat Kumar Gogada)
    - Make explicitly non-modular (Paul Gortmaker)
    - Relax device number checking to allow SR-IOV (Po Liu)

    Xilinx NWL host bridge driver:
    - Expand error logging (Bharat Kumar Gogada)
    - Enable all MSI interrupts using MSI mask (Bharat Kumar Gogada)
    - Make explicitly non-modular (Paul Gortmaker)

    Miscellaneous:
    - Drop CONFIG_KEXEC_CORE ifdeffery (Lukas Wunner)
    - portdrv: Make explicitly non-modular (Paul Gortmaker)
    - Make DPC explicitly non-modular (Paul Gortmaker)"

    * tag 'pci-v4.9-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (105 commits)
    x86/PCI: VMD: Move VMD driver to drivers/pci/host
    PCI: rockchip: Fix wrong transmitted FTS count
    PCI: rockchip: Improve the deassert sequence of four reset pins
    PCI: rockchip: Increase the Max Credit update interval
    PCI: rcar: Try increasing PCIe link speed to 5 GT/s at boot
    PCI/AER: Fix aer_probe() kernel-doc comment
    PCI: Ignore requested alignment for VF BARs
    PCI: Ignore requested alignment for PROBE_ONLY and fixed resources
    PCI: Avoid unnecessary resume after direct-complete
    PCI: Recognize D3cold in pci_update_current_state()
    PCI: Query platform firmware for device power state
    PCI: Afford direct-complete to devices with non-standard PM
    PCI/AER: Cache capability position
    PCI/AER: Avoid memory allocation in interrupt handling path
    x86/PCI: VMD: Request userspace control of PCIe hotplug indicators
    PCI: pciehp: Allow exclusive userspace control of indicators
    ACPI / APEI: Send correct severity to calculate AER severity
    PCI/AER: Remove duplicate AER severity translation
    x86/PCI: VMD: Synchronize with RCU freeing MSI IRQ descs
    x86/PCI: VMD: Eliminate index member from IRQ list
    ...

    Linus Torvalds
     
  • Pull VFIO updates from Alex Williamson:
    - comment fixes (Wei Jiangang)
    - static symbols (Baoyou Xie)
    - FLR virtualization (Alex Williamson)
    - catching INTx enabling after MSI/X teardown (Alex Williamson)
    - update to pci_alloc_irq_vectors helpers (Christoph Hellwig)

    * tag 'vfio-v4.9-rc1' of git://github.com/awilliam/linux-vfio:
    vfio_pci: use pci_alloc_irq_vectors
    vfio-pci: Disable INTx after MSI/X teardown
    vfio-pci: Virtualize PCIe & AF FLR
    vfio: platform: mark symbols static where possible
    vfio/pci: Fix typos in comments

    Linus Torvalds
     
  • Pull MD updates from Shaohua Li:
    "This update includes:

    - new AVX512 instruction based raid6 gen/recovery algorithm

    - a couple of md-cluster related bug fixes

    - fix a potential deadlock

    - set nonrotational bit for raid array with SSD

    - set correct max_hw_sectors for raid5/6, which hopefuly can improve
    performance a little bit

    - other minor fixes"

    * tag 'md/4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md:
    md: set rotational bit
    raid6/test/test.c: bug fix: Specify aligned(alignment) attributes to the char arrays
    raid5: handle register_shrinker failure
    raid5: fix to detect failure of register_shrinker
    md: fix a potential deadlock
    md/bitmap: fix wrong cleanup
    raid5: allow arbitrary max_hw_sectors
    lib/raid6: Add AVX512 optimized xor_syndrome functions
    lib/raid6/test/Makefile: Add avx512 gen_syndrome and recovery functions
    lib/raid6: Add AVX512 optimized recovery functions
    lib/raid6: Add AVX512 optimized gen_syndrome functions
    md-cluster: make resync lock also could be interruptted
    md-cluster: introduce dlm_lock_sync_interruptible to fix tasks hang
    md-cluster: convert the completion to wait queue
    md-cluster: protect md_find_rdev_nr_rcu with rcu lock
    md-cluster: clean related infos of cluster
    md: changes for MD_STILL_CLOSED flag
    md-cluster: remove some unnecessary dlm_unlock_sync
    md-cluster: use FORCEUNLOCK in lockres_free
    md-cluster: call md_kick_rdev_from_array once ack failed

    Linus Torvalds