21 May, 2016

20 commits

  • wait_iff_congested has been used to throttle allocator before it retried
    another round of direct reclaim to allow the writeback to make some
    progress and prevent reclaim from looping over dirty/writeback pages
    without making any progress.

    We used to do congestion_wait before commit 0e093d99763e ("writeback: do
    not sleep on the congestion queue if there are no congested BDIs or if
    significant congestion is not being encountered in the current zone")
    but that led to undesirable stalls and sleeping for the full timeout
    even when the BDI wasn't congested. Hence wait_iff_congested was used
    instead.

    But it seems that even wait_iff_congested doesn't work as expected. We
    might have a small file LRU list with all pages dirty/writeback and yet
    the bdi is not congested so this is just a cond_resched in the end and
    can end up triggering pre mature OOM.

    This patch replaces the unconditional wait_iff_congested by
    congestion_wait which is executed only if we _know_ that the last round
    of direct reclaim didn't make any progress and dirty+writeback pages are
    more than a half of the reclaimable pages on the zone which might be
    usable for our target allocation. This shouldn't reintroduce stalls
    fixed by 0e093d99763e because congestion_wait is called only when we are
    getting hopeless when sleeping is a better choice than OOM with many
    pages under IO.

    We have to preserve logic introduced by commit 373ccbe59270 ("mm,
    vmstat: allow WQ concurrency to discover memory reclaim doesn't make any
    progress") into the __alloc_pages_slowpath now that wait_iff_congested
    is not used anymore. As the only remaining user of wait_iff_congested
    is shrink_inactive_list we can remove the WQ specific short sleep from
    wait_iff_congested because the sleep is needed to be done only once in
    the allocation retry cycle.

    [mhocko@suse.com: high_zoneidx->ac_classzone_idx to evaluate memory reserves properly]
    Link: http://lkml.kernel.org/r/1463051677-29418-2-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Hillf Danton
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Tetsuo Handa
    Cc: Vladimir Davydov
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • __alloc_pages_slowpath has traditionally relied on the direct reclaim
    and did_some_progress as an indicator that it makes sense to retry
    allocation rather than declaring OOM. shrink_zones had to rely on
    zone_reclaimable if shrink_zone didn't make any progress to prevent from
    a premature OOM killer invocation - the LRU might be full of dirty or
    writeback pages and direct reclaim cannot clean those up.

    zone_reclaimable allows to rescan the reclaimable lists several times
    and restart if a page is freed. This is really subtle behavior and it
    might lead to a livelock when a single freed page keeps allocator
    looping but the current task will not be able to allocate that single
    page. OOM killer would be more appropriate than looping without any
    progress for unbounded amount of time.

    This patch changes OOM detection logic and pulls it out from shrink_zone
    which is too low to be appropriate for any high level decisions such as
    OOM which is per zonelist property. It is __alloc_pages_slowpath which
    knows how many attempts have been done and what was the progress so far
    therefore it is more appropriate to implement this logic.

    The new heuristic is implemented in should_reclaim_retry helper called
    from __alloc_pages_slowpath. It tries to be more deterministic and
    easier to follow. It builds on an assumption that retrying makes sense
    only if the currently reclaimable memory + free pages would allow the
    current allocation request to succeed (as per __zone_watermark_ok) at
    least for one zone in the usable zonelist.

    This alone wouldn't be sufficient, though, because the writeback might
    get stuck and reclaimable pages might be pinned for a really long time
    or even depend on the current allocation context. Therefore there is a
    backoff mechanism implemented which reduces the reclaim target after
    each reclaim round without any progress. This means that we should
    eventually converge to only NR_FREE_PAGES as the target and fail on the
    wmark check and proceed to OOM. The backoff is simple and linear with
    1/16 of the reclaimable pages for each round without any progress. We
    are optimistic and reset counter for successful reclaim rounds.

    Costly high order pages mostly preserve their semantic and those without
    __GFP_REPEAT fail right away while those which have the flag set will
    back off after the amount of reclaimable pages reaches equivalent of the
    requested order. The only difference is that if there was no progress
    during the reclaim we rely on zone watermark check. This is more
    logical thing to do than previous 1<
    Acked-by: Hillf Danton
    Cc: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Tetsuo Handa
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Compaction can provide a wild variation of feedback to the caller. Many
    of them are implementation specific and the caller of the compaction
    (especially the page allocator) shouldn't be bound to specifics of the
    current implementation.

    This patch abstracts the feedback into three basic types:
    - compaction_made_progress - compaction was active and made some
    progress.
    - compaction_failed - compaction failed and further attempts to
    invoke it would most probably fail and therefore it is not
    worth retrying
    - compaction_withdrawn - compaction wasn't invoked for an
    implementation specific reasons. In the current implementation
    it means that the compaction was deferred, contended or the
    page scanners met too early without any progress. Retrying is
    still worthwhile.

    [vbabka@suse.cz: do not change thp back off behavior]
    [akpm@linux-foundation.org: fix typo in comment, per Hillf]
    Signed-off-by: Michal Hocko
    Acked-by: Hillf Danton
    Acked-by: Vlastimil Babka
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Tetsuo Handa
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • __alloc_pages_direct_compact communicates potential back off by two
    variables:
    - deferred_compaction tells that the compaction returned
    COMPACT_DEFERRED
    - contended_compaction is set when there is a contention on
    zone->lock resp. zone->lru_lock locks

    __alloc_pages_slowpath then backs of for THP allocation requests to
    prevent from long stalls. This is rather messy and it would be much
    cleaner to return a single compact result value and hide all the nasty
    details into __alloc_pages_direct_compact.

    This patch shouldn't introduce any functional changes.

    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Acked-by: Hillf Danton
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Tetsuo Handa
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • compaction_result will be used as the primary feedback channel for
    compaction users. At the same time try_to_compact_pages (and
    potentially others) assume a certain ordering where a more specific
    feedback takes precendence.

    This gets a bit awkward when we have conflicting feedback from different
    zones. E.g one returing COMPACT_COMPLETE meaning the full zone has been
    scanned without any outcome while other returns with COMPACT_PARTIAL aka
    made some progress. The caller should get COMPACT_PARTIAL because that
    means that the compaction still can make some progress. The same
    applies for COMPACT_PARTIAL vs COMPACT_PARTIAL_SKIPPED.

    Reorder PARTIAL to be the largest one so the larger the value is the
    more progress we have done.

    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Acked-by: Hillf Danton
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Tetsuo Handa
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • COMPACT_COMPLETE now means that compaction and free scanner met. This
    is not very useful information if somebody just wants to use this
    feedback and make any decisions based on that. The current caller might
    be a poor guy who just happened to scan tiny portion of the zone and
    that could be the reason no suitable pages were compacted. Make sure we
    distinguish the full and partial zone walks.

    Consumers should treat COMPACT_PARTIAL_SKIPPED as a potential success
    and be optimistic in retrying.

    The existing users of COMPACT_COMPLETE are conservatively changed to use
    COMPACT_PARTIAL_SKIPPED as well but some of them should be probably
    reconsidered and only defer the compaction only for COMPACT_COMPLETE
    with the new semantic.

    This patch shouldn't introduce any functional changes.

    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Acked-by: Hillf Danton
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Tetsuo Handa
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • try_to_compact_pages() can currently return COMPACT_SKIPPED even when
    the compaction is defered for some zone just because zone DMA is skipped
    in 99% of cases due to watermark checks. This makes COMPACT_DEFERRED
    basically unusable for the page allocator as a feedback mechanism.

    Make sure we distinguish those two states properly and switch their
    ordering in the enum. This would mean that the COMPACT_SKIPPED will be
    returned only when all eligible zones are skipped.

    As a result COMPACT_DEFERRED handling for THP in __alloc_pages_slowpath
    will be more precise and we would bail out rather than reclaim.

    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Acked-by: Hillf Danton
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Tetsuo Handa
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • The compiler is complaining after "mm, compaction: change COMPACT_
    constants into enum"

    mm/compaction.c: In function `compact_zone':
    mm/compaction.c:1350:2: warning: enumeration value `COMPACT_DEFERRED' not handled in switch [-Wswitch]
    switch (ret) {
    ^
    mm/compaction.c:1350:2: warning: enumeration value `COMPACT_COMPLETE' not handled in switch [-Wswitch]
    mm/compaction.c:1350:2: warning: enumeration value `COMPACT_NO_SUITABLE_PAGE' not handled in switch [-Wswitch]
    mm/compaction.c:1350:2: warning: enumeration value `COMPACT_NOT_SUITABLE_ZONE' not handled in switch [-Wswitch]
    mm/compaction.c:1350:2: warning: enumeration value `COMPACT_CONTENDED' not handled in switch [-Wswitch]

    compaction_suitable is allowed to return only COMPACT_PARTIAL,
    COMPACT_SKIPPED and COMPACT_CONTINUE so other cases are simply
    impossible. Put a VM_BUG_ON to catch an impossible return value.

    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Acked-by: Hillf Danton
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Tetsuo Handa
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Compaction code is doing weird dances between COMPACT_FOO -> int ->
    unsigned long

    But there doesn't seem to be any reason for that. All functions which
    return/use one of those constants are not expecting any other value so it
    really makes sense to define an enum for them and make it clear that no
    other values are expected.

    This is a pure cleanup and shouldn't introduce any functional changes.

    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Acked-by: Hillf Danton
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Tetsuo Handa
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Motivation:
    As pointed out by Linus [2][3] relying on zone_reclaimable as a way to
    communicate the reclaim progress is rater dubious. I tend to agree,
    not only it is really obscure, it is not hard to imagine cases where a
    single page freed in the loop keeps all the reclaimers looping without
    getting any progress because their gfp_mask wouldn't allow to get that
    page anyway (e.g. single GFP_ATOMIC alloc and free loop). This is rather
    rare so it doesn't happen in the practice but the current logic which we
    have is rather obscure and hard to follow a also non-deterministic.

    This is an attempt to make the OOM detection more deterministic and
    easier to follow because each reclaimer basically tracks its own
    progress which is implemented at the page allocator layer rather spread
    out between the allocator and the reclaim. The more on the
    implementation is described in the first patch.

    I have tested several different scenarios but it should be clear that
    testing OOM killer is quite hard to be representative. There is usually
    a tiny gap between almost OOM and full blown OOM which is often time
    sensitive. Anyway, I have tested the following 2 scenarios and I would
    appreciate if there are more to test.

    Testing environment: a virtual machine with 2G of RAM and 2CPUs without
    any swap to make the OOM more deterministic.

    1) 2 writers (each doing dd with 4M blocks to an xfs partition with 1G
    file size, removes the files and starts over again) running in
    parallel for 10s to build up a lot of dirty pages when 100 parallel
    mem_eaters (anon private populated mmap which waits until it gets
    signal) with 80M each.

    This causes an OOM flood of course and I have compared both patched
    and unpatched kernels. The test is considered finished after there
    are no OOM conditions detected. This should tell us whether there are
    any excessive kills or some of them premature (e.g. due to dirty pages):

    I have performed two runs this time each after a fresh boot.

    * base kernel
    $ grep "Out of memory:" base-oom-run1.log | wc -l
    78
    $ grep "Out of memory:" base-oom-run2.log | wc -l
    78

    $ grep "Kill process" base-oom-run1.log | tail -n1
    [ 91.391203] Out of memory: Kill process 3061 (mem_eater) score 39 or sacrifice child
    $ grep "Kill process" base-oom-run2.log | tail -n1
    [ 82.141919] Out of memory: Kill process 3086 (mem_eater) score 39 or sacrifice child

    $ grep "DMA32 free:" base-oom-run1.log | sed 's@.*free:\([0-9]*\)kB.*@\1@' | calc_min_max.awk
    min: 5376.00 max: 6776.00 avg: 5530.75 std: 166.50 nr: 61
    $ grep "DMA32 free:" base-oom-run2.log | sed 's@.*free:\([0-9]*\)kB.*@\1@' | calc_min_max.awk
    min: 5416.00 max: 5608.00 avg: 5514.15 std: 42.94 nr: 52

    $ grep "DMA32.*all_unreclaimable? no" base-oom-run1.log | wc -l
    1
    $ grep "DMA32.*all_unreclaimable? no" base-oom-run2.log | wc -l
    3

    * patched kernel
    $ grep "Out of memory:" patched-oom-run1.log | wc -l
    78
    miso@tiehlicka /mnt/share/devel/miso/kvm $ grep "Out of memory:" patched-oom-run2.log | wc -l
    77

    e grep "Kill process" patched-oom-run1.log | tail -n1
    [ 497.317732] Out of memory: Kill process 3108 (mem_eater) score 39 or sacrifice child
    $ grep "Kill process" patched-oom-run2.log | tail -n1
    [ 316.169920] Out of memory: Kill process 3093 (mem_eater) score 39 or sacrifice child

    $ grep "DMA32 free:" patched-oom-run1.log | sed 's@.*free:\([0-9]*\)kB.*@\1@' | calc_min_max.awk
    min: 5420.00 max: 5808.00 avg: 5513.90 std: 60.45 nr: 78
    $ grep "DMA32 free:" patched-oom-run2.log | sed 's@.*free:\([0-9]*\)kB.*@\1@' | calc_min_max.awk
    min: 5380.00 max: 6384.00 avg: 5520.94 std: 136.84 nr: 77

    e grep "DMA32.*all_unreclaimable? no" patched-oom-run1.log | wc -l
    2
    $ grep "DMA32.*all_unreclaimable? no" patched-oom-run2.log | wc -l
    3

    The patched kernel run noticeably longer while invoking OOM killer same
    number of times. This means that the original implementation is much
    more aggressive and triggers the OOM killer sooner. free pages stats
    show that neither kernels went OOM too early most of the time, though. I
    guess the difference is in the backoff when retries without any progress
    do sleep for a while if there is memory under writeback or dirty which
    is highly likely considering the parallel IO.
    Both kernels have seen races where zone wasn't marked unreclaimable
    and we still hit the OOM killer. This is most likely a race where
    a task managed to exit between the last allocation attempt and the oom
    killer invocation.

    2) 2 writers again with 10s of run and then 10 mem_eaters to consume as much
    memory as possible without triggering the OOM killer. This required a lot
    of tuning but I've considered 3 consecutive runs in three different boots
    without OOM as a success.

    * base kernel
    size=$(awk '/MemFree/{printf "%dK", ($2/10)-(16*1024)}' /proc/meminfo)

    * patched kernel
    size=$(awk '/MemFree/{printf "%dK", ($2/10)-(12*1024)}' /proc/meminfo)

    That means 40M more memory was usable without triggering OOM killer. The
    base kernel sometimes managed to handle the same as patched but it
    wasn't consistent and failed in at least on of the 3 runs. This seems
    like a minor improvement.

    I was testing also GPF_REPEAT costly requests (hughetlb) with fragmented
    memory and under memory pressure. The results are in patch 11 where the
    logic is implemented. In short I can see huge improvement there.

    I am certainly interested in other usecases as well as well as any
    feedback. Especially those which require higher order requests.

    This patch (of 14):

    While playing with the oom detection rework [1] I have noticed that my
    heavy order-9 (hugetlb) load close to OOM ended up in an endless loop
    where the reclaim hasn't made any progress but did_some_progress didn't
    reflect that and compaction_suitable was backing off because no zone is
    above low wmark + 1 << order.

    It turned out that this is in fact an old standing bug in
    compaction_ready which ignores the requested_highidx and did the
    watermark check for 0 classzone_idx. This succeeds for zone DMA most
    of the time as the zone is mostly unused because of lowmem protection.
    As a result costly high order allocatios always report a successfull
    progress even when there was none. This wasn't a problem so far
    because these allocations usually fail quite early or retry only few
    times with __GFP_REPEAT but this will change after later patch in this
    series so make sure to not lie about the progress and propagate
    requested_highidx down to compaction_ready and use it for both the
    watermak check and compaction_suitable to fix this issue.

    [1] http://lkml.kernel.org/r/1459855533-4600-1-git-send-email-mhocko@kernel.org
    [2] https://lkml.org/lkml/2015/10/12/808
    [3] https://lkml.org/lkml/2015/10/13/597

    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Acked-by: Hillf Danton
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Tetsuo Handa
    Cc: Joonsoo Kim
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • The inactive file list should still be large enough to contain readahead
    windows and freshly written file data, but it no longer is the only
    source for detecting multiple accesses to file pages. The workingset
    refault measurement code causes recently evicted file pages that get
    accessed again after a shorter interval to be promoted directly to the
    active list.

    With that mechanism in place, we can afford to (on a larger system)
    dedicate more memory to the active file list, so we can actually cache
    more of the frequently used file pages in memory, and not have them
    pushed out by streaming writes, once-used streaming file reads, etc.

    This can help things like database workloads, where only half the page
    cache can currently be used to cache the database working set. This
    patch automatically increases that fraction on larger systems, using the
    same ratio that has already been used for anonymous memory.

    [hannes@cmpxchg.org: cgroup-awareness]
    Signed-off-by: Rik van Riel
    Signed-off-by: Johannes Weiner
    Reported-by: Andres Freund
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • Andres observed that his database workload is struggling with the
    transaction journal creating pressure on frequently read pages.

    Access patterns like transaction journals frequently write the same
    pages over and over, but in the majority of cases those pages are never
    read back. There are no caching benefits to be had for those pages, so
    activating them and having them put pressure on pages that do benefit
    from caching is a bad choice.

    Leave page activations to read accesses and don't promote pages based on
    writes alone.

    It could be said that partially written pages do contain cache-worthy
    data, because even if *userspace* does not access the unwritten part,
    the kernel still has to read it from the filesystem for correctness.
    However, a counter argument is that these pages enjoy at least *some*
    protection over other inactive file pages through the writeback cache,
    in the sense that dirty pages are written back with a delay and cache
    reclaim leaves them alone until they have been written back to disk.
    Should that turn out to be insufficient and we see increased read IO
    from partial writes under memory pressure, we can always go back and
    update grab_cache_page_write_begin() to take (pos, len) so that it can
    tell partial writes from pages that don't need partial reads. But for
    now, keep it simple.

    Signed-off-by: Johannes Weiner
    Reported-by: Andres Freund
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • This is a follow-up to

    http://www.spinics.net/lists/linux-mm/msg101739.html

    where Andres reported his database workingset being pushed out by the
    minimum size enforcement of the inactive file list - currently 50% of
    cache - as well as repeatedly written file pages that are never actually
    read.

    Two changes fell out of the discussions. The first change observes that
    pages that are only ever written don't benefit from caching beyond what
    the writeback cache does for partial page writes, and so we shouldn't
    promote them to the active file list where they compete with pages whose
    cached data is actually accessed repeatedly. This change comes in two
    patches - one for in-cache write accesses and one for refaults triggered
    by writes, neither of which should promote a cache page.

    Second, with the refault detection we don't need to set 50% of the cache
    aside for used-once cache anymore since we can detect frequently used
    pages even when they are evicted between accesses. We can allow the
    active list to be bigger and thus protect a bigger workingset that isn't
    challenged by streamers. Depending on the access patterns, this can
    increase major faults during workingset transitions for better
    performance during stable phases.

    This patch (of 3):

    When rewriting a page, the data in that page is replaced with new data.
    This means that evicting something else from the active file list, in
    order to cache data that will be replaced by something else, is likely
    to be a waste of memory.

    It is better to save the active list for frequently read pages, because
    reads actually use the data that is in the page.

    This patch ignores partial writes, because it is unclear whether the
    complexity of identifying those is worth any potential performance gain
    obtained from better caching pages that see repeated partial writes at
    large enough intervals to not get caught by the use-twice promotion code
    used for the inactive file list.

    Signed-off-by: Rik van Riel
    Signed-off-by: Johannes Weiner
    Reported-by: Andres Freund
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • Pull MFD updates from Lee Jones:
    "New Drivers:
    - Add new driver for MAXIM MAX77620/MAX20024 PMIC
    - Add new driver for Hisilicon HI665X PMIC

    New Device Support:
    - Add support for AXP809 in axp20x-rsb
    - Add support for Power Supply in axp20x

    New core features:
    - devm_mfd_* managed resources

    Fix-ups:
    - Remove unused code (da9063-irq, wm8400-core, tps6105x,
    smsc-ece1099, twl4030-power)
    - Improve clean-up in error path (intel_quark_i2c_gpio)
    - Explicitly include headers (syscon.h)
    - Allow building as modules (max77693)
    - Use IS_ENABLED() instead of rolling your own (dm355evm_msp,
    wm8400-core)
    - DT adaptions (axp20x, hi655x, arizona, max77620)
    - Remove CLK_IS_ROOT flag (intel-lpss, intel_quark)
    - Move to gpiochip API (asic3, dm355evm_msp, htc-egpio, htc-i2cpld,
    sm501, tc6393xb, tps65010, ucb1x00, vexpress)
    - Make use of devm_mfd_* calls (act8945a, as3711, atmel-hlcdc,
    bcm590xx, hi6421-pmic-core, lp3943, menf21bmc, mt6397, rdc321x,
    rk808, rn5t618, rt5033, sky81452, stw481x, tps6507x, tps65217,
    wm8400)

    Bug Fixes"
    - Fix ACPI child matching (mfd-core)
    - Fix start-up ordering issues (mt6397-core, arizona-core)
    - Fix forgotten register state on resume (intel-lpss)
    - Fix Clock related issues (twl6040)
    - Fix scheduling whilst atomic (omap-usb-tll)
    - Kconfig changes (vexpress)"

    * tag 'mfd-for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd: (73 commits)
    mfd: hi655x: Add MFD driver for hi655x
    mfd: ab8500-debugfs: Trivial fix of spelling mistake on "between"
    mfd: vexpress: Add !ARCH_USES_GETTIMEOFFSET dependency
    mfd: Add device-tree binding doc for PMIC MAX77620/MAX20024
    mfd: max77620: Add core driver for MAX77620/MAX20024
    mfd: arizona: Add defines for GPSW values that can be used from DT
    mfd: omap-usb-tll: Fix scheduling while atomic BUG
    mfd: wm5110: ARIZONA_CLOCK_CONTROL should be volatile
    mfd: axp20x: Add a cell for the ac power_supply part of the axp20x PMICs
    mfd: intel_soc_pmic_core: Terminate panel control GPIO lookup table correctly
    mfd: wl1273-core: Use devm_mfd_add_devices() for mfd_device registration
    mfd: tps65910: Use devm_mfd_add_devices and devm_regmap_add_irq_chip
    mfd: sec: Use devm_mfd_add_devices and devm_regmap_add_irq_chip
    mfd: rc5t583: Use devm_mfd_add_devices and devm_request_threaded_irq
    mfd: max77686: Use devm_mfd_add_devices and devm_regmap_add_irq_chip
    mfd: as3722: Use devm_mfd_add_devices and devm_regmap_add_irq_chip
    mfd: twl4030-power: Remove driver path in file comment
    MAINTAINERS: Add entry for X-Powers AXP family PMIC drivers
    mfd: smsc-ece1099: Remove unnecessarily remove callback
    mfd: Use IS_ENABLED(CONFIG_FOO) instead of checking FOO || FOO_MODULE
    ...

    Linus Torvalds
     
  • Pull HSI updates from Sebastian Reichel:

    - merge omap-ssi and omap-ssi-port modules

    - fix omap-ssi module reloading

    - add DVFS support to omap-ssi

    * tag 'hsi-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-hsi:
    HSI: omap-ssi: move omap_ssi_port_update_fclk
    HSI: omap-ssi: include pinctrl header files
    HSI: omap-ssi: add COMMON_CLK dependency
    HSI: omap-ssi: add clk change support
    HSI: omap_ssi: built omap_ssi and omap_ssi_port into one module
    HSI: omap_ssi: fix removal of port platform device
    HSI: omap_ssi: make sure probe stays available
    HSI: omap_ssi: fix module unloading
    HSI: omap_ssi_port: switch to gpiod API

    Linus Torvalds
     
  • Pull fbdev updates from Tomi Valkeinen:

    - imxfb: fix lcd power up

    - small fixes and cleanups

    * tag 'fbdev-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tomba/linux:
    fbdev: Use IS_ENABLED() instead of checking for built-in or module
    efifb: Don't show the mapping VA
    video: AMBA CLCD: Remove unncessary include in amba-clcd.c
    fbdev: ssd1307fb: Fix charge pump setting
    Documentation: fb: fix spelling mistakes
    fbdev: fbmem: implement error handling in fbmem_init()
    fbdev: sh_mipi_dsi: remove driver
    video: fbdev: imxfb: add some error handling
    video: fbdev: imxfb: fix semantic of .get_power and .set_power
    video: fbdev: omap2: Remove deprecated regulator_can_change_voltage() usage

    Linus Torvalds
     
  • Pull crypto fix from Herbert Xu:
    "Fix a regression that causes sha-mb to crash"

    * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
    crypto: sha1-mb - make sha1_x8_avx2() conform to C function ABI

    Linus Torvalds
     
  • The newly added nps irqchip driver causes build warnings on ARM64.

    include/soc/nps/common.h: In function 'nps_host_reg_non_cl':
    include/soc/nps/common.h:148:9: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]

    As the driver is only used on ARC, we don't need to see it without
    COMPILE_TEST elsewhere, and we can avoid the warnings by only building
    on 32-bit architectures even with CONFIG_COMPILE_TEST.

    Acked-by: Marc Zyngier
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Vineet Gupta
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     
  • Pull powerpc updates from Michael Ellerman:
    "Highlights:
    - Support for Power ISA 3.0 (Power9) Radix Tree MMU from Aneesh Kumar K.V
    - Live patching support for ppc64le (also merged via livepatching.git)

    Various cleanups & minor fixes from:
    - Aaro Koskinen, Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V,
    Chris Smart, Daniel Axtens, Frederic Barrat, Gavin Shan, Ian Munsie,
    Lennart Sorensen, Madhavan Srinivasan, Mahesh Salgaonkar, Markus Elfring,
    Michael Ellerman, Oliver O'Halloran, Paul Gortmaker, Paul Mackerras,
    Rashmica Gupta, Russell Currey, Suraj Jitindar Singh, Thiago Jung
    Bauermann, Valentin Rothberg, Vipin K Parashar.

    General:
    - Update LMB associativity index during DLPAR add/remove from Nathan
    Fontenot
    - Fix branching to OOL handlers in relocatable kernel from Hari Bathini
    - Add support for userspace Power9 copy/paste from Chris Smart
    - Always use STRICT_MM_TYPECHECKS from Michael Ellerman
    - Add mask of possible MMU features from Michael Ellerman

    PCI:
    - Enable pass through of NVLink to guests from Alexey Kardashevskiy
    - Cleanups in preparation for powernv PCI hotplug from Gavin Shan
    - Don't report error in eeh_pe_reset_and_recover() from Gavin Shan
    - Restore initial state in eeh_pe_reset_and_recover() from Gavin Shan
    - Revert "powerpc/eeh: Fix crash in eeh_add_device_early() on Cell"
    from Guilherme G Piccoli
    - Remove the dependency on EEH struct in DDW mechanism from Guilherme
    G Piccoli

    selftests:
    - Test cp_abort during context switch from Chris Smart
    - Add several tests for transactional memory support from Rashmica
    Gupta

    perf:
    - Add support for sampling interrupt register state from Anju T
    - Add support for unwinding perf-stackdump from Chandan Kumar

    cxl:
    - Configure the PSL for two CAPI ports on POWER8NVL from Philippe
    Bergheaud
    - Allow initialization on timebase sync failures from Frederic Barrat
    - Increase timeout for detection of AFU mmio hang from Frederic
    Barrat
    - Handle num_of_processes larger than can fit in the SPA from Ian
    Munsie
    - Ensure PSL interrupt is configured for contexts with no AFU IRQs
    from Ian Munsie
    - Add kernel API to allow a context to operate with relocate disabled
    from Ian Munsie
    - Check periodically the coherent platform function's state from
    Christophe Lombard

    Freescale:
    - Updates from Scott: "Contains 86xx fixes, minor device tree fixes,
    an erratum workaround, and a kconfig dependency fix."

    * tag 'powerpc-4.7-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (192 commits)
    powerpc/86xx: Fix PCI interrupt map definition
    powerpc/86xx: Move pci1 definition to the include file
    powerpc/fsl: Fix build of the dtb embedded kernel images
    powerpc/fsl: Fix rcpm compatible string
    powerpc/fsl: Remove FSL_SOC dependency from FSL_LBC
    powerpc/fsl-pci: Add a workaround for PCI 5 errata
    powerpc/fsl: Fix SPI compatible on t208xrdb and t1040rdb
    powerpc/powernv/npu: Add PE to PHB's list
    powerpc/powernv: Fix insufficient memory allocation
    powerpc/iommu: Remove the dependency on EEH struct in DDW mechanism
    Revert "powerpc/eeh: Fix crash in eeh_add_device_early() on Cell"
    powerpc/eeh: Drop unnecessary label in eeh_pe_change_owner()
    powerpc/eeh: Ignore handlers in eeh_pe_reset_and_recover()
    powerpc/eeh: Restore initial state in eeh_pe_reset_and_recover()
    powerpc/eeh: Don't report error in eeh_pe_reset_and_recover()
    Revert "powerpc/powernv: Exclude root bus in pnv_pci_reset_secondary_bus()"
    powerpc/powernv/npu: Enable NVLink pass through
    powerpc/powernv/npu: Rework TCE Kill handling
    powerpc/powernv/npu: Add set/unset window helpers
    powerpc/powernv/ioda2: Export debug helper pe_level_printk()
    ...

    Linus Torvalds
     
  • Pull ARM updates from Russell King:
    "Changes included in this pull request:

    - revert pxa2xx-flash back to using ioremap_cached() and switch
    memremap() to use arch_memremap_wb()

    - remove pci=firmware command line argument handling

    - remove unnecessary arm_dma_set_mask() implementation, the generic
    implementation will do for ARM

    - removal of the ARM kallsyms "hack" to work around mode switching
    veneers and vectors located below PAGE_OFFSET

    - tidy up build system output a little

    - add L2 cache power management DT bindings

    - remove duplicated local_irq_disable() in reboot paths

    - handle AMBA primecell devices better at registration time with PM
    domains (needed for Samsung SoCs)

    - ARM specific preparation to support Keystone II kexec"

    * 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm:
    ARM: 8567/1: cache-uniphier: activate ways for secondary CPUs
    ARM: 8570/2: Documentation: devicetree: Add PL310 PM bindings
    ARM: 8569/1: pl2x0: Add OF control of cache power management
    ARM: 8568/1: reboot: remove duplicated local_irq_disable()
    ARM: 8566/1: drivers: amba: properly handle devices with power domains
    ARM: provide arm_has_idmap_alias() helper
    ARM: kexec: remove 512MB restriction on kexec crashdump
    ARM: provide improved virt_to_idmap() functionality
    ARM: kexec: fix crashkernel= handling
    ARM: 8557/1: specify install, zinstall, and uinstall as PHONY targets
    ARM: 8562/1: suppress "include/generated/mach-types.h is up to date."
    ARM: 8553/1: kallsyms: remove --page-offset command line option
    ARM: 8552/1: kallsyms: remove special lower address limit for CONFIG_ARM
    ARM: 8555/1: kallsyms: ignore ARM mode switching veneers
    ARM: 8548/1: dma-mapping: remove arm_dma_set_mask()
    ARM: 8554/1: kernel: pci: remove pci=firmware command line parameter handling
    ARM: memremap: implement arch_memremap_wb()
    memremap: add arch specific hook for MEMREMAP_WB mappings
    mtd: pxa2xx-flash: switch back from memremap to ioremap_cached
    ARM: reintroduce ioremap_cached() for creating cached I/O mappings

    Linus Torvalds
     

20 May, 2016

20 commits

  • Merge updates from Andrew Morton:

    - fsnotify fix

    - poll() timeout fix

    - a few scripts/ tweaks

    - debugobjects updates

    - the (small) ocfs2 queue

    - Minor fixes to kernel/padata.c

    - Maybe half of the MM queue

    * emailed patches from Andrew Morton : (117 commits)
    mm, page_alloc: restore the original nodemask if the fast path allocation failed
    mm, page_alloc: uninline the bad page part of check_new_page()
    mm, page_alloc: don't duplicate code in free_pcp_prepare
    mm, page_alloc: defer debugging checks of pages allocated from the PCP
    mm, page_alloc: defer debugging checks of freed pages until a PCP drain
    cpuset: use static key better and convert to new API
    mm, page_alloc: inline pageblock lookup in page free fast paths
    mm, page_alloc: remove unnecessary variable from free_pcppages_bulk
    mm, page_alloc: pull out side effects from free_pages_check
    mm, page_alloc: un-inline the bad part of free_pages_check
    mm, page_alloc: check multiple page fields with a single branch
    mm, page_alloc: remove field from alloc_context
    mm, page_alloc: avoid looking up the first zone in a zonelist twice
    mm, page_alloc: shortcut watermark checks for order-0 pages
    mm, page_alloc: reduce cost of fair zone allocation policy retry
    mm, page_alloc: shorten the page allocator fast path
    mm, page_alloc: check once if a zone has isolated pageblocks
    mm, page_alloc: move __GFP_HARDWALL modifications out of the fastpath
    mm, page_alloc: simplify last cpupid reset
    mm, page_alloc: remove unnecessary initialisation from __alloc_pages_nodemask()
    ...

    Linus Torvalds
     
  • The page allocator fast path uses either the requested nodemask or
    cpuset_current_mems_allowed if cpusets are enabled. If the allocation
    context allows watermarks to be ignored then it can also ignore memory
    policies. However, on entering the allocator slowpath the nodemask may
    still be cpuset_current_mems_allowed and the policies are enforced.
    This patch resets the nodemask appropriately before entering the
    slowpath.

    Link: http://lkml.kernel.org/r/20160504143628.GU2858@techsingularity.net
    Signed-off-by: Vlastimil Babka
    Signed-off-by: Mel Gorman
    Cc: Jesper Dangaard Brouer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Bad pages should be rare so the code handling them doesn't need to be
    inline for performance reasons. Put it to separate function which
    returns void. This also assumes that the initial page_expected_state()
    result will match the result of the thorough check, i.e. the page
    doesn't become "good" in the meanwhile. This matches the same
    expectations already in place in free_pages_check().

    !DEBUG_VM bloat-o-meter:

    add/remove: 1/0 grow/shrink: 0/1 up/down: 134/-274 (-140)
    function old new delta
    check_new_page_bad - 134 +134
    get_page_from_freelist 3468 3194 -274

    Signed-off-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The new free_pcp_prepare() function shares a lot of code with
    free_pages_prepare(), which makes this a maintenance risk when some
    future patch modifies only one of them. We should be able to achieve
    the same effect (skipping free_pages_check() from !DEBUG_VM configs) by
    adding a parameter to free_pages_prepare() and making it inline, so the
    checks (and the order != 0 parts) are eliminated from the call from
    free_pcp_prepare().

    !DEBUG_VM: bloat-o-meter reports no difference, as my gcc was already
    inlining free_pages_prepare() and the elimination seems to work as
    expected

    DEBUG_VM bloat-o-meter:

    add/remove: 0/1 grow/shrink: 2/0 up/down: 1035/-778 (257)
    function old new delta
    __free_pages_ok 297 1060 +763
    free_hot_cold_page 480 752 +272
    free_pages_prepare 778 - -778

    Here inlining didn't occur before, and added some code, but it's ok for
    a debug option.

    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Vlastimil Babka
    Signed-off-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Every page allocated checks a number of page fields for validity. This
    catches corruption bugs of pages that are already freed but it is
    expensive. This patch weakens the debugging check by checking PCP pages
    only when the PCP lists are being refilled. All compound pages are
    checked. This potentially avoids debugging checks entirely if the PCP
    lists are never emptied and refilled so some corruption issues may be
    missed. Full checking requires DEBUG_VM.

    With the two deferred debugging patches applied, the impact to a page
    allocator microbenchmark is

    4.6.0-rc3 4.6.0-rc3
    inline-v3r6 deferalloc-v3r7
    Min alloc-odr0-1 344.00 ( 0.00%) 317.00 ( 7.85%)
    Min alloc-odr0-2 248.00 ( 0.00%) 231.00 ( 6.85%)
    Min alloc-odr0-4 209.00 ( 0.00%) 192.00 ( 8.13%)
    Min alloc-odr0-8 181.00 ( 0.00%) 166.00 ( 8.29%)
    Min alloc-odr0-16 168.00 ( 0.00%) 154.00 ( 8.33%)
    Min alloc-odr0-32 161.00 ( 0.00%) 148.00 ( 8.07%)
    Min alloc-odr0-64 158.00 ( 0.00%) 145.00 ( 8.23%)
    Min alloc-odr0-128 156.00 ( 0.00%) 143.00 ( 8.33%)
    Min alloc-odr0-256 168.00 ( 0.00%) 154.00 ( 8.33%)
    Min alloc-odr0-512 178.00 ( 0.00%) 167.00 ( 6.18%)
    Min alloc-odr0-1024 186.00 ( 0.00%) 174.00 ( 6.45%)
    Min alloc-odr0-2048 192.00 ( 0.00%) 180.00 ( 6.25%)
    Min alloc-odr0-4096 198.00 ( 0.00%) 184.00 ( 7.07%)
    Min alloc-odr0-8192 200.00 ( 0.00%) 188.00 ( 6.00%)
    Min alloc-odr0-16384 201.00 ( 0.00%) 188.00 ( 6.47%)
    Min free-odr0-1 189.00 ( 0.00%) 180.00 ( 4.76%)
    Min free-odr0-2 132.00 ( 0.00%) 126.00 ( 4.55%)
    Min free-odr0-4 104.00 ( 0.00%) 99.00 ( 4.81%)
    Min free-odr0-8 90.00 ( 0.00%) 85.00 ( 5.56%)
    Min free-odr0-16 84.00 ( 0.00%) 80.00 ( 4.76%)
    Min free-odr0-32 80.00 ( 0.00%) 76.00 ( 5.00%)
    Min free-odr0-64 78.00 ( 0.00%) 74.00 ( 5.13%)
    Min free-odr0-128 77.00 ( 0.00%) 73.00 ( 5.19%)
    Min free-odr0-256 94.00 ( 0.00%) 91.00 ( 3.19%)
    Min free-odr0-512 108.00 ( 0.00%) 112.00 ( -3.70%)
    Min free-odr0-1024 115.00 ( 0.00%) 118.00 ( -2.61%)
    Min free-odr0-2048 120.00 ( 0.00%) 125.00 ( -4.17%)
    Min free-odr0-4096 123.00 ( 0.00%) 129.00 ( -4.88%)
    Min free-odr0-8192 126.00 ( 0.00%) 130.00 ( -3.17%)
    Min free-odr0-16384 126.00 ( 0.00%) 131.00 ( -3.97%)

    Note that the free paths for large numbers of pages is impacted as the
    debugging cost gets shifted into that path when the page data is no
    longer necessarily cache-hot.

    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Jesper Dangaard Brouer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Every page free checks a number of page fields for validity. This
    catches premature frees and corruptions but it is also expensive. This
    patch weakens the debugging check by checking PCP pages at the time they
    are drained from the PCP list. This will trigger the bug but the site
    that freed the corrupt page will be lost. To get the full context, a
    kernel rebuild with DEBUG_VM is necessary.

    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Mel Gorman
    Cc: Vlastimil Babka
    Cc: Jesper Dangaard Brouer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • An important function for cpusets is cpuset_node_allowed(), which
    optimizes on the fact if there's a single root CPU set, it must be
    trivially allowed. But the check "nr_cpusets()
    Signed-off-by: Mel Gorman
    Acked-by: Zefan Li
    Cc: Jesper Dangaard Brouer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The function call overhead of get_pfnblock_flags_mask() is measurable in
    the page free paths. This patch uses an inlined version that is faster.

    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Jesper Dangaard Brouer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The original count is never reused so it can be removed.

    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Jesper Dangaard Brouer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Check without side-effects should be easier to maintain. It also
    removes the duplicated cpupid and flags reset done in !DEBUG_VM variant
    of both free_pcp_prepare() and then bulkfree_pcp_prepare(). Finally, it
    enables the next patch.

    It shouldn't result in new branches, thanks to inlining of the check.

    !DEBUG_VM bloat-o-meter:

    add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-27 (-27)
    function old new delta
    __free_pages_ok 748 739 -9
    free_pcppages_bulk 1403 1385 -18

    DEBUG_VM:

    add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-28 (-28)
    function old new delta
    free_pages_prepare 806 778 -28

    This is also slightly faster because cpupid information is not set on
    tail pages so we can avoid resets there.

    Signed-off-by: Vlastimil Babka
    Signed-off-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • From: Vlastimil Babka

    !DEBUG_VM size and bloat-o-meter:

    add/remove: 1/0 grow/shrink: 0/2 up/down: 124/-370 (-246)
    function old new delta
    free_pages_check_bad - 124 +124
    free_pcppages_bulk 1288 1171 -117
    __free_pages_ok 948 695 -253

    DEBUG_VM:

    add/remove: 1/0 grow/shrink: 0/1 up/down: 124/-214 (-90)
    function old new delta
    free_pages_check_bad - 124 +124
    free_pages_prepare 1112 898 -214

    [akpm@linux-foundation.org: fix whitespace]
    Signed-off-by: Vlastimil Babka
    Signed-off-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Every page allocated or freed is checked for sanity to avoid corruptions
    that are difficult to detect later. A bad page could be due to a number
    of fields. Instead of using multiple branches, this patch combines
    multiple fields into a single branch. A detailed check is only
    necessary if that check fails.

    4.6.0-rc2 4.6.0-rc2
    initonce-v1r20 multcheck-v1r20
    Min alloc-odr0-1 359.00 ( 0.00%) 348.00 ( 3.06%)
    Min alloc-odr0-2 260.00 ( 0.00%) 254.00 ( 2.31%)
    Min alloc-odr0-4 214.00 ( 0.00%) 213.00 ( 0.47%)
    Min alloc-odr0-8 186.00 ( 0.00%) 186.00 ( 0.00%)
    Min alloc-odr0-16 173.00 ( 0.00%) 173.00 ( 0.00%)
    Min alloc-odr0-32 165.00 ( 0.00%) 166.00 ( -0.61%)
    Min alloc-odr0-64 162.00 ( 0.00%) 162.00 ( 0.00%)
    Min alloc-odr0-128 161.00 ( 0.00%) 160.00 ( 0.62%)
    Min alloc-odr0-256 170.00 ( 0.00%) 169.00 ( 0.59%)
    Min alloc-odr0-512 181.00 ( 0.00%) 180.00 ( 0.55%)
    Min alloc-odr0-1024 190.00 ( 0.00%) 188.00 ( 1.05%)
    Min alloc-odr0-2048 196.00 ( 0.00%) 194.00 ( 1.02%)
    Min alloc-odr0-4096 202.00 ( 0.00%) 199.00 ( 1.49%)
    Min alloc-odr0-8192 205.00 ( 0.00%) 202.00 ( 1.46%)
    Min alloc-odr0-16384 205.00 ( 0.00%) 203.00 ( 0.98%)

    Again, the benefit is marginal but avoiding excessive branches is
    important. Ideally the paths would not have to check these conditions
    at all but regrettably abandoning the tests would make use-after-free
    bugs much harder to detect.

    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Jesper Dangaard Brouer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The classzone_idx can be inferred from preferred_zoneref so remove the
    unnecessary field and save stack space.

    Signed-off-by: Mel Gorman
    Cc: Vlastimil Babka
    Cc: Jesper Dangaard Brouer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The allocator fast path looks up the first usable zone in a zonelist and
    then get_page_from_freelist does the same job in the zonelist iterator.
    This patch preserves the necessary information.

    4.6.0-rc2 4.6.0-rc2
    fastmark-v1r20 initonce-v1r20
    Min alloc-odr0-1 364.00 ( 0.00%) 359.00 ( 1.37%)
    Min alloc-odr0-2 262.00 ( 0.00%) 260.00 ( 0.76%)
    Min alloc-odr0-4 214.00 ( 0.00%) 214.00 ( 0.00%)
    Min alloc-odr0-8 186.00 ( 0.00%) 186.00 ( 0.00%)
    Min alloc-odr0-16 173.00 ( 0.00%) 173.00 ( 0.00%)
    Min alloc-odr0-32 165.00 ( 0.00%) 165.00 ( 0.00%)
    Min alloc-odr0-64 161.00 ( 0.00%) 162.00 ( -0.62%)
    Min alloc-odr0-128 159.00 ( 0.00%) 161.00 ( -1.26%)
    Min alloc-odr0-256 168.00 ( 0.00%) 170.00 ( -1.19%)
    Min alloc-odr0-512 180.00 ( 0.00%) 181.00 ( -0.56%)
    Min alloc-odr0-1024 190.00 ( 0.00%) 190.00 ( 0.00%)
    Min alloc-odr0-2048 196.00 ( 0.00%) 196.00 ( 0.00%)
    Min alloc-odr0-4096 202.00 ( 0.00%) 202.00 ( 0.00%)
    Min alloc-odr0-8192 206.00 ( 0.00%) 205.00 ( 0.49%)
    Min alloc-odr0-16384 206.00 ( 0.00%) 205.00 ( 0.49%)

    The benefit is negligible and the results are within the noise but each
    cycle counts.

    Signed-off-by: Mel Gorman
    Cc: Vlastimil Babka
    Cc: Jesper Dangaard Brouer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Watermarks have to be checked on every allocation including the number
    of pages being allocated and whether reserves can be accessed. The
    reserves only matter if memory is limited and the free_pages adjustment
    only applies to high-order pages. This patch adds a shortcut for
    order-0 pages that avoids numerous calculations if there is plenty of
    free memory yielding the following performance difference in a page
    allocator microbenchmark;

    4.6.0-rc2 4.6.0-rc2
    optfair-v1r20 fastmark-v1r20
    Min alloc-odr0-1 380.00 ( 0.00%) 364.00 ( 4.21%)
    Min alloc-odr0-2 273.00 ( 0.00%) 262.00 ( 4.03%)
    Min alloc-odr0-4 227.00 ( 0.00%) 214.00 ( 5.73%)
    Min alloc-odr0-8 196.00 ( 0.00%) 186.00 ( 5.10%)
    Min alloc-odr0-16 183.00 ( 0.00%) 173.00 ( 5.46%)
    Min alloc-odr0-32 173.00 ( 0.00%) 165.00 ( 4.62%)
    Min alloc-odr0-64 169.00 ( 0.00%) 161.00 ( 4.73%)
    Min alloc-odr0-128 169.00 ( 0.00%) 159.00 ( 5.92%)
    Min alloc-odr0-256 180.00 ( 0.00%) 168.00 ( 6.67%)
    Min alloc-odr0-512 190.00 ( 0.00%) 180.00 ( 5.26%)
    Min alloc-odr0-1024 198.00 ( 0.00%) 190.00 ( 4.04%)
    Min alloc-odr0-2048 204.00 ( 0.00%) 196.00 ( 3.92%)
    Min alloc-odr0-4096 209.00 ( 0.00%) 202.00 ( 3.35%)
    Min alloc-odr0-8192 213.00 ( 0.00%) 206.00 ( 3.29%)
    Min alloc-odr0-16384 214.00 ( 0.00%) 206.00 ( 3.74%)

    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Jesper Dangaard Brouer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The fair zone allocation policy is not without cost but it can be
    reduced slightly. This patch removes an unnecessary local variable,
    checks the likely conditions of the fair zone policy first, uses a bool
    instead of a flags check and falls through when a remote node is
    encountered instead of doing a full restart. The benefit is marginal
    but it's there

    4.6.0-rc2 4.6.0-rc2
    decstat-v1r20 optfair-v1r20
    Min alloc-odr0-1 377.00 ( 0.00%) 380.00 ( -0.80%)
    Min alloc-odr0-2 273.00 ( 0.00%) 273.00 ( 0.00%)
    Min alloc-odr0-4 226.00 ( 0.00%) 227.00 ( -0.44%)
    Min alloc-odr0-8 196.00 ( 0.00%) 196.00 ( 0.00%)
    Min alloc-odr0-16 183.00 ( 0.00%) 183.00 ( 0.00%)
    Min alloc-odr0-32 175.00 ( 0.00%) 173.00 ( 1.14%)
    Min alloc-odr0-64 172.00 ( 0.00%) 169.00 ( 1.74%)
    Min alloc-odr0-128 170.00 ( 0.00%) 169.00 ( 0.59%)
    Min alloc-odr0-256 183.00 ( 0.00%) 180.00 ( 1.64%)
    Min alloc-odr0-512 191.00 ( 0.00%) 190.00 ( 0.52%)
    Min alloc-odr0-1024 199.00 ( 0.00%) 198.00 ( 0.50%)
    Min alloc-odr0-2048 204.00 ( 0.00%) 204.00 ( 0.00%)
    Min alloc-odr0-4096 210.00 ( 0.00%) 209.00 ( 0.48%)
    Min alloc-odr0-8192 213.00 ( 0.00%) 213.00 ( 0.00%)
    Min alloc-odr0-16384 214.00 ( 0.00%) 214.00 ( 0.00%)

    The benefit is marginal at best but one of the most important benefits,
    avoiding a second search when falling back to another node is not
    triggered by this particular test so the benefit for some corner cases
    is understated.

    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Jesper Dangaard Brouer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The page allocator fast path checks page multiple times unnecessarily.
    This patch avoids all the slowpath checks if the first allocation
    attempt succeeds.

    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Jesper Dangaard Brouer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • When bulk freeing pages from the per-cpu lists the zone is checked for
    isolated pageblocks on every release. This patch checks it once per
    drain.

    [mgorman@techsingularity.net: fix locking radce, per Vlastimil]
    Signed-off-by: Mel Gorman
    Signed-off-by: Vlastimil Babka
    Cc: Vlastimil Babka
    Cc: Jesper Dangaard Brouer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • __GFP_HARDWALL only has meaning in the context of cpusets but the fast
    path always applies the flag on the first attempt. Move the
    manipulations into the cpuset paths where they will be masked by a
    static branch in the common case.

    With the other micro-optimisations in this series combined, the impact
    on a page allocator microbenchmark is

    4.6.0-rc2 4.6.0-rc2
    decstat-v1r20 micro-v1r20
    Min alloc-odr0-1 381.00 ( 0.00%) 377.00 ( 1.05%)
    Min alloc-odr0-2 275.00 ( 0.00%) 273.00 ( 0.73%)
    Min alloc-odr0-4 229.00 ( 0.00%) 226.00 ( 1.31%)
    Min alloc-odr0-8 199.00 ( 0.00%) 196.00 ( 1.51%)
    Min alloc-odr0-16 186.00 ( 0.00%) 183.00 ( 1.61%)
    Min alloc-odr0-32 179.00 ( 0.00%) 175.00 ( 2.23%)
    Min alloc-odr0-64 174.00 ( 0.00%) 172.00 ( 1.15%)
    Min alloc-odr0-128 172.00 ( 0.00%) 170.00 ( 1.16%)
    Min alloc-odr0-256 181.00 ( 0.00%) 183.00 ( -1.10%)
    Min alloc-odr0-512 193.00 ( 0.00%) 191.00 ( 1.04%)
    Min alloc-odr0-1024 201.00 ( 0.00%) 199.00 ( 1.00%)
    Min alloc-odr0-2048 206.00 ( 0.00%) 204.00 ( 0.97%)
    Min alloc-odr0-4096 212.00 ( 0.00%) 210.00 ( 0.94%)
    Min alloc-odr0-8192 215.00 ( 0.00%) 213.00 ( 0.93%)
    Min alloc-odr0-16384 216.00 ( 0.00%) 214.00 ( 0.93%)

    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Jesper Dangaard Brouer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The current reset unnecessarily clears flags and makes pointless
    calculations.

    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Jesper Dangaard Brouer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman