07 Nov, 2015

17 commits

  • Andrew stated the following

    We have quite a history of remote parts of the kernel using
    weird/wrong/inexplicable combinations of __GFP_ flags. I tend
    to think that this is because we didn't adequately explain the
    interface.

    And I don't think that gfp.h really improved much in this area as
    a result of this patchset. Could you go through it some time and
    decide if we've adequately documented all this stuff?

    This patches first moves some GFP flag combinations that are part of the MM
    internals to mm/internal.h. The rest of the patch documents the __GFP_FOO
    bits under various headings and then documents the flag combinations. It
    will not help callers that are brain damaged but the clarity might motivate
    some fixes and avoid future mistakes.

    Signed-off-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Vitaly Wool
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The primary purpose of watermarks is to ensure that reclaim can always
    make forward progress in PF_MEMALLOC context (kswapd and direct reclaim).
    These assume that order-0 allocations are all that is necessary for
    forward progress.

    High-order watermarks serve a different purpose. Kswapd had no high-order
    awareness before they were introduced
    (https://lkml.kernel.org/r/413AA7B2.4000907@yahoo.com.au). This was
    particularly important when there were high-order atomic requests. The
    watermarks both gave kswapd awareness and made a reserve for those atomic
    requests.

    There are two important side-effects of this. The most important is that
    a non-atomic high-order request can fail even though free pages are
    available and the order-0 watermarks are ok. The second is that
    high-order watermark checks are expensive as the free list counts up to
    the requested order must be examined.

    With the introduction of MIGRATE_HIGHATOMIC it is no longer necessary to
    have high-order watermarks. Kswapd and compaction still need high-order
    awareness which is handled by checking that at least one suitable
    high-order page is free.

    With the patch applied, there was little difference in the allocation
    failure rates as the atomic reserves are small relative to the number of
    allocation attempts. The expected impact is that there will never be an
    allocation failure report that shows suitable pages on the free lists.

    The one potential side-effect of this is that in a vanilla kernel, the
    watermark checks may have kept a free page for an atomic allocation. Now,
    we are 100% relying on the HighAtomic reserves and an early allocation to
    have allocated them. If the first high-order atomic allocation is after
    the system is already heavily fragmented then it'll fail.

    [akpm@linux-foundation.org: simplify __zone_watermark_ok(), per Vlastimil]
    Signed-off-by: Mel Gorman
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Vitaly Wool
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • High-order watermark checking exists for two reasons -- kswapd high-order
    awareness and protection for high-order atomic requests. Historically the
    kernel depended on MIGRATE_RESERVE to preserve min_free_kbytes as
    high-order free pages for as long as possible. This patch introduces
    MIGRATE_HIGHATOMIC that reserves pageblocks for high-order atomic
    allocations on demand and avoids using those blocks for order-0
    allocations. This is more flexible and reliable than MIGRATE_RESERVE was.

    A MIGRATE_HIGHORDER pageblock is created when an atomic high-order
    allocation request steals a pageblock but limits the total number to 1% of
    the zone. Callers that speculatively abuse atomic allocations for
    long-lived high-order allocations to access the reserve will quickly fail.
    Note that SLUB is currently not such an abuser as it reclaims at least
    once. It is possible that the pageblock stolen has few suitable
    high-order pages and will need to steal again in the near future but there
    would need to be strong justification to search all pageblocks for an
    ideal candidate.

    The pageblocks are unreserved if an allocation fails after a direct
    reclaim attempt.

    The watermark checks account for the reserved pageblocks when the
    allocation request is not a high-order atomic allocation.

    The reserved pageblocks can not be used for order-0 allocations. This may
    allow temporary wastage until a failed reclaim reassigns the pageblock.
    This is deliberate as the intent of the reservation is to satisfy a
    limited number of atomic high-order short-lived requests if the system
    requires them.

    The stutter benchmark was used to evaluate this but while it was running
    there was a systemtap script that randomly allocated between 1 high-order
    page and 12.5% of memory's worth of order-3 pages using GFP_ATOMIC. This
    is much larger than the potential reserve and it does not attempt to be
    realistic. It is intended to stress random high-order allocations from an
    unknown source, show that there is a reduction in failures without
    introducing an anomaly where atomic allocations are more reliable than
    regular allocations. The amount of memory reserved varied throughout the
    workload as reserves were created and reclaimed under memory pressure.
    The allocation failures once the workload warmed up were as follows;

    4.2-rc5-vanilla 70%
    4.2-rc5-atomic-reserve 56%

    The failure rate was also measured while building multiple kernels. The
    failure rate was 14% but is 6% with this patch applied.

    Overall, this is a small reduction but the reserves are small relative to
    the number of allocation requests. In early versions of the patch, the
    failure rate reduced by a much larger amount but that required much larger
    reserves and perversely made atomic allocations seem more reliable than
    regular allocations.

    [yalin.wang2010@gmail.com: fix redundant check and a memory leak]
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Vitaly Wool
    Cc: Rik van Riel
    Signed-off-by: yalin wang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • MIGRATE_RESERVE preserves an old property of the buddy allocator that
    existed prior to fragmentation avoidance -- min_free_kbytes worth of pages
    tended to remain contiguous until the only alternative was to fail the
    allocation. At the time it was discovered that high-order atomic
    allocations relied on this property so MIGRATE_RESERVE was introduced. A
    later patch will introduce an alternative MIGRATE_HIGHATOMIC so this patch
    deletes MIGRATE_RESERVE and supporting code so it'll be easier to review.
    Note that this patch in isolation may look like a false regression if
    someone was bisecting high-order atomic allocation failures.

    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vitaly Wool
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The zonelist cache (zlc) was introduced to skip over zones that were
    recently known to be full. This avoided expensive operations such as the
    cpuset checks, watermark calculations and zone_reclaim. The situation
    today is different and the complexity of zlc is harder to justify.

    1) The cpuset checks are no-ops unless a cpuset is active and in general
    are a lot cheaper.

    2) zone_reclaim is now disabled by default and I suspect that was a large
    source of the cost that zlc wanted to avoid. When it is enabled, it's
    known to be a major source of stalling when nodes fill up and it's
    unwise to hit every other user with the overhead.

    3) Watermark checks are expensive to calculate for high-order
    allocation requests. Later patches in this series will reduce the cost
    of the watermark checking.

    4) The most important issue is that in the current implementation it
    is possible for a failed THP allocation to mark a zone full for order-0
    allocations and cause a fallback to remote nodes.

    The last issue could be addressed with additional complexity but as the
    benefit of zlc is questionable, it is better to remove it. If stalls due
    to zone_reclaim are ever reported then an alternative would be to
    introduce deferring logic based on a timeout inside zone_reclaim itself
    and leave the page allocator fast paths alone.

    The impact on page-allocator microbenchmarks is negligible as they don't
    hit the paths where the zlc comes into play. Most page-reclaim related
    workloads showed no noticeable difference as a result of the removal.

    The impact was noticeable in a workload called "stutter". One part uses a
    lot of anonymous memory, a second measures mmap latency and a third copies
    a large file. In an ideal world the latency application would not notice
    the mmap latency. On a 2-node machine the results of this patch are

    stutter
    4.3.0-rc1 4.3.0-rc1
    baseline nozlc-v4
    Min mmap 20.9243 ( 0.00%) 20.7716 ( 0.73%)
    1st-qrtle mmap 22.0612 ( 0.00%) 22.0680 ( -0.03%)
    2nd-qrtle mmap 22.3291 ( 0.00%) 22.3809 ( -0.23%)
    3rd-qrtle mmap 25.2244 ( 0.00%) 25.2396 ( -0.06%)
    Max-90% mmap 48.0995 ( 0.00%) 28.3713 ( 41.02%)
    Max-93% mmap 52.5557 ( 0.00%) 36.0170 ( 31.47%)
    Max-95% mmap 55.8173 ( 0.00%) 47.3163 ( 15.23%)
    Max-99% mmap 67.3781 ( 0.00%) 70.1140 ( -4.06%)
    Max mmap 24447.6375 ( 0.00%) 12915.1356 ( 47.17%)
    Mean mmap 33.7883 ( 0.00%) 27.7944 ( 17.74%)
    Best99%Mean mmap 27.7825 ( 0.00%) 25.2767 ( 9.02%)
    Best95%Mean mmap 26.3912 ( 0.00%) 23.7994 ( 9.82%)
    Best90%Mean mmap 24.9886 ( 0.00%) 23.2251 ( 7.06%)
    Best50%Mean mmap 22.0157 ( 0.00%) 22.0261 ( -0.05%)
    Best10%Mean mmap 21.6705 ( 0.00%) 21.6083 ( 0.29%)
    Best5%Mean mmap 21.5581 ( 0.00%) 21.4611 ( 0.45%)
    Best1%Mean mmap 21.3079 ( 0.00%) 21.1631 ( 0.68%)

    Note that the maximum stall latency went from 24 seconds to 12 which is
    still bad but an improvement. The milage varies considerably 2-node
    machine on an earlier test went from 494 seconds to 47 seconds and a
    4-node machine that tested an earlier version of this patch went from a
    worst case stall time of 6 seconds to 67ms. The nature of the benchmark
    is inherently unpredictable as it is hammering the system and the milage
    will vary between machines.

    There is a secondary impact with potentially more direct reclaim because
    zones are now being considered instead of being skipped by zlc. In this
    particular test run it did not occur so will not be described. However,
    in at least one test the following was observed

    1. Direct reclaim rates were higher. This was likely due to direct reclaim
    being entered instead of the zlc disabling a zone and busy looping.
    Busy looping may have the effect of allowing kswapd to make more
    progress and in some cases may be better overall. If this is found then
    the correct action is to put direct reclaimers to sleep on a waitqueue
    and allow kswapd make forward progress. Busy looping on the zlc is even
    worse than when the allocator used to blindly call congestion_wait().

    2. There was higher swap activity as direct reclaim was active.

    3. Direct reclaim efficiency was lower. This is related to 1 as more
    scanning activity also encountered more pages that could not be
    immediately reclaimed

    In that case, the direct page scan and reclaim rates are noticeable but
    it is not considered a problem for a few reasons

    1. The test is primarily concerned with latency. The mmap attempts are also
    faulted which means there are THP allocation requests. The ZLC could
    cause zones to be disabled causing the process to busy loop instead
    of reclaiming. This looks like elevated direct reclaim activity but
    it's the correct action to take based on what processes requested.

    2. The test hammers reclaim and compaction heavily. The number of successful
    THP faults is highly variable but affects the reclaim stats. It's not a
    realistic or reasonable measure of page reclaim activity.

    3. No other page-reclaim intensive workload that was tested showed a problem.

    4. If a workload is identified that benefitted from the busy looping then it
    should be fixed by having direct reclaimers sleep on a wait queue until
    woken by kswapd instead of busy looping. We had this class of problem before
    when congestion_waits() with a fixed timeout was a brain damaged decision
    but happened to benefit some workloads.

    If a workload is identified that relied on the zlc to busy loop then it
    should be fixed correctly and have a direct reclaimer sleep on a waitqueue
    until woken by kswapd.

    Signed-off-by: Mel Gorman
    Acked-by: David Rientjes
    Acked-by: Christoph Lameter
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: Vitaly Wool
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • __GFP_WAIT was used to signal that the caller was in atomic context and
    could not sleep. Now it is possible to distinguish between true atomic
    context and callers that are not willing to sleep. The latter should
    clear __GFP_DIRECT_RECLAIM so kswapd will still wake. As clearing
    __GFP_WAIT behaves differently, there is a risk that people will clear the
    wrong flags. This patch renames __GFP_WAIT to __GFP_RECLAIM to clearly
    indicate what it does -- setting it allows all reclaim activity, clearing
    them prevents it.

    [akpm@linux-foundation.org: fix build]
    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Mel Gorman
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Acked-by: Johannes Weiner
    Cc: Christoph Lameter
    Acked-by: David Rientjes
    Cc: Vitaly Wool
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • GFP_IOFS was intended to be shorthand for clearing two flags, not a set of
    allocation flags. There is only one user of this flag combination now and
    there appears to be no reason why Lustre had to be protected from reclaim
    stalls. As none of the sites appear to be atomic, this patch simply
    deletes GFP_IOFS and converts Lustre to using GFP_KERNEL, GFP_NOFS or
    GFP_NOIO as appropriate.

    Signed-off-by: Mel Gorman
    Cc: Oleg Drokin
    Cc: Andreas Dilger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • …d avoiding waking kswapd

    __GFP_WAIT has been used to identify atomic context in callers that hold
    spinlocks or are in interrupts. They are expected to be high priority and
    have access one of two watermarks lower than "min" which can be referred
    to as the "atomic reserve". __GFP_HIGH users get access to the first
    lower watermark and can be called the "high priority reserve".

    Over time, callers had a requirement to not block when fallback options
    were available. Some have abused __GFP_WAIT leading to a situation where
    an optimisitic allocation with a fallback option can access atomic
    reserves.

    This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
    cannot sleep and have no alternative. High priority users continue to use
    __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and
    are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify
    callers that want to wake kswapd for background reclaim. __GFP_WAIT is
    redefined as a caller that is willing to enter direct reclaim and wake
    kswapd for background reclaim.

    This patch then converts a number of sites

    o __GFP_ATOMIC is used by callers that are high priority and have memory
    pools for those requests. GFP_ATOMIC uses this flag.

    o Callers that have a limited mempool to guarantee forward progress clear
    __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
    into this category where kswapd will still be woken but atomic reserves
    are not used as there is a one-entry mempool to guarantee progress.

    o Callers that are checking if they are non-blocking should use the
    helper gfpflags_allow_blocking() where possible. This is because
    checking for __GFP_WAIT as was done historically now can trigger false
    positives. Some exceptions like dm-crypt.c exist where the code intent
    is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
    flag manipulations.

    o Callers that built their own GFP flags instead of starting with GFP_KERNEL
    and friends now also need to specify __GFP_KSWAPD_RECLAIM.

    The first key hazard to watch out for is callers that removed __GFP_WAIT
    and was depending on access to atomic reserves for inconspicuous reasons.
    In some cases it may be appropriate for them to use __GFP_HIGH.

    The second key hazard is callers that assembled their own combination of
    GFP flags instead of starting with something like GFP_KERNEL. They may
    now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless
    if it's missed in most cases as other activity will wake kswapd.

    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Vitaly Wool <vitalywool@gmail.com>
    Cc: Rik van Riel <riel@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Mel Gorman
     
  • This patch redefines which GFP bits are used for specifying mobility and
    the order of the migrate types. Once redefined it's possible to convert
    GFP flags to a migrate type with a simple mask and shift. The only
    downside is that readers of OOM kill messages and allocation failures may
    have been used to the existing values but scripts/gfp-translate will help.

    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vitaly Wool
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • There is a seqcounter that protects against spurious allocation failures
    when a task is changing the allowed nodes in a cpuset. There is no need
    to check the seqcounter until a cpuset exists.

    Signed-off-by: Mel Gorman
    Acked-by: Christoph Lameter
    Acked-by: David Rientjes
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: Vitaly Wool
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • File-backed pages that will be immediately written are balanced between
    zones. This heuristic tries to avoid having a single zone filled with
    recently dirtied pages but the checks are unnecessarily expensive. Move
    consider_zone_balanced into the alloc_context instead of checking bitmaps
    multiple times. The patch also gives the parameter a more meaningful
    name.

    Signed-off-by: Mel Gorman
    Acked-by: David Rientjes
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Acked-by: Johannes Weiner
    Cc: Christoph Lameter
    Cc: Vitaly Wool
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Overall, the intent of this series is to remove the zonelist cache which
    was introduced to avoid high overhead in the page allocator. Once this is
    done, it is necessary to reduce the cost of watermark checks.

    The series starts with minor micro-optimisations.

    Next it notes that GFP flags that affect watermark checks are abused.
    __GFP_WAIT historically identified callers that could not sleep and could
    access reserves. This was later abused to identify callers that simply
    prefer to avoid sleeping and have other options. A patch distinguishes
    between atomic callers, high-priority callers and those that simply wish
    to avoid sleep.

    The zonelist cache has been around for a long time but it is of dubious
    merit with a lot of complexity and some issues that are explained. The
    most important issue is that a failed THP allocation can cause a zone to
    be treated as "full". This potentially causes unnecessary stalls, reclaim
    activity or remote fallbacks. The issues could be fixed but it's not
    worth it. The series places a small number of other micro-optimisations
    on top before examining GFP flags watermarks.

    High-order watermarks enforcement can cause high-order allocations to fail
    even though pages are free. The watermark checks both protect high-order
    atomic allocations and make kswapd aware of high-order pages but there is
    a much better way that can be handled using migrate types. This series
    uses page grouping by mobility to reserve pageblocks for high-order
    allocations with the size of the reservation depending on demand. kswapd
    awareness is maintained by examining the free lists. By patch 12 in this
    series, there are no high-order watermark checks while preserving the
    properties that motivated the introduction of the watermark checks.

    This patch (of 10):

    No user of zone_watermark_ok_safe() specifies alloc_flags. This patch
    removes the unnecessary parameter.

    Signed-off-by: Mel Gorman
    Acked-by: David Rientjes
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Reviewed-by: Christoph Lameter
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Introduce is_sysrq_oom helper function indicating oom kill triggered
    by sysrq to improve readability.

    No functional changes.

    Signed-off-by: Yaowei Bai
    Acked-by: David Rientjes
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yaowei Bai
     
  • Pull backlight updates from Lee Jones:
    "New Device Support
    - None

    New Functionality:
    - None

    Core Frameworks:
    - Reject legacy PWM request for device defined in DT

    Fix-ups:
    - Remove unnecessary MODULE_ALIAS(); adp8860_bl, adp8870_bl
    - Simplify code: pm8941-wled
    - Supply default-brightness logic; pm8941-wled

    Bug Fixes:
    - Clean up OF node; 88pm860x_bl
    - Ensure struct is zeroed; lp855x_bl"

    * tag 'backlight-for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/backlight:
    backlight: pm8941-wled: Add default-brightness property
    backlight: pm8941-wled: Fix ptr_ret.cocci warnings
    backlight: pwm: Reject legacy PWM request for device defined in DT
    backlight: 88pm860x_bl: Add missing of_node_put
    backlight: adp8870: Remove unnecessary MODULE_ALIAS()
    backlight: adp8860: Remove unnecessary MODULE_ALIAS()
    backlight: lp855x: Make sure props struct is zeroed

    Linus Torvalds
     
  • Commit b158b69a3765 ("mfd: rtsx: Simplify function return logic")
    removed the use of the 'err' variable, but left the variable itself
    around, resulting in gcc quite reasonably warning:

    drivers/mfd/rtsx_pcr.c: In function ‘rtsx_pci_set_pull_ctl’:
    drivers/mfd/rtsx_pcr.c:565:6: warning: unused variable ‘err’ [-Wunused-variable]
    int err;
    ^

    Get rid of the unused variable, and avoid the new warning.

    Cc: Javier Martinez Canillas
    Cc: Lee Jones
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Pull MFD updates from Lee Jones:
    "New Device Support:
    - Add support for 88pm860; 88pm80x
    - Add support for 24c08 EEPROM; at24
    - Add support for Broxton Whiskey Cove; intel*
    - Add support for RTS522A; rts5227
    - Add support for I2C devices; intel_quark_i2c_gpio

    New Functionality:
    - Add microphone support; arizona
    - Add general purpose switch support; arizona
    - Add fuel-gauge support; da9150-core
    - Add shutdown support; sec-core
    - Add charger support; tps65217
    - Add flexible serial communication unit support; atmel-flexcom
    - Add power button support; axp20x
    - Add led-flash support; rt5033

    Core Frameworks:
    - Supply a generic macro for defining Regmap IRQs
    - Rework ACPI child device matching

    Fix-ups:
    - Use Regmap to access registers; tps6105x
    - Use DEFINE_RES_IRQ_NAMED() macro; da9150
    - Re-arrange device registration order; intel_quark_i2c_gpio
    - Allow OF matching; cros_ec_i2c, atmel-hlcdc, hi6421-pmic, max8997, sm501
    - Handle deferred probe; twl6040
    - Improve accuracy of headphone detect; arizona
    - Unnecessary MODULE_ALIAS() removal; bcm590xx, rt5033
    - Remove unused code; htc-i2cpld, arizona, pcf50633-irq, sec-core
    - Simplify code; kempld, rts5209, da903x, lm3533, da9052, arizona
    - Remove #iffery; arizona
    - DT binding adaptions; many

    Bug Fixes:
    - Fix possible NULL pointer dereference; wm831x, tps6105x
    - Fix 64bit bug; intel_soc_pmic_bxtwc
    - Fix signedness issue; arizona"

    * tag 'mfd-for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd: (73 commits)
    bindings: mfd: s2mps11: Add documentation for s2mps15 PMIC
    mfd: sec-core: Remove unused s2mpu02-rtc and s2mpu02-clk children
    extcon: arizona: Add extcon specific device tree binding document
    MAINTAINERS: Add binding docs for Cirrus Logic/Wolfson Arizona devices
    mfd: arizona: Remove bindings covered in new subsystem specific docs
    mfd: rt5033: Add RT5033 Flash led sub device
    mfd: lpss: Add Intel Broxton PCI IDs
    mfd: lpss: Add Broxton ACPI IDs
    mfd: arizona: Signedness bug in arizona_runtime_suspend()
    mfd: axp20x: Add a cell for the power button part of the, axp288 PMICs
    mfd: dt-bindings: Document pulled down WRSTBI pin on S2MPS1X
    mfd: sec-core: Disable buck voltage reset on watchdog falling edge
    mfd: sec-core: Dump PMIC revision to find out the HW
    mfd: arizona: Use correct type ID for device tree config
    mfd: arizona: Remove use of codec build config #ifdefs
    mfd: arizona: Simplify adding subdevices
    mfd: arizona: Downgrade type mismatch messages to dev_warn
    mfd: arizona: Factor out checking of jack detection state
    mfd: arizona: Factor out DCVDD isolation control
    mfd: Make TPS6105X select REGMAP_I2C
    ...

    Linus Torvalds
     
  • It turns out that we still have issues with the EFI memory map that ends
    up polluting our kernel page tables with writable executable pages.

    That will get sorted out, but in the meantime let's not make the scary
    complaint about them be on by default. The code is useful for
    developers, but not ready for end user testing yet.

    Acked-by: Borislav Petkov
    Acked-by: Ingo Molnar
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

06 Nov, 2015

23 commits

  • …linux-platform-drivers-x86

    Pull x86 platform driver update from Darren Hart:
    "Various toshiba hotkey and keyboard related fixes and a new WMI
    driver. Several intel_scu_ipc cleanups and a locking fix. A
    spattering of small single fixes across various platforms.

    I was asked to pick up an OLPC cleanup as the driver appeared
    unmaintained and it seemed similar to what is maintained in
    platform/drivers/x86. I have included the patch and an update to the
    MAINTAINERS file.

    toshiba_acpi:
    - Initialize hotkey_event_type variable
    - Remove unneeded u32 variables from *setup_keyboard
    - Add 0x prefix to available_kbd_modes_show function
    - Change default Hotkey enabling value
    - Unify hotkey enabling functions

    toshiba-wmi:
    - Toshiba WMI Hotkey Driver

    intel_scu_ipc:
    - Protect dev member assignment on ->remove()
    - Switch to use module_pci_driver() macro
    - Convert to use struct device *
    - Propagate pointer to struct intel_scu_ipc_dev
    - Fix error path by turning to devm_* / pcim_*

    acer-wmi:
    - remove threeg and interface sysfs interfaces

    OLPC:
    - Use %*ph specifier instead of passing direct values

    MAINTAINERS:
    - Add drivers/platform/olpc to drivers/platform/x86

    sony-laptop:
    - Fix handling sony_nc_hotkeys_decode result

    intel_mid_powerbtn:
    - Remove misuse of IRQF_NO_SUSPEND flag

    compal-laptop:
    - Add charge control limit

    asus-wmi:
    - restore kbd led level after resume"

    * tag 'platform-drivers-x86-v4.4-1' of git://git.infradead.org/users/dvhart/linux-platform-drivers-x86:
    toshiba_acpi: Initialize hotkey_event_type variable
    intel_scu_ipc: Protect dev member assignment on ->remove()
    intel_scu_ipc: Switch to use module_pci_driver() macro
    intel_scu_ipc: Convert to use struct device *
    intel_scu_ipc: Propagate pointer to struct intel_scu_ipc_dev
    intel_scu_ipc: Fix error path by turning to devm_* / pcim_*
    acer-wmi: remove threeg and interface sysfs interfaces
    OLPC: Use %*ph specifier instead of passing direct values
    MAINTAINERS: Add drivers/platform/olpc to drivers/platform/x86
    platform/x86: Toshiba WMI Hotkey Driver
    sony-laptop: Fix handling sony_nc_hotkeys_decode result
    intel_mid_powerbtn: Remove misuse of IRQF_NO_SUSPEND flag
    compal-laptop: Add charge control limit
    asus-wmi: restore kbd led level after resume
    toshiba_acpi: Remove unneeded u32 variables from *setup_keyboard
    toshiba_acpi: Add 0x prefix to available_kbd_modes_show function
    toshiba_acpi: Change default Hotkey enabling value
    toshiba_acpi: Unify hotkey enabling functions

    Linus Torvalds
     
  • Pull powerpc updates from Michael Ellerman:

    - Kconfig: remove BE-only platforms from LE kernel build from Boqun
    Feng
    - Refresh ps3_defconfig from Geoff Levand
    - Emit GNU & SysV hashes for the vdso from Michael Ellerman
    - Define an enum for the bolted SLB indexes from Anshuman Khandual
    - Use a local to avoid multiple calls to get_slb_shadow() from Michael
    Ellerman
    - Add gettimeofday() benchmark from Michael Neuling
    - Avoid link stack corruption in __get_datapage() from Michael Neuling
    - Add virt_to_pfn and use this instead of opencoding from Aneesh Kumar
    K.V
    - Add ppc64le_defconfig from Michael Ellerman
    - pseries: extract of_helpers module from Andy Shevchenko
    - Correct string length in pseries_of_derive_parent() from Nathan
    Fontenot
    - Free the MSI bitmap if it was slab allocated from Denis Kirjanov
    - Shorten irq_chip name for the SIU from Christophe Leroy
    - Wait 1s for secondaries to enter OPAL during kexec from Samuel
    Mendoza-Jonas
    - Fix _ALIGN_* errors due to type difference, from Aneesh Kumar K.V
    - powerpc/pseries/hvcserver: don't memset pi_buff if it is null from
    Colin Ian King
    - Disable hugepd for 64K page size, from Aneesh Kumar K.V
    - Differentiate between hugetlb and THP during page walk from Aneesh
    Kumar K.V
    - Make PCI non-optional for pseries from Michael Ellerman
    - Individual System V IPC system calls from Sam bobroff
    - Add selftest of unmuxed IPC calls from Michael Ellerman
    - discard .exit.data at runtime from Stephen Rothwell
    - Delete old orphaned PrPMC 280/2800 DTS and boot file, from Paul
    Gortmaker
    - Use of_get_next_parent to simplify code from Christophe Jaillet
    - Paginate some xmon output from Sam bobroff
    - Add some more elements to the xmon PACA dump from Michael Ellerman
    - Allow the tm-syscall selftest to build with old headers from Michael
    Ellerman
    - Run EBB selftests only on POWER8 from Denis Kirjanov
    - Drop CONFIG_TUNE_CELL in favour of CONFIG_CELL_CPU from Michael
    Ellerman
    - Avoid reference to potentially freed memory in prom.c from Christophe
    Jaillet
    - Quieten boot wrapper output with run_cmd from Geoff Levand
    - EEH fixes and cleanups from Gavin Shan
    - Fix recursive fenced PHB on Broadcom shiner adapter from Gavin Shan
    - Use of_get_next_parent() in of_get_ibm_chip_id() from Michael
    Ellerman
    - Fix section mismatch warning in msi_bitmap_alloc() from Denis
    Kirjanov
    - Fix ps3-lpm white space from Rudhresh Kumar J
    - Fix ps3-vuart null dereference from Colin King
    - nvram: Add missing kfree in error path from Christophe Jaillet
    - nvram: Fix function name in some errors messages, from Christophe
    Jaillet
    - drivers/macintosh: adb: fix misleading Kconfig help text from Aaro
    Koskinen
    - agp/uninorth: fix a memleak in create_gatt_table from Denis Kirjanov
    - cxl: Free virtual PHB when removing from Andrew Donnellan
    - scripts/kconfig/Makefile: Allow KBUILD_DEFCONFIG to be a target from
    Michael Ellerman
    - scripts/kconfig/Makefile: Fix KBUILD_DEFCONFIG check when building
    with O= from Michael Ellerman
    - Freescale updates from Scott: Highlights include 64-bit book3e
    kexec/kdump support, a rework of the qoriq clock driver, device tree
    changes including qoriq fman nodes, support for a new 85xx board, and
    some fixes.
    - MPC5xxx updates from Anatolij: Highlights include a driver for
    MPC512x LocalPlus Bus FIFO with its device tree binding
    documentation, mpc512x device tree updates and some minor fixes.

    * tag 'powerpc-4.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (106 commits)
    powerpc/msi: Fix section mismatch warning in msi_bitmap_alloc()
    powerpc/prom: Use of_get_next_parent() in of_get_ibm_chip_id()
    powerpc/pseries: Correct string length in pseries_of_derive_parent()
    powerpc/e6500: hw tablewalk: make sure we invalidate and write to the same tlb entry
    powerpc/mpc85xx: Add FSL QorIQ DPAA FMan support to the SoC device tree(s)
    powerpc/mpc85xx: Create dts components for the FSL QorIQ DPAA FMan
    powerpc/fsl: Add #clock-cells and clockgen label to clockgen nodes
    powerpc: handle error case in cpm_muram_alloc()
    powerpc: mpic: use IRQCHIP_SKIP_SET_WAKE instead of redundant mpic_irq_set_wake
    powerpc/book3e-64: Enable kexec
    powerpc/book3e-64/kexec: Set "r4 = 0" when entering spinloop
    powerpc/booke: Only use VIRT_PHYS_OFFSET on booke32
    powerpc/book3e-64/kexec: Enable SMP release
    powerpc/book3e-64/kexec: create an identity TLB mapping
    powerpc/book3e-64: Don't limit paca to 256 MiB
    powerpc/book3e/kdump: Enable crash_kexec_wait_realmode
    powerpc/book3e: support CONFIG_RELOCATABLE
    powerpc/booke64: Fix args to copy_and_flush
    powerpc/book3e-64: rename interrupt_end_book3e with __end_interrupts
    powerpc/e6500: kexec: Handle hardware threads
    ...

    Linus Torvalds
     
  • Merge patch-bomb from Andrew Morton:

    - inotify tweaks

    - some ocfs2 updates (many more are awaiting review)

    - various misc bits

    - kernel/watchdog.c updates

    - Some of mm. I have a huge number of MM patches this time and quite a
    lot of it is quite difficult and much will be held over to next time.

    * emailed patches from Andrew Morton : (162 commits)
    selftests: vm: add tests for lock on fault
    mm: mlock: add mlock flags to enable VM_LOCKONFAULT usage
    mm: introduce VM_LOCKONFAULT
    mm: mlock: add new mlock system call
    mm: mlock: refactor mlock, munlock, and munlockall code
    kasan: always taint kernel on report
    mm, slub, kasan: enable user tracking by default with KASAN=y
    kasan: use IS_ALIGNED in memory_is_poisoned_8()
    kasan: Fix a type conversion error
    lib: test_kasan: add some testcases
    kasan: update reference to kasan prototype repo
    kasan: move KASAN_SANITIZE in arch/x86/boot/Makefile
    kasan: various fixes in documentation
    kasan: update log messages
    kasan: accurately determine the type of the bad access
    kasan: update reported bug types for kernel memory accesses
    kasan: update reported bug types for not user nor kernel memory accesses
    mm/kasan: prevent deadlock in kasan reporting
    mm/kasan: don't use kasan shadow pointer in generic functions
    mm/kasan: MODULE_VADDR is not available on all archs
    ...

    Linus Torvalds
     
  • This fixes a bug from commit f3f86e33dc3d ("vfs: Fix pathological
    performance case for __alloc_fd()").

    v2: refactor to share fd bitmap copying code
    Signed-off-by: Eric Biggers
    Signed-off-by: Linus Torvalds

    Eric Biggers
     
  • Test the mmap() flag, and the mlockall() flag. These tests ensure that
    pages are not faulted in until they are accessed, that the pages are
    unevictable once faulted in, and that VMA splitting and merging works with
    the new VM flag. The second test ensures that mlock limits are respected.
    Note that the limit test needs to be run a normal user.

    Also add tests to use the new mlock2 family of system calls.

    [treding@nvidia.com: : Fix mlock2-tests for 32-bit architectures]
    [treding@nvidia.com: ensure the mlock2 syscall number can be found]
    [treding@nvidia.com: use the right arguments for main()]
    Signed-off-by: Eric B Munson
    Cc: Shuah Khan
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Jonathan Corbet
    Cc: Catalin Marinas
    Cc: Geert Uytterhoeven
    Cc: Guenter Roeck
    Cc: Heiko Carstens
    Cc: Kirill A. Shutemov
    Cc: Michael Kerrisk
    Cc: Ralf Baechle
    Cc: Stephen Rothwell
    Signed-off-by: Thierry Reding
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric B Munson
     
  • The previous patch introduced a flag that specified pages in a VMA should
    be placed on the unevictable LRU, but they should not be made present when
    the area is created. This patch adds the ability to set this state via
    the new mlock system calls.

    We add MLOCK_ONFAULT for mlock2 and MCL_ONFAULT for mlockall.
    MLOCK_ONFAULT will set the VM_LOCKONFAULT modifier for VM_LOCKED.
    MCL_ONFAULT should be used as a modifier to the two other mlockall flags.
    When used with MCL_CURRENT, all current mappings will be marked with
    VM_LOCKED | VM_LOCKONFAULT. When used with MCL_FUTURE, the mm->def_flags
    will be marked with VM_LOCKED | VM_LOCKONFAULT. When used with both
    MCL_CURRENT and MCL_FUTURE, all current mappings and mm->def_flags will be
    marked with VM_LOCKED | VM_LOCKONFAULT.

    Prior to this patch, mlockall() will unconditionally clear the
    mm->def_flags any time it is called without MCL_FUTURE. This behavior is
    maintained after adding MCL_ONFAULT. If a call to mlockall(MCL_FUTURE) is
    followed by mlockall(MCL_CURRENT), the mm->def_flags will be cleared and
    new VMAs will be unlocked. This remains true with or without MCL_ONFAULT
    in either mlockall() invocation.

    munlock() will unconditionally clear both vma flags. munlockall()
    unconditionally clears for VMA flags on all VMAs and in the mm->def_flags
    field.

    Signed-off-by: Eric B Munson
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Jonathan Corbet
    Cc: Catalin Marinas
    Cc: Geert Uytterhoeven
    Cc: Guenter Roeck
    Cc: Heiko Carstens
    Cc: Kirill A. Shutemov
    Cc: Michael Kerrisk
    Cc: Ralf Baechle
    Cc: Shuah Khan
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric B Munson
     
  • The cost of faulting in all memory to be locked can be very high when
    working with large mappings. If only portions of the mapping will be used
    this can incur a high penalty for locking.

    For the example of a large file, this is the usage pattern for a large
    statical language model (probably applies to other statical or graphical
    models as well). For the security example, any application transacting in
    data that cannot be swapped out (credit card data, medical records, etc).

    This patch introduces the ability to request that pages are not
    pre-faulted, but are placed on the unevictable LRU when they are finally
    faulted in. The VM_LOCKONFAULT flag will be used together with VM_LOCKED
    and has no effect when set without VM_LOCKED. Setting the VM_LOCKONFAULT
    flag for a VMA will cause pages faulted into that VMA to be added to the
    unevictable LRU when they are faulted or if they are already present, but
    will not cause any missing pages to be faulted in.

    Exposing this new lock state means that we cannot overload the meaning of
    the FOLL_POPULATE flag any longer. Prior to this patch it was used to
    mean that the VMA for a fault was locked. This means we need the new
    FOLL_MLOCK flag to communicate the locked state of a VMA. FOLL_POPULATE
    will now only control if the VMA should be populated and in the case of
    VM_LOCKONFAULT, it will not be set.

    Signed-off-by: Eric B Munson
    Acked-by: Kirill A. Shutemov
    Acked-by: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Jonathan Corbet
    Cc: Catalin Marinas
    Cc: Geert Uytterhoeven
    Cc: Guenter Roeck
    Cc: Heiko Carstens
    Cc: Michael Kerrisk
    Cc: Ralf Baechle
    Cc: Shuah Khan
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric B Munson
     
  • With the refactored mlock code, introduce a new system call for mlock.
    The new call will allow the user to specify what lock states are being
    added. mlock2 is trivial at the moment, but a follow on patch will add a
    new mlock state making it useful.

    Signed-off-by: Eric B Munson
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Heiko Carstens
    Cc: Geert Uytterhoeven
    Cc: Catalin Marinas
    Cc: Stephen Rothwell
    Cc: Guenter Roeck
    Cc: Jonathan Corbet
    Cc: Kirill A. Shutemov
    Cc: Michael Kerrisk
    Cc: Ralf Baechle
    Cc: Shuah Khan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric B Munson
     
  • mlock() allows a user to control page out of program memory, but this
    comes at the cost of faulting in the entire mapping when it is allocated.
    For large mappings where the entire area is not necessary this is not
    ideal. Instead of forcing all locked pages to be present when they are
    allocated, this set creates a middle ground. Pages are marked to be
    placed on the unevictable LRU (locked) when they are first used, but they
    are not faulted in by the mlock call.

    This series introduces a new mlock() system call that takes a flags
    argument along with the start address and size. This flags argument gives
    the caller the ability to request memory be locked in the traditional way,
    or to be locked after the page is faulted in. A new MCL flag is added to
    mirror the lock on fault behavior from mlock() in mlockall().

    There are two main use cases that this set covers. The first is the
    security focussed mlock case. A buffer is needed that cannot be written
    to swap. The maximum size is known, but on average the memory used is
    significantly less than this maximum. With lock on fault, the buffer is
    guaranteed to never be paged out without consuming the maximum size every
    time such a buffer is created.

    The second use case is focussed on performance. Portions of a large file
    are needed and we want to keep the used portions in memory once accessed.
    This is the case for large graphical models where the path through the
    graph is not known until run time. The entire graph is unlikely to be
    used in a given invocation, but once a node has been used it needs to stay
    resident for further processing. Given these constraints we have a number
    of options. We can potentially waste a large amount of memory by mlocking
    the entire region (this can also cause a significant stall at startup as
    the entire file is read in). We can mlock every page as we access them
    without tracking if the page is already resident but this introduces large
    overhead for each access. The third option is mapping the entire region
    with PROT_NONE and using a signal handler for SIGSEGV to
    mprotect(PROT_READ) and mlock() the needed page. Doing this page at a
    time adds a significant performance penalty. Batching can be used to
    mitigate this overhead, but in order to safely avoid trying to mprotect
    pages outside of the mapping, the boundaries of each mapping to be used in
    this way must be tracked and available to the signal handler. This is
    precisely what the mm system in the kernel should already be doing.

    For mlock(MLOCK_ONFAULT) the user is charged against RLIMIT_MEMLOCK as if
    mlock(MLOCK_LOCKED) or mmap(MAP_LOCKED) was used, so when the VMA is
    created not when the pages are faulted in. For mlockall(MCL_ONFAULT) the
    user is charged as if MCL_FUTURE was used. This decision was made to keep
    the accounting checks out of the page fault path.

    To illustrate the benefit of this set I wrote a test program that mmaps a
    5 GB file filled with random data and then makes 15,000,000 accesses to
    random addresses in that mapping. The test program was run 20 times for
    each setup. Results are reported for two program portions, setup and
    execution. The setup phase is calling mmap and optionally mlock on the
    entire region. For most experiments this is trivial, but it highlights
    the cost of faulting in the entire region. Results are averages across
    the 20 runs in milliseconds.

    mmap with mlock(MLOCK_LOCKED) on entire range:
    Setup avg: 8228.666
    Processing avg: 8274.257

    mmap with mlock(MLOCK_LOCKED) before each access:
    Setup avg: 0.113
    Processing avg: 90993.552

    mmap with PROT_NONE and signal handler and batch size of 1 page:
    With the default value in max_map_count, this gets ENOMEM as I attempt
    to change the permissions, after upping the sysctl significantly I get:
    Setup avg: 0.058
    Processing avg: 69488.073
    mmap with PROT_NONE and signal handler and batch size of 8 pages:
    Setup avg: 0.068
    Processing avg: 38204.116

    mmap with PROT_NONE and signal handler and batch size of 16 pages:
    Setup avg: 0.044
    Processing avg: 29671.180

    mmap with mlock(MLOCK_ONFAULT) on entire range:
    Setup avg: 0.189
    Processing avg: 17904.899

    The signal handler in the batch cases faulted in memory in two steps to
    avoid having to know the start and end of the faulting mapping. The first
    step covers the page that caused the fault as we know that it will be
    possible to lock. The second step speculatively tries to mlock and
    mprotect the batch size - 1 pages that follow. There may be a clever way
    to avoid this without having the program track each mapping to be covered
    by this handeler in a globally accessible structure, but I could not find
    it. It should be noted that with a large enough batch size this two step
    fault handler can still cause the program to crash if it reaches far
    beyond the end of the mapping.

    These results show that if the developer knows that a majority of the
    mapping will be used, it is better to try and fault it in at once,
    otherwise mlock(MLOCK_ONFAULT) is significantly faster.

    The performance cost of these patches are minimal on the two benchmarks I
    have tested (stream and kernbench). The following are the average values
    across 20 runs of stream and 10 runs of kernbench after a warmup run whose
    results were discarded.

    Avg throughput in MB/s from stream using 1000000 element arrays
    Test 4.2-rc1 4.2-rc1+lock-on-fault
    Copy: 10,566.5 10,421
    Scale: 10,685 10,503.5
    Add: 12,044.1 11,814.2
    Triad: 12,064.8 11,846.3

    Kernbench optimal load
    4.2-rc1 4.2-rc1+lock-on-fault
    Elapsed Time 78.453 78.991
    User Time 64.2395 65.2355
    System Time 9.7335 9.7085
    Context Switches 22211.5 22412.1
    Sleeps 14965.3 14956.1

    This patch (of 6):

    Extending the mlock system call is very difficult because it currently
    does not take a flags argument. A later patch in this set will extend
    mlock to support a middle ground between pages that are locked and faulted
    in immediately and unlocked pages. To pave the way for the new system
    call, the code needs some reorganization so that all the actual entry
    point handles is checking input and translating to VMA flags.

    Signed-off-by: Eric B Munson
    Acked-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Michael Kerrisk
    Cc: Catalin Marinas
    Cc: Geert Uytterhoeven
    Cc: Guenter Roeck
    Cc: Heiko Carstens
    Cc: Jonathan Corbet
    Cc: Ralf Baechle
    Cc: Shuah Khan
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric B Munson
     
  • Currently we already taint the kernel in some cases. E.g. if we hit some
    bug in slub memory we call object_err() which will taint the kernel with
    TAINT_BAD_PAGE flag. But for other kind of bugs kernel left untainted.

    Always taint with TAINT_BAD_PAGE if kasan found some bug. This is useful
    for automated testing.

    Signed-off-by: Andrey Ryabinin
    Cc: Alexander Potapenko
    Reviewed-by: Dmitry Vyukov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • It's recommended to have slub's user tracking enabled with CONFIG_KASAN,
    because:

    a) User tracking disables slab merging which improves
    detecting out-of-bounds accesses.
    b) User tracking metadata acts as redzone which also improves
    detecting out-of-bounds accesses.
    c) User tracking provides additional information about object.
    This information helps to understand bugs.

    Currently it is not enabled by default. Besides recompiling the kernel
    with KASAN and reinstalling it, user also have to change the boot cmdline,
    which is not very handy.

    Enable slub user tracking by default with KASAN=y, since there is no good
    reason to not do this.

    [akpm@linux-foundation.org: little fixes, per David]
    Signed-off-by: Andrey Ryabinin
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • Use IS_ALIGNED() to determine whether the shadow span two bytes. It
    generates less code and more readable. Also add some comments in shadow
    check functions.

    Signed-off-by: Xishi Qiu
    Acked-by: Andrey Ryabinin
    Cc: Andrey Konovalov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xishi Qiu
     
  • The current KASAN code can not find the following out-of-bounds bugs:

    char *ptr;
    ptr = kmalloc(8, GFP_KERNEL);
    memset(ptr+7, 0, 2);

    the cause of the problem is the type conversion error in
    *memory_is_poisoned_n* function. So this patch fix that.

    Signed-off-by: Wang Long
    Acked-by: Andrey Ryabinin
    Cc: Vladimir Murzin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wang Long
     
  • Add some out of bounds testcases to test_kasan module.

    Signed-off-by: Wang Long
    Acked-by: Andrey Ryabinin
    Cc: Vladimir Murzin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wang Long
     
  • Update the reference to the kasan prototype repository on github, since it
    was renamed.

    Signed-off-by: Andrey Konovalov
    Cc: Andrey Ryabinin
    Cc: Dmitry Vyukov
    Cc: Alexander Potapenko
    Cc: Konstantin Serebryany
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Konovalov
     
  • Move KASAN_SANITIZE in arch/x86/boot/Makefile above the comment
    related to SVGA_MODE, since the comment refers to 'the next line'.

    Signed-off-by: Andrey Konovalov
    Cc: Andrey Ryabinin
    Cc: Dmitry Vyukov
    Cc: Alexander Potapenko
    Cc: Konstantin Serebryany
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Konovalov
     
  • [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Andrey Konovalov
    Cc: Andrey Ryabinin
    Cc: Dmitry Vyukov
    Cc: Alexander Potapenko
    Cc: Konstantin Serebryany
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Konovalov
     
  • We decided to use KASAN as the short name of the tool and
    KernelAddressSanitizer as the full one. Update log messages according to
    that.

    Signed-off-by: Andrey Konovalov
    Cc: Andrey Ryabinin
    Cc: Dmitry Vyukov
    Cc: Alexander Potapenko
    Cc: Konstantin Serebryany
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Konovalov
     
  • Makes KASAN accurately determine the type of the bad access. If the shadow
    byte value is in the [0, KASAN_SHADOW_SCALE_SIZE) range we can look at
    the next shadow byte to determine the type of the access.

    Signed-off-by: Andrey Konovalov
    Cc: Andrey Ryabinin
    Cc: Dmitry Vyukov
    Cc: Alexander Potapenko
    Cc: Konstantin Serebryany
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Konovalov
     
  • Update the names of the bad access types to better reflect the type of
    the access that happended and make these error types "literals" that can
    be used for classification and deduplication in scripts.

    Signed-off-by: Andrey Konovalov
    Cc: Andrey Ryabinin
    Cc: Dmitry Vyukov
    Cc: Alexander Potapenko
    Cc: Konstantin Serebryany
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Konovalov
     
  • Each access with address lower than
    kasan_shadow_to_mem(KASAN_SHADOW_START) is reported as user-memory-access.
    This is not always true, the accessed address might not be in user space.
    Fix this by reporting such accesses as null-ptr-derefs or
    wild-memory-accesses.

    There's another reason for this change. For userspace ASan we have a
    bunch of systems that analyze error types for the purpose of
    classification and deduplication. Sooner of later we will write them to
    KASAN as well. Then clearly and explicitly stated error types will bring
    value.

    Signed-off-by: Andrey Konovalov
    Cc: Andrey Ryabinin
    Cc: Dmitry Vyukov
    Cc: Alexander Potapenko
    Cc: Konstantin Serebryany
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Konovalov
     
  • When we end up calling kasan_report in real mode, our shadow mapping for
    the spinlock variable will show poisoned. This will result in us calling
    kasan_report_error with lock_report spin lock held. To prevent this
    disable kasan reporting when we are priting error w.r.t kasan.

    Signed-off-by: Aneesh Kumar K.V
    Reviewed-by: Andrey Ryabinin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     
  • We can't use generic functions like print_hex_dump to access kasan shadow
    region. This require us to setup another kasan shadow region for the
    address passed (kasan shadow address). Some architectures won't be able
    to do that. Hence make a copy of the shadow region row and pass that to
    generic functions.

    Signed-off-by: Aneesh Kumar K.V
    Reviewed-by: Andrey Ryabinin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V