02 Sep, 2019

1 commit

  • Pull turbostat updates from Len Brown:
    "User-space turbostat (and x86_energy_perf_policy) patches.

    They are primarily bug fixes from users"

    * 'turbostat' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux:
    tools/power turbostat: update version number
    tools/power turbostat: Add support for Hygon Fam 18h (Dhyana) RAPL
    tools/power turbostat: Fix caller parameter of get_tdp_amd()
    tools/power turbostat: Fix CPU%C1 display value
    tools/power turbostat: do not enforce 1ms
    tools/power turbostat: read from pipes too
    tools/power turbostat: Add Ice Lake NNPI support
    tools/power turbostat: rename has_hsw_msrs()
    tools/power turbostat: Fix Haswell Core systems
    tools/power turbostat: add Jacobsville support
    tools/power turbostat: fix buffer overrun
    tools/power turbostat: fix file descriptor leaks
    tools/power turbostat: fix leak of file descriptor on error return path
    tools/power turbostat: Make interval calculation per thread to reduce jitter
    tools/power turbostat: remove duplicate pc10 column
    tools/power x86_energy_perf_policy: Fix argument parsing
    tools/power: Fix typo in man page
    tools/power/x86: Enable compiler optimisations and Fortify by default
    tools/power x86_energy_perf_policy: Fix "uninitialized variable" warnings at -O2

    Linus Torvalds
     

01 Sep, 2019

22 commits

  • Today is 19.08.31, at least in some parts of the world.

    Signed-off-by: Len Brown

    Len Brown
     
  • Commit 9392bd98bba760be96ee ("tools/power turbostat: Add support for AMD
    Fam 17h (Zen) RAPL") and the commit 3316f99a9f1b68c578c5 ("tools/power
    turbostat: Also read package power on AMD F17h (Zen)") add AMD Fam 17h
    RAPL support.

    Hygon Family 18h(Dhyana) support RAPL in bit 14 of CPUID 0x80000007 EDX,
    and has MSRs RAPL_PWR_UNIT/CORE_ENERGY_STAT/PKG_ENERGY_STAT. So add Hygon
    Dhyana Family 18h support for RAPL.

    Already tested on Hygon multi-node systems and it shows correct per-core
    energy usage and the total package power.

    Signed-off-by: Pu Wen
    Reviewed-by: Calvin Walton
    Signed-off-by: Len Brown

    Pu Wen
     
  • Commit 9392bd98bba760be96ee ("tools/power turbostat: Add support for AMD
    Fam 17h (Zen) RAPL") add a function get_tdp_amd(), the parameter is CPU
    family. But the rapl_probe_amd() function use wrong model parameter.
    Fix the wrong caller parameter of get_tdp_amd() to use family.

    Cc: # v5.1+
    Signed-off-by: Pu Wen
    Reviewed-by: Calvin Walton
    Signed-off-by: Len Brown

    Pu Wen
     
  • In some case C1% will be wrong value, when platform doesn't have MSR for
    C1 residency.

    For example:
    Core CPU CPU%c1
    - - 100.00
    0 0 100.00
    0 2 100.00
    1 1 100.00
    1 3 100.00

    But adding Busy% will fix this
    Core CPU Busy% CPU%c1
    - - 99.77 0.23
    0 0 99.77 0.23
    0 2 99.77 0.23
    1 1 99.77 0.23
    1 3 99.77 0.23

    This issue can be reproduced on most of the recent systems including
    Broadwell, Skylake and later.

    This is because if we don't select Busy% or Avg_MHz or Bzy_MHz then
    mperf value will not be read from MSR, so it will be 0. But this
    is required for C1% calculation when MSR for C1 residency is not present.
    Same is true for C3, C6 and C7 column selection.

    So add another define DO_BIC_READ(), which doesn't depend on user
    column selection and use for mperf, C3, C6 and C7 related counters.
    So when there is no platform support for C1 residency counters,
    we still read these counters, if the CPU has support and user selected
    display of CPU%c1.

    Signed-off-by: Srinivas Pandruvada
    Signed-off-by: Len Brown

    Srinivas Pandruvada
     
  • Turbostat works by taking a snapshot of counters, sleeping, taking another
    snapshot, calculating deltas, and printing out the table.

    The sleep time is controlled via -i option or by user sending a signal or a
    character to stdin. In the latter case, turbostat always adds 1 ms
    sleep before it reads the counters, in order to avoid larger imprecisions
    in the results in prints.

    While the 1 ms delay may be a good idea for a "dumb" user, it is a
    problem for an "aware" user. I do thousands and thousands of measurements
    over a short period of time (like 2ms), and turbostat unconditionally adds
    a 1ms to my interval, so I cannot get what I really need.

    This patch removes the unconditional 1ms sleep. This is an expert user
    tool, after all, and non-experts will unlikely ever use it in the non-fixed
    interval mode anyway, so I think it is OK to remove the 1ms delay.

    Signed-off-by: Artem Bityutskiy
    Signed-off-by: Len Brown

    Artem Bityutskiy
     
  • Commit '47936f944e78 tools/power turbostat: fix printing on input' make
    a valid fix, but it completely disabled piped stdin support, which is
    a valuable use-case. Indeed, if stdin is a pipe, turbostat won't read
    anything from it, so it becomes impossible to get turbostat output at
    user-defined moments, instead of the regular intervals.

    There is no reason why this should works for terminals, but not for
    pipes. This patch improves the situation. Instead of ignoring pipes, we
    read data from them but gracefully handle the EOF case.

    Signed-off-by: Artem Bityutskiy
    Signed-off-by: Len Brown

    Artem Bityutskiy
     
  • This enables turbostat utility on Ice Lake NNPI SoC.

    Link: https://lkml.org/lkml/2019/6/5/1034
    Signed-off-by: Rajneesh Bhardwaj
    Signed-off-by: Len Brown

    Rajneesh Bhardwaj
     
  • Perhaps if this more descriptive name had been used,
    then we wouldn't have had the HSW ULT vs HSW CORE bug,
    fixed by the previous commit.

    Signed-off-by: Len Brown

    Len Brown
     
  • turbostat: cpu0: msr offset 0x630 read failed: Input/output error

    because Haswell Core does not have C8-C10.

    Output C8-C10 only on Haswell ULT.

    Fixes: f5a4c76ad7de ("tools/power turbostat: consolidate duplicate model numbers")

    Reported-by: Prarit Bhargava
    Suggested-by: Kosuke Tatsukawa
    Signed-off-by: Len Brown

    Len Brown
     
  • Jacobsville behaves like Denverton.

    Signed-off-by: Zhang Rui
    Signed-off-by: Len Brown

    Zhang Rui
     
  • turbostat could be terminated by general protection fault on some latest
    hardwares which (for example) support 9 levels of C-states and show 18
    "tADDED" lines. That bloats the total output and finally causes buffer
    overrun. So let's extend the buffer to avoid this.

    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Len Brown

    Naoya Horiguchi
     
  • Fix file descriptor leaks by closing fp before return.

    Addresses-Coverity-ID: 1444591 ("Resource leak")
    Addresses-Coverity-ID: 1444592 ("Resource leak")
    Fixes: 5ea7647b333f ("tools/power turbostat: Warn on bad ACPI LPIT data")
    Signed-off-by: Gustavo A. R. Silva
    Reviewed-by: Prarit Bhargava
    Signed-off-by: Len Brown

    Gustavo A. R. Silva
     
  • Currently the error return path does not close the file fp and leaks
    a file descriptor. Fix this by closing the file.

    Fixes: 5ea7647b333f ("tools/power turbostat: Warn on bad ACPI LPIT data")
    Signed-off-by: Colin Ian King
    Signed-off-by: Len Brown

    Colin Ian King
     
  • Turbostat currently normalizes TSC and other values by dividing by an
    interval. This interval is the delta between the start of one global
    (all counters on all CPUs) sampling and the start of another. However,
    this introduces a lot of jitter into the data.

    In order to reduce jitter, the interval calculation should be based on
    timestamps taken per thread and close to the start of the thread's
    sampling.

    Define a per thread time value to hold the delta between samples taken
    on the thread.

    Use the timestamp taken at the beginning of sampling to calculate the
    delta.

    Move the thread's beginning timestamp to after the CPU migration to
    avoid jitter due to the migration.

    Use the global time delta for the average time delta.

    Signed-off-by: Yazen Ghannam
    Signed-off-by: Len Brown

    Yazen Ghannam
     
  • Remove the duplicate pc10 column.

    Fixes: be0e54c4ebbf ("turbostat: Build-in "Low Power Idle" counters support")
    Reported-by: Naoya Horiguchi
    Signed-off-by: Len Brown

    Len Brown
     
  • The -w argument in x86_energy_perf_policy currently triggers an
    unconditional segfault.

    This is because the argument string reads: "+a:c:dD:E:e:f:m:M:rt:u:vw" and
    yet the argument handler expects an argument.

    When parse_optarg_string is called with a null argument, we then proceed to
    crash in strncmp, not horribly friendly.

    The man page describes -w as taking an argument, the long form
    (--hwp-window) is correctly marked as taking a required argument, and the
    code expects it.

    As such, this patch simply marks the short form (-w) as requiring an
    argument.

    Signed-off-by: Zephaniah E. Loss-Cutler-Hull
    Signed-off-by: Len Brown

    Zephaniah E. Loss-Cutler-Hull
     
  • From context, we mean EPB (Enegry Performance Bias).

    Signed-off-by: Matt Lupfer
    Signed-off-by: Len Brown

    Matt Lupfer
     
  • Compiling without optimisations is silly, especially since some
    warnings depend on the optimiser. Use -O2.

    Fortify adds warnings for unchecked I/O (among other things), which
    seems to be a good idea for user-space code. Enable that too.

    Signed-off-by: Ben Hutchings
    Signed-off-by: Len Brown

    Ben Hutchings
     
  • x86_energy_perf_policy first uses __get_cpuid() to check the maximum
    CPUID level and exits if it is too low. It then assumes that later
    calls will succeed (which I think is architecturally guaranteed). It
    also assumes that CPUID works at all (which is not guaranteed on
    x86_32).

    If optimisations are enabled, gcc warns about potentially
    uninitialized variables. Fix this by adding an exit-on-error after
    every call to __get_cpuid() instead of just checking the maximum
    level.

    Signed-off-by: Ben Hutchings
    Signed-off-by: Len Brown

    Ben Hutchings
     
  • Pull i2c fixes from Wolfram Sang:
    "I2C has a bunch of driver fixes and a core improvement to make the
    on-going API transition more robust"

    * 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
    i2c: mediatek: disable zero-length transfers for mt8183
    i2c: iproc: Stop advertising support of SMBUS quick cmd
    MAINTAINERS: i2c mv64xxx: Update documentation path
    i2c: piix4: Fix port selection for AMD Family 16h Model 30h
    i2c: designware: Synchronize IRQs when unregistering slave client
    i2c: i801: Avoid memory leak in check_acpi_smo88xx_device()
    i2c: make i2c_unregister_device() ERR_PTR safe

    Linus Torvalds
     
  • Pull tracing fixes from Steven Rostedt:
    "Small fixes and minor cleanups for tracing:

    - Make exported ftrace function not static

    - Fix NULL pointer dereference in reading probes as they are created

    - Fix NULL pointer dereference in k/uprobe clean up path

    - Various documentation fixes"

    * tag 'trace-v5.3-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    tracing: Correct kdoc formats
    ftrace/x86: Remove mcount() declaration
    tracing/probe: Fix null pointer dereference
    tracing: Make exported ftrace_set_clr_event non-static
    ftrace: Check for successful allocation of hash
    ftrace: Check for empty hash and comment the race with registering probes
    ftrace: Fix NULL pointer dereference in t_probe_next()

    Linus Torvalds
     
  • Pull RISC-V fix from Paul Walmsley:
    "One significant fix for 32-bit RISC-V systems:

    Fix the RV32 memory map to prevent userspace from corrupting the
    FIXMAP area. Without this patch, the system can crash very early
    during the boot"

    * tag 'riscv/for-v5.3-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
    RISC-V: Fix FIXMAP area corruption on RV32 systems

    Linus Torvalds
     

31 Aug, 2019

17 commits

  • Pull KVM fixes from Radim Krčmář:
    "PPC:
    - Fix bug which could leave locks held in the host on return to a
    guest.

    x86:
    - Prevent infinitely looping emulation of a failing syscall while
    single stepping.

    - Do not crash the host when nesting is disabled"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
    KVM: x86: Don't update RIP or do single-step on faulting emulation
    KVM: x86: hyper-v: don't crash on KVM_GET_SUPPORTED_HV_CPUID when kvm_intel.nested is disabled
    KVM: PPC: Book3S: Fix incorrect guest-to-user-translation error handling

    Linus Torvalds
     
  • Merge misc mm fixes from Andrew Morton:
    "7 fixes"

    * emailed patches from Andrew Morton :
    mm: memcontrol: fix percpu vmstats and vmevents flush
    mm, memcg: do not set reclaim_state on soft limit reclaim
    mailmap: add aliases for Dmitry Safonov
    mm/z3fold.c: fix lock/unlock imbalance in z3fold_page_isolate
    mm, memcg: partially revert "mm/memcontrol.c: keep local VM counters in sync with the hierarchical ones"
    mm/zsmalloc.c: fix build when CONFIG_COMPACTION=n
    mm: memcontrol: flush percpu slab vmstats on kmem offlining

    Linus Torvalds
     
  • Fix the following kdoc warnings:

    kernel/trace/trace.c:1579: warning: Function parameter or member 'tr' not described in 'update_max_tr_single'
    kernel/trace/trace.c:1579: warning: Function parameter or member 'tsk' not described in 'update_max_tr_single'
    kernel/trace/trace.c:1579: warning: Function parameter or member 'cpu' not described in 'update_max_tr_single'
    kernel/trace/trace.c:1776: warning: Function parameter or member 'type' not described in 'register_tracer'
    kernel/trace/trace.c:2239: warning: Function parameter or member 'task' not described in 'tracing_record_taskinfo'
    kernel/trace/trace.c:2239: warning: Function parameter or member 'flags' not described in 'tracing_record_taskinfo'
    kernel/trace/trace.c:2269: warning: Function parameter or member 'prev' not described in 'tracing_record_taskinfo_sched_switch'
    kernel/trace/trace.c:2269: warning: Function parameter or member 'next' not described in 'tracing_record_taskinfo_sched_switch'
    kernel/trace/trace.c:2269: warning: Function parameter or member 'flags' not described in 'tracing_record_taskinfo_sched_switch'
    kernel/trace/trace.c:3078: warning: Function parameter or member 'ip' not described in 'trace_vbprintk'
    kernel/trace/trace.c:3078: warning: Function parameter or member 'fmt' not described in 'trace_vbprintk'
    kernel/trace/trace.c:3078: warning: Function parameter or member 'args' not described in 'trace_vbprintk'

    Link: http://lkml.kernel.org/r/20190828052549.2472-2-jakub.kicinski@netronome.com

    Signed-off-by: Jakub Kicinski
    Signed-off-by: Steven Rostedt (VMware)

    Jakub Kicinski
     
  • Commit 562e14f72292 ("ftrace/x86: Remove mcount support") removed the
    support for using mcount, so we could remove the mcount() declaration
    to clean up.

    Link: http://lkml.kernel.org/r/20190826170150.10f101ba@xhacker.debian

    Signed-off-by: Jisheng Zhang
    Signed-off-by: Steven Rostedt (VMware)

    Jisheng Zhang
     
  • BUG: KASAN: null-ptr-deref in trace_probe_cleanup+0x8d/0xd0
    Read of size 8 at addr 0000000000000000 by task syz-executor.0/9746
    trace_probe_cleanup+0x8d/0xd0
    free_trace_kprobe.part.14+0x15/0x50
    alloc_trace_kprobe+0x23e/0x250

    Link: http://lkml.kernel.org/r/1565220563-980-1-git-send-email-danielliu861@gmail.com

    Fixes: e3dc9f898ef9c ("tracing/probe: Add trace_event_call accesses APIs")
    Signed-off-by: Xinpeng Liu
    Signed-off-by: Steven Rostedt (VMware)

    Xinpeng Liu
     
  • The function ftrace_set_clr_event is declared static and marked
    EXPORT_SYMBOL_GPL(), which is at best an odd combination. Because the
    function was decided to be a part of API, this commit removes the static
    attribute and adds the declaration to the header.

    Link: http://lkml.kernel.org/r/20190704172110.27041-1-efremov@linux.com

    Fixes: f45d1225adb04 ("tracing: Kernel access to Ftrace instances")
    Reviewed-by: Joe Jin
    Signed-off-by: Denis Efremov
    Signed-off-by: Steven Rostedt (VMware)

    Denis Efremov
     
  • Pull crypto fix from Herbert Xu:
    "Fix a potential crash in the ccp driver"

    * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
    crypto: ccp - Ignore unconfigured CCP device on suspend/resume

    Linus Torvalds
     
  • Commit dfe2a77fd243 ("kfifo: fix kfifo_alloc() and kfifo_init()") made
    the kfifo code round the number of elements up. That was good for
    __kfifo_alloc(), but it's actually wrong for __kfifo_init().

    The difference? __kfifo_alloc() will allocate the rounded-up number of
    elements, but __kfifo_init() uses an allocation done by the caller. We
    can't just say "use more elements than the caller allocated", and have
    to round down.

    The good news? All the normal cases will be using power-of-two arrays
    anyway, and most users of kfifo's don't use kfifo_init() at all, but one
    of the helper macros to declare a KFIFO that enforce the proper
    power-of-two behavior. But it looks like at least ibmvscsis might be
    affected.

    The bad news? Will Deacon refers to an old thread and points points out
    that the memory ordering in kfifo's is questionable. See

    https://lore.kernel.org/lkml/20181211034032.32338-1-yuleixzhang@tencent.com/

    for more.

    Fixes: dfe2a77fd243 ("kfifo: fix kfifo_alloc() and kfifo_init()")
    Reported-by: laokz
    Cc: Stefani Seibold
    Cc: Andrew Morton
    Cc: Dan Carpenter
    Cc: Greg KH
    Cc: Kees Cook
    Cc: Will Deacon
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Instead of using raw_cpu_read() use per_cpu() to read the actual data of
    the corresponding cpu otherwise we will be reading the data of the
    current cpu for the number of online CPUs.

    Link: http://lkml.kernel.org/r/20190829203110.129263-1-shakeelb@google.com
    Fixes: bb65f89b7d3d ("mm: memcontrol: flush percpu vmevents before releasing memcg")
    Fixes: c350a99ea2b1 ("mm: memcontrol: flush percpu vmstats before releasing memcg")
    Signed-off-by: Shakeel Butt
    Acked-by: Roman Gushchin
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     
  • Adric Blake has noticed[1] the following warning:

    WARNING: CPU: 7 PID: 175 at mm/vmscan.c:245 set_task_reclaim_state+0x1e/0x40
    [...]
    Call Trace:
    mem_cgroup_shrink_node+0x9b/0x1d0
    mem_cgroup_soft_limit_reclaim+0x10c/0x3a0
    balance_pgdat+0x276/0x540
    kswapd+0x200/0x3f0
    ? wait_woken+0x80/0x80
    kthread+0xfd/0x130
    ? balance_pgdat+0x540/0x540
    ? kthread_park+0x80/0x80
    ret_from_fork+0x35/0x40
    ---[ end trace 727343df67b2398a ]---

    which tells us that soft limit reclaim is about to overwrite the
    reclaim_state configured up in the call chain (kswapd in this case but
    the direct reclaim is equally possible). This means that reclaim stats
    would get misleading once the soft reclaim returns and another reclaim
    is done.

    Fix the warning by dropping set_task_reclaim_state from the soft reclaim
    which is always called with reclaim_state set up.

    [1] http://lkml.kernel.org/r/CAE1jjeePxYPvw1mw2B3v803xHVR_BNnz0hQUY_JDMN8ny29M6w@mail.gmail.com

    Link: http://lkml.kernel.org/r/20190828071808.20410-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Adric Blake
    Acked-by: Yafang Shao
    Acked-by: Yang Shi
    Cc: Johannes Weiner
    Cc: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • I don't work for Virtuozzo or Samsung anymore and I've noticed that they
    have started sending annoying html email-replies.

    And I prioritize my personal emails over work email box, so while at it
    add an entry for Arista too - so I can reply faster when needed.

    Link: http://lkml.kernel.org/r/20190827220346.11123-1-dima@arista.com
    Signed-off-by: Dmitry Safonov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitry Safonov
     
  • Fix lock/unlock imbalance by unlocking *zhdr* before return.

    Addresses Coverity ID 1452811 ("Missing unlock")

    Link: http://lkml.kernel.org/r/20190826030634.GA4379@embeddedor
    Fixes: d776aaa9895e ("mm/z3fold.c: fix race between migration and destruction")
    Signed-off-by: Gustavo A. R. Silva
    Reviewed-by: Andrew Morton
    Cc: Henry Burns
    Cc: Vitaly Wool
    Cc: Shakeel Butt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gustavo A. R. Silva
     
  • …h the hierarchical ones"

    Commit 766a4c19d880 ("mm/memcontrol.c: keep local VM counters in sync
    with the hierarchical ones") effectively decreased the precision of
    per-memcg vmstats_local and per-memcg-per-node lruvec percpu counters.

    That's good for displaying in memory.stat, but brings a serious
    regression into the reclaim process.

    One issue I've discovered and debugged is the following:
    lruvec_lru_size() can return 0 instead of the actual number of pages in
    the lru list, preventing the kernel to reclaim last remaining pages.
    Result is yet another dying memory cgroups flooding. The opposite is
    also happening: scanning an empty lru list is the waste of cpu time.

    Also, inactive_list_is_low() can return incorrect values, preventing the
    active lru from being scanned and freed. It can fail both because the
    size of active and inactive lists are inaccurate, and because the number
    of workingset refaults isn't precise. In other words, the result is
    pretty random.

    I'm not sure, if using the approximate number of slab pages in
    count_shadow_number() is acceptable, but issues described above are
    enough to partially revert the patch.

    Let's keep per-memcg vmstat_local batched (they are only used for
    displaying stats to the userspace), but keep lruvec stats precise. This
    change fixes the dead memcg flooding on my setup.

    Link: http://lkml.kernel.org/r/20190817004726.2530670-1-guro@fb.com
    Fixes: 766a4c19d880 ("mm/memcontrol.c: keep local VM counters in sync with the hierarchical ones")
    Signed-off-by: Roman Gushchin <guro@fb.com>
    Acked-by: Yafang Shao <laoar.shao@gmail.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Roman Gushchin
     
  • Fixes: 701d678599d0c1 ("mm/zsmalloc.c: fix race condition in zs_destroy_pool")
    Link: http://lkml.kernel.org/r/201908251039.5oSbEEUT%25lkp@intel.com
    Reported-by: kbuild test robot
    Cc: Sergey Senozhatsky
    Cc: Henry Burns
    Cc: Minchan Kim
    Cc: Shakeel Butt
    Cc: Jonathan Adams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • I've noticed that the "slab" value in memory.stat is sometimes 0, even
    if some children memory cgroups have a non-zero "slab" value. The
    following investigation showed that this is the result of the kmem_cache
    reparenting in combination with the per-cpu batching of slab vmstats.

    At the offlining some vmstat value may leave in the percpu cache, not
    being propagated upwards by the cgroup hierarchy. It means that stats
    on ancestor levels are lower than actual. Later when slab pages are
    released, the precise number of pages is substracted on the parent
    level, making the value negative. We don't show negative values, 0 is
    printed instead.

    To fix this issue, let's flush percpu slab memcg and lruvec stats on
    memcg offlining. This guarantees that numbers on all ancestor levels
    are accurate and match the actual number of outstanding slab pages.

    Link: http://lkml.kernel.org/r/20190819202338.363363-3-guro@fb.com
    Fixes: fb2f2b0adb98 ("mm: memcg/slab: reparent memcg kmem_caches on cgroup removal")
    Signed-off-by: Roman Gushchin
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • In register_ftrace_function_probe(), we are not checking the return
    value of alloc_and_copy_ftrace_hash(). The subsequent call to
    ftrace_match_records() may end up dereferencing the same. Add a check to
    ensure this doesn't happen.

    Link: http://lkml.kernel.org/r/26e92574f25ad23e7cafa3cf5f7a819de1832cbe.1562249521.git.naveen.n.rao@linux.vnet.ibm.com

    Cc: stable@vger.kernel.org
    Fixes: 1ec3a81a0cf42 ("ftrace: Have each function probe use its own ftrace_ops")
    Signed-off-by: Naveen N. Rao
    Signed-off-by: Steven Rostedt (VMware)

    Naveen N. Rao
     
  • The race between adding a function probe and reading the probes that exist
    is very subtle. It needs a comment. Also, the issue can also happen if the
    probe has has the EMPTY_HASH as its func_hash.

    Cc: stable@vger.kernel.org
    Fixes: 7b60f3d876156 ("ftrace: Dynamically create the probe ftrace_ops for the trace_array")
    Signed-off-by: Steven Rostedt (VMware)

    Steven Rostedt (VMware)