07 Sep, 2016

5 commits

  • Install the callbacks via the state machine.

    Signed-off-by: Sebastian Andrzej Siewior
    Cc: Peter Zijlstra
    Cc: rt@linutronix.de
    Link: http://lkml.kernel.org/r/20160818125731.27256-7-bigeasy@linutronix.de
    Signed-off-by: Thomas Gleixner

    Sebastian Andrzej Siewior
     
  • Install the callbacks via the state machine.

    Signed-off-by: Richard Weinberger
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Sebastian Andrzej Siewior
    Reviewed-by: Sebastian Andrzej Siewior
    Cc: Peter Zijlstra
    Cc: Pekka Enberg
    Cc: linux-mm@kvack.org
    Cc: rt@linutronix.de
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrew Morton
    Cc: Christoph Lameter
    Link: http://lkml.kernel.org/r/20160823125319.abeapfjapf2kfezp@linutronix.de
    Signed-off-by: Thomas Gleixner

    Sebastian Andrzej Siewior
     
  • Install the callbacks via the state machine. They are installed at run time but
    relay_prepare_cpu() does not need to be invoked by the boot CPU because
    relay_open() was not yet invoked and there are no pools that need to be created.

    Signed-off-by: Richard Weinberger
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Sebastian Andrzej Siewior
    Reviewed-by: Sebastian Andrzej Siewior
    Cc: Peter Zijlstra
    Cc: rt@linutronix.de
    Cc: Andrew Morton
    Link: http://lkml.kernel.org/r/20160818125731.27256-3-bigeasy@linutronix.de
    Signed-off-by: Thomas Gleixner

    Richard Weinberger
     
  • relay essentially needs to maintain a per CPU array of channel buffer
    pointers but it manually creates that array. Instead its better to use
    the per CPU constructs, provided by the kernel, to allocate & access the
    array of pointer to channel buffers.

    Signed-off-by: Akash Goel
    Reviewed-by: Chris Wilson
    Link: http://lkml.kernel.org/r/1470909140-25919-1-git-send-email-akash.goel@intel.com
    Signed-off-by: Andrew Morton
    Signed-off-by: Thomas Gleixner

    Akash Goel
     
  • All users are converted to state machine, remove CPU_STARTING and the
    corresponding CPU_DYING.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Sebastian Andrzej Siewior
    Cc: Peter Zijlstra
    Cc: rt@linutronix.de
    Link: http://lkml.kernel.org/r/20160818125731.27256-2-bigeasy@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     

06 Sep, 2016

1 commit


05 Sep, 2016

1 commit

  • Some compilers are unhappy with the anon union in the state array. Replace
    it with a named union.

    While at it align the state array initializers proper and add the missing
    name tags.

    Fixes: cf392d10b69e "cpu/hotplug: Add multi instance support"
    Reported-by: Ingo Molnar
    Reported-by: Fenguang Wu
    Signed-off-by: Thomas Gleixner
    Cc: rt@linutronix.de

    Thomas Gleixner
     

03 Sep, 2016

3 commits

  • When cpu_hotplug_enable() is called unbalanced w/o a preceeding
    cpu_hotplug_disable() the code emits a warning, but happily decrements the
    disabled counter. This causes the next operations to malfunction.

    Prevent the decrement and just emit a warning.

    Signed-off-by: Lianwei Wang
    Cc: peterz@infradead.org
    Cc: linux-pm@vger.kernel.org
    Cc: oleg@redhat.com
    Link: http://lkml.kernel.org/r/1465541008-12476-1-git-send-email-lianwei.wang@gmail.com
    Signed-off-by: Thomas Gleixner

    Lianwei Wang
     
  • This patch adds the ability for a given state to have multiple
    instances. Until now all states have a single instance and the startup /
    teardown callback use global variables.
    A few drivers need to perform a the same callbacks on multiple
    "instances". Currently we have three drivers in tree which all have a
    global list which they iterate over. With multi instance they support
    don't need their private list and the functionality has been moved into
    core code. Plus we hold the hotplug lock in core so no cpus comes/goes
    while instances are registered and we do rollback in error case :)

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Sebastian Andrzej Siewior
    Cc: Mark Rutland
    Cc: Peter Zijlstra
    Cc: Will Deacon
    Cc: rt@linutronix.de
    Link: http://lkml.kernel.org/r/1471024183-12666-3-git-send-email-bigeasy@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • This is preparation for the following patch.
    This rework here changes the arguments of cpuhp_invoke_callback(). It
    passes now `state' and whether `startup' or `teardown' callback should
    be invoked. The callback then is looked up by the function.

    The following is a clanup of callers:
    - cpuhp_issue_call() has one argument less
    - struct cpuhp_cpu_state (which is used by the hotplug thread) gets also
    its callback removed. The decision if it is a single callback
    invocation moved to the `single' variable. Also a `bringup' variable
    has been added to distinguish between startup and teardown callback.
    - take_cpu_down() needs to start one step earlier. We always get here
    via CPUHP_TEARDOWN_CPU callback. Before that change cpuhp_ap_states +
    CPUHP_TEARDOWN_CPU pointed to an empty entry because TEARDOWN is saved
    in bp_states for this reason. Now that we use cpuhp_get_step() to
    lookup the state we must explicitly skip it in order not to invoke it
    twice.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Sebastian Andrzej Siewior
    Cc: Mark Rutland
    Cc: Peter Zijlstra
    Cc: Will Deacon
    Cc: rt@linutronix.de
    Link: http://lkml.kernel.org/r/1471024183-12666-2-git-send-email-bigeasy@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     

02 Sep, 2016

1 commit


29 Aug, 2016

3 commits

  • Pull perf fixes from Thomas Gleixner:
    "A few fixes from the perf departement

    - prevent a imbalanced preemption disable in the events teardown code
    - prevent out of bound acces in perf userspace
    - make perf tools compile with UCLIBC again
    - a fix for the userspace unwinder utility"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    perf/core: Use this_cpu_ptr() when stopping AUX events
    perf evsel: Do not access outside hw cache name arrays
    tools lib: Reinstate strlcpy() header guard with __UCLIBC__
    perf unwind: Use addr_location::addr instead of ip for entries

    Linus Torvalds
     
  • Pull irq fixes from Thomas Gleixner:
    "This lot provides:

    - plug a hotplug race in the new affinity infrastructure
    - a fix for the trigger type of chained interrupts
    - plug a potential memory leak in the core code
    - a few fixes for ARM and MIPS GICs"

    * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    irqchip/mips-gic: Implement activate op for device domain
    irqchip/mips-gic: Cleanup chip and handler setup
    genirq/affinity: Use get/put_online_cpus around cpumask operations
    genirq: Fix potential memleak when failing to get irq pm
    irqchip/gicv3-its: Disable the ITS before initializing it
    irqchip/gicv3: Remove disabling redistributor and group1 non-secure interrupts
    irqchip/gic: Allow self-SGIs for SMP on UP configurations
    genirq: Correctly configure the trigger on chained interrupts

    Linus Torvalds
     
  • Pull timer fixes from Thomas Gleixner:
    "A few updates for timers & co:

    - prevent a livelock in the timekeeping code when debugging is
    enabled

    - prevent out of bounds access in the timekeeping debug code

    - various fixes in clocksource drivers

    - a new maintainers entry"

    * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    clocksource/drivers/sun4i: Clear interrupts after stopping timer in probe function
    drivers/clocksource/pistachio: Fix memory corruption in init
    clocksource/drivers/timer-atmel-pit: Enable mck clock
    clocksource/drivers/pxa: Fix include files for compilation
    MAINTAINERS: Add ARM ARCHITECTED TIMER entry
    timekeeping: Cap array access in timekeeping_debug
    timekeeping: Avoid taking lock in NMI path with CONFIG_DEBUG_TIMEKEEPING

    Linus Torvalds
     

27 Aug, 2016

4 commits

  • Merge fixes from Andrew Morton:
    "11 fixes"

    * emailed patches from Andrew Morton :
    mm: silently skip readahead for DAX inodes
    dax: fix device-dax region base
    fs/seq_file: fix out-of-bounds read
    mm: memcontrol: avoid unused function warning
    mm: clarify COMPACTION Kconfig text
    treewide: replace config_enabled() with IS_ENABLED() (2nd round)
    printk: fix parsing of "brl=" option
    soft_dirty: fix soft_dirty during THP split
    sysctl: handle error writing UINT_MAX to u32 fields
    get_maintainer: quiet noisy implicit -f vcs_file_exists checking
    byteswap: don't use __builtin_bswap*() with sparse

    Linus Torvalds
     
  • Pull block fixes from Jens Axboe:
    "Here's a set of block fixes for the current 4.8-rc release. This
    contains:

    - a fix for a secure erase regression, from Adrian.

    - a fix for an mmc use-after-free bug regression, also from Adrian.

    - potential zero pointer deference in bdev freezing, from Andrey.

    - a race fix for blk_set_queue_dying() from Bart.

    - a set of xen blkfront fixes from Bob Liu.

    - three small fixes for bcache, from Eric and Kent.

    - a fix for a potential invalid NVMe state transition, from Gabriel.

    - blk-mq CPU offline fix, preventing us from issuing and completing a
    request on the wrong queue. From me.

    - revert two previous floppy changes, since they caused a user
    visibile regression. A better fix is in the works.

    - ensure that we don't send down bios that have more than 256
    elements in them. Fixes a crash with bcache, for example. From
    Ming.

    - a fix for deferencing an error pointer with cgroup writeback.
    Fixes a regression. From Vegard"

    * 'for-linus' of git://git.kernel.dk/linux-block:
    mmc: fix use-after-free of struct request
    Revert "floppy: refactor open() flags handling"
    Revert "floppy: fix open(O_ACCMODE) for ioctl-only open"
    fs/block_dev: fix potential NULL ptr deref in freeze_bdev()
    blk-mq: improve warning for running a queue on the wrong CPU
    blk-mq: don't overwrite rq->mq_ctx
    block: make sure a big bio is split into at most 256 bvecs
    nvme: Fix nvme_get/set_features() with a NULL result pointer
    bdev: fix NULL pointer dereference
    xen-blkfront: free resources if xlvbd_alloc_gendisk fails
    xen-blkfront: introduce blkif_set_queue_limits()
    xen-blkfront: fix places not updated after introducing 64KB page granularity
    bcache: pr_err: more meaningful error message when nr_stripes is invalid
    bcache: RESERVE_PRIO is too small by one when prio_buckets() is a power of two.
    bcache: register_bcache(): call blkdev_put() when cache_alloc() fails
    block: Fix race triggered by blk_set_queue_dying()
    block: Fix secure erase
    nvme: Prevent controller state invalid transition

    Linus Torvalds
     
  • Commit bbeddf52adc1 ("printk: move braille console support into separate
    braille.[ch] files") moved the parsing of braille-related options into
    _braille_console_setup(), changing the type of variable str from char*
    to char**. In this commit, memcmp(str, "brl,", 4) was correctly updated
    to memcmp(*str, "brl,", 4) but not memcmp(str, "brl=", 4).

    Update the code to make "brl=" option work again and replace memcmp()
    with strncmp() to make the compiler able to detect such an issue.

    Fixes: bbeddf52adc1 ("printk: move braille console support into separate braille.[ch] files")
    Link: http://lkml.kernel.org/r/20160823165700.28952-1-nicolas.iooss_linux@m4x.org
    Signed-off-by: Nicolas Iooss
    Cc: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nicolas Iooss
     
  • We have scripts which write to certain fields on 3.18 kernels but this
    seems to be failing on 4.4 kernels. An entry which we write to here is
    xfrm_aevent_rseqth which is u32.

    echo 4294967295 > /proc/sys/net/core/xfrm_aevent_rseqth

    Commit 230633d109e3 ("kernel/sysctl.c: detect overflows when converting
    to int") prevented writing to sysctl entries when integer overflow
    occurs. However, this does not apply to unsigned integers.

    Heinrich suggested that we introduce a new option to handle 64 bit
    limits and set min as 0 and max as UINT_MAX. This might not work as it
    leads to issues similar to __do_proc_doulongvec_minmax. Alternatively,
    we would need to change the datatype of the entry to 64 bit.

    static int __do_proc_doulongvec_minmax(void *data, struct ctl_table
    {
    i = (unsigned long *) data; //This cast is causing to read beyond the size of data (u32)
    vleft = table->maxlen / sizeof(unsigned long); //vleft is 0 because maxlen is sizeof(u32) which is lesser than sizeof(unsigned long) on x86_64.

    Introduce a new proc handler proc_douintvec. Individual proc entries
    will need to be updated to use the new handler.

    [akpm@linux-foundation.org: coding-style fixes]
    Fixes: 230633d109e3 ("kernel/sysctl.c:detect overflows when converting to int")
    Link: http://lkml.kernel.org/r/1471479806-5252-1-git-send-email-subashab@codeaurora.org
    Signed-off-by: Subash Abhinov Kasiviswanathan
    Cc: Heinrich Schuchardt
    Cc: Kees Cook
    Cc: "David S. Miller"
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Subash Abhinov Kasiviswanathan
     

24 Aug, 2016

3 commits

  • When tearing down an AUX buf for an event via perf_mmap_close(),
    __perf_event_output_stop() is called on the event's CPU to ensure that
    trace generation is halted before the process of unmapping and
    freeing the buffer pages begins.

    The callback is performed via cpu_function_call(), which ensures that it
    runs with interrupts disabled and is therefore not preemptible.
    Unfortunately, the current code grabs the per-cpu context pointer using
    get_cpu_ptr(), which unnecessarily disables preemption and doesn't pair
    the call with put_cpu_ptr(), leading to a preempt_count() imbalance and
    a BUG when freeing the AUX buffer later on:

    WARNING: CPU: 1 PID: 2249 at kernel/events/ring_buffer.c:539 __rb_free_aux+0x10c/0x120
    Modules linked in:
    [...]
    Call Trace:
    [] dump_stack+0x4f/0x72
    [] __warn+0xc6/0xe0
    [] warn_slowpath_null+0x18/0x20
    [] __rb_free_aux+0x10c/0x120
    [] rb_free_aux+0x13/0x20
    [] perf_mmap_close+0x29e/0x2f0
    [] ? perf_iterate_ctx+0xe0/0xe0
    [] remove_vma+0x25/0x60
    [] exit_mmap+0x106/0x140
    [] mmput+0x1c/0xd0
    [] do_exit+0x253/0xbf0
    [] do_group_exit+0x3e/0xb0
    [] get_signal+0x249/0x640
    [] do_signal+0x23/0x640
    [] ? _raw_write_unlock_irq+0x12/0x30
    [] ? _raw_spin_unlock_irq+0x9/0x10
    [] ? __schedule+0x2c6/0x710
    [] exit_to_usermode_loop+0x74/0x90
    [] prepare_exit_to_usermode+0x26/0x30
    [] retint_user+0x8/0x10

    This patch uses this_cpu_ptr() instead of get_cpu_ptr(), since preemption is
    already disabled by the caller.

    Signed-off-by: Will Deacon
    Reviewed-by: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Fixes: 95ff4ca26c49 ("perf/core: Free AUX pages in unmap path")
    Link: http://lkml.kernel.org/r/20160824091905.GA16944@arm.com
    Signed-off-by: Ingo Molnar

    Will Deacon
     
  • It was reported that hibernation could fail on the 2nd attempt, where the
    system hangs at hibernate() -> syscore_resume() -> i8237A_resume() ->
    claim_dma_lock(), because the lock has already been taken.

    However there is actually no other process would like to grab this lock on
    that problematic platform.

    Further investigation showed that the problem is triggered by setting
    /sys/power/pm_trace to 1 before the 1st hibernation.

    Since once pm_trace is enabled, the rtc becomes unmeaningful after suspend,
    and meanwhile some BIOSes would like to adjust the 'invalid' RTC (e.g, smaller
    than 1970) to the release date of that motherboard during POST stage, thus
    after resumed, it may seem that the system had a significant long sleep time
    which is a completely meaningless value.

    Then in timekeeping_resume -> tk_debug_account_sleep_time, if the bit31 of the
    sleep time happened to be set to 1, fls() returns 32 and we add 1 to
    sleep_time_bin[32], which causes an out of bounds array access and therefor
    memory being overwritten.

    As depicted by System.map:
    0xffffffff81c9d080 b sleep_time_bin
    0xffffffff81c9d100 B dma_spin_lock
    the dma_spin_lock.val is set to 1, which caused this problem.

    This patch adds a sanity check in tk_debug_account_sleep_time()
    to ensure we don't index past the sleep_time_bin array.

    [jstultz: Problem diagnosed and original patch by Chen Yu, I've solved the
    issue slightly differently, but borrowed his excelent explanation of the
    issue here.]

    Fixes: 5c83545f24ab "power: Add option to log time spent in suspend"
    Reported-by: Janek Kozicki
    Reported-by: Chen Yu
    Signed-off-by: John Stultz
    Cc: linux-pm@vger.kernel.org
    Cc: Peter Zijlstra
    Cc: Xunlei Pang
    Cc: "Rafael J. Wysocki"
    Cc: stable
    Cc: Zhang Rui
    Link: http://lkml.kernel.org/r/1471993702-29148-3-git-send-email-john.stultz@linaro.org
    Signed-off-by: Thomas Gleixner

    John Stultz
     
  • When I added some extra sanity checking in timekeeping_get_ns() under
    CONFIG_DEBUG_TIMEKEEPING, I missed that the NMI safe __ktime_get_fast_ns()
    method was using timekeeping_get_ns().

    Thus the locking added to the debug checks broke the NMI-safety of
    __ktime_get_fast_ns().

    This patch open-codes the timekeeping_get_ns() logic for
    __ktime_get_fast_ns(), so can avoid any deadlocks in NMI.

    Fixes: 4ca22c2648f9 "timekeeping: Add warnings when overflows or underflows are observed"
    Reported-by: Steven Rostedt
    Reported-by: Peter Zijlstra
    Signed-off-by: John Stultz
    Cc: stable
    Link: http://lkml.kernel.org/r/1471993702-29148-2-git-send-email-john.stultz@linaro.org
    Signed-off-by: Thomas Gleixner

    John Stultz
     

22 Aug, 2016

2 commits

  • Without locking out CPU mask operations we might end up with an inconsistent
    view of the cpumask in the function.

    Fixes: 5e385a6ef31f: "genirq: Add a helper to spread an affinity mask for MSI/MSI-X vectors"
    Signed-off-by: Christoph Hellwig
    Link: http://lkml.kernel.org/r/1470924405-25728-1-git-send-email-hch@lst.de
    Signed-off-by: Thomas Gleixner

    Christoph Hellwig
     
  • Obviously we should free action here if irq_chip_pm_get failed.

    Fixes: be45beb2df69: "genirq: Add runtime power management support for IRQ chips"
    Signed-off-by: Shawn Lin
    Cc: Jon Hunter
    Cc: Marc Zyngier
    Link: http://lkml.kernel.org/r/1471854112-13006-1-git-send-email-shawn.lin@rock-chips.com
    Signed-off-by: Thomas Gleixner

    Shawn Lin
     

19 Aug, 2016

3 commits

  • Pull scheduler fixes from Ingo Molnar:
    "Two cputime fixes - hopefully the last ones"

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/cputime: Resync steal time when guest & host lose sync
    sched/cputime: Fix NO_HZ_FULL getrusage() monotonicity regression

    Linus Torvalds
     
  • Pull perf fixes from Ingo Molnar:
    "Mostly tooling fixes, but also start/stop filter related fixes, a perf
    event read() fix, a fix uncovered by fuzzing, and an uprobes leak fix"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    perf/core: Check return value of the perf_event_read() IPI
    perf/core: Enable mapping of the stop filters
    perf/core: Update filters only on executable mmap
    perf/core: Fix file name handling for start/stop filters
    perf/core: Fix event_function_local()
    uprobes: Fix the memcg accounting
    perf intel-pt: Fix occasional decoding errors when tracing system-wide
    tools: Sync kvm related header files for arm64 and s390
    perf probe: Release resources on error when handling exit paths
    perf probe: Check for dup and fdopen failures
    perf symbols: Fix annotation of objects with debuginfo files
    perf script: Don't disable use_callchain if input is pipe
    perf script: Show proper message when failed list scripts
    perf jitdump: Add the right header to get the major()/minor() definitions
    perf ppc64le: Fix build failure when libelf is not present
    perf tools mem: Fix -t store option for record command
    perf intel-pt: Fix ip compression

    Linus Torvalds
     
  • Pull power management fixes from Rafael Wysocki:
    "More hibernation-related material: one fix for a recent regression in
    the core, one small cleanup of the x86-64 resume code and a
    documentation update.

    Specifics:

    - Fix a hibernate core regression resulting from uncovering a latent
    bug in its implementation of memory bitmaps by a recent commit
    (James Morse).

    - Use __pa() to compute a physical address in the x86-64 code
    finalizing resume from hibernation (Rafael Wysocki).

    - Update power management documentation related to system sleep
    states to remove outdated information from it and to add a
    description of a recently introduced hibernation debug feature to
    it (Rafael Wysocki)"

    * tag 'pm-4.8-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
    PM / hibernate: Fix rtree_next_node() to avoid walking off list ends
    x86/power/64: Use __pa() for physical address computation
    PM / sleep: Update some system sleep documentation

    Linus Torvalds
     

18 Aug, 2016

10 commits

  • Commit:

    57430218317e ("sched/cputime: Count actually elapsed irq & softirq time")

    ... fixed a bug but also triggered a regression:

    On an i5 laptop, 4 pCPUs, 4vCPUs for one full dynticks guest, there are four
    CPU hog processes(for loop) running in the guest, I hot-unplug the pCPUs
    on host one by one until there is only one left, then observe CPU utilization
    via 'top' in the guest, it shows:

    100% st for cpu0(housekeeping)
    75% st for other CPUs (nohz full mode)

    However, w/o this commit it shows the correct 75% for all four CPUs.

    When a guest is interrupted for a longer amount of time, missed clock ticks
    are not redelivered later. Because of that, we should not limit the amount
    of steal time accounted to the amount of time that the calling functions
    think have passed.

    However, the interval returned by account_other_time() is NOT rounded down
    to the nearest jiffy, while the base interval in get_vtime_delta() it is
    subtracted from is, so the max cputime limit is required to avoid underflow.

    This patch fixes the regression by limiting the account_other_time() from
    get_vtime_delta() to avoid underflow, and lets the other three call sites
    (in account_other_time() and steal_account_process_time()) account however
    much steal time the host told us elapsed.

    Suggested-by: Rik van Riel
    Suggested-by: Paolo Bonzini
    Signed-off-by: Wanpeng Li
    Reviewed-by: Rik van Riel
    Cc: Frederic Weisbecker
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Radim Krcmar
    Cc: Thomas Gleixner
    Cc: kvm@vger.kernel.org
    Link: http://lkml.kernel.org/r/1471399546-4069-1-git-send-email-wanpeng.li@hotmail.com
    [ Improved the changelog. ]
    Signed-off-by: Ingo Molnar

    Wanpeng Li
     
  • Mike reports:

    Roughly 10% of the time, ltp testcase getrusage04 fails:
    getrusage04 0 TINFO : Expected timers granularity is 4000 us
    getrusage04 0 TINFO : Using 1 as multiply factor for max [us]time increment (1000+4000us)!
    getrusage04 0 TINFO : utime: 0us; stime: 179us
    getrusage04 0 TINFO : utime: 3751us; stime: 0us
    getrusage04 1 TFAIL : getrusage04.c:133: stime increased > 5000us:

    And tracked it down to the case where the task simply doesn't get
    _any_ [us]time ticks.

    Update the code to assume all rtime is utime when we lack information,
    thus ensuring a task that elides the tick gets time accounted.

    Reported-by: Mike Galbraith
    Tested-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Frederic Weisbecker
    Cc: Fredrik Markstrom
    Cc: Linus Torvalds
    Cc: Paolo Bonzini
    Cc: Peter Zijlstra
    Cc: Radim
    Cc: Rik van Riel
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: Wanpeng Li
    Cc: stable@vger.kernel.org # 4.3+
    Fixes: 9d7fb0427648 ("sched/cputime: Guarantee stime + utime == rtime")
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • The call to smp_call_function_single in perf_event_read() may fail if
    an invalid or not online CPU index is passed. Warn user if such bug is
    present and return error.

    Signed-off-by: David Carrillo-Cisneros
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Kan Liang
    Cc: Linus Torvalds
    Cc: Paul Turner
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vegard Nossum
    Cc: Vince Weaver
    Link: http://lkml.kernel.org/r/1471467307-61171-2-git-send-email-davidcc@google.com
    Signed-off-by: Ingo Molnar

    David Carrillo-Cisneros
     
  • At this time the perf_addr_filter_needs_mmap() function will _not_
    return true on a user space 'stop' filter. But stop filters need
    exactly the same kind of mapping that range and start filters get.

    Signed-off-by: Mathieu Poirier
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Link: http://lkml.kernel.org/r/1468860187-318-4-git-send-email-mathieu.poirier@linaro.org
    Signed-off-by: Ingo Molnar

    Mathieu Poirier
     
  • Function perf_event_mmap() is called by the MM subsystem each time
    part of a binary is loaded in memory. There can be several mapping
    for a binary, many times unrelated to the code section.

    Each time a section of a binary is mapped address filters are
    updated, event when the map doesn't pertain to the code section.
    The end result is that filters are configured based on the last map
    event that was received rather than the last mapping of the code
    segment.

    For example if we have an executable 'main' that calls library
    'libcstest.so.1.0', and that we want to collect traces on code
    that is in that library. The perf cmd line for this scenario
    would be:

    perf record -e cs_etm// --filter 'filter 0x72c/0x40@/opt/lib/libcstest.so.1.0' --per-thread ./main

    Resulting in binaries being mapped this way:

    root@linaro-nano:~# cat /proc/1950/maps
    00400000-00401000 r-xp 00000000 08:02 33169 /home/linaro/main
    00410000-00411000 r--p 00000000 08:02 33169 /home/linaro/main
    00411000-00412000 rw-p 00001000 08:02 33169 /home/linaro/main
    7fa2464000-7fa2474000 rw-p 00000000 00:00 0
    7fa2474000-7fa25a4000 r-xp 00000000 08:02 543 /lib/aarch64-linux-gnu/libc-2.21.so
    7fa25a4000-7fa25b3000 ---p 00130000 08:02 543 /lib/aarch64-linux-gnu/libc-2.21.so
    7fa25b3000-7fa25b7000 r--p 0012f000 08:02 543 /lib/aarch64-linux-gnu/libc-2.21.so
    7fa25b7000-7fa25b9000 rw-p 00133000 08:02 543 /lib/aarch64-linux-gnu/libc-2.21.so
    7fa25b9000-7fa25bd000 rw-p 00000000 00:00 0
    7fa25bd000-7fa25be000 r-xp 00000000 08:02 38308 /opt/lib/libcstest.so.1.0
    7fa25be000-7fa25cd000 ---p 00001000 08:02 38308 /opt/lib/libcstest.so.1.0
    7fa25cd000-7fa25ce000 r--p 00000000 08:02 38308 /opt/lib/libcstest.so.1.0
    7fa25ce000-7fa25cf000 rw-p 00001000 08:02 38308 /opt/lib/libcstest.so.1.0
    7fa25cf000-7fa25eb000 r-xp 00000000 08:02 574 /lib/aarch64-linux-gnu/ld-2.21.so
    7fa25ef000-7fa25f2000 rw-p 00000000 00:00 0
    7fa25f7000-7fa25f9000 rw-p 00000000 00:00 0
    7fa25f9000-7fa25fa000 r--p 00000000 00:00 0 [vvar]
    7fa25fa000-7fa25fb000 r-xp 00000000 00:00 0 [vdso]
    7fa25fb000-7fa25fc000 r--p 0001c000 08:02 574 /lib/aarch64-linux-gnu/ld-2.21.so
    7fa25fc000-7fa25fe000 rw-p 0001d000 08:02 574 /lib/aarch64-linux-gnu/ld-2.21.so
    7ff2ea8000-7ff2ec9000 rw-p 00000000 00:00 0 [stack]
    root@linaro-nano:~#

    Before 'main()' can execute 'libcstest.so.1.0' has to be loaded in
    memory. Once that has been done perf_event_mmap() has been called
    4 times, with the last map starting at address 0x7fa25ce000 and
    the address filter configured to start filtering when the
    IP has passed over address 0x0x7fa25ce72c (0x7fa25ce000 + 0x72c).

    But that is wrong since the code segment for library 'libcstest.so.1.0'
    as been mapped at 0x7fa25bd000, resulting in traces not being
    collected.

    This patch corrects the situation by requesting that address
    filters be updated only if the mapped event is for a code
    segment.

    Signed-off-by: Mathieu Poirier
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Link: http://lkml.kernel.org/r/1468860187-318-3-git-send-email-mathieu.poirier@linaro.org
    Signed-off-by: Ingo Molnar

    Mathieu Poirier
     
  • Binary file names have to be supplied for both range and start/stop
    filters but the current code only processes the filename if an
    address range filter is specified. This code adds processing of
    the filename for start/stop filters.

    Signed-off-by: Mathieu Poirier
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Link: http://lkml.kernel.org/r/1468860187-318-2-git-send-email-mathieu.poirier@linaro.org
    Signed-off-by: Ingo Molnar

    Mathieu Poirier
     
  • Vincent reported triggering the WARN_ON_ONCE() in event_function_local().

    While thinking through cases I noticed that by using event_function()
    directly, we miss the inactive case usually handled by
    event_function_call().

    Therefore construct a blend of event_function_call() and
    event_function() that handles the cases relevant to
    event_function_local().

    Reported-by: Vince Weaver
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: stable@vger.kernel.org # 4.5+
    Fixes: fae3fde65138 ("perf: Collapse and fix event_function_call() users")
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • __replace_page() wronlgy calls mem_cgroup_cancel_charge() in "success" path,
    it should only do this if page_check_address() fails.

    This means that every enable/disable leads to unbalanced mem_cgroup_uncharge()
    from put_page(old_page), it is trivial to underflow the page_counter->count
    and trigger OOM.

    Reported-and-tested-by: Brenden Blanco
    Signed-off-by: Oleg Nesterov
    Reviewed-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Alexander Shishkin
    Cc: Alexei Starovoitov
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Vladimir Davydov
    Cc: stable@vger.kernel.org # 3.17+
    Fixes: 00501b531c47 ("mm: memcontrol: rewrite charge API")
    Link: http://lkml.kernel.org/r/20160817153629.GB29724@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • * pm-sleep:
    PM / hibernate: Fix rtree_next_node() to avoid walking off list ends
    x86/power/64: Use __pa() for physical address computation
    PM / sleep: Update some system sleep documentation

    Rafael J. Wysocki
     
  • Pull networking fixes from David Miller:

    1) Buffers powersave frame test is reversed in cfg80211, fix from Felix
    Fietkau.

    2) Remove bogus WARN_ON in openvswitch, from Jarno Rajahalme.

    3) Fix some tg3 ethtool logic bugs, and one that would cause no
    interrupts to be generated when rx-coalescing is set to 0. From
    Satish Baddipadige and Siva Reddy Kallam.

    4) QLCNIC mailbox corruption and napi budget handling fix from Manish
    Chopra.

    5) Fix fib_trie logic when walking the trie during /proc/net/route
    output than can access a stale node pointer. From David Forster.

    6) Several sctp_diag fixes from Phil Sutter.

    7) PAUSE frame handling fixes in mlxsw driver from Ido Schimmel.

    8) Checksum fixup fixes in bpf from Daniel Borkmann.

    9) Memork leaks in nfnetlink, from Liping Zhang.

    10) Use after free in rxrpc, from David Howells.

    11) Use after free in new skb_array code of macvtap driver, from Jason
    Wang.

    12) Calipso resource leak, from Colin Ian King.

    13) mediatek bug fixes (missing stats sync init, etc.) from Sean Wang.

    14) Fix bpf non-linear packet write helpers, from Daniel Borkmann.

    15) Fix lockdep splats in macsec, from Sabrina Dubroca.

    16) hv_netvsc bug fixes from Vitaly Kuznetsov, mostly to do with VF
    handling.

    17) Various tc-action bug fixes, from CONG Wang.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (116 commits)
    net_sched: allow flushing tc police actions
    net_sched: unify the init logic for act_police
    net_sched: convert tcf_exts from list to pointer array
    net_sched: move tc offload macros to pkt_cls.h
    net_sched: fix a typo in tc_for_each_action()
    net_sched: remove an unnecessary list_del()
    net_sched: remove the leftover cleanup_a()
    mlxsw: spectrum: Allow packets to be trapped from any PG
    mlxsw: spectrum: Unmap 802.1Q FID before destroying it
    mlxsw: spectrum: Add missing rollbacks in error path
    mlxsw: reg: Fix missing op field fill-up
    mlxsw: spectrum: Trap loop-backed packets
    mlxsw: spectrum: Add missing packet traps
    mlxsw: spectrum: Mark port as active before registering it
    mlxsw: spectrum: Create PVID vPort before registering netdevice
    mlxsw: spectrum: Remove redundant errors from the code
    mlxsw: spectrum: Don't return upon error in removal path
    i40e: check for and deal with non-contiguous TCs
    ixgbe: Re-enable ability to toggle VLAN filtering
    ixgbe: Force VLNCTRL.VFE to be set in all VMDq paths
    ...

    Linus Torvalds
     

17 Aug, 2016

1 commit

  • Commit 1e2a7d78499e ("irqdomain: Don't set type when mapping an IRQ")
    moved the trigger configuration call from the irqdomain mapping to
    the interrupt being actually requested.

    This patch failed to handle the case where we configure a chained
    interrupt, which doesn't get requested through the usual path.

    In order to solve this, let's call __irq_set_trigger just before
    starting the cascade interrupt. Special care must be taken to
    make the flow handler stick, as the .irq_set_type method could
    have reset it (it doesn't know we're dealing with a chained
    interrupt).

    Based on an initial patch by Jon Hunter.

    Fixes: 1e2a7d78499e ("irqdomain: Don't set type when mapping an IRQ")
    Reported-by: John Stultz
    Reported-by: Linus Walleij
    Tested-by: John Stultz
    Acked-by: Jon Hunter
    Signed-off-by: Marc Zyngier

    Marc Zyngier
     

16 Aug, 2016

2 commits

  • Commit 288dab8a35a0 ("block: add a separate operation type for secure
    erase") split REQ_OP_SECURE_ERASE from REQ_OP_DISCARD without considering
    all the places REQ_OP_DISCARD was being used to mean either. Fix those.

    Signed-off-by: Adrian Hunter
    Fixes: 288dab8a35a0 ("block: add a separate operation type for secure erase")
    Signed-off-by: Jens Axboe

    Adrian Hunter
     
  • rtree_next_node() walks the linked list of leaf nodes to find the next
    block of pages in the struct memory_bitmap. If it walks off the end of
    the list of nodes, it walks the list of memory zones to find the next
    region of memory. If it walks off the end of the list of zones, it
    returns false.

    This leaves the struct bm_position's node and zone pointers pointing
    at their respective struct list_heads in struct mem_zone_bm_rtree.

    memory_bm_find_bit() uses struct bm_position's node and zone pointers
    to avoid walking lists and trees if the next bit appears in the same
    node/zone. It handles these values being stale.

    Swap rtree_next_node()s 'step then test' to 'test-next then step',
    this means if we reach the end of memory we return false and leave
    the node and zone pointers as they were.

    This fixes a panic on resume using AMD Seattle with 64K pages:
    [ 6.868732] Freezing user space processes ... (elapsed 0.000 seconds) done.
    [ 6.875753] Double checking all user space processes after OOM killer disable... (elapsed 0.000 seconds)
    [ 6.896453] PM: Using 3 thread(s) for decompression.
    [ 6.896453] PM: Loading and decompressing image data (5339 pages)...
    [ 7.318890] PM: Image loading progress: 0%
    [ 7.323395] Unable to handle kernel paging request at virtual address 00800040
    [ 7.330611] pgd = ffff000008df0000
    [ 7.334003] [00800040] *pgd=00000083fffe0003, *pud=00000083fffe0003, *pmd=00000083fffd0003, *pte=0000000000000000
    [ 7.344266] Internal error: Oops: 96000005 [#1] PREEMPT SMP
    [ 7.349825] Modules linked in:
    [ 7.352871] CPU: 2 PID: 1 Comm: swapper/0 Tainted: G W I 4.8.0-rc1 #4737
    [ 7.360512] Hardware name: AMD Overdrive/Supercharger/Default string, BIOS ROD1002C 04/08/2016
    [ 7.369109] task: ffff8003c0220000 task.stack: ffff8003c0280000
    [ 7.375020] PC is at set_bit+0x18/0x30
    [ 7.378758] LR is at memory_bm_set_bit+0x24/0x30
    [ 7.383362] pc : [] lr : [] pstate: 60000045
    [ 7.390743] sp : ffff8003c0283b00
    [ 7.473551]
    [ 7.475031] Process swapper/0 (pid: 1, stack limit = 0xffff8003c0280020)
    [ 7.481718] Stack: (0xffff8003c0283b00 to 0xffff8003c0284000)
    [ 7.800075] Call trace:
    [ 7.887097] [] set_bit+0x18/0x30
    [ 7.891876] [] duplicate_memory_bitmap.constprop.38+0x54/0x70
    [ 7.899172] [] snapshot_write_next+0x22c/0x47c
    [ 7.905166] [] load_image_lzo+0x754/0xa88
    [ 7.910725] [] swsusp_read+0x144/0x230
    [ 7.916025] [] load_image_and_restore+0x58/0x90
    [ 7.922105] [] software_resume+0x2f0/0x338
    [ 7.927752] [] do_one_initcall+0x38/0x11c
    [ 7.933314] [] kernel_init_freeable+0x14c/0x1ec
    [ 7.939395] [] kernel_init+0x10/0xfc
    [ 7.944520] [] ret_from_fork+0x10/0x40
    [ 7.949820] Code: d2800022 8b400c21 f9800031 9ac32043 (c85f7c22)
    [ 7.955909] ---[ end trace 0024a5986e6ff323 ]---
    [ 7.960529] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b

    Here struct mem_zone_bm_rtree's start_pfn has been returned instead of
    struct rtree_node's addr as the node/zone pointers are corrupt after
    we walked off the end of the lists during mark_unsafe_pages().

    This behaviour was exposed by commit 6dbecfd345a6 ("PM / hibernate:
    Simplify mark_unsafe_pages()"), which caused mark_unsafe_pages() to call
    duplicate_memory_bitmap(), which uses memory_bm_find_bit() after walking
    off the end of the memory bitmap.

    Fixes: 3a20cb177961 (PM / Hibernate: Implement position keeping in radix tree)
    Signed-off-by: James Morse
    [ rjw: Subject ]
    Signed-off-by: Rafael J. Wysocki

    James Morse
     

13 Aug, 2016

1 commit

  • While hashing out BPF's current_task_under_cgroup helper bits, it came
    to discussion that the skb_in_cgroup helper name was suboptimally chosen.

    Tejun says:

    So, I think in_cgroup should mean that the object is in that
    particular cgroup while under_cgroup in the subhierarchy of that
    cgroup. Let's rename the other subhierarchy test to under too. I
    think that'd be a lot less confusing going forward.

    [...]

    It's more intuitive and gives us the room to implement the real
    "in" test if ever necessary in the future.

    Since this touches uapi bits, we need to change this as long as v4.8
    is not yet officially released. Thus, change the helper enum and rename
    related bits.

    Fixes: 4a482f34afcc ("cgroup: bpf: Add bpf_skb_in_cgroup_proto")
    Reference: http://patchwork.ozlabs.org/patch/658500/
    Suggested-by: Sargun Dhillon
    Suggested-by: Tejun Heo
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov

    Daniel Borkmann