02 Sep, 2016

1 commit


01 Sep, 2016

2 commits

  • Prior to the change the function would blindly deference mm, exe_file
    and exe_file->f_inode, each of which could have been NULL or freed.

    Use get_task_exe_file to safely obtain stable exe_file.

    Signed-off-by: Mateusz Guzik
    Acked-by: Konstantin Khlebnikov
    Acked-by: Richard Guy Briggs
    Cc: # 4.3.x
    Signed-off-by: Paul Moore

    Mateusz Guzik
     
  • For more convenient access if one has a pointer to the task.

    As a minor nit take advantage of the fact that only task lock + rcu are
    needed to safely grab ->exe_file. This saves mm refcount dance.

    Use the helper in proc_exe_link.

    Signed-off-by: Mateusz Guzik
    Acked-by: Konstantin Khlebnikov
    Acked-by: Richard Guy Briggs
    Cc: # 4.3.x
    Signed-off-by: Paul Moore

    Mateusz Guzik
     

31 Aug, 2016

3 commits

  • Pull seccomp fix from Kees Cook:
    "Fix fatal signal delivery after ptrace reordering"

    * tag 'seccomp-v4.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
    seccomp: Fix tracer exit notifications during fatal signals

    Linus Torvalds
     
  • This fixes a ptrace vs fatal pending signals bug as manifested in
    seccomp now that seccomp was reordered to happen after ptrace. The
    short version is that seccomp should not attempt to call do_exit()
    while fatal signals are pending under a tracer. The existing code was
    trying to be as defensively paranoid as possible, but it now ends up
    confusing ptrace. Instead, the syscall can just be skipped (which solves
    the original concern that the do_exit() was addressing) and normal signal
    handling, tracer notification, and process death can happen.

    Paraphrasing from the original bug report:

    If a tracee task is in a PTRACE_EVENT_SECCOMP trap, or has been resumed
    after such a trap but not yet been scheduled, and another task in the
    thread-group calls exit_group(), then the tracee task exits without the
    ptracer receiving a PTRACE_EVENT_EXIT notification. Test case here:
    https://gist.github.com/khuey/3c43ac247c72cef8c956ca73281c9be7

    The bug happens because when __seccomp_filter() detects
    fatal_signal_pending(), it calls do_exit() without dequeuing the fatal
    signal. When do_exit() sends the PTRACE_EVENT_EXIT notification and
    that task is descheduled, __schedule() notices that there is a fatal
    signal pending and changes its state from TASK_TRACED to TASK_RUNNING.
    That prevents the ptracer's waitpid() from returning the ptrace event.
    A more detailed analysis is here:
    https://github.com/mozilla/rr/issues/1762#issuecomment-237396255.

    Reported-by: Robert O'Callahan
    Reported-by: Kyle Huey
    Tested-by: Kyle Huey
    Fixes: 93e35efb8de4 ("x86/ptrace: run seccomp after ptrace")
    Signed-off-by: Kees Cook
    Acked-by: Oleg Nesterov
    Acked-by: James Morris

    Kees Cook
     
  • Pull cgroup fixes from Tejun Heo:
    "Two fixes for cgroup.

    - There still was a hole in enforcing cpuset rules, fixed by Li.

    - The recent switch to global percpu_rwseom for threadgroup locking
    revealed a couple issues in how percpu_rwsem is implemented and
    used by cgroup. Balbir found that the read locking section was too
    wide unnecessarily including operations which can often depend on
    IOs. With percpu_rwsem updates (coming through a different tree)
    and reduction of read locking section, all the reported locking
    latency issues, including the android one, are resolved.

    It looks like we can keep global percpu_rwsem locking for now. If
    there actually are cases which can't be resolved, we can go back to
    more complex per-signal_struct locking"

    * 'for-4.8-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: reduce read locked section of cgroup_threadgroup_rwsem during fork
    cpuset: make sure new tasks conform to the current config of the cpuset

    Linus Torvalds
     

29 Aug, 2016

3 commits

  • Pull perf fixes from Thomas Gleixner:
    "A few fixes from the perf departement

    - prevent a imbalanced preemption disable in the events teardown code
    - prevent out of bound acces in perf userspace
    - make perf tools compile with UCLIBC again
    - a fix for the userspace unwinder utility"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    perf/core: Use this_cpu_ptr() when stopping AUX events
    perf evsel: Do not access outside hw cache name arrays
    tools lib: Reinstate strlcpy() header guard with __UCLIBC__
    perf unwind: Use addr_location::addr instead of ip for entries

    Linus Torvalds
     
  • Pull irq fixes from Thomas Gleixner:
    "This lot provides:

    - plug a hotplug race in the new affinity infrastructure
    - a fix for the trigger type of chained interrupts
    - plug a potential memory leak in the core code
    - a few fixes for ARM and MIPS GICs"

    * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    irqchip/mips-gic: Implement activate op for device domain
    irqchip/mips-gic: Cleanup chip and handler setup
    genirq/affinity: Use get/put_online_cpus around cpumask operations
    genirq: Fix potential memleak when failing to get irq pm
    irqchip/gicv3-its: Disable the ITS before initializing it
    irqchip/gicv3: Remove disabling redistributor and group1 non-secure interrupts
    irqchip/gic: Allow self-SGIs for SMP on UP configurations
    genirq: Correctly configure the trigger on chained interrupts

    Linus Torvalds
     
  • Pull timer fixes from Thomas Gleixner:
    "A few updates for timers & co:

    - prevent a livelock in the timekeeping code when debugging is
    enabled

    - prevent out of bounds access in the timekeeping debug code

    - various fixes in clocksource drivers

    - a new maintainers entry"

    * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    clocksource/drivers/sun4i: Clear interrupts after stopping timer in probe function
    drivers/clocksource/pistachio: Fix memory corruption in init
    clocksource/drivers/timer-atmel-pit: Enable mck clock
    clocksource/drivers/pxa: Fix include files for compilation
    MAINTAINERS: Add ARM ARCHITECTED TIMER entry
    timekeeping: Cap array access in timekeeping_debug
    timekeeping: Avoid taking lock in NMI path with CONFIG_DEBUG_TIMEKEEPING

    Linus Torvalds
     

27 Aug, 2016

4 commits

  • Merge fixes from Andrew Morton:
    "11 fixes"

    * emailed patches from Andrew Morton :
    mm: silently skip readahead for DAX inodes
    dax: fix device-dax region base
    fs/seq_file: fix out-of-bounds read
    mm: memcontrol: avoid unused function warning
    mm: clarify COMPACTION Kconfig text
    treewide: replace config_enabled() with IS_ENABLED() (2nd round)
    printk: fix parsing of "brl=" option
    soft_dirty: fix soft_dirty during THP split
    sysctl: handle error writing UINT_MAX to u32 fields
    get_maintainer: quiet noisy implicit -f vcs_file_exists checking
    byteswap: don't use __builtin_bswap*() with sparse

    Linus Torvalds
     
  • Pull block fixes from Jens Axboe:
    "Here's a set of block fixes for the current 4.8-rc release. This
    contains:

    - a fix for a secure erase regression, from Adrian.

    - a fix for an mmc use-after-free bug regression, also from Adrian.

    - potential zero pointer deference in bdev freezing, from Andrey.

    - a race fix for blk_set_queue_dying() from Bart.

    - a set of xen blkfront fixes from Bob Liu.

    - three small fixes for bcache, from Eric and Kent.

    - a fix for a potential invalid NVMe state transition, from Gabriel.

    - blk-mq CPU offline fix, preventing us from issuing and completing a
    request on the wrong queue. From me.

    - revert two previous floppy changes, since they caused a user
    visibile regression. A better fix is in the works.

    - ensure that we don't send down bios that have more than 256
    elements in them. Fixes a crash with bcache, for example. From
    Ming.

    - a fix for deferencing an error pointer with cgroup writeback.
    Fixes a regression. From Vegard"

    * 'for-linus' of git://git.kernel.dk/linux-block:
    mmc: fix use-after-free of struct request
    Revert "floppy: refactor open() flags handling"
    Revert "floppy: fix open(O_ACCMODE) for ioctl-only open"
    fs/block_dev: fix potential NULL ptr deref in freeze_bdev()
    blk-mq: improve warning for running a queue on the wrong CPU
    blk-mq: don't overwrite rq->mq_ctx
    block: make sure a big bio is split into at most 256 bvecs
    nvme: Fix nvme_get/set_features() with a NULL result pointer
    bdev: fix NULL pointer dereference
    xen-blkfront: free resources if xlvbd_alloc_gendisk fails
    xen-blkfront: introduce blkif_set_queue_limits()
    xen-blkfront: fix places not updated after introducing 64KB page granularity
    bcache: pr_err: more meaningful error message when nr_stripes is invalid
    bcache: RESERVE_PRIO is too small by one when prio_buckets() is a power of two.
    bcache: register_bcache(): call blkdev_put() when cache_alloc() fails
    block: Fix race triggered by blk_set_queue_dying()
    block: Fix secure erase
    nvme: Prevent controller state invalid transition

    Linus Torvalds
     
  • Commit bbeddf52adc1 ("printk: move braille console support into separate
    braille.[ch] files") moved the parsing of braille-related options into
    _braille_console_setup(), changing the type of variable str from char*
    to char**. In this commit, memcmp(str, "brl,", 4) was correctly updated
    to memcmp(*str, "brl,", 4) but not memcmp(str, "brl=", 4).

    Update the code to make "brl=" option work again and replace memcmp()
    with strncmp() to make the compiler able to detect such an issue.

    Fixes: bbeddf52adc1 ("printk: move braille console support into separate braille.[ch] files")
    Link: http://lkml.kernel.org/r/20160823165700.28952-1-nicolas.iooss_linux@m4x.org
    Signed-off-by: Nicolas Iooss
    Cc: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nicolas Iooss
     
  • We have scripts which write to certain fields on 3.18 kernels but this
    seems to be failing on 4.4 kernels. An entry which we write to here is
    xfrm_aevent_rseqth which is u32.

    echo 4294967295 > /proc/sys/net/core/xfrm_aevent_rseqth

    Commit 230633d109e3 ("kernel/sysctl.c: detect overflows when converting
    to int") prevented writing to sysctl entries when integer overflow
    occurs. However, this does not apply to unsigned integers.

    Heinrich suggested that we introduce a new option to handle 64 bit
    limits and set min as 0 and max as UINT_MAX. This might not work as it
    leads to issues similar to __do_proc_doulongvec_minmax. Alternatively,
    we would need to change the datatype of the entry to 64 bit.

    static int __do_proc_doulongvec_minmax(void *data, struct ctl_table
    {
    i = (unsigned long *) data; //This cast is causing to read beyond the size of data (u32)
    vleft = table->maxlen / sizeof(unsigned long); //vleft is 0 because maxlen is sizeof(u32) which is lesser than sizeof(unsigned long) on x86_64.

    Introduce a new proc handler proc_douintvec. Individual proc entries
    will need to be updated to use the new handler.

    [akpm@linux-foundation.org: coding-style fixes]
    Fixes: 230633d109e3 ("kernel/sysctl.c:detect overflows when converting to int")
    Link: http://lkml.kernel.org/r/1471479806-5252-1-git-send-email-subashab@codeaurora.org
    Signed-off-by: Subash Abhinov Kasiviswanathan
    Cc: Heinrich Schuchardt
    Cc: Kees Cook
    Cc: "David S. Miller"
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Subash Abhinov Kasiviswanathan
     

24 Aug, 2016

3 commits

  • When tearing down an AUX buf for an event via perf_mmap_close(),
    __perf_event_output_stop() is called on the event's CPU to ensure that
    trace generation is halted before the process of unmapping and
    freeing the buffer pages begins.

    The callback is performed via cpu_function_call(), which ensures that it
    runs with interrupts disabled and is therefore not preemptible.
    Unfortunately, the current code grabs the per-cpu context pointer using
    get_cpu_ptr(), which unnecessarily disables preemption and doesn't pair
    the call with put_cpu_ptr(), leading to a preempt_count() imbalance and
    a BUG when freeing the AUX buffer later on:

    WARNING: CPU: 1 PID: 2249 at kernel/events/ring_buffer.c:539 __rb_free_aux+0x10c/0x120
    Modules linked in:
    [...]
    Call Trace:
    [] dump_stack+0x4f/0x72
    [] __warn+0xc6/0xe0
    [] warn_slowpath_null+0x18/0x20
    [] __rb_free_aux+0x10c/0x120
    [] rb_free_aux+0x13/0x20
    [] perf_mmap_close+0x29e/0x2f0
    [] ? perf_iterate_ctx+0xe0/0xe0
    [] remove_vma+0x25/0x60
    [] exit_mmap+0x106/0x140
    [] mmput+0x1c/0xd0
    [] do_exit+0x253/0xbf0
    [] do_group_exit+0x3e/0xb0
    [] get_signal+0x249/0x640
    [] do_signal+0x23/0x640
    [] ? _raw_write_unlock_irq+0x12/0x30
    [] ? _raw_spin_unlock_irq+0x9/0x10
    [] ? __schedule+0x2c6/0x710
    [] exit_to_usermode_loop+0x74/0x90
    [] prepare_exit_to_usermode+0x26/0x30
    [] retint_user+0x8/0x10

    This patch uses this_cpu_ptr() instead of get_cpu_ptr(), since preemption is
    already disabled by the caller.

    Signed-off-by: Will Deacon
    Reviewed-by: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Fixes: 95ff4ca26c49 ("perf/core: Free AUX pages in unmap path")
    Link: http://lkml.kernel.org/r/20160824091905.GA16944@arm.com
    Signed-off-by: Ingo Molnar

    Will Deacon
     
  • It was reported that hibernation could fail on the 2nd attempt, where the
    system hangs at hibernate() -> syscore_resume() -> i8237A_resume() ->
    claim_dma_lock(), because the lock has already been taken.

    However there is actually no other process would like to grab this lock on
    that problematic platform.

    Further investigation showed that the problem is triggered by setting
    /sys/power/pm_trace to 1 before the 1st hibernation.

    Since once pm_trace is enabled, the rtc becomes unmeaningful after suspend,
    and meanwhile some BIOSes would like to adjust the 'invalid' RTC (e.g, smaller
    than 1970) to the release date of that motherboard during POST stage, thus
    after resumed, it may seem that the system had a significant long sleep time
    which is a completely meaningless value.

    Then in timekeeping_resume -> tk_debug_account_sleep_time, if the bit31 of the
    sleep time happened to be set to 1, fls() returns 32 and we add 1 to
    sleep_time_bin[32], which causes an out of bounds array access and therefor
    memory being overwritten.

    As depicted by System.map:
    0xffffffff81c9d080 b sleep_time_bin
    0xffffffff81c9d100 B dma_spin_lock
    the dma_spin_lock.val is set to 1, which caused this problem.

    This patch adds a sanity check in tk_debug_account_sleep_time()
    to ensure we don't index past the sleep_time_bin array.

    [jstultz: Problem diagnosed and original patch by Chen Yu, I've solved the
    issue slightly differently, but borrowed his excelent explanation of the
    issue here.]

    Fixes: 5c83545f24ab "power: Add option to log time spent in suspend"
    Reported-by: Janek Kozicki
    Reported-by: Chen Yu
    Signed-off-by: John Stultz
    Cc: linux-pm@vger.kernel.org
    Cc: Peter Zijlstra
    Cc: Xunlei Pang
    Cc: "Rafael J. Wysocki"
    Cc: stable
    Cc: Zhang Rui
    Link: http://lkml.kernel.org/r/1471993702-29148-3-git-send-email-john.stultz@linaro.org
    Signed-off-by: Thomas Gleixner

    John Stultz
     
  • When I added some extra sanity checking in timekeeping_get_ns() under
    CONFIG_DEBUG_TIMEKEEPING, I missed that the NMI safe __ktime_get_fast_ns()
    method was using timekeeping_get_ns().

    Thus the locking added to the debug checks broke the NMI-safety of
    __ktime_get_fast_ns().

    This patch open-codes the timekeeping_get_ns() logic for
    __ktime_get_fast_ns(), so can avoid any deadlocks in NMI.

    Fixes: 4ca22c2648f9 "timekeeping: Add warnings when overflows or underflows are observed"
    Reported-by: Steven Rostedt
    Reported-by: Peter Zijlstra
    Signed-off-by: John Stultz
    Cc: stable
    Link: http://lkml.kernel.org/r/1471993702-29148-2-git-send-email-john.stultz@linaro.org
    Signed-off-by: Thomas Gleixner

    John Stultz
     

22 Aug, 2016

2 commits

  • Without locking out CPU mask operations we might end up with an inconsistent
    view of the cpumask in the function.

    Fixes: 5e385a6ef31f: "genirq: Add a helper to spread an affinity mask for MSI/MSI-X vectors"
    Signed-off-by: Christoph Hellwig
    Link: http://lkml.kernel.org/r/1470924405-25728-1-git-send-email-hch@lst.de
    Signed-off-by: Thomas Gleixner

    Christoph Hellwig
     
  • Obviously we should free action here if irq_chip_pm_get failed.

    Fixes: be45beb2df69: "genirq: Add runtime power management support for IRQ chips"
    Signed-off-by: Shawn Lin
    Cc: Jon Hunter
    Cc: Marc Zyngier
    Link: http://lkml.kernel.org/r/1471854112-13006-1-git-send-email-shawn.lin@rock-chips.com
    Signed-off-by: Thomas Gleixner

    Shawn Lin
     

19 Aug, 2016

3 commits

  • Pull scheduler fixes from Ingo Molnar:
    "Two cputime fixes - hopefully the last ones"

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/cputime: Resync steal time when guest & host lose sync
    sched/cputime: Fix NO_HZ_FULL getrusage() monotonicity regression

    Linus Torvalds
     
  • Pull perf fixes from Ingo Molnar:
    "Mostly tooling fixes, but also start/stop filter related fixes, a perf
    event read() fix, a fix uncovered by fuzzing, and an uprobes leak fix"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    perf/core: Check return value of the perf_event_read() IPI
    perf/core: Enable mapping of the stop filters
    perf/core: Update filters only on executable mmap
    perf/core: Fix file name handling for start/stop filters
    perf/core: Fix event_function_local()
    uprobes: Fix the memcg accounting
    perf intel-pt: Fix occasional decoding errors when tracing system-wide
    tools: Sync kvm related header files for arm64 and s390
    perf probe: Release resources on error when handling exit paths
    perf probe: Check for dup and fdopen failures
    perf symbols: Fix annotation of objects with debuginfo files
    perf script: Don't disable use_callchain if input is pipe
    perf script: Show proper message when failed list scripts
    perf jitdump: Add the right header to get the major()/minor() definitions
    perf ppc64le: Fix build failure when libelf is not present
    perf tools mem: Fix -t store option for record command
    perf intel-pt: Fix ip compression

    Linus Torvalds
     
  • Pull power management fixes from Rafael Wysocki:
    "More hibernation-related material: one fix for a recent regression in
    the core, one small cleanup of the x86-64 resume code and a
    documentation update.

    Specifics:

    - Fix a hibernate core regression resulting from uncovering a latent
    bug in its implementation of memory bitmaps by a recent commit
    (James Morse).

    - Use __pa() to compute a physical address in the x86-64 code
    finalizing resume from hibernation (Rafael Wysocki).

    - Update power management documentation related to system sleep
    states to remove outdated information from it and to add a
    description of a recently introduced hibernation debug feature to
    it (Rafael Wysocki)"

    * tag 'pm-4.8-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
    PM / hibernate: Fix rtree_next_node() to avoid walking off list ends
    x86/power/64: Use __pa() for physical address computation
    PM / sleep: Update some system sleep documentation

    Linus Torvalds
     

18 Aug, 2016

10 commits

  • Commit:

    57430218317e ("sched/cputime: Count actually elapsed irq & softirq time")

    ... fixed a bug but also triggered a regression:

    On an i5 laptop, 4 pCPUs, 4vCPUs for one full dynticks guest, there are four
    CPU hog processes(for loop) running in the guest, I hot-unplug the pCPUs
    on host one by one until there is only one left, then observe CPU utilization
    via 'top' in the guest, it shows:

    100% st for cpu0(housekeeping)
    75% st for other CPUs (nohz full mode)

    However, w/o this commit it shows the correct 75% for all four CPUs.

    When a guest is interrupted for a longer amount of time, missed clock ticks
    are not redelivered later. Because of that, we should not limit the amount
    of steal time accounted to the amount of time that the calling functions
    think have passed.

    However, the interval returned by account_other_time() is NOT rounded down
    to the nearest jiffy, while the base interval in get_vtime_delta() it is
    subtracted from is, so the max cputime limit is required to avoid underflow.

    This patch fixes the regression by limiting the account_other_time() from
    get_vtime_delta() to avoid underflow, and lets the other three call sites
    (in account_other_time() and steal_account_process_time()) account however
    much steal time the host told us elapsed.

    Suggested-by: Rik van Riel
    Suggested-by: Paolo Bonzini
    Signed-off-by: Wanpeng Li
    Reviewed-by: Rik van Riel
    Cc: Frederic Weisbecker
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Radim Krcmar
    Cc: Thomas Gleixner
    Cc: kvm@vger.kernel.org
    Link: http://lkml.kernel.org/r/1471399546-4069-1-git-send-email-wanpeng.li@hotmail.com
    [ Improved the changelog. ]
    Signed-off-by: Ingo Molnar

    Wanpeng Li
     
  • Mike reports:

    Roughly 10% of the time, ltp testcase getrusage04 fails:
    getrusage04 0 TINFO : Expected timers granularity is 4000 us
    getrusage04 0 TINFO : Using 1 as multiply factor for max [us]time increment (1000+4000us)!
    getrusage04 0 TINFO : utime: 0us; stime: 179us
    getrusage04 0 TINFO : utime: 3751us; stime: 0us
    getrusage04 1 TFAIL : getrusage04.c:133: stime increased > 5000us:

    And tracked it down to the case where the task simply doesn't get
    _any_ [us]time ticks.

    Update the code to assume all rtime is utime when we lack information,
    thus ensuring a task that elides the tick gets time accounted.

    Reported-by: Mike Galbraith
    Tested-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Frederic Weisbecker
    Cc: Fredrik Markstrom
    Cc: Linus Torvalds
    Cc: Paolo Bonzini
    Cc: Peter Zijlstra
    Cc: Radim
    Cc: Rik van Riel
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: Wanpeng Li
    Cc: stable@vger.kernel.org # 4.3+
    Fixes: 9d7fb0427648 ("sched/cputime: Guarantee stime + utime == rtime")
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • The call to smp_call_function_single in perf_event_read() may fail if
    an invalid or not online CPU index is passed. Warn user if such bug is
    present and return error.

    Signed-off-by: David Carrillo-Cisneros
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Kan Liang
    Cc: Linus Torvalds
    Cc: Paul Turner
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vegard Nossum
    Cc: Vince Weaver
    Link: http://lkml.kernel.org/r/1471467307-61171-2-git-send-email-davidcc@google.com
    Signed-off-by: Ingo Molnar

    David Carrillo-Cisneros
     
  • At this time the perf_addr_filter_needs_mmap() function will _not_
    return true on a user space 'stop' filter. But stop filters need
    exactly the same kind of mapping that range and start filters get.

    Signed-off-by: Mathieu Poirier
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Link: http://lkml.kernel.org/r/1468860187-318-4-git-send-email-mathieu.poirier@linaro.org
    Signed-off-by: Ingo Molnar

    Mathieu Poirier
     
  • Function perf_event_mmap() is called by the MM subsystem each time
    part of a binary is loaded in memory. There can be several mapping
    for a binary, many times unrelated to the code section.

    Each time a section of a binary is mapped address filters are
    updated, event when the map doesn't pertain to the code section.
    The end result is that filters are configured based on the last map
    event that was received rather than the last mapping of the code
    segment.

    For example if we have an executable 'main' that calls library
    'libcstest.so.1.0', and that we want to collect traces on code
    that is in that library. The perf cmd line for this scenario
    would be:

    perf record -e cs_etm// --filter 'filter 0x72c/0x40@/opt/lib/libcstest.so.1.0' --per-thread ./main

    Resulting in binaries being mapped this way:

    root@linaro-nano:~# cat /proc/1950/maps
    00400000-00401000 r-xp 00000000 08:02 33169 /home/linaro/main
    00410000-00411000 r--p 00000000 08:02 33169 /home/linaro/main
    00411000-00412000 rw-p 00001000 08:02 33169 /home/linaro/main
    7fa2464000-7fa2474000 rw-p 00000000 00:00 0
    7fa2474000-7fa25a4000 r-xp 00000000 08:02 543 /lib/aarch64-linux-gnu/libc-2.21.so
    7fa25a4000-7fa25b3000 ---p 00130000 08:02 543 /lib/aarch64-linux-gnu/libc-2.21.so
    7fa25b3000-7fa25b7000 r--p 0012f000 08:02 543 /lib/aarch64-linux-gnu/libc-2.21.so
    7fa25b7000-7fa25b9000 rw-p 00133000 08:02 543 /lib/aarch64-linux-gnu/libc-2.21.so
    7fa25b9000-7fa25bd000 rw-p 00000000 00:00 0
    7fa25bd000-7fa25be000 r-xp 00000000 08:02 38308 /opt/lib/libcstest.so.1.0
    7fa25be000-7fa25cd000 ---p 00001000 08:02 38308 /opt/lib/libcstest.so.1.0
    7fa25cd000-7fa25ce000 r--p 00000000 08:02 38308 /opt/lib/libcstest.so.1.0
    7fa25ce000-7fa25cf000 rw-p 00001000 08:02 38308 /opt/lib/libcstest.so.1.0
    7fa25cf000-7fa25eb000 r-xp 00000000 08:02 574 /lib/aarch64-linux-gnu/ld-2.21.so
    7fa25ef000-7fa25f2000 rw-p 00000000 00:00 0
    7fa25f7000-7fa25f9000 rw-p 00000000 00:00 0
    7fa25f9000-7fa25fa000 r--p 00000000 00:00 0 [vvar]
    7fa25fa000-7fa25fb000 r-xp 00000000 00:00 0 [vdso]
    7fa25fb000-7fa25fc000 r--p 0001c000 08:02 574 /lib/aarch64-linux-gnu/ld-2.21.so
    7fa25fc000-7fa25fe000 rw-p 0001d000 08:02 574 /lib/aarch64-linux-gnu/ld-2.21.so
    7ff2ea8000-7ff2ec9000 rw-p 00000000 00:00 0 [stack]
    root@linaro-nano:~#

    Before 'main()' can execute 'libcstest.so.1.0' has to be loaded in
    memory. Once that has been done perf_event_mmap() has been called
    4 times, with the last map starting at address 0x7fa25ce000 and
    the address filter configured to start filtering when the
    IP has passed over address 0x0x7fa25ce72c (0x7fa25ce000 + 0x72c).

    But that is wrong since the code segment for library 'libcstest.so.1.0'
    as been mapped at 0x7fa25bd000, resulting in traces not being
    collected.

    This patch corrects the situation by requesting that address
    filters be updated only if the mapped event is for a code
    segment.

    Signed-off-by: Mathieu Poirier
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Link: http://lkml.kernel.org/r/1468860187-318-3-git-send-email-mathieu.poirier@linaro.org
    Signed-off-by: Ingo Molnar

    Mathieu Poirier
     
  • Binary file names have to be supplied for both range and start/stop
    filters but the current code only processes the filename if an
    address range filter is specified. This code adds processing of
    the filename for start/stop filters.

    Signed-off-by: Mathieu Poirier
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Link: http://lkml.kernel.org/r/1468860187-318-2-git-send-email-mathieu.poirier@linaro.org
    Signed-off-by: Ingo Molnar

    Mathieu Poirier
     
  • Vincent reported triggering the WARN_ON_ONCE() in event_function_local().

    While thinking through cases I noticed that by using event_function()
    directly, we miss the inactive case usually handled by
    event_function_call().

    Therefore construct a blend of event_function_call() and
    event_function() that handles the cases relevant to
    event_function_local().

    Reported-by: Vince Weaver
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: stable@vger.kernel.org # 4.5+
    Fixes: fae3fde65138 ("perf: Collapse and fix event_function_call() users")
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • __replace_page() wronlgy calls mem_cgroup_cancel_charge() in "success" path,
    it should only do this if page_check_address() fails.

    This means that every enable/disable leads to unbalanced mem_cgroup_uncharge()
    from put_page(old_page), it is trivial to underflow the page_counter->count
    and trigger OOM.

    Reported-and-tested-by: Brenden Blanco
    Signed-off-by: Oleg Nesterov
    Reviewed-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Alexander Shishkin
    Cc: Alexei Starovoitov
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Vladimir Davydov
    Cc: stable@vger.kernel.org # 3.17+
    Fixes: 00501b531c47 ("mm: memcontrol: rewrite charge API")
    Link: http://lkml.kernel.org/r/20160817153629.GB29724@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • * pm-sleep:
    PM / hibernate: Fix rtree_next_node() to avoid walking off list ends
    x86/power/64: Use __pa() for physical address computation
    PM / sleep: Update some system sleep documentation

    Rafael J. Wysocki
     
  • Pull networking fixes from David Miller:

    1) Buffers powersave frame test is reversed in cfg80211, fix from Felix
    Fietkau.

    2) Remove bogus WARN_ON in openvswitch, from Jarno Rajahalme.

    3) Fix some tg3 ethtool logic bugs, and one that would cause no
    interrupts to be generated when rx-coalescing is set to 0. From
    Satish Baddipadige and Siva Reddy Kallam.

    4) QLCNIC mailbox corruption and napi budget handling fix from Manish
    Chopra.

    5) Fix fib_trie logic when walking the trie during /proc/net/route
    output than can access a stale node pointer. From David Forster.

    6) Several sctp_diag fixes from Phil Sutter.

    7) PAUSE frame handling fixes in mlxsw driver from Ido Schimmel.

    8) Checksum fixup fixes in bpf from Daniel Borkmann.

    9) Memork leaks in nfnetlink, from Liping Zhang.

    10) Use after free in rxrpc, from David Howells.

    11) Use after free in new skb_array code of macvtap driver, from Jason
    Wang.

    12) Calipso resource leak, from Colin Ian King.

    13) mediatek bug fixes (missing stats sync init, etc.) from Sean Wang.

    14) Fix bpf non-linear packet write helpers, from Daniel Borkmann.

    15) Fix lockdep splats in macsec, from Sabrina Dubroca.

    16) hv_netvsc bug fixes from Vitaly Kuznetsov, mostly to do with VF
    handling.

    17) Various tc-action bug fixes, from CONG Wang.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (116 commits)
    net_sched: allow flushing tc police actions
    net_sched: unify the init logic for act_police
    net_sched: convert tcf_exts from list to pointer array
    net_sched: move tc offload macros to pkt_cls.h
    net_sched: fix a typo in tc_for_each_action()
    net_sched: remove an unnecessary list_del()
    net_sched: remove the leftover cleanup_a()
    mlxsw: spectrum: Allow packets to be trapped from any PG
    mlxsw: spectrum: Unmap 802.1Q FID before destroying it
    mlxsw: spectrum: Add missing rollbacks in error path
    mlxsw: reg: Fix missing op field fill-up
    mlxsw: spectrum: Trap loop-backed packets
    mlxsw: spectrum: Add missing packet traps
    mlxsw: spectrum: Mark port as active before registering it
    mlxsw: spectrum: Create PVID vPort before registering netdevice
    mlxsw: spectrum: Remove redundant errors from the code
    mlxsw: spectrum: Don't return upon error in removal path
    i40e: check for and deal with non-contiguous TCs
    ixgbe: Re-enable ability to toggle VLAN filtering
    ixgbe: Force VLNCTRL.VFE to be set in all VMDq paths
    ...

    Linus Torvalds
     

17 Aug, 2016

2 commits

  • cgroup_threadgroup_rwsem is acquired in read mode during process exit
    and fork. It is also grabbed in write mode during
    __cgroups_proc_write(). I've recently run into a scenario with lots
    of memory pressure and OOM and I am beginning to see

    systemd

    __switch_to+0x1f8/0x350
    __schedule+0x30c/0x990
    schedule+0x48/0xc0
    percpu_down_write+0x114/0x170
    __cgroup_procs_write.isra.12+0xb8/0x3c0
    cgroup_file_write+0x74/0x1a0
    kernfs_fop_write+0x188/0x200
    __vfs_write+0x6c/0xe0
    vfs_write+0xc0/0x230
    SyS_write+0x6c/0x110
    system_call+0x38/0xb4

    This thread is waiting on the reader of cgroup_threadgroup_rwsem to
    exit. The reader itself is under memory pressure and has gone into
    reclaim after fork. There are times the reader also ends up waiting on
    oom_lock as well.

    __switch_to+0x1f8/0x350
    __schedule+0x30c/0x990
    schedule+0x48/0xc0
    jbd2_log_wait_commit+0xd4/0x180
    ext4_evict_inode+0x88/0x5c0
    evict+0xf8/0x2a0
    dispose_list+0x50/0x80
    prune_icache_sb+0x6c/0x90
    super_cache_scan+0x190/0x210
    shrink_slab.part.15+0x22c/0x4c0
    shrink_zone+0x288/0x3c0
    do_try_to_free_pages+0x1dc/0x590
    try_to_free_pages+0xdc/0x260
    __alloc_pages_nodemask+0x72c/0xc90
    alloc_pages_current+0xb4/0x1a0
    page_table_alloc+0xc0/0x170
    __pte_alloc+0x58/0x1f0
    copy_page_range+0x4ec/0x950
    copy_process.isra.5+0x15a0/0x1870
    _do_fork+0xa8/0x4b0
    ppc_clone+0x8/0xc

    In the meanwhile, all processes exiting/forking are blocked almost
    stalling the system.

    This patch moves the threadgroup_change_begin from before
    cgroup_fork() to just before cgroup_canfork(). There is no nee to
    worry about threadgroup changes till the task is actually added to the
    threadgroup. This avoids having to call reclaim with
    cgroup_threadgroup_rwsem held.

    tj: Subject and description edits.

    Signed-off-by: Balbir Singh
    Acked-by: Zefan Li
    Cc: Oleg Nesterov
    Cc: Andrew Morton
    Cc: stable@vger.kernel.org # v4.2+
    Signed-off-by: Tejun Heo

    Balbir Singh
     
  • Commit 1e2a7d78499e ("irqdomain: Don't set type when mapping an IRQ")
    moved the trigger configuration call from the irqdomain mapping to
    the interrupt being actually requested.

    This patch failed to handle the case where we configure a chained
    interrupt, which doesn't get requested through the usual path.

    In order to solve this, let's call __irq_set_trigger just before
    starting the cascade interrupt. Special care must be taken to
    make the flow handler stick, as the .irq_set_type method could
    have reset it (it doesn't know we're dealing with a chained
    interrupt).

    Based on an initial patch by Jon Hunter.

    Fixes: 1e2a7d78499e ("irqdomain: Don't set type when mapping an IRQ")
    Reported-by: John Stultz
    Reported-by: Linus Walleij
    Tested-by: John Stultz
    Acked-by: Jon Hunter
    Signed-off-by: Marc Zyngier

    Marc Zyngier
     

16 Aug, 2016

2 commits

  • Commit 288dab8a35a0 ("block: add a separate operation type for secure
    erase") split REQ_OP_SECURE_ERASE from REQ_OP_DISCARD without considering
    all the places REQ_OP_DISCARD was being used to mean either. Fix those.

    Signed-off-by: Adrian Hunter
    Fixes: 288dab8a35a0 ("block: add a separate operation type for secure erase")
    Signed-off-by: Jens Axboe

    Adrian Hunter
     
  • rtree_next_node() walks the linked list of leaf nodes to find the next
    block of pages in the struct memory_bitmap. If it walks off the end of
    the list of nodes, it walks the list of memory zones to find the next
    region of memory. If it walks off the end of the list of zones, it
    returns false.

    This leaves the struct bm_position's node and zone pointers pointing
    at their respective struct list_heads in struct mem_zone_bm_rtree.

    memory_bm_find_bit() uses struct bm_position's node and zone pointers
    to avoid walking lists and trees if the next bit appears in the same
    node/zone. It handles these values being stale.

    Swap rtree_next_node()s 'step then test' to 'test-next then step',
    this means if we reach the end of memory we return false and leave
    the node and zone pointers as they were.

    This fixes a panic on resume using AMD Seattle with 64K pages:
    [ 6.868732] Freezing user space processes ... (elapsed 0.000 seconds) done.
    [ 6.875753] Double checking all user space processes after OOM killer disable... (elapsed 0.000 seconds)
    [ 6.896453] PM: Using 3 thread(s) for decompression.
    [ 6.896453] PM: Loading and decompressing image data (5339 pages)...
    [ 7.318890] PM: Image loading progress: 0%
    [ 7.323395] Unable to handle kernel paging request at virtual address 00800040
    [ 7.330611] pgd = ffff000008df0000
    [ 7.334003] [00800040] *pgd=00000083fffe0003, *pud=00000083fffe0003, *pmd=00000083fffd0003, *pte=0000000000000000
    [ 7.344266] Internal error: Oops: 96000005 [#1] PREEMPT SMP
    [ 7.349825] Modules linked in:
    [ 7.352871] CPU: 2 PID: 1 Comm: swapper/0 Tainted: G W I 4.8.0-rc1 #4737
    [ 7.360512] Hardware name: AMD Overdrive/Supercharger/Default string, BIOS ROD1002C 04/08/2016
    [ 7.369109] task: ffff8003c0220000 task.stack: ffff8003c0280000
    [ 7.375020] PC is at set_bit+0x18/0x30
    [ 7.378758] LR is at memory_bm_set_bit+0x24/0x30
    [ 7.383362] pc : [] lr : [] pstate: 60000045
    [ 7.390743] sp : ffff8003c0283b00
    [ 7.473551]
    [ 7.475031] Process swapper/0 (pid: 1, stack limit = 0xffff8003c0280020)
    [ 7.481718] Stack: (0xffff8003c0283b00 to 0xffff8003c0284000)
    [ 7.800075] Call trace:
    [ 7.887097] [] set_bit+0x18/0x30
    [ 7.891876] [] duplicate_memory_bitmap.constprop.38+0x54/0x70
    [ 7.899172] [] snapshot_write_next+0x22c/0x47c
    [ 7.905166] [] load_image_lzo+0x754/0xa88
    [ 7.910725] [] swsusp_read+0x144/0x230
    [ 7.916025] [] load_image_and_restore+0x58/0x90
    [ 7.922105] [] software_resume+0x2f0/0x338
    [ 7.927752] [] do_one_initcall+0x38/0x11c
    [ 7.933314] [] kernel_init_freeable+0x14c/0x1ec
    [ 7.939395] [] kernel_init+0x10/0xfc
    [ 7.944520] [] ret_from_fork+0x10/0x40
    [ 7.949820] Code: d2800022 8b400c21 f9800031 9ac32043 (c85f7c22)
    [ 7.955909] ---[ end trace 0024a5986e6ff323 ]---
    [ 7.960529] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b

    Here struct mem_zone_bm_rtree's start_pfn has been returned instead of
    struct rtree_node's addr as the node/zone pointers are corrupt after
    we walked off the end of the lists during mark_unsafe_pages().

    This behaviour was exposed by commit 6dbecfd345a6 ("PM / hibernate:
    Simplify mark_unsafe_pages()"), which caused mark_unsafe_pages() to call
    duplicate_memory_bitmap(), which uses memory_bm_find_bit() after walking
    off the end of the memory bitmap.

    Fixes: 3a20cb177961 (PM / Hibernate: Implement position keeping in radix tree)
    Signed-off-by: James Morse
    [ rjw: Subject ]
    Signed-off-by: Rafael J. Wysocki

    James Morse
     

13 Aug, 2016

5 commits

  • While hashing out BPF's current_task_under_cgroup helper bits, it came
    to discussion that the skb_in_cgroup helper name was suboptimally chosen.

    Tejun says:

    So, I think in_cgroup should mean that the object is in that
    particular cgroup while under_cgroup in the subhierarchy of that
    cgroup. Let's rename the other subhierarchy test to under too. I
    think that'd be a lot less confusing going forward.

    [...]

    It's more intuitive and gives us the room to implement the real
    "in" test if ever necessary in the future.

    Since this touches uapi bits, we need to change this as long as v4.8
    is not yet officially released. Thus, change the helper enum and rename
    related bits.

    Fixes: 4a482f34afcc ("cgroup: bpf: Add bpf_skb_in_cgroup_proto")
    Reference: http://patchwork.ozlabs.org/patch/658500/
    Suggested-by: Sargun Dhillon
    Suggested-by: Tejun Heo
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov

    Daniel Borkmann
     
  • Pull power management fixes from Rafael Wysocki:
    "Two hibernation fixes allowing it to work with the recently added
    randomization of the kernel identity mapping base on x86-64 and one
    cpufreq driver regression fix.

    Specifics:

    - Fix the x86 identity mapping creation helpers to avoid the
    assumption that the base address of the mapping will always be
    aligned at the PGD level, as it may be aligned at the PUD level if
    address space randomization is enabled (Rafael Wysocki).

    - Fix the hibernation core to avoid executing tracing functions
    before restoring the processor state completely during resume
    (Thomas Garnier).

    - Fix a recently introduced regression in the powernv cpufreq driver
    that causes it to crash due to an out-of-bounds array access
    (Akshay Adiga)"

    * tag 'pm-4.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
    PM / hibernate: Restore processor state before using per-CPU variables
    x86/power/64: Always create temporary identity mapping correctly
    cpufreq: powernv: Fix crash in gpstate_timer_handler()

    Linus Torvalds
     
  • Pull timer fixes from Ingo Molnar:
    "Misc fixes: a /dev/rtc regression fix, two APIC timer period
    calibration fixes, an ARM clocksource driver fix and a NOHZ
    power use regression fix"

    * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/hpet: Fix /dev/rtc breakage caused by RTC cleanup
    x86/timers/apic: Inform TSC deadline clockevent device about recalibration
    x86/timers/apic: Fix imprecise timer interrupts by eliminating TSC clockevents frequency roundoff error
    timers: Fix get_next_timer_interrupt() computation
    clocksource/arm_arch_timer: Force per-CPU interrupt to be level-triggered

    Linus Torvalds
     
  • * pm-sleep:
    PM / hibernate: Restore processor state before using per-CPU variables
    x86/power/64: Always create temporary identity mapping correctly

    * pm-cpufreq:
    cpufreq: powernv: Fix crash in gpstate_timer_handler()

    Rafael J. Wysocki
     
  • Pull scheduler fixes from Ingo Molnar:
    "Misc fixes: cputime fixes, two deadline scheduler fixes and a cgroups
    scheduling fix"

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/cputime: Fix omitted ticks passed in parameter
    sched/cputime: Fix steal time accounting
    sched/deadline: Fix lock pinning warning during CPU hotplug
    sched/cputime: Mitigate performance regression in times()/clock_gettime()
    sched/fair: Fix typo in sync_throttle()
    sched/deadline: Fix wrap-around in DL heap

    Linus Torvalds