19 Feb, 2016

1 commit

  • The pmem driver calls devm_memremap() to map a persistent memory range.
    When the pmem driver is unloaded, this memremap'd range is not released
    so the kernel will leak a vma.

    Fix devm_memremap_release() to handle a given memremap'd address
    properly.

    Signed-off-by: Toshi Kani
    Acked-by: Dan Williams
    Cc: Christoph Hellwig
    Cc: Ross Zwisler
    Cc: Matthew Wilcox
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Toshi Kani
     

15 Feb, 2016

1 commit


12 Feb, 2016

3 commits

  • The pfn_t type uses an unsigned long to store a pfn + flags value. On a
    64-bit platform the upper 12 bits of an unsigned long are never used for
    storing the value of a pfn. However, this is not true on highmem
    platforms, all 32-bits of a pfn value are used to address a 44-bit
    physical address space. A pfn_t needs to store a 64-bit value.

    Link: https://bugzilla.kernel.org/show_bug.cgi?id=112211
    Fixes: 01c8f1c44b83 ("mm, dax, gpu: convert vm_insert_mixed to pfn_t")
    Signed-off-by: Dan Williams
    Reported-by: Stuart Foster
    Reported-by: Julian Margetson
    Tested-by: Julian Margetson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • Mike said:

    : CONFIG_UBSAN_ALIGNMENT breaks x86-64 kernel with lockdep enabled, i. e
    : kernel with CONFIG_UBSAN_ALIGNMENT fails to load without even any error
    : message.
    :
    : The problem is that ubsan callbacks use spinlocks and might be called
    : before lockdep is initialized. Particularly this line in the
    : reserve_ebda_region function causes problem:
    :
    : lowmem = *(unsigned short *)__va(BIOS_LOWMEM_KILOBYTES);
    :
    : If i put lockdep_init() before reserve_ebda_region call in
    : x86_64_start_reservations kernel loads well.

    Fix this ordering issue permanently: change lockdep so that it uses
    hlists for the hash tables. Unlike a list_head, an hlist_head is in its
    initialized state when it is all-zeroes, so lockdep is ready for
    operation immediately upon boot - lockdep_init() need not have run.

    The patch will also save some memory.

    lockdep_init() and lockdep_initialized can be done away with now - a 4.6
    patch has been prepared to do this.

    Reported-by: Mike Krinkin
    Suggested-by: Mike Krinkin
    Cc: Andrey Ryabinin
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Pull networking fixes from David Miller:

    1) Fix BPF handling of branch offset adjustmnets on backjumps, from
    Daniel Borkmann.

    2) Make sure selinux knows about SOCK_DESTROY netlink messages, from
    Lorenzo Colitti.

    3) Fix openvswitch tunnel mtu regression, from David Wragg.

    4) Fix ICMP handling of TCP sockets in syn_recv state, from Eric
    Dumazet.

    5) Fix SCTP user hmacid byte ordering bug, from Xin Long.

    6) Fix recursive locking in ipv6 addrconf, from Subash Abhinov
    Kasiviswanathan.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
    bpf: fix branch offset adjustment on backjumps after patching ctx expansion
    vxlan, gre, geneve: Set a large MTU on ovs-created tunnel devices
    geneve: Relax MTU constraints
    vxlan: Relax MTU constraints
    flow_dissector: Fix unaligned access in __skb_flow_dissector when used by eth_get_headlen
    of: of_mdio: Add marvell, 88e1145 to whitelist of PHY compatibilities.
    selinux: nlmsgtab: add SOCK_DESTROY to the netlink mapping tables
    sctp: translate network order to host order when users get a hmacid
    enic: increment devcmd2 result ring in case of timeout
    tg3: Fix for tg3 transmit queue 0 timed out when too many gso_segs
    net:Add sysctl_max_skb_frags
    tcp: do not drop syn_recv on all icmp reports
    ipv6: fix a lockdep splat
    unix: correctly track in-flight fds in sending process user_struct
    update be2net maintainers' email addresses
    dwc_eth_qos: Reset hardware before PHY start
    ipv6: addrconf: Fix recursive spin lock call

    Linus Torvalds
     

11 Feb, 2016

4 commits

  • When ctx access is used, the kernel often needs to expand/rewrite
    instructions, so after that patching, branch offsets have to be
    adjusted for both forward and backward jumps in the new eBPF program,
    but for backward jumps it fails to account the delta. Meaning, for
    example, if the expansion happens exactly on the insn that sits at
    the jump target, it doesn't fix up the back jump offset.

    Analysis on what the check in adjust_branches() is currently doing:

    /* adjust offset of jmps if necessary */
    if (i < pos && i + insn->off + 1 > pos)
    insn->off += delta;
    else if (i > pos && i + insn->off + 1 < pos)
    insn->off -= delta;

    First condition (forward jumps):

    Before: After:

    insns[0] insns[0]
    insns[1] off + 1 == pos, means we jump to that newly patched
    instruction, so no offset adjustment are needed. That part is correct.

    Second condition (backward jumps):

    Before: After:

    insns[0] insns[0]
    insns[1] pos is okay only by itself. However, i +
    insn->off + 1 < pos does not always work as intended to trigger the
    adjustment. It works when jump targets would be far off where the
    delta wouldn't matter. But, for example, where the fixed insn->off
    before pointed to pos (target_Y), it now points to pos + delta, so
    that additional room needs to be taken into account for the check.
    This means that i) both tests here need to be adjusted into pos + delta,
    and ii) for the second condition, the test needs to be
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Pull cgroup fixes from Tejun Heo:

    - The destruction path of cgroup objects are asynchronous and
    multi-staged and some of them ended up destroying parents before
    children leading to failures in cpu and memory controllers. Ensure
    that parents are always destroyed after children.

    - cpuset mm node migration was performed synchronously while holding
    threadgroup and cgroup mutexes and the recent threadgroup locking
    update resulted in a possible deadlock. The migration is best effort
    and shouldn't have been performed under those locks to begin with.
    Made asynchronous.

    - Minor documentation fix.

    * 'for-4.5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    Documentation: cgroup: Fix 'cgroup-legacy' -> 'cgroup-v1'
    cgroup: make sure a parent css isn't freed before its children
    cgroup: make sure a parent css isn't offlined before its children
    cpuset: make mm migration asynchronous

    Linus Torvalds
     
  • Pull workqueue fixes from Tejun Heo:
    "Workqueue fixes for v4.5-rc3.

    - Remove a spurious triggering of flush dependency warning.

    - Officially break local execution guarantee of unbound work items
    and add a debug feature to flush out usages which depend on it.

    - Work around CPU -> NODE mapping becoming invalid on CPU offline.

    The branch is young but pushing out early as stable kernels are being
    affected"

    * 'for-4.5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
    workqueue: handle NUMA_NO_NODE for unbound pool_workqueue lookup
    workqueue: implement "workqueue.debug_force_rr_cpu" debug feature
    workqueue: schedule WORK_CPU_UNBOUND work on wq_unbound_cpumask CPUs
    Revert "workqueue: make sure delayed work run in local cpu"
    workqueue: skip flush dependency checks for legacy workqueues

    Linus Torvalds
     
  • When looking up the pool_workqueue to use for an unbound workqueue,
    workqueue assumes that the target CPU is always bound to a valid NUMA
    node. However, currently, when a CPU goes offline, the mapping is
    destroyed and cpu_to_node() returns NUMA_NO_NODE.

    This has always been broken but hasn't triggered often enough before
    874bbfe600a6 ("workqueue: make sure delayed work run in local cpu").
    After the commit, workqueue forcifully assigns the local CPU for
    delayed work items without explicit target CPU to fix a different
    issue. This widens the window where CPU can go offline while a
    delayed work item is pending causing delayed work items dispatched
    with target CPU set to an already offlined CPU. The resulting
    NUMA_NO_NODE mapping makes workqueue try to queue the work item on a
    NULL pool_workqueue and thus crash.

    While 874bbfe600a6 has been reverted for a different reason making the
    bug less visible again, it can still happen. Fix it by mapping
    NUMA_NO_NODE to the default pool_workqueue from unbound_pwq_by_node().
    This is a temporary workaround. The long term solution is keeping CPU
    -> NODE mapping stable across CPU off/online cycles which is being
    worked on.

    Signed-off-by: Tejun Heo
    Reported-by: Mike Galbraith
    Cc: Tang Chen
    Cc: Rafael J. Wysocki
    Cc: Len Brown
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/g/1454424264.11183.46.camel@gmail.com
    Link: http://lkml.kernel.org/g/1453702100-2597-1-git-send-email-tangchen@cn.fujitsu.com

    Tejun Heo
     

10 Feb, 2016

4 commits

  • Pull module fixes from Rusty Russell:
    "Fix for async_probe module param added in 4.3 (clearly not widely used
    yet), and a much more interesting kallsyms race which has been around
    approximately forever. This fix is more invasive, and will require
    some care in backporting, but I hated all the bandaids I could think
    of, so...

    There are some more coming, which are only for breakages introduced
    this cycle (livepatch), but wanted these in now"

    * tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
    modules: fix longstanding /proc/kallsyms vs module insertion race.
    module: wrapper for symbol name.
    modules: fix modparam async_probe request

    Linus Torvalds
     
  • Workqueue used to guarantee local execution for work items queued
    without explicit target CPU. The guarantee is gone now which can
    break some usages in subtle ways. To flush out those cases, this
    patch implements a debug feature which forces round-robin CPU
    selection for all such work items.

    The debug feature defaults to off and can be enabled with a kernel
    parameter. The default can be flipped with a debug config option.

    If you hit this commit during bisection, please refer to 041bd12e272c
    ("Revert "workqueue: make sure delayed work run in local cpu"") for
    more information and ping me.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • WORK_CPU_UNBOUND work items queued to a bound workqueue always run
    locally. This is a good thing normally, but not when the user has
    asked us to keep unbound work away from certain CPUs. Round robin
    these to wq_unbound_cpumask CPUs instead, as perturbation avoidance
    trumps performance.

    tj: Cosmetic and comment changes. WARN_ON_ONCE() dropped from empty
    (wq_unbound_cpumask AND cpu_online_mask). If we want that, it
    should be done when config changes.

    Signed-off-by: Mike Galbraith
    Signed-off-by: Tejun Heo

    Mike Galbraith
     
  • This reverts commit 874bbfe600a660cba9c776b3957b1ce393151b76.

    Workqueue used to implicity guarantee that work items queued without
    explicit CPU specified are put on the local CPU. Recent changes in
    timer broke the guarantee and led to vmstat breakage which was fixed
    by 176bed1de5bf ("vmstat: explicitly schedule per-cpu work on the CPU
    we need it to run on").

    vmstat is the most likely to expose the issue and it's quite possible
    that there are other similar problems which are a lot more difficult
    to trigger. As a preventive measure, 874bbfe600a6 ("workqueue: make
    sure delayed work run in local cpu") was applied to restore the local
    CPU guarnatee. Unfortunately, the change exposed a bug in timer code
    which got fixed by 22b886dd1018 ("timers: Use proper base migration in
    add_timer_on()"). Due to code restructuring, the commit couldn't be
    backported beyond certain point and stable kernels which only had
    874bbfe600a6 started crashing.

    The local CPU guarantee was accidental more than anything else and we
    want to get rid of it anyway. As, with the vmstat case fixed,
    874bbfe600a6 is causing more problems than it's fixing, it has been
    decided to take the chance and officially break the guarantee by
    reverting the commit. A debug feature will be added to force foreign
    CPU assignment to expose cases relying on the guarantee and fixes for
    the individual cases will be backported to stable as necessary.

    Signed-off-by: Tejun Heo
    Fixes: 874bbfe600a6 ("workqueue: make sure delayed work run in local cpu")
    Link: http://lkml.kernel.org/g/20160120211926.GJ10810@quack.suse.cz
    Cc: stable@vger.kernel.org
    Cc: Mike Galbraith
    Cc: Henrique de Moraes Holschuh
    Cc: Daniel Bilik
    Cc: Jan Kara
    Cc: Shaohua Li
    Cc: Sasha Levin
    Cc: Ben Hutchings
    Cc: Thomas Gleixner
    Cc: Daniel Bilik
    Cc: Jiri Slaby
    Cc: Michal Hocko

    Tejun Heo
     

09 Feb, 2016

1 commit

  • check_prev_add() caches saved stack trace in static trace variable
    to avoid duplicate save_trace() calls in dependencies involving trylocks.
    But that caching logic contains a bug. We may not save trace on first
    iteration due to early return from check_prev_add(). Then on the
    second iteration when we actually need the trace we don't save it
    because we think that we've already saved it.

    Let check_prev_add() itself control when stack is saved.

    There is another bug. Trace variable is protected by graph lock.
    But we can temporary release graph lock during printing.

    Fix this by invalidating cached stack trace when we release graph lock.

    Signed-off-by: Dmitry Vyukov
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: glider@google.com
    Cc: kcc@google.com
    Cc: peter@hurleysoftware.com
    Cc: sasha.levin@oracle.com
    Link: http://lkml.kernel.org/r/1454593240-121647-1-git-send-email-dvyukov@google.com
    Signed-off-by: Ingo Molnar

    Dmitry Vyukov
     

06 Feb, 2016

1 commit

  • A random wakeup can get us out of sigsuspend() without TIF_SIGPENDING
    being set.

    Avoid that by making sure we were signaled, like sys_pause() does.

    Signed-off-by: Sasha Levin
    Acked-by: Oleg Nesterov
    Acked-by: Peter Zijlstra (Intel)
    Cc: Dmitry Vyukov
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     

04 Feb, 2016

1 commit


03 Feb, 2016

3 commits

  • For CONFIG_KALLSYMS, we keep two symbol tables and two string tables.
    There's one full copy, marked SHF_ALLOC and laid out at the end of the
    module's init section. There's also a cut-down version that only
    contains core symbols and strings, and lives in the module's core
    section.

    After module init (and before we free the module memory), we switch
    the mod->symtab, mod->num_symtab and mod->strtab to point to the core
    versions. We do this under the module_mutex.

    However, kallsyms doesn't take the module_mutex: it uses
    preempt_disable() and rcu tricks to walk through the modules, because
    it's used in the oops path. It's also used in /proc/kallsyms.
    There's nothing atomic about the change of these variables, so we can
    get the old (larger!) num_symtab and the new symtab pointer; in fact
    this is what I saw when trying to reproduce.

    By grouping these variables together, we can use a
    carefully-dereferenced pointer to ensure we always get one or the
    other (the free of the module init section is already done in an RCU
    callback, so that's safe). We allocate the init one at the end of the
    module init section, and keep the core one inside the struct module
    itself (it could also have been allocated at the end of the module
    core, but that's probably overkill).

    Reported-by: Weilong Chen
    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=111541
    Cc: stable@kernel.org
    Signed-off-by: Rusty Russell

    Rusty Russell
     
  • This trivial wrapper adds clarity and makes the following patch
    smaller.

    Cc: stable@kernel.org
    Signed-off-by: Rusty Russell

    Rusty Russell
     
  • Commit f2411da746985 ("driver-core: add driver module
    asynchronous probe support") added async probe support,
    in two forms:

    * in-kernel driver specification annotation
    * generic async_probe module parameter (modprobe foo async_probe)

    To support the generic kernel parameter parse_args() was
    extended via commit ecc8617053e0 ("module: add extra
    argument for parse_params() callback") however commit
    failed to f2411da746985 failed to add the required argument.

    This causes a crash then whenever async_probe generic
    module parameter is used. This was overlooked when the
    form in which in-kernel async probe support was reworked
    a bit... Fix this as originally intended.

    Cc: Hannes Reinecke
    Cc: Dmitry Torokhov
    Cc: stable@vger.kernel.org (4.2+)
    Signed-off-by: Luis R. Rodriguez
    Signed-off-by: Rusty Russell [minimized]

    Luis R. Rodriguez
     

02 Feb, 2016

1 commit

  • Pull libnvdimm fixes from Dan Williams:
    "1/ Fixes to the libnvdimm 'pfn' device that establishes a reserved
    area for storing a struct page array.

    2/ Fixes for dax operations on a raw block device to prevent pagecache
    collisions with dax mappings.

    3/ A fix for pfn_t usage in vm_insert_mixed that lead to a null
    pointer de-reference.

    These have received build success notification from the kbuild robot
    across 153 configs and pass the latest ndctl tests"

    * 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
    phys_to_pfn_t: use phys_addr_t
    mm: fix pfn_t to page conversion in vm_insert_mixed
    block: use DAX for partition table reads
    block: revert runtime dax control of the raw block device
    fs, block: force direct-I/O for dax-enabled block devices
    devm_memremap_pages: fix vmem_altmap lifetime + alignment handling
    libnvdimm, pfn: fix restoring memmap location
    libnvdimm: fix mode determination for e820 devices

    Linus Torvalds
     

01 Feb, 2016

6 commits

  • Pull timer fixes from Thomas Gleixner:
    "The timer departement delivers:

    - a regression fix for the NTP code along with a proper selftest
    - prevent a spurious timer interrupt in the NOHZ lowres code
    - a fix for user space interfaces returning the remaining time on
    architectures with CONFIG_TIME_LOW_RES=y
    - a few patches to fix COMPILE_TEST fallout"

    * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    tick/nohz: Set the correct expiry when switching to nohz/lowres mode
    clocksource: Fix dependencies for archs w/o HAS_IOMEM
    clocksource: Select CLKSRC_MMIO where needed
    tick/sched: Hide unused oneshot timer code
    kselftests: timers: Add adjtimex SETOFFSET validity tests
    ntp: Fix ADJ_SETOFFSET being used w/ ADJ_NANO
    itimers: Handle relative timers with CONFIG_TIME_LOW_RES proper
    posix-timers: Handle relative timers with CONFIG_TIME_LOW_RES proper
    timerfd: Handle relative timers with CONFIG_TIME_LOW_RES proper
    hrtimer: Handle remaining time proper for TIME_LOW_RES
    clockevents/tcb_clksrc: Prevent disabling an already disabled clock

    Linus Torvalds
     
  • Pull scheduler fixes from Thomas Gleixner:
    "Three small fixes in the scheduler/core:

    - use after free in the numa code
    - crash in the numa init code
    - a simple spelling fix"

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    pid: Fix spelling in comments
    sched/numa: Fix use-after-free bug in the task_numa_compare
    sched: Fix crash in sched_init_numa()

    Linus Torvalds
     
  • Pull perf fixes from Thomas Gleixner:
    "This is much bigger than typical fixes, but Peter found a category of
    races that spurred more fixes and more debugging enhancements. Work
    started before the merge window, but got finished only now.

    Aside of that this contains the usual small fixes to perf and tools.
    Nothing particular exciting"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (43 commits)
    perf: Remove/simplify lockdep annotation
    perf: Synchronously clean up child events
    perf: Untangle 'owner' confusion
    perf: Add flags argument to perf_remove_from_context()
    perf: Clean up sync_child_event()
    perf: Robustify event->owner usage and SMP ordering
    perf: Fix STATE_EXIT usage
    perf: Update locking order
    perf: Remove __free_event()
    perf/bpf: Convert perf_event_array to use struct file
    perf: Fix NULL deref
    perf/x86: De-obfuscate code
    perf/x86: Fix uninitialized value usage
    perf: Fix race in perf_event_exit_task_context()
    perf: Fix orphan hole
    perf stat: Do not clean event's private stats
    perf hists: Fix HISTC_MEM_DCACHELINE width setting
    perf annotate browser: Fix behaviour of Shift-Tab with nothing focussed
    perf tests: Remove wrong semicolon in while loop in CQM test
    perf: Synchronously free aux pages in case of allocation failure
    ...

    Linus Torvalds
     
  • Pull locking fix from Thomas Gleixner:
    "A single commit, which makes the rtmutex.wait_lock an irq safe lock.

    This prevents a potential deadlock which can be triggered by the rcu
    boosting code from rcu_read_unlock()"

    * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    rtmutex: Make wait_lock irq safe

    Linus Torvalds
     
  • Pull IRQ fixes from Ingo Molnar:
    "Mostly irqchip driver fixes, but also an irq core crash fix and a
    build fix"

    * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    irqchip/mxs: Add missing set_handle_irq()
    irqchip/atmel-aic: Fix wrong bit operation for IRQ priority
    irqchip/gic-v3-its: Recompute the number of pages on page size change
    base: Export platform_msi_domain_[alloc,free]_irqs
    of: MSI: Simplify irqdomain lookup
    irqdomain: Allow domain lookup with DOMAIN_BUS_WIRED token
    irqchip: Fix dependencies for archs w/o HAS_IOMEM
    irqchip/s3c24xx: Mark init_eint as __maybe_unused
    genirq: Validate action before dereferencing it in handle_irq_event_percpu()

    Linus Torvalds
     
  • A dma_addr_t is potentially smaller than a phys_addr_t on some archs.
    Don't truncate the address when doing the pfn conversion.

    Cc: Ross Zwisler
    Reported-by: Matthew Wilcox
    [willy: fix pfn_t_to_phys as well]
    Signed-off-by: Dan Williams

    Dan Williams
     

30 Jan, 2016

7 commits

  • Accidentally discovered this typo when I studied this module.

    Signed-off-by: Zhen Lei
    Cc: Hanjun Guo
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Tianhong Ding
    Cc: Xinwei Hu
    Cc: Zefan Li
    Link: http://lkml.kernel.org/r/1454119457-11272-1-git-send-email-thunder.leizhen@huawei.com
    Signed-off-by: Ingo Molnar

    Zhen Lei
     
  • to_vmem_altmap() needs to return valid results until
    arch_remove_memory() completes. It also needs to be valid for any pfn
    in a section regardless of whether that pfn maps to data. This escape
    was a result of a bug in the unit test.

    The signature of this bug is that free_pagetable() fails to retrieve a
    vmem_altmap and goes off into the weeds:

    BUG: unable to handle kernel NULL pointer dereference at (null)
    IP: [] get_pfnblock_flags_mask+0x49/0x60
    [..]
    Call Trace:
    [] free_hot_cold_page+0x97/0x1d0
    [] __free_pages+0x2a/0x40
    [] free_pagetable+0x8c/0xd4
    [] remove_pagetable+0x37a/0x808
    [] vmemmap_free+0x10/0x20

    Fixes: 4b94ffdc4163 ("x86, mm: introduce vmem_altmap to augment vmemmap_populate()")
    Cc: Andrew Morton
    Reported-by: Jeff Moyer
    Signed-off-by: Dan Williams

    Dan Williams
     
  • Pull power management and ACPI fixes from Rafael Wysocki:
    "These are: cpuidle fixes (including one fix for a recent regression),
    cpufreq fixes (including fixes for two issues introduced during the
    4.2 cycle), generic power domains framework fixes (two locking fixes
    and one cleanup), one locking fix in the ACPI-based PCI hotplug
    framework (ACPIPHP), removal of one ACPI backlight blacklist entry
    that isn't necessary any more and a PM Kconfig cleanup.

    Specifics:

    - Fix a recent cpuidle core regression that broke suspend-to-idle on
    all systems where cpuidle drivers don't provide ->enter_freeze
    callbacks for any states (Sudeep Holla).

    - Drop an unnecessary symbol definition from the cpuidle core code
    handling coupled CPU cores (Anders Roxell).

    - Fix a race condition related to governor initialization and removal
    in the cpufreq core (Viresh Kumar).

    - Clean up the cpufreq core to use list_is_last() for checking if the
    given policy object is the last element of a list instead of open
    coding that in a clumsy way (Gautham R Shenoy).

    - Fix compiler warnings in the pxa2xx and cpufreq-dt cpufreq drivers
    (Arnd Bergmann).

    - Fix two locking issues and clean up a comment in the generic power
    domains framework (Ulf Hansson, Marek Szyprowski, Moritz Fischer).

    - Fix the error code path of one function in the ACPI-based PCI
    hotplug framework (ACPIPHP) that forgets to release a lock acquired
    previously (Insu Yun).

    - Drop the ACPI backlight blacklist entry for Dell Inspiron 5737 that
    is not necessary any more (Hans de Goede).

    - Clean up the top-level PM Kconfig to stop requiring APM emulation
    to depend on PM which in fact isn't necessary (Arnd Bergmann)"

    * tag 'pm+acpi-4.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
    cpufreq: cpufreq-dt: avoid uninitialized variable warnings:
    cpufreq: pxa2xx: fix pxa_cpufreq_change_voltage prototype
    PM: APM_EMULATION does not depend on PM
    cpufreq: Use list_is_last() to check last entry of the policy list
    cpufreq: Fix NULL reference crash while accessing policy->governor_data
    cpuidle: coupled: remove unused define cpuidle_coupled_lock
    PM / Domains: Fix typo in comment
    PM / Domains: Fix potential deadlock while adding/removing subdomains
    ACPI / PCI / hotplug: unlock in error path in acpiphp_enable_slot()
    ACPI: Revert "ACPI / video: Add Dell Inspiron 5737 to the blacklist"
    cpuidle: fix fallback mechanism for suspend to idle in absence of enter_freeze
    PM / domains: fix lockdep issue for all subdomains

    Linus Torvalds
     
  • * pm-cpuidle:
    cpuidle: coupled: remove unused define cpuidle_coupled_lock
    cpuidle: fix fallback mechanism for suspend to idle in absence of enter_freeze

    * pm-cpufreq:
    cpufreq: cpufreq-dt: avoid uninitialized variable warnings:
    cpufreq: pxa2xx: fix pxa_cpufreq_change_voltage prototype
    cpufreq: Use list_is_last() to check last entry of the policy list
    cpufreq: Fix NULL reference crash while accessing policy->governor_data

    * pm-domains:
    PM / Domains: Fix typo in comment
    PM / Domains: Fix potential deadlock while adding/removing subdomains
    PM / domains: fix lockdep issue for all subdomains

    * pm-sleep:
    PM: APM_EMULATION does not depend on PM

    Rafael J. Wysocki
     
  • Pull security layer fixes from James Morris:
    "The keys patch fixes a bug which is breaking kerberos, and the seccomp
    fix addresses a no_new_privs bypass"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
    KEYS: Only apply KEY_FLAG_KEEP to a key if a parent keyring has it set
    seccomp: always propagate NO_NEW_PRIVS on tsync

    Linus Torvalds
     
  • fca839c00a12 ("workqueue: warn if memory reclaim tries to flush
    !WQ_MEM_RECLAIM workqueue") implemented flush dependency warning which
    triggers if a PF_MEMALLOC task or WQ_MEM_RECLAIM workqueue tries to
    flush a !WQ_MEM_RECLAIM workquee.

    This assumes that workqueues marked with WQ_MEM_RECLAIM sit in memory
    reclaim path and making it depend on something which may need more
    memory to make forward progress can lead to deadlocks. Unfortunately,
    workqueues created with the legacy create*_workqueue() interface
    always have WQ_MEM_RECLAIM regardless of whether they are depended
    upon memory reclaim or not. These spurious WQ_MEM_RECLAIM markings
    cause spurious triggering of the flush dependency checks.

    WARNING: CPU: 0 PID: 6 at kernel/workqueue.c:2361 check_flush_dependency+0x138/0x144()
    workqueue: WQ_MEM_RECLAIM deferwq:deferred_probe_work_func is flushing !WQ_MEM_RECLAIM events:lru_add_drain_per_cpu
    ...
    Workqueue: deferwq deferred_probe_work_func
    [] (unwind_backtrace) from [] (show_stack+0x10/0x14)
    [] (show_stack) from [] (dump_stack+0x94/0xd4)
    [] (dump_stack) from [] (warn_slowpath_common+0x80/0xb0)
    [] (warn_slowpath_common) from [] (warn_slowpath_fmt+0x30/0x40)
    [] (warn_slowpath_fmt) from [] (check_flush_dependency+0x138/0x144)
    [] (check_flush_dependency) from [] (flush_work+0x50/0x15c)
    [] (flush_work) from [] (lru_add_drain_all+0x130/0x180)
    [] (lru_add_drain_all) from [] (migrate_prep+0x8/0x10)
    [] (migrate_prep) from [] (alloc_contig_range+0xd8/0x338)
    [] (alloc_contig_range) from [] (cma_alloc+0xe0/0x1ac)
    [] (cma_alloc) from [] (__alloc_from_contiguous+0x38/0xd8)
    [] (__alloc_from_contiguous) from [] (__dma_alloc+0x240/0x278)
    [] (__dma_alloc) from [] (arm_dma_alloc+0x54/0x5c)
    [] (arm_dma_alloc) from [] (dmam_alloc_coherent+0xc0/0xec)
    [] (dmam_alloc_coherent) from [] (ahci_port_start+0x150/0x1dc)
    [] (ahci_port_start) from [] (ata_host_start.part.3+0xc8/0x1c8)
    [] (ata_host_start.part.3) from [] (ata_host_activate+0x50/0x148)
    [] (ata_host_activate) from [] (ahci_host_activate+0x44/0x114)
    [] (ahci_host_activate) from [] (ahci_platform_init_host+0x1d8/0x3c8)
    [] (ahci_platform_init_host) from [] (tegra_ahci_probe+0x448/0x4e8)
    [] (tegra_ahci_probe) from [] (platform_drv_probe+0x50/0xac)
    [] (platform_drv_probe) from [] (driver_probe_device+0x214/0x2c0)
    [] (driver_probe_device) from [] (bus_for_each_drv+0x60/0x94)
    [] (bus_for_each_drv) from [] (__device_attach+0xb0/0x114)
    [] (__device_attach) from [] (bus_probe_device+0x84/0x8c)
    [] (bus_probe_device) from [] (deferred_probe_work_func+0x68/0x98)
    [] (deferred_probe_work_func) from [] (process_one_work+0x120/0x3f8)
    [] (process_one_work) from [] (worker_thread+0x38/0x55c)
    [] (worker_thread) from [] (kthread+0xdc/0xf4)
    [] (kthread) from [] (ret_from_fork+0x14/0x3c)

    Fix it by marking workqueues created via create*_workqueue() with
    __WQ_LEGACY and disabling flush dependency checks on them.

    Signed-off-by: Tejun Heo
    Reported-and-tested-by: Thierry Reding
    Link: http://lkml.kernel.org/g/20160126173843.GA11115@ulmo.nvidia.com
    Fixes: fca839c00a12 ("workqueue: warn if memory reclaim tries to flush !WQ_MEM_RECLAIM workqueue")

    Tejun Heo
     
  • When a max stack trace is discovered, the stack dump is saved. In order to
    not record the overhead of the stack tracer, the ip of the traced function
    is looked for within the dump. The trace is started from the location of
    that function. But if for some reason the ip is not found, the entire stack
    trace is then truncated. That's not very useful. Instead, print everything
    if the ip of the traced function is not found within the trace.

    This issue showed up on s390.

    Link: http://lkml.kernel.org/r/20160129102241.1b3c9c04@gandalf.local.home

    Fixes: 72ac426a5bb0 ("tracing: Clean up stack tracing and fix fentry updates")
    Cc: stable@vger.kernel.org # v4.3+
    Reported-by: Heiko Carstens
    Tested-by: Heiko Carstens
    Signed-off-by: Steven Rostedt

    Steven Rostedt
     

29 Jan, 2016

7 commits

  • Now that the perf_event_ctx_lock_nested() call has moved from
    put_event() into perf_event_release_kernel() the first reason is no
    longer valid as that can no longer happen.

    The second reason seems to have been invalidated when Al Viro made fput()
    unconditionally async in the following commit:

    4a9d4b024a31 ("switch fput to task_work_add")

    such that munmap()->fput()->release()->perf_release() would no longer happen.

    Therefore, remove the annotation. This should increase the efficiency
    of lockdep coverage of perf locking.

    Suggested-by: Alexander Shishkin
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho de Melo
    Cc: David Ahern
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • The orphan cleanup workqueue doesn't always catch orphans, for example,
    if they never schedule after they are orphaned. IOW, the event leak is
    still very real. It also wouldn't work for kernel counters.

    Doing it synchonously is a little hairy due to lock inversion issues,
    but is made to work.

    Patch based on work by Alexander Shishkin.

    Suggested-by: Alexander Shishkin
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: David Ahern
    Cc: Jiri Olsa
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: vince@deater.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • There are two concepts of owner wrt an event and they are conflated:

    - event::owner / event::owner_list,
    used by prctl(.option = PR_TASK_PERF_EVENTS_{EN,DIS}ABLE).

    - the 'owner' of the event object, typically the file descriptor.

    Currently these two concepts are conflated, which gives trouble with
    scm_rights passing of file descriptors. Passing the event and then
    closing the creating task would render the event 'orphan' and would
    have it cleared out. Unlikely what is expectd.

    This patch untangles these two concepts by using PERF_EVENT_STATE_EXIT
    to denote the second type.

    Reported-by: Alexei Starovoitov
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho de Melo
    Cc: David Ahern
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • In preparation to adding more options, convert the boolean argument
    into a flags word.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho de Melo
    Cc: David Ahern
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • sync_child_event() has outgrown its purpose, it does far too much.
    Bring it back to its named purpose.

    Rename __perf_event_exit_task() to perf_event_exit_event() to better
    reflect what it does and move the event->state assignment under the
    ctx->lock, like state changes ought to be.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho de Melo
    Cc: David Ahern
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Use smp_store_release() to clear event->owner and
    lockless_dereference() to observe it. Further use READ_ONCE() for all
    lockless reads.

    This changes perf_remove_from_owner() to leave event->owner cleared.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho de Melo
    Cc: David Ahern
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • We should never attempt to enable a STATE_EXIT event.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho de Melo
    Cc: David Ahern
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Signed-off-by: Ingo Molnar

    Peter Zijlstra