08 Oct, 2016

9 commits

  • When doing an nmi backtrace of many cores, most of which are idle, the
    output is a little overwhelming and very uninformative. Suppress
    messages for cpus that are idling when they are interrupted and just
    emit one line, "NMI backtrace for N skipped: idling at pc 0xNNN".

    We do this by grouping all the cpuidle code together into a new
    .cpuidle.text section, and then checking the address of the interrupted
    PC to see if it lies within that section.

    This commit suitably tags x86 and tile idle routines, and only adds in
    the minimal framework for other architectures.

    Link: http://lkml.kernel.org/r/1472487169-14923-5-git-send-email-cmetcalf@mellanox.com
    Signed-off-by: Chris Metcalf
    Acked-by: Peter Zijlstra (Intel)
    Tested-by: Peter Zijlstra (Intel)
    Tested-by: Daniel Thompson [arm]
    Tested-by: Petr Mladek
    Cc: Aaron Tomlin
    Cc: Peter Zijlstra (Intel)
    Cc: "Rafael J. Wysocki"
    Cc: Russell King
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Metcalf
     
  • The global zero page is used to satisfy an anonymous read fault. If
    THP(Transparent HugePage) is enabled then the global huge zero page is
    used. The global huge zero page uses an atomic counter for reference
    counting and is allocated/freed dynamically according to its counter
    value.

    CPU time spent on that counter will greatly increase if there are a lot
    of processes doing anonymous read faults. This patch proposes a way to
    reduce the access to the global counter so that the CPU load can be
    reduced accordingly.

    To do this, a new flag of the mm_struct is introduced:
    MMF_USED_HUGE_ZERO_PAGE. With this flag, the process only need to touch
    the global counter in two cases:

    1 The first time it uses the global huge zero page;
    2 The time when mm_user of its mm_struct reaches zero.

    Note that right now, the huge zero page is eligible to be freed as soon
    as its last use goes away. With this patch, the page will not be
    eligible to be freed until the exit of the last process from which it
    was ever used.

    And with the use of mm_user, the kthread is not eligible to use huge
    zero page either. Since no kthread is using huge zero page today, there
    is no difference after applying this patch. But if that is not desired,
    I can change it to when mm_count reaches zero.

    Case used for test on Haswell EP:

    usemem -n 72 --readonly -j 0x200000 100G

    Which spawns 72 processes and each will mmap 100G anonymous space and
    then do read only access to that space sequentially with a step of 2MB.

    CPU cycles from perf report for base commit:
    54.03% usemem [kernel.kallsyms] [k] get_huge_zero_page
    CPU cycles from perf report for this commit:
    0.11% usemem [kernel.kallsyms] [k] mm_get_huge_zero_page

    Performance(throughput) of the workload for base commit: 1784430792
    Performance(throughput) of the workload for this commit: 4726928591
    164% increase.

    Runtime of the workload for base commit: 707592 us
    Runtime of the workload for this commit: 303970 us
    50% drop.

    Link: http://lkml.kernel.org/r/fe51a88f-446a-4622-1363-ad1282d71385@intel.com
    Signed-off-by: Aaron Lu
    Cc: Sergey Senozhatsky
    Cc: "Kirill A. Shutemov"
    Cc: Dave Hansen
    Cc: Tim Chen
    Cc: Huang Ying
    Cc: Vlastimil Babka
    Cc: Jerome Marchand
    Cc: Andrea Arcangeli
    Cc: Mel Gorman
    Cc: Ebru Akagunduz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aaron Lu
     
  • There are no users of exit_oom_victim on !current task anymore so enforce
    the API to always work on the current.

    Link: http://lkml.kernel.org/r/1472119394-11342-8-git-send-email-mhocko@kernel.org
    Signed-off-by: Tetsuo Handa
    Signed-off-by: Michal Hocko
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • Commit 74070542099c ("oom, suspend: fix oom_reaper vs.
    oom_killer_disable race") has workaround an existing race between
    oom_killer_disable and oom_reaper by adding another round of
    try_to_freeze_tasks after the oom killer was disabled. This was the
    easiest thing to do for a late 4.7 fix. Let's fix it properly now.

    After "oom: keep mm of the killed task available" we no longer have to
    call exit_oom_victim from the oom reaper because we have stable mm
    available and hide the oom_reaped mm by MMF_OOM_SKIP flag. So let's
    remove exit_oom_victim and the race described in the above commit
    doesn't exist anymore if.

    Unfortunately this alone is not sufficient for the oom_killer_disable
    usecase because now we do not have any reliable way to reach
    exit_oom_victim (the victim might get stuck on a way to exit for an
    unbounded amount of time). OOM killer can cope with that by checking mm
    flags and move on to another victim but we cannot do the same for
    oom_killer_disable as we would lose the guarantee of no further
    interference of the victim with the rest of the system. What we can do
    instead is to cap the maximum time the oom_killer_disable waits for
    victims. The only current user of this function (pm suspend) already
    has a concept of timeout for back off so we can reuse the same value
    there.

    Let's drop set_freezable for the oom_reaper kthread because it is no
    longer needed as the reaper doesn't wake or thaw any processes.

    Link: http://lkml.kernel.org/r/1472119394-11342-7-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • After "oom: keep mm of the killed task available" we can safely detect
    an oom victim by checking task->signal->oom_mm so we do not need the
    signal_struct counter anymore so let's get rid of it.

    This alone wouldn't be sufficient for nommu archs because
    exit_oom_victim doesn't hide the process from the oom killer anymore.
    We can, however, mark the mm with a MMF flag in __mmput. We can reuse
    MMF_OOM_REAPED and rename it to a more generic MMF_OOM_SKIP.

    Link: http://lkml.kernel.org/r/1472119394-11342-6-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Lockdep complains that __mmdrop is not safe from the softirq context:

    =================================
    [ INFO: inconsistent lock state ]
    4.6.0-oomfortification2-00011-geeb3eadeab96-dirty #949 Tainted: G W
    ---------------------------------
    inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
    swapper/1/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
    (pgd_lock){+.?...}, at: pgd_free+0x19/0x6b
    {SOFTIRQ-ON-W} state was registered at:
    __lock_acquire+0xa06/0x196e
    lock_acquire+0x139/0x1e1
    _raw_spin_lock+0x32/0x41
    __change_page_attr_set_clr+0x2a5/0xacd
    change_page_attr_set_clr+0x16f/0x32c
    set_memory_nx+0x37/0x3a
    free_init_pages+0x9e/0xc7
    alternative_instructions+0xa2/0xb3
    check_bugs+0xe/0x2d
    start_kernel+0x3ce/0x3ea
    x86_64_start_reservations+0x2a/0x2c
    x86_64_start_kernel+0x17a/0x18d
    irq event stamp: 105916
    hardirqs last enabled at (105916): free_hot_cold_page+0x37e/0x390
    hardirqs last disabled at (105915): free_hot_cold_page+0x2c1/0x390
    softirqs last enabled at (105878): _local_bh_enable+0x42/0x44
    softirqs last disabled at (105879): irq_exit+0x6f/0xd1

    other info that might help us debug this:
    Possible unsafe locking scenario:

    CPU0
    ----
    lock(pgd_lock);

    lock(pgd_lock);

    *** DEADLOCK ***

    1 lock held by swapper/1/0:
    #0: (rcu_callback){......}, at: rcu_process_callbacks+0x390/0x800

    stack backtrace:
    CPU: 1 PID: 0 Comm: swapper/1 Tainted: G W 4.6.0-oomfortification2-00011-geeb3eadeab96-dirty #949
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
    Call Trace:

    print_usage_bug.part.25+0x259/0x268
    mark_lock+0x381/0x567
    __lock_acquire+0x993/0x196e
    lock_acquire+0x139/0x1e1
    _raw_spin_lock+0x32/0x41
    pgd_free+0x19/0x6b
    __mmdrop+0x25/0xb9
    __put_task_struct+0x103/0x11e
    delayed_put_task_struct+0x157/0x15e
    rcu_process_callbacks+0x660/0x800
    __do_softirq+0x1ec/0x4d5
    irq_exit+0x6f/0xd1
    smp_apic_timer_interrupt+0x42/0x4d
    apic_timer_interrupt+0x8e/0xa0

    arch_cpu_idle+0xf/0x11
    default_idle_call+0x32/0x34
    cpu_startup_entry+0x20c/0x399
    start_secondary+0xfe/0x101

    More over commit a79e53d85683 ("x86/mm: Fix pgd_lock deadlock") was
    explicit about pgd_lock not to be called from the irq context. This
    means that __mmdrop called from free_signal_struct has to be postponed
    to a user context. We already have a similar mechanism for mmput_async
    so we can use it here as well. This is safe because mm_count is pinned
    by mm_users.

    This fixes bug introduced by "oom: keep mm of the killed task available"

    Link: http://lkml.kernel.org/r/1472119394-11342-5-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • oom_reap_task has to call exit_oom_victim in order to make sure that the
    oom vicim will not block the oom killer for ever. This is, however,
    opening new problems (e.g oom_killer_disable exclusion - see commit
    74070542099c ("oom, suspend: fix oom_reaper vs. oom_killer_disable
    race")). exit_oom_victim should be only called from the victim's
    context ideally.

    One way to achieve this would be to rely on per mm_struct flags. We
    already have MMF_OOM_REAPED to hide a task from the oom killer since
    "mm, oom: hide mm which is shared with kthread or global init". The
    problem is that the exit path:

    do_exit
    exit_mm
    tsk->mm = NULL;
    mmput
    __mmput
    exit_oom_victim

    doesn't guarantee that exit_oom_victim will get called in a bounded
    amount of time. At least exit_aio depends on IO which might get blocked
    due to lack of memory and who knows what else is lurking there.

    This patch takes a different approach. We remember tsk->mm into the
    signal_struct and bind it to the signal struct life time for all oom
    victims. __oom_reap_task_mm as well as oom_scan_process_thread do not
    have to rely on find_lock_task_mm anymore and they will have a reliable
    reference to the mm struct. As a result all the oom specific
    communication inside the OOM killer can be done via tsk->signal->oom_mm.

    Increasing the signal_struct for something as unlikely as the oom killer
    is far from ideal but this approach will make the code much more
    reasonable and long term we even might want to move task->mm into the
    signal_struct anyway. In the next step we might want to make the oom
    killer exclusion and access to memory reserves completely independent
    which would be also nice.

    Link: http://lkml.kernel.org/r/1472119394-11342-4-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Pull trivial updates from Jiri Kosina:
    "The usual rocket science from the trivial tree"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
    tracing/syscalls: fix multiline in error message text
    lib/Kconfig.debug: fix DEBUG_SECTION_MISMATCH description
    doc: vfs: fix fadvise() sycall name
    x86/entry: spell EBX register correctly in documentation
    securityfs: fix securityfs_create_dir comment
    irq: Fix typo in tracepoint.xml

    Linus Torvalds
     
  • Pull livepatching updates from Jiri Kosina:

    - fix for patching modules that contain .altinstructions or
    .parainstructions sections, from Jessica Yu

    - make TAINT_LIVEPATCH a per-module flag (so that it's immediately
    clear which module caused the taint), from Josh Poimboeuf

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching:
    livepatch/module: make TAINT_LIVEPATCH module-specific
    Documentation: livepatch: add section about arch-specific code
    livepatch/x86: apply alternatives and paravirt patches after relocations
    livepatch: use arch_klp_init_object_loaded() to finish arch-specific tasks

    Linus Torvalds
     

07 Oct, 2016

3 commits

  • Pull tracing updates from Steven Rostedt:
    "This release cycle is rather small. Just a few fixes to tracing.

    The big change is the addition of the hwlat tracer. It not only
    detects SMIs, but also other latency that's caused by the hardware. I
    have detected some latency from large boxes having bus contention"

    * tag 'trace-v4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    tracing: Call traceoff trigger after event is recorded
    ftrace/scripts: Add helper script to bisect function tracing problem functions
    tracing: Have max_latency be defined for HWLAT_TRACER as well
    tracing: Add NMI tracing in hwlat detector
    tracing: Have hwlat trace migrate across tracing_cpumask CPUs
    tracing: Add documentation for hwlat_detector tracer
    tracing: Added hardware latency tracer
    ftrace: Access ret_stack->subtime only in the function profiler
    function_graph: Handle TRACE_BPUTS in print_graph_comment
    tracing/uprobe: Drop isdigit() check in create_trace_uprobe

    Linus Torvalds
     
  • Pull KVM updates from Radim Krčmář:
    "All architectures:
    - move `make kvmconfig` stubs from x86
    - use 64 bits for debugfs stats

    ARM:
    - Important fixes for not using an in-kernel irqchip
    - handle SError exceptions and present them to guests if appropriate
    - proxying of GICV access at EL2 if guest mappings are unsafe
    - GICv3 on AArch32 on ARMv8
    - preparations for GICv3 save/restore, including ABI docs
    - cleanups and a bit of optimizations

    MIPS:
    - A couple of fixes in preparation for supporting MIPS EVA host
    kernels
    - MIPS SMP host & TLB invalidation fixes

    PPC:
    - Fix the bug which caused guests to falsely report lockups
    - other minor fixes
    - a small optimization

    s390:
    - Lazy enablement of runtime instrumentation
    - up to 255 CPUs for nested guests
    - rework of machine check deliver
    - cleanups and fixes

    x86:
    - IOMMU part of AMD's AVIC for vmexit-less interrupt delivery
    - Hyper-V TSC page
    - per-vcpu tsc_offset in debugfs
    - accelerated INS/OUTS in nVMX
    - cleanups and fixes"

    * tag 'kvm-4.9-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (140 commits)
    KVM: MIPS: Drop dubious EntryHi optimisation
    KVM: MIPS: Invalidate TLB by regenerating ASIDs
    KVM: MIPS: Split kernel/user ASID regeneration
    KVM: MIPS: Drop other CPU ASIDs on guest MMU changes
    KVM: arm/arm64: vgic: Don't flush/sync without a working vgic
    KVM: arm64: Require in-kernel irqchip for PMU support
    KVM: PPC: Book3s PR: Allow access to unprivileged MMCR2 register
    KVM: PPC: Book3S PR: Support 64kB page size on POWER8E and POWER8NVL
    KVM: PPC: Book3S: Remove duplicate setting of the B field in tlbie
    KVM: PPC: BookE: Fix a sanity check
    KVM: PPC: Book3S HV: Take out virtual core piggybacking code
    KVM: PPC: Book3S: Treat VTB as a per-subcore register, not per-thread
    ARM: gic-v3: Work around definition of gic_write_bpr1
    KVM: nVMX: Fix the NMI IDT-vectoring handling
    KVM: VMX: Enable MSR-BASED TPR shadow even if APICv is inactive
    KVM: nVMX: Fix reload apic access page warning
    kvmconfig: add virtio-gpu to config fragment
    config: move x86 kvm_guest.config to a common location
    arm64: KVM: Remove duplicating init code for setting VMID
    ARM: KVM: Support vgic-v3
    ...

    Linus Torvalds
     
  • Pull namespace updates from Eric Biederman:
    "This set of changes is a number of smaller things that have been
    overlooked in other development cycles focused on more fundamental
    change. The devpts changes are small things that were a distraction
    until we managed to kill off DEVPTS_MULTPLE_INSTANCES. There is an
    trivial regression fix to autofs for the unprivileged mount changes
    that went in last cycle. A pair of ioctls has been added by Andrey
    Vagin making it is possible to discover the relationships between
    namespaces when referring to them through file descriptors.

    The big user visible change is starting to add simple resource limits
    to catch programs that misbehave. With namespaces in general and user
    namespaces in particular allowing users to use more kinds of
    resources, it has become important to have something to limit errant
    programs. Because the purpose of these limits is to catch errant
    programs the code needs to be inexpensive to use as it always on, and
    the default limits need to be high enough that well behaved programs
    on well behaved systems don't encounter them.

    To this end, after some review I have implemented per user per user
    namespace limits, and use them to limit the number of namespaces. The
    limits being per user mean that one user can not exhause the limits of
    another user. The limits being per user namespace allow contexts where
    the limit is 0 and security conscious folks can remove from their
    threat anlysis the code used to manage namespaces (as they have
    historically done as it root only). At the same time the limits being
    per user namespace allow other parts of the system to use namespaces.

    Namespaces are increasingly being used in application sand boxing
    scenarios so an all or nothing disable for the entire system for the
    security conscious folks makes increasing use of these sandboxes
    impossible.

    There is also added a limit on the maximum number of mounts present in
    a single mount namespace. It is nontrivial to guess what a reasonable
    system wide limit on the number of mount structure in the kernel would
    be, especially as it various based on how a system is using
    containers. A limit on the number of mounts in a mount namespace
    however is much easier to understand and set. In most cases in
    practice only about 1000 mounts are used. Given that some autofs
    scenarious have the potential to be 30,000 to 50,000 mounts I have set
    the default limit for the number of mounts at 100,000 which is well
    above every known set of users but low enough that the mount hash
    tables don't degrade unreaonsably.

    These limits are a start. I expect this estabilishes a pattern that
    other limits for resources that namespaces use will follow. There has
    been interest in making inotify event limits per user per user
    namespace as well as interest expressed in making details about what
    is going on in the kernel more visible"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (28 commits)
    autofs: Fix automounts by using current_real_cred()->uid
    mnt: Add a per mount namespace limit on the number of mounts
    netns: move {inc,dec}_net_namespaces into #ifdef
    nsfs: Simplify __ns_get_path
    tools/testing: add a test to check nsfs ioctl-s
    nsfs: add ioctl to get a parent namespace
    nsfs: add ioctl to get an owning user namespace for ns file descriptor
    kernel: add a helper to get an owning user namespace for a namespace
    devpts: Change the owner of /dev/pts/ptmx to the mounter of /dev/pts
    devpts: Remove sync_filesystems
    devpts: Make devpts_kill_sb safe if fsi is NULL
    devpts: Simplify devpts_mount by using mount_nodev
    devpts: Move the creation of /dev/pts/ptmx into fill_super
    devpts: Move parse_mount_options into fill_super
    userns: When the per user per user namespace limit is reached return ENOSPC
    userns; Document per user per user namespace limits.
    mntns: Add a limit on the number of mount namespaces.
    netns: Add a limit on the number of net namespaces
    cgroupns: Add a limit on the number of cgroup namespaces
    ipcns: Add a limit on the number of ipc namespaces
    ...

    Linus Torvalds
     

06 Oct, 2016

1 commit

  • Pull networking updates from David Miller:

    1) BBR TCP congestion control, from Neal Cardwell, Yuchung Cheng and
    co. at Google. https://lwn.net/Articles/701165/

    2) Do TCP Small Queues for retransmits, from Eric Dumazet.

    3) Support collect_md mode for all IPV4 and IPV6 tunnels, from Alexei
    Starovoitov.

    4) Allow cls_flower to classify packets in ip tunnels, from Amir Vadai.

    5) Support DSA tagging in older mv88e6xxx switches, from Andrew Lunn.

    6) Support GMAC protocol in iwlwifi mwm, from Ayala Beker.

    7) Support ndo_poll_controller in mlx5, from Calvin Owens.

    8) Move VRF processing to an output hook and allow l3mdev to be
    loopback, from David Ahern.

    9) Support SOCK_DESTROY for UDP sockets. Also from David Ahern.

    10) Congestion control in RXRPC, from David Howells.

    11) Support geneve RX offload in ixgbe, from Emil Tantilov.

    12) When hitting pressure for new incoming TCP data SKBs, perform a
    partial rathern than a full purge of the OFO queue (which could be
    huge). From Eric Dumazet.

    13) Convert XFRM state and policy lookups to RCU, from Florian Westphal.

    14) Support RX network flow classification to igb, from Gangfeng Huang.

    15) Hardware offloading of eBPF in nfp driver, from Jakub Kicinski.

    16) New skbmod packet action, from Jamal Hadi Salim.

    17) Remove some inefficiencies in snmp proc output, from Jia He.

    18) Add FIB notifications to properly propagate route changes to
    hardware which is doing forwarding offloading. From Jiri Pirko.

    19) New dsa driver for qca8xxx chips, from John Crispin.

    20) Implement RFC7559 ipv6 router solicitation backoff, from Maciej
    Żenczykowski.

    21) Add L3 mode to ipvlan, from Mahesh Bandewar.

    22) Support 802.1ad in mlx4, from Moshe Shemesh.

    23) Support hardware LRO in mediatek driver, from Nelson Chang.

    24) Add TC offloading to mlx5, from Or Gerlitz.

    25) Convert various drivers to ethtool ksettings interfaces, from
    Philippe Reynes.

    26) TX max rate limiting for cxgb4, from Rahul Lakkireddy.

    27) NAPI support for ath10k, from Rajkumar Manoharan.

    28) Support XDP in mlx5, from Rana Shahout and Saeed Mahameed.

    29) UDP replicast support in TIPC, from Richard Alpe.

    30) Per-queue statistics for qed driver, from Sudarsana Reddy Kalluru.

    31) Support BQL in thunderx driver, from Sunil Goutham.

    32) TSO support in alx driver, from Tobias Regnery.

    33) Add stream parser engine and use it in kcm.

    34) Support async DHCP replies in ipconfig module, from Uwe
    Kleine-König.

    35) DSA port fast aging for mv88e6xxx driver, from Vivien Didelot.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1715 commits)
    mlxsw: switchx2: Fix misuse of hard_header_len
    mlxsw: spectrum: Fix misuse of hard_header_len
    net/faraday: Stop NCSI device on shutdown
    net/ncsi: Introduce ncsi_stop_dev()
    net/ncsi: Rework the channel monitoring
    net/ncsi: Allow to extend NCSI request properties
    net/ncsi: Rework request index allocation
    net/ncsi: Don't probe on the reserved channel ID (0x1f)
    net/ncsi: Introduce NCSI_RESERVED_CHANNEL
    net/ncsi: Avoid unused-value build warning from ia64-linux-gcc
    net: Add netdev all_adj_list refcnt propagation to fix panic
    net: phy: Add Edge-rate driver for Microsemi PHYs.
    vmxnet3: Wake queue from reset work
    i40e: avoid NULL pointer dereference and recursive errors on early PCI error
    qed: Add RoCE ll2 & GSI support
    qed: Add support for memory registeration verbs
    qed: Add support for QP verbs
    qed: PD,PKEY and CQ verb support
    qed: Add support for RoCE hw init
    qede: Add qedr framework
    ...

    Linus Torvalds
     

05 Oct, 2016

1 commit

  • Pull audit updates from Paul Moore:
    "Another relatively small pull request for v4.9 with just two patches.

    The patch from Richard updates the list of features we support and
    report back to userspace; this should have been sent earlier with the
    rest of the v4.8 patches but it got lost in my inbox.

    The second patch fixes a problem reported by our Android friends where
    we weren't very consistent in recording PIDs"

    * 'stable-4.9' of git://git.infradead.org/users/pcmoore/audit:
    audit: add exclude filter extension to feature bitmap
    audit: consistently record PIDs with task_tgid_nr()

    Linus Torvalds
     

04 Oct, 2016

11 commits

  • Pull CPU hotplug updates from Thomas Gleixner:
    "Yet another batch of cpu hotplug core updates and conversions:

    - Provide core infrastructure for multi instance drivers so the
    drivers do not have to keep custom lists.

    - Convert custom lists to the new infrastructure. The block-mq custom
    list conversion comes through the block tree and makes the diffstat
    tip over to more lines removed than added.

    - Handle unbalanced hotplug enable/disable calls more gracefully.

    - Remove the obsolete CPU_STARTING/DYING notifier support.

    - Convert another batch of notifier users.

    The relayfs changes which conflicted with the conversion have been
    shipped to me by Andrew.

    The remaining lot is targeted for 4.10 so that we finally can remove
    the rest of the notifiers"

    * 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits)
    cpufreq: Fix up conversion to hotplug state machine
    blk/mq: Reserve hotplug states for block multiqueue
    x86/apic/uv: Convert to hotplug state machine
    s390/mm/pfault: Convert to hotplug state machine
    mips/loongson/smp: Convert to hotplug state machine
    mips/octeon/smp: Convert to hotplug state machine
    fault-injection/cpu: Convert to hotplug state machine
    padata: Convert to hotplug state machine
    cpufreq: Convert to hotplug state machine
    ACPI/processor: Convert to hotplug state machine
    virtio scsi: Convert to hotplug state machine
    oprofile/timer: Convert to hotplug state machine
    block/softirq: Convert to hotplug state machine
    lib/irq_poll: Convert to hotplug state machine
    x86/microcode: Convert to hotplug state machine
    sh/SH-X3 SMP: Convert to hotplug state machine
    ia64/mca: Convert to hotplug state machine
    ARM/OMAP/wakeupgen: Convert to hotplug state machine
    ARM/shmobile: Convert to hotplug state machine
    arm64/FP/SIMD: Convert to hotplug state machine
    ...

    Linus Torvalds
     
  • Pull irq updates from Thomas Gleixner:
    "The irq departement proudly presents:

    - A rework of the core infrastructure to optimally spread interrupt
    for multiqueue devices. The first version was a bit naive and
    failed to take thread siblings and other details into account.
    Developed in cooperation with Christoph and Keith.

    - Proper delegation of softirqs to ksoftirqd, so if ksoftirqd is
    active then no further softirq processsing on interrupt return
    happens. Otherwise we try to delegate and still run another batch
    of network packets in the irq return path, which then tries to
    delegate to ksoftirqd .....

    - A proper machine parseable sysfs based alternative for
    /proc/interrupts.

    - ACPI support for the GICV3-ITS and ARM interrupt remapping

    - Two new irq chips from the ARM SoC zoo: STM32-EXTI and MVEBU-PIC

    - A new irq chip for the JCore (SuperH)

    - The usual pile of small fixlets in core and irqchip drivers"

    * 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (42 commits)
    softirq: Let ksoftirqd do its job
    genirq: Make function __irq_do_set_handler() static
    ARM/dts: Add EXTI controller node to stm32f429
    ARM/STM32: Select external interrupts controller
    drivers/irqchip: Add STM32 external interrupts support
    Documentation/dt-bindings: Document STM32 EXTI controller bindings
    irqchip/mips-gic: Use for_each_set_bit to iterate over local IRQs
    pci/msi: Retrieve affinity for a vector
    genirq/affinity: Remove old irq spread infrastructure
    genirq/msi: Switch to new irq spreading infrastructure
    genirq/affinity: Provide smarter irq spreading infrastructure
    genirq/msi: Add cpumask allocation to alloc_msi_entry
    genirq: Expose interrupt information through sysfs
    irqchip/gicv3-its: Use MADT ITS subtable to do PCI/MSI domain initialization
    irqchip/gicv3-its: Factor out PCI-MSI part that might be reused for ACPI
    irqchip/gicv3-its: Probe ITS in the ACPI way
    irqchip/gicv3-its: Refactor ITS DT init code to prepare for ACPI
    irqchip/gicv3-its: Cleanup for ITS domain initialization
    PCI/MSI: Setup MSI domain on a per-device basis using IORT ACPI table
    ACPI: Add new IORT functions to support MSI domain handling
    ...

    Linus Torvalds
     
  • Pull timer updates from Thomas Gleixner:
    "A rather smalish set of updates for timers and timekeeping:

    - Two core fixes to prevent potential undefinded behaviour about
    which gcc is complaining rightfully.

    - A fix to prevent stopping the tick on an (soon) offline CPU so it
    can complete the shutdown procedure.

    - Wait for clocks to stabilize before making decisions, so a not yet
    validated clock is not rejected.

    - The usual pile of fixes to the various clocksource drivers.

    - Core code typo and include fixlets"

    * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    timekeeping: Include the correct header for errno definitions
    clocksource/drivers/ti-32k: Prevent ftrace recursion
    clocksource/mips-gic-timer: Stop checking cpu_has_counter
    clocksource/mips-gic-timer: Print an error if IRQ setup fails
    tick/nohz: Prevent stopping the tick on an offline CPU
    clocksource/drivers/oxnas: Add OX820 compatible
    clocksource/drivers/timer-atmel-pit: Simplify IRQ handler
    clocksource/drivers/timer-atmel-pit: Remove uselesss WARN_ON_ONCE
    clocksource/drivers/timer-atmel-pit: Drop at91sam926x_pit_common_init
    clocksource/drivers/moxart: Replace panic by pr_err
    clocksource/drivers/moxart: Replace setup_irq by request_irq
    clocksource/drivers/moxart: Add Aspeed support
    clocksource/drivers/moxart: Use struct to hold state
    clocksource/drivers/moxart: Refactor enable/disable
    time: Avoid undefined behaviour in ktime_add_safe()
    time: Avoid undefined behaviour in timespec64_add_safe()
    timekeeping: Prints the amounts of time spent during suspend
    clocksource: Defer override invalidation unless clock is unstable
    hrtimer: Spelling fixes

    Linus Torvalds
     
  • Pull x86 vdso updates from Ingo Molnar:
    "The main changes in this cycle centered around adding support for
    32-bit compatible C/R of the vDSO on 64-bit kernels, by Dmitry
    Safonov"

    * 'x86-vdso-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/vdso: Use CONFIG_X86_X32_ABI to enable vdso prctl
    x86/vdso: Only define map_vdso_randomized() if CONFIG_X86_64
    x86/vdso: Only define prctl_map_vdso() if CONFIG_CHECKPOINT_RESTORE
    x86/signal: Add SA_{X32,IA32}_ABI sa_flags
    x86/ptrace: Down with test_thread_flag(TIF_IA32)
    x86/coredump: Use pr_reg size, rather that TIF_IA32 flag
    x86/arch_prctl/vdso: Add ARCH_MAP_VDSO_*
    x86/vdso: Replace calculate_addr in map_vdso() with addr
    x86/vdso: Unmap vdso blob on vvar mapping failure

    Linus Torvalds
     
  • Pull low-level x86 updates from Ingo Molnar:
    "In this cycle this topic tree has become one of those 'super topics'
    that accumulated a lot of changes:

    - Add CONFIG_VMAP_STACK=y support to the core kernel and enable it on
    x86 - preceded by an array of changes. v4.8 saw preparatory changes
    in this area already - this is the rest of the work. Includes the
    thread stack caching performance optimization. (Andy Lutomirski)

    - switch_to() cleanups and all around enhancements. (Brian Gerst)

    - A large number of dumpstack infrastructure enhancements and an
    unwinder abstraction. The secret long term plan is safe(r) live
    patching plus maybe another attempt at debuginfo based unwinding -
    but all these current bits are standalone enhancements in a frame
    pointer based debug environment as well. (Josh Poimboeuf)

    - More __ro_after_init and const annotations. (Kees Cook)

    - Enable KASLR for the vmemmap memory region. (Thomas Garnier)"

    [ The virtually mapped stack changes are pretty fundamental, and not
    x86-specific per se, even if they are only used on x86 right now. ]

    * 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (70 commits)
    x86/asm: Get rid of __read_cr4_safe()
    thread_info: Use unsigned long for flags
    x86/alternatives: Add stack frame dependency to alternative_call_2()
    x86/dumpstack: Fix show_stack() task pointer regression
    x86/dumpstack: Remove dump_trace() and related callbacks
    x86/dumpstack: Convert show_trace_log_lvl() to use the new unwinder
    oprofile/x86: Convert x86_backtrace() to use the new unwinder
    x86/stacktrace: Convert save_stack_trace_*() to use the new unwinder
    perf/x86: Convert perf_callchain_kernel() to use the new unwinder
    x86/unwind: Add new unwind interface and implementations
    x86/dumpstack: Remove NULL task pointer convention
    fork: Optimize task creation by caching two thread stacks per CPU if CONFIG_VMAP_STACK=y
    sched/core: Free the stack early if CONFIG_THREAD_INFO_IN_TASK
    lib/syscall: Pin the task stack in collect_syscall()
    x86/process: Pin the target stack in get_wchan()
    x86/dumpstack: Pin the target stack when dumping it
    kthread: Pin the stack via try_get_task_stack()/put_task_stack() in to_live_kthread() function
    sched/core: Add try_get_task_stack() and put_task_stack()
    x86/entry/64: Fix a minor comment rebase error
    iommu/amd: Don't put completion-wait semaphore on stack
    ...

    Linus Torvalds
     
  • Pull scheduler changes from Ingo Molnar:
    "The main changes are:

    - irqtime accounting cleanups and enhancements. (Frederic Weisbecker)

    - schedstat debugging enhancements, make it more broadly runtime
    available. (Josh Poimboeuf)

    - More work on asymmetric topology/capacity scheduling. (Morten
    Rasmussen)

    - sched/wait fixes and cleanups. (Oleg Nesterov)

    - PELT (per entity load tracking) improvements. (Peter Zijlstra)

    - Rewrite and enhance select_idle_siblings(). (Peter Zijlstra)

    - sched/numa enhancements/fixes (Rik van Riel)

    - sched/cputime scalability improvements (Stanislaw Gruszka)

    - Load calculation arithmetics fixes. (Dietmar Eggemann)

    - sched/deadline enhancements (Tommaso Cucinotta)

    - Fix utilization accounting when switching to the SCHED_NORMAL
    policy. (Vincent Guittot)

    - ... plus misc cleanups and enhancements"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (64 commits)
    sched/irqtime: Consolidate irqtime flushing code
    sched/irqtime: Consolidate accounting synchronization with u64_stats API
    u64_stats: Introduce IRQs disabled helpers
    sched/irqtime: Remove needless IRQs disablement on kcpustat update
    sched/irqtime: No need for preempt-safe accessors
    sched/fair: Fix min_vruntime tracking
    sched/debug: Add SCHED_WARN_ON()
    sched/core: Fix set_user_nice()
    sched/fair: Introduce set_curr_task() helper
    sched/core, ia64: Rename set_curr_task()
    sched/core: Fix incorrect utilization accounting when switching to fair class
    sched/core: Optimize SCHED_SMT
    sched/core: Rewrite and improve select_idle_siblings()
    sched/core: Replace sd_busy/nr_busy_cpus with sched_domain_shared
    sched/core: Introduce 'struct sched_domain_shared'
    sched/core: Restructure destroy_sched_domain()
    sched/core: Remove unused @cpu argument from destroy_sched_domain*()
    sched/wait: Introduce init_wait_entry()
    sched/wait: Avoid abort_exclusive_wait() in __wait_on_bit_lock()
    sched/wait: Avoid abort_exclusive_wait() in ___wait_event()
    ...

    Linus Torvalds
     
  • Pull perf updates from Ingo Molnar:
    "The main kernel side changes were:

    - uprobes enhancements (Masami Hiramatsu)

    - Uncore group events enhancements (David Carrillo-Cisneros)

    - x86 Intel: Add support for Skylake server uncore PMUs (Kan Liang)

    - x86 Intel: LBR cleanups and enhancements, for better branch
    annotation tracking (Peter Zijlstra)

    - x86 Intel: Add support for PTWRITE and power event tracing
    (Alexander Shishkin)

    - ... various fixes, cleanups and smaller enhancements.

    Lots of tooling changes - a couple of highlights:

    - Support event group view with hierarchy mode in 'perf top' and
    'perf report' (Namhyung Kim)

    e.g.:

    $ perf record -e '{cycles,instructions}' make
    $ perf report --hierarchy --stdio
    ...
    # Overhead Command / Shared Object / Symbol
    # ...................... ..................................
    ...
    25.74% 27.18%sh
    19.96% 24.14%libc-2.24.so
    9.55% 14.64%[.] __strcmp_sse2
    1.54% 0.00%[.] __tfind
    1.07% 1.13%[.] _int_malloc
    0.95% 0.00%[.] __strchr_sse2
    0.89% 1.39%[.] __tsearch
    0.76% 0.00%[.] strlen

    - Add branch stack / basic block info to 'perf annotate --stdio',
    where for each branch, we add an asm comment after the instruction
    with information on how often it was taken and predicted. See
    example with color output at:

    http://vger.kernel.org/~acme/perf/annotate_basic_blocks.png

    (Peter Zijlstra)

    - Add support for using symbols in address filters with Intel PT and
    ARM CoreSight (hardware assisted tracing facilities) (Adrian
    Hunter, Mathieu Poirier)

    - Add support for interacting with Coresight PMU ETMs/PTMs, that are
    IP blocks to perform hardware assisted tracing on a ARM CPU core
    (Mathieu Poirier)

    - Support generating cross arch probes, i.e. if you specify a vmlinux
    file for different arch than the one in the host machine,

    $ perf probe --definition function_name args

    will generate the probe definition string needed to append to the
    target machine /sys/kernel/debug/tracing/kprobes_events file, using
    scripting (Masami Hiramatsu).

    - Allow configuring the default 'perf report -s' sort order in
    ~/.perfconfig, for instance, "sym,dso" may be more fitting for
    kernel developers. (Arnaldo Carvalho de Melo)

    - ... plus lots of other changes, refactorings, features and fixes"

    * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (149 commits)
    perf tests: Add dwarf unwind test for powerpc
    perf probe: Match linkage name with mangled name
    perf probe: Fix to cut off incompatible chars from group name
    perf probe: Skip if the function address is 0
    perf probe: Ignore the error of finding inline instance
    perf intel-pt: Fix decoding when there are address filters
    perf intel-pt: Enable decoder to handle TIP.PGD with missing IP
    perf intel-pt: Read address filter from AUXTRACE_INFO event
    perf intel-pt: Record address filter in AUXTRACE_INFO event
    perf intel-pt: Add a helper function for processing AUXTRACE_INFO
    perf intel-pt: Fix missing error codes processing auxtrace_info
    perf intel-pt: Add support for recording the max non-turbo ratio
    perf intel-pt: Fix snapshot overlap detection decoder errors
    perf probe: Increase debug level of SDT debug messages
    perf record: Add support for using symbols in address filters
    perf symbols: Add dso__last_symbol()
    perf record: Fix error paths
    perf record: Rename label 'out_symbol_exit'
    perf script: Fix vanished idle symbols
    perf evsel: Add support for address filters
    ...

    Linus Torvalds
     
  • Pull locking updates from Ingo Molnar:
    "The main changes in this cycle were:

    - rwsem micro-optimizations (Davidlohr Bueso)

    - Improve the implementation and optimize the performance of
    percpu-rwsems. (Peter Zijlstra.)

    - Convert all lglock users to better facilities such as percpu-rwsems
    or percpu-spinlocks and remove lglocks. (Peter Zijlstra)

    - Remove the ticket (spin)lock implementation. (Peter Zijlstra)

    - Korean translation of memory-barriers.txt and related fixes to the
    English document. (SeongJae Park)

    - misc fixes and cleanups"

    * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
    x86/cmpxchg, locking/atomics: Remove superfluous definitions
    x86, locking/spinlocks: Remove ticket (spin)lock implementation
    locking/lglock: Remove lglock implementation
    stop_machine: Remove stop_cpus_lock and lg_double_lock/unlock()
    fs/locks: Use percpu_down_read_preempt_disable()
    locking/percpu-rwsem: Add down_read_preempt_disable()
    fs/locks: Replace lg_local with a per-cpu spinlock
    fs/locks: Replace lg_global with a percpu-rwsem
    locking/percpu-rwsem: Add DEFINE_STATIC_PERCPU_RWSEMand percpu_rwsem_assert_held()
    locking/pv-qspinlock: Use cmpxchg_release() in __pv_queued_spin_unlock()
    locking/rwsem, x86: Drop a bogus cc clobber
    futex: Add some more function commentry
    locking/hung_task: Show all locks
    locking/rwsem: Scan the wait_list for readers only once
    locking/rwsem: Remove a few useless comments
    locking/rwsem: Return void in __rwsem_mark_wake()
    locking, rcu, cgroup: Avoid synchronize_sched() in __cgroup_procs_write()
    locking/Documentation: Add Korean translation
    locking/Documentation: Fix a typo of example result
    locking/Documentation: Fix wrong section reference
    ...

    Linus Torvalds
     
  • Pull core SMP updates from Ingo Molnar:
    "Two main change is generic vCPU pinning and physical CPU SMP-call
    support, for Xen to be able to perform certain calls on specific
    physical CPUs - by Juergen Gross"

    * 'core-smp-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    smp: Allocate smp_call_on_cpu() workqueue on stack too
    hwmon: Use smp_call_on_cpu() for dell-smm i8k
    dcdbas: Make use of smp_call_on_cpu()
    xen: Add xen_pin_vcpu() to support calling functions on a dedicated pCPU
    smp: Add function to execute a function synchronously on a CPU
    virt, sched: Add generic vCPU pinning support
    xen: Sync xen header

    Linus Torvalds
     
  • Pull RCU updates from Ingo Molnar:
    "The main changes in this cycle were:

    - Expedited grace-period changes, most notably avoiding having user
    threads drive expedited grace periods, using a workqueue instead.

    - Miscellaneous fixes, including a performance fix for lists that was
    sent with the lists modifications.

    - CPU hotplug updates, most notably providing exact CPU-online
    tracking for RCU. This will in turn allow removal of the checks
    supporting RCU's prior heuristic that was based on the assumption
    that CPUs would take no longer than one jiffy to come online.

    - Torture-test updates.

    - Documentation updates"

    * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (22 commits)
    list: Expand list_first_entry_or_null()
    torture: TOROUT_STRING(): Insert a space between flag and message
    rcuperf: Consistently insert space between flag and message
    rcutorture: Print out barrier error as document says
    torture: Add task state to writer-task stall printk()s
    torture: Convert torture_shutdown() to hrtimer
    rcutorture: Convert to hotplug state machine
    cpu/hotplug: Get rid of CPU_STARTING reference
    rcu: Provide exact CPU-online tracking for RCU
    rcu: Avoid redundant quiescent-state chasing
    rcu: Don't use modular infrastructure in non-modular code
    sched: Make wake_up_nohz_cpu() handle CPUs going offline
    rcu: Use rcu_gp_kthread_wake() to wake up grace period kthreads
    rcu: Use RCU's online-CPU state for expedited IPI retry
    rcu: Exclude RCU-offline CPUs from expedited grace periods
    rcu: Make expedited RCU CPU stall warnings respond to controls
    rcu: Stop disabling expedited RCU CPU stall warnings
    rcu: Drive expedited grace periods from workqueue
    rcu: Consolidate expedited grace period machinery
    documentation: Record reason for rcu_head two-byte alignment
    ...

    Linus Torvalds
     
  • Pull power management updates from Rafael Wysocki:
    "Traditionally, cpufreq is the area with the greatest number of
    changes, but there are fewer of them than last time. There also is
    some activity in the generic power domains and the devfreq frameworks,
    a couple of system suspend and hibernation fixes and some assorted
    changes in other places.

    One new feature is the cpufreq change to allow the scheduler to pass
    hints to the governors' utilization update callbacks and some code
    rework based on that. Another one is the support for domain removal in
    the generic power domains framework. Also it is now possible to use
    hibernation with PAGE_POISONING_ZERO enabled and devfreq supports the
    RockChip DFI controller and the rk3399 DMC.

    The rest of the changes is mostly fixes and cleanups in a number of
    places.

    Specifics:

    - Add a mechanism for passing hints from the scheduler to cpufreq
    governors via their utilization update callbacks and use it to
    introduce "IOwait boosting" into the schedutil governor and
    intel_pstate that will make them boost performance if the enqueued
    task was previously waiting on I/O (Rafael Wysocki).

    - Fix a schedutil governor problem that causes it to overestimate
    utilization if SMT is in use (Steve Muckle).

    - Update defconfigs trying to use the schedutil governor as a module
    which is not possible any more (Javier Martinez Canillas).

    - Update the intel_pstate's pstate_sample tracepoint to take "IOwait
    boosting" into account (Srinivas Pandruvada).

    - Fix a problem in the cpufreq core causing it to mishandle the
    initialization of CPUs registered after the cpufreq driver (Viresh
    Kumar, Rafael Wysocki).

    - Make the cpufreq-dt driver support per-policy governor tunables,
    clean it up and update its Kconfig description (Viresh Kumar).

    - Add support for more ARM platforms to the cpufreq-dt driver
    (Chanwoo Choi, Dave Gerlach, Geert Uytterhoeven).

    - Make the cpufreq CPPC driver report frequencies in KHz to avoid
    user space compatiblility issues (Al Stone, Hoan Tran).

    - Clean up a few cpufreq drivers (st, kirkwood, SCPI) a bit (Colin
    Ian King, Markus Elfring).

    - Constify some local structures in the intel_pstate driver (Julia
    Lawall).

    - Add a Documentation/cpu-freq/ entry to MAINTAINERS (Jean Delvare).

    - Add support for PM domain removal to the generic power domains
    (genpd) framework, add new DT helper functions to it and make it
    always enable debugfs support if available (Jon Hunter, Tomeu
    Vizoso).

    - Clean up the generic power domains (genpd) framework and make it
    avoid measuring power-on and power-off latencies during system-wide
    PM transitions (Ulf Hansson).

    - Add support for the RockChip DFI controller and the rk3399 DMC to
    the devfreq framework (Lin Huang, Axel Lin, Arnd Bergmann).

    - Add COMPILE_TEST to the devfreq framework (Krzysztof Kozlowski,
    Stephen Rothwell).

    - Fix a minor issue in the exynos-ppmu devfreq driver and fix up
    devfreq Kconfig indentation style (Wei Yongjun, Jisheng Zhang).

    - Fix the system suspend interface to make suspend-to-idle work if
    platform suspend operations have not been registered (Sudeep
    Holla).

    - Make it possible to use hibernation with PAGE_POISONING_ZERO
    enabled (Anisse Astier).

    - Increas the default timeout of the system suspend/resume watchdog
    and make it depend on EXPERT (Chen Yu).

    - Make the operating performance points (OPP) framework avoid using
    OPPs that aren't supported by the platform and fix a build warning
    in it (Dave Gerlach, Arnd Bergmann).

    - Fix the ARM cpuidle driver's return value (Christophe Jaillet).

    - Make the SmartReflex AVS (Adaptive Voltage Scaling) driver use more
    common logging style (Joe Perches)"

    * tag 'pm-4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (58 commits)
    PM / OPP: Don't support OPP if it provides supported-hw but platform does not
    cpufreq: st: add missing \n to end of dev_err message
    cpufreq: kirkwood: add missing \n to end of dev_err messages
    PM / Domains: Rename pm_genpd_sync_poweron|poweroff()
    PM / Domains: Don't measure latency of ->power_on|off() during system PM
    PM / Domains: Remove redundant system PM callbacks
    PM / Domains: Simplify detaching a device from its genpd
    PM / devfreq: rk3399_dmc: Remove explictly regulator_put call in .remove
    PM / devfreq: rockchip: add PM_DEVFREQ_EVENT dependency
    PM / OPP: avoid maybe-uninitialized warning
    PM / Domains: Allow holes in genpd_data.domains array
    cpufreq: CPPC: Avoid overflow when calculating desired_perf
    cpufreq: ti: Use generic platdev driver
    cpufreq: intel_pstate: Add io_boost trace
    partial revert of "PM / devfreq: Add COMPILE_TEST for build coverage"
    cpufreq: intel_pstate: Use IOWAIT flag in Atom algorithm
    cpufreq: schedutil: Add iowait boosting
    cpufreq / sched: SCHED_CPUFREQ_IOWAIT flag to indicate iowait condition
    PM / Domains: Add support for removing nested PM domains by provider
    PM / Domains: Add support for removing PM domains
    ...

    Linus Torvalds
     

03 Oct, 2016

2 commits

  • Pull arm64 updates from Will Deacon:
    "It's a bit all over the place this time with no "killer feature" to
    speak of. Support for mismatched cache line sizes should help people
    seeing whacky JIT failures on some SoCs, and the big.LITTLE perf
    updates have been a long time coming, but a lot of the changes here
    are cleanups.

    We stray outside arch/arm64 in a few areas: the arch/arm/ arch_timer
    workaround is acked by Russell, the DT/OF bits are acked by Rob, the
    arch_timer clocksource changes acked by Marc, CPU hotplug by tglx and
    jump_label by Peter (all CC'd).

    Summary:

    - Support for execute-only page permissions
    - Support for hibernate and DEBUG_PAGEALLOC
    - Support for heterogeneous systems with mismatches cache line sizes
    - Errata workarounds (A53 843419 update and QorIQ A-008585 timer bug)
    - arm64 PMU perf updates, including cpumasks for heterogeneous systems
    - Set UTS_MACHINE for building rpm packages
    - Yet another head.S tidy-up
    - Some cleanups and refactoring, particularly in the NUMA code
    - Lots of random, non-critical fixes across the board"

    * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (100 commits)
    arm64: tlbflush.h: add __tlbi() macro
    arm64: Kconfig: remove SMP dependence for NUMA
    arm64: Kconfig: select OF/ACPI_NUMA under NUMA config
    arm64: fix dump_backtrace/unwind_frame with NULL tsk
    arm/arm64: arch_timer: Use archdata to indicate vdso suitability
    arm64: arch_timer: Work around QorIQ Erratum A-008585
    arm64: arch_timer: Add device tree binding for A-008585 erratum
    arm64: Correctly bounds check virt_addr_valid
    arm64: migrate exception table users off module.h and onto extable.h
    arm64: pmu: Hoist pmu platform device name
    arm64: pmu: Probe default hw/cache counters
    arm64: pmu: add fallback probe table
    MAINTAINERS: Update ARM PMU PROFILING AND DEBUGGING entry
    arm64: Improve kprobes test for atomic sequence
    arm64/kvm: use alternative auto-nop
    arm64: use alternative auto-nop
    arm64: alternative: add auto-nop infrastructure
    arm64: lse: convert lse alternatives NOP padding to use __nops
    arm64: barriers: introduce nops and __nops macros for NOP sequences
    arm64: sysreg: replace open-coded mrs_s/msr_s with {read,write}_sysreg_s
    ...

    Linus Torvalds
     
  • Three sets of overlapping changes. Nothing serious.

    Signed-off-by: David S. Miller

    David S. Miller
     

02 Oct, 2016

3 commits

  • * pm-devfreq:
    PM / devfreq: rk3399_dmc: Remove explictly regulator_put call in .remove
    PM / devfreq: rockchip: add PM_DEVFREQ_EVENT dependency
    partial revert of "PM / devfreq: Add COMPILE_TEST for build coverage"
    PM / devfreq: rockchip: add devfreq driver for rk3399 dmc
    Documentation: bindings: add dt documentation for rk3399 dmc
    PM / devfreq: event: support rockchip dfi controller
    Documentation: bindings: add dt documentation for dfi controller
    PM / devfreq: event: remove duplicate devfreq_event_get_drvdata()
    PM / devfreq: fix Kconfig indent style
    PM / devfreq: Add COMPILE_TEST for build coverage
    PM / devfreq: exynos-ppmu: remove unneeded of_node_put()

    * pm-sleep:
    PM / Hibernate: allow hibernation with PAGE_POISONING_ZERO
    PM / sleep: enable suspend-to-idle even without registered suspend_ops
    PM / sleep: Increase default DPM watchdog timeout to 120

    Rafael J. Wysocki
     
  • * pm-cpufreq: (24 commits)
    cpufreq: st: add missing \n to end of dev_err message
    cpufreq: kirkwood: add missing \n to end of dev_err messages
    cpufreq: CPPC: Avoid overflow when calculating desired_perf
    cpufreq: ti: Use generic platdev driver
    cpufreq: intel_pstate: Add io_boost trace
    cpufreq: intel_pstate: Use IOWAIT flag in Atom algorithm
    cpufreq: schedutil: Add iowait boosting
    cpufreq / sched: SCHED_CPUFREQ_IOWAIT flag to indicate iowait condition
    cpufreq: CPPC: Force reporting values in KHz to fix user space interface
    cpufreq: create link to policy only for registered CPUs
    intel_pstate: constify local structures
    cpufreq: dt: Support governor tunables per policy
    cpufreq: dt: Update kconfig description
    cpufreq: dt: Remove unused code
    MAINTAINERS: Add Documentation/cpu-freq/
    cpufreq: dt: Add support for r8a7792
    cpufreq / sched: ignore SMT when determining max cpu capacity
    cpufreq: Drop unnecessary check from cpufreq_policy_alloc()
    ARM: multi_v7_defconfig: Don't attempt to enable schedutil governor as module
    ARM: exynos_defconfig: Don't attempt to enable schedutil governor as module
    ...

    Rafael J. Wysocki
     
  • Rafael J. Wysocki
     

01 Oct, 2016

1 commit

  • CAI Qian pointed out that the semantics
    of shared subtrees make it possible to create an exponentially
    increasing number of mounts in a mount namespace.

    mkdir /tmp/1 /tmp/2
    mount --make-rshared /
    for i in $(seq 1 20) ; do mount --bind /tmp/1 /tmp/2 ; done

    Will create create 2^20 or 1048576 mounts, which is a practical problem
    as some people have managed to hit this by accident.

    As such CVE-2016-6213 was assigned.

    Ian Kent described the situation for autofs users
    as follows:

    > The number of mounts for direct mount maps is usually not very large because of
    > the way they are implemented, large direct mount maps can have performance
    > problems. There can be anywhere from a few (likely case a few hundred) to less
    > than 10000, plus mounts that have been triggered and not yet expired.
    >
    > Indirect mounts have one autofs mount at the root plus the number of mounts that
    > have been triggered and not yet expired.
    >
    > The number of autofs indirect map entries can range from a few to the common
    > case of several thousand and in rare cases up to between 30000 and 50000. I've
    > not heard of people with maps larger than 50000 entries.
    >
    > The larger the number of map entries the greater the possibility for a large
    > number of active mounts so it's not hard to expect cases of a 1000 or somewhat
    > more active mounts.

    So I am setting the default number of mounts allowed per mount
    namespace at 100,000. This is more than enough for any use case I
    know of, but small enough to quickly stop an exponential increase
    in mounts. Which should be perfect to catch misconfigurations and
    malfunctioning programs.

    For anyone who needs a higher limit this can be changed by writing
    to the new /proc/sys/fs/mount-max sysctl.

    Tested-by: CAI Qian
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

30 Sep, 2016

9 commits

  • Get the cr4 fixes so we can apply the final cleanup

    Thomas Gleixner
     
  • The code performing irqtime nsecs stats flushing to kcpustat is roughly
    the same for hardirq and softirq. So lets consolidate that common code.

    Signed-off-by: Frederic Weisbecker
    Reviewed-by: Rik van Riel
    Cc: Eric Dumazet
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Paolo Bonzini
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Wanpeng Li
    Link: http://lkml.kernel.org/r/1474849761-12678-6-git-send-email-fweisbec@gmail.com
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • The irqtime accounting currently implement its own ad hoc implementation
    of u64_stats API. Lets rather consolidate it with the appropriate
    library.

    Signed-off-by: Frederic Weisbecker
    Reviewed-by: Rik van Riel
    Cc: Eric Dumazet
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Paolo Bonzini
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Wanpeng Li
    Link: http://lkml.kernel.org/r/1474849761-12678-5-git-send-email-fweisbec@gmail.com
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • The callers of the functions performing irqtime kcpustat updates have
    IRQS disabled, no need to disable them again.

    Signed-off-by: Frederic Weisbecker
    Reviewed-by: Rik van Riel
    Cc: Eric Dumazet
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Paolo Bonzini
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Wanpeng Li
    Link: http://lkml.kernel.org/r/1474849761-12678-3-git-send-email-fweisbec@gmail.com
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • We can safely use the preempt-unsafe accessors for irqtime when we
    flush its counters to kcpustat as IRQs are disabled at this time.

    Signed-off-by: Frederic Weisbecker
    Reviewed-by: Rik van Riel
    Cc: Eric Dumazet
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Paolo Bonzini
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Wanpeng Li
    Link: http://lkml.kernel.org/r/1474849761-12678-2-git-send-email-fweisbec@gmail.com
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • While going through enqueue/dequeue to review the movement of
    set_curr_task() I noticed that the (2nd) update_min_vruntime() call in
    dequeue_entity() is suspect.

    It turns out, its actually wrong because it will consider
    cfs_rq->curr, which could be the entry we just normalized. This mixes
    different vruntime forms and leads to fail.

    The purpose of the second update_min_vruntime() is to move
    min_vruntime forward if the entity we just removed is the one that was
    holding it back; _except_ for the DEQUEUE_SAVE case, because then we
    know its a temporary removal and it will come back.

    However, since we do put_prev_task() _after_ dequeue(), cfs_rq->curr
    will still be set (and per the above, can be tranformed into a
    different unit), so update_min_vruntime() should also consider
    curr->on_rq. This also fixes another corner case where the enqueue
    (which also does update_curr()->update_min_vruntime()) happens on the
    rq->lock break in schedule(), between dequeue and put_prev_task.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Fixes: 1e876231785d ("sched: Fix ->min_vruntime calculation in dequeue_entity()")
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Provide SCHED_WARN_ON as wrapper for WARN_ON_ONCE() to avoid
    CONFIG_SCHED_DEBUG wrappery.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Almost all scheduler functions update state with the following
    pattern:

    if (queued)
    dequeue_task(rq, p, DEQUEUE_SAVE);
    if (running)
    put_prev_task(rq, p);

    /* update state */

    if (queued)
    enqueue_task(rq, p, ENQUEUE_RESTORE);
    if (running)
    set_curr_task(rq, p);

    set_user_nice() however misses the running part, cure this.

    This was found by asserting we never enqueue 'current'.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Now that the ia64 only set_curr_task() symbol is gone, provide a
    helper just like put_prev_task().

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra