05 Sep, 2015

2 commits

  • An IPI is sent to flush remote TLBs when a page is unmapped that was
    potentially accesssed by other CPUs. There are many circumstances where
    this happens but the obvious one is kswapd reclaiming pages belonging to a
    running process as kswapd and the task are likely running on separate
    CPUs.

    On small machines, this is not a significant problem but as machine gets
    larger with more cores and more memory, the cost of these IPIs can be
    high. This patch uses a simple structure that tracks CPUs that
    potentially have TLB entries for pages being unmapped. When the unmapping
    is complete, the full TLB is flushed on the assumption that a refill cost
    is lower than flushing individual entries.

    Architectures wishing to do this must give the following guarantee.

    If a clean page is unmapped and not immediately flushed, the
    architecture must guarantee that a write to that linear address
    from a CPU with a cached TLB entry will trap a page fault.

    This is essentially what the kernel already depends on but the window is
    much larger with this patch applied and is worth highlighting. The
    architecture should consider whether the cost of the full TLB flush is
    higher than sending an IPI to flush each individual entry. An additional
    architecture helper called flush_tlb_local is required. It's a trivial
    wrapper with some accounting in the x86 case.

    The impact of this patch depends on the workload as measuring any benefit
    requires both mapped pages co-located on the LRU and memory pressure. The
    case with the biggest impact is multiple processes reading mapped pages
    taken from the vm-scalability test suite. The test case uses NR_CPU
    readers of mapped files that consume 10*RAM.

    Linear mapped reader on a 4-node machine with 64G RAM and 48 CPUs

    4.2.0-rc1 4.2.0-rc1
    vanilla flushfull-v7
    Ops lru-file-mmap-read-elapsed 159.62 ( 0.00%) 120.68 ( 24.40%)
    Ops lru-file-mmap-read-time_range 30.59 ( 0.00%) 2.80 ( 90.85%)
    Ops lru-file-mmap-read-time_stddv 6.70 ( 0.00%) 0.64 ( 90.38%)

    4.2.0-rc1 4.2.0-rc1
    vanilla flushfull-v7
    User 581.00 611.43
    System 5804.93 4111.76
    Elapsed 161.03 122.12

    This is showing that the readers completed 24.40% faster with 29% less
    system CPU time. From vmstats, it is known that the vanilla kernel was
    interrupted roughly 900K times per second during the steady phase of the
    test and the patched kernel was interrupts 180K times per second.

    The impact is lower on a single socket machine.

    4.2.0-rc1 4.2.0-rc1
    vanilla flushfull-v7
    Ops lru-file-mmap-read-elapsed 25.33 ( 0.00%) 20.38 ( 19.54%)
    Ops lru-file-mmap-read-time_range 0.91 ( 0.00%) 1.44 (-58.24%)
    Ops lru-file-mmap-read-time_stddv 0.28 ( 0.00%) 0.47 (-65.34%)

    4.2.0-rc1 4.2.0-rc1
    vanilla flushfull-v7
    User 58.09 57.64
    System 111.82 76.56
    Elapsed 27.29 22.55

    It's still a noticeable improvement with vmstat showing interrupts went
    from roughly 500K per second to 45K per second.

    The patch will have no impact on workloads with no memory pressure or have
    relatively few mapped pages. It will have an unpredictable impact on the
    workload running on the CPU being flushed as it'll depend on how many TLB
    entries need to be refilled and how long that takes. Worst case, the TLB
    will be completely cleared of active entries when the target PFNs were not
    resident at all.

    [sasha.levin@oracle.com: trace tlb flush after disabling preemption in try_to_unmap_flush]
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Dave Hansen
    Acked-by: Ingo Molnar
    Cc: Linus Torvalds
    Signed-off-by: Sasha Levin
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This allows to select the userfaultfd during configuration to build it.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Pavel Emelyanov
    Cc: Sanidhya Kashyap
    Cc: zhang.zhanghailiang@huawei.com
    Cc: "Kirill A. Shutemov"
    Cc: Andres Lagar-Cavilla
    Cc: Dave Hansen
    Cc: Paolo Bonzini
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Andy Lutomirski
    Cc: Hugh Dickins
    Cc: Peter Feiner
    Cc: "Dr. David Alan Gilbert"
    Cc: Johannes Weiner
    Cc: "Huangpeng (Peter)"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

02 Sep, 2015

1 commit

  • Pull cgroup updates from Tejun Heo:

    - a new PIDs controller is added. It turns out that PIDs are actually
    an independent resource from kmem due to the limited PID space.

    - more core preparations for the v2 interface. Once cpu side interface
    is settled, it should be ready for lifting the devel mask.
    for-4.3-unified-base was temporarily branched so that other trees
    (block) can pull cgroup core changes that blkcg changes depend on.

    - a non-critical idr_preload usage bug fix.

    * 'for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: pids: fix invalid get/put usage
    cgroup: introduce cgroup_subsys->legacy_name
    cgroup: don't print subsystems for the default hierarchy
    cgroup: make cftype->private a unsigned long
    cgroup: export cgrp_dfl_root
    cgroup: define controller file conventions
    cgroup: fix idr_preload usage
    cgroup: add documentation for the PIDs controller
    cgroup: implement the PIDs subsystem
    cgroup: allow a cgroup subsystem to reject a fork

    Linus Torvalds
     

12 Aug, 2015

1 commit

  • …k/linux-rcu into core/rcu

    Pull RCU changes from Paul E. McKenney:

    - The combination of tree geometry-initialization simplifications
    and OS-jitter-reduction changes to expedited grace periods.
    These two are stacked due to the large number of conflicts
    that would otherwise result.

    [ With one addition, a temporary commit to silence a lockdep false
    positive. Additional changes to the expedited grace-period
    primitives (queued for 4.4) remove the cause of this false
    positive, and therefore include a revert of this temporary commit. ]

    - Documentation updates.

    - Torture-test updates.

    - Miscellaneous fixes.

    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Ingo Molnar
     

07 Aug, 2015

1 commit

  • Dave Hansen reported the following;

    My laptop has been behaving strangely with 4.2-rc2. Once I log
    in to my X session, I start getting all kinds of strange errors
    from applications and see this in my dmesg:

    VFS: file-max limit 8192 reached

    The problem is that the file-max is calculated before memory is fully
    initialised and miscalculates how much memory the kernel is using. This
    patch recalculates file-max after deferred memory initialisation. Note
    that using memory hotplug infrastructure would not have avoided this
    problem as the value is not recalculated after memory hot-add.

    4.1: files_stat.max_files = 6582781
    4.2-rc2: files_stat.max_files = 8192
    4.2-rc2 patched: files_stat.max_files = 6562467

    Small differences with the patch applied and 4.1 but not enough to matter.

    Signed-off-by: Mel Gorman
    Reported-by: Dave Hansen
    Cc: Nicolai Stange
    Cc: Dave Hansen
    Cc: Alex Ng
    Cc: Fengguang Wu
    Cc: Peter Zijlstra (Intel)
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

23 Jul, 2015

1 commit


15 Jul, 2015

1 commit

  • Adds a new single-purpose PIDs subsystem to limit the number of
    tasks that can be forked inside a cgroup. Essentially this is an
    implementation of RLIMIT_NPROC that applies to a cgroup rather than a
    process tree.

    However, it should be noted that organisational operations (adding and
    removing tasks from a PIDs hierarchy) will *not* be prevented. Rather,
    the number of tasks in the hierarchy cannot exceed the limit through
    forking. This is due to the fact that, in the unified hierarchy, attach
    cannot fail (and it is not possible for a task to overcome its PIDs
    cgroup policy limit by attaching to a child cgroup -- even if migrating
    mid-fork it must be able to fork in the parent first).

    PIDs are fundamentally a global resource, and it is possible to reach
    PID exhaustion inside a cgroup without hitting any reasonable kmemcg
    policy. Once you've hit PID exhaustion, you're only in a marginally
    better state than OOM. This subsystem allows PID exhaustion inside a
    cgroup to be prevented.

    Signed-off-by: Aleksa Sarai
    Signed-off-by: Tejun Heo

    Aleksa Sarai
     

07 Jul, 2015

1 commit


04 Jul, 2015

3 commits

  • Pull scheduler fixes from Ingo Molnar:
    "Debug info and other statistics fixes and related enhancements"

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/numa: Fix numa balancing stats in /proc/pid/sched
    sched/numa: Show numa_group ID in /proc/sched_debug task listings
    sched/debug: Move print_cfs_rq() declaration to kernel/sched/sched.h
    sched/stat: Expose /proc/pid/schedstat if CONFIG_SCHED_INFO=y
    sched/stat: Simplify the sched_info accounting dependency

    Linus Torvalds
     
  • Pull max log buf size increase from Ingo Molnar:
    "Ran into this limit recently, so increase it by an order of magnitude"

    * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    printk: Increase maximum CONFIG_LOG_BUF_SHIFT from 21 to 25

    Linus Torvalds
     
  • Both CONFIG_SCHEDSTATS=y and CONFIG_TASK_DELAY_ACCT=y track task
    sched_info, which results in ugly #if clauses.

    Simplify the code by introducing a synthethic CONFIG_SCHED_INFO
    switch, selected by both.

    Signed-off-by: Naveen N. Rao
    Cc: Balbir Singh
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Srikar Dronamraju
    Cc: Thomas Gleixner
    Cc: a.p.zijlstra@chello.nl
    Cc: ricklind@us.ibm.com
    Link: http://lkml.kernel.org/r/8d19eef800811a94b0f91bcbeb27430a884d7433.1435255405.git.naveen.n.rao@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar

    Naveen N. Rao
     

02 Jul, 2015

2 commits

  • Merge third patchbomb from Andrew Morton:

    - the rest of MM

    - scripts/gdb updates

    - ipc/ updates

    - lib/ updates

    - MAINTAINERS updates

    - various other misc things

    * emailed patches from Andrew Morton : (67 commits)
    genalloc: rename of_get_named_gen_pool() to of_gen_pool_get()
    genalloc: rename dev_get_gen_pool() to gen_pool_get()
    x86: opt into HAVE_COPY_THREAD_TLS, for both 32-bit and 64-bit
    MAINTAINERS: add zpool
    MAINTAINERS: BCACHE: Kent Overstreet has changed email address
    MAINTAINERS: move Jens Osterkamp to CREDITS
    MAINTAINERS: remove unused nbd.h pattern
    MAINTAINERS: update brcm gpio filename pattern
    MAINTAINERS: update brcm dts pattern
    MAINTAINERS: update sound soc intel patterns
    MAINTAINERS: remove website for paride
    MAINTAINERS: update Emulex ocrdma email addresses
    bcache: use kvfree() in various places
    libcxgbi: use kvfree() in cxgbi_free_big_mem()
    target: use kvfree() in session alloc and free
    IB/ehca: use kvfree() in ipz_queue_{cd}tor()
    drm/nouveau/gem: use kvfree() in u_free()
    drm: use kvfree() in drm_free_large()
    cxgb4: use kvfree() in t4_free_mem()
    cxgb3: use kvfree() in cxgb_free_mem()
    ...

    Linus Torvalds
     
  • Pull module updates from Rusty Russell:
    "Main excitement here is Peter Zijlstra's lockless rbtree optimization
    to speed module address lookup. He found some abusers of the module
    lock doing that too.

    A little bit of parameter work here too; including Dan Streetman's
    breaking up the big param mutex so writing a parameter can load
    another module (yeah, really). Unfortunately that broke the usual
    suspects, !CONFIG_MODULES and !CONFIG_SYSFS, so those fixes were
    appended too"

    * tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: (26 commits)
    modules: only use mod->param_lock if CONFIG_MODULES
    param: fix module param locks when !CONFIG_SYSFS.
    rcu: merge fix for Convert ACCESS_ONCE() to READ_ONCE() and WRITE_ONCE()
    module: add per-module param_lock
    module: make perm const
    params: suppress unused variable error, warn once just in case code changes.
    modules: clarify CONFIG_MODULE_COMPRESS help, suggest 'N'.
    kernel/module.c: avoid ifdefs for sig_enforce declaration
    kernel/workqueue.c: remove ifdefs over wq_power_efficient
    kernel/params.c: export param_ops_bool_enable_only
    kernel/params.c: generalize bool_enable_only
    kernel/module.c: use generic module param operaters for sig_enforce
    kernel/params: constify struct kernel_param_ops uses
    sysfs: tightened sysfs permission checks
    module: Rework module_addr_{min,max}
    module: Use __module_address() for module_address_lookup()
    module: Make the mod_tree stuff conditional on PERF_EVENTS || TRACING
    module: Optimize __module_address() using a latched RB-tree
    rbtree: Implement generic latch_tree
    seqlock: Introduce raw_read_seqcount_latch()
    ...

    Linus Torvalds
     

01 Jul, 2015

2 commits

  • So I tried to some kernel debugging that produced a ton of kernel messages
    on a big box, and wanted to save them all: but CONFIG_LOG_BUF_SHIFT maxes
    out at 21 (2 MB).

    Increase it to 25 (32 MB).

    This does not affect any existing config or defaults.

    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Waiman Long reported that 24TB machines hit OOM during basic setup when
    struct page initialisation was deferred. One approach is to initialise
    memory on demand but it interferes with page allocator paths. This patch
    creates dedicated threads to initialise memory before basic setup. It
    then blocks on a rw_semaphore until completion as a wait_queue and counter
    is overkill. This may be slower to boot but it's simplier overall and
    also gets rid of a section mangling which existed so kswapd could do the
    initialisation.

    [akpm@linux-foundation.org: include rwsem.h, use DECLARE_RWSEM, fix comment, remove unneeded cast]
    Signed-off-by: Mel Gorman
    Cc: Waiman Long
    Cc: Dave Hansen
    Cc: Scott Norton
    Tested-by: Daniel J Blueman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

27 Jun, 2015

3 commits

  • Pull cgroup updates from Tejun Heo:

    - threadgroup_lock got reorganized so that its users can pick the
    actual locking mechanism to use. Its only user - cgroups - is
    updated to use a percpu_rwsem instead of per-process rwsem.

    This makes things a bit lighter on hot paths and allows cgroups to
    perform and fail multi-task (a process) migrations atomically.
    Multi-task migrations are used in several places including the
    unified hierarchy.

    - Delegation rule and documentation added to unified hierarchy. This
    will likely be the last interface update from the cgroup core side
    for unified hierarchy before lifting the devel mask.

    - Some groundwork for the pids controller which is scheduled to be
    merged in the coming devel cycle.

    * 'for-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: add delegation section to unified hierarchy documentation
    cgroup: require write perm on common ancestor when moving processes on the default hierarchy
    cgroup: separate out cgroup_procs_write_permission() from __cgroup_procs_write()
    kernfs: make kernfs_get_inode() public
    MAINTAINERS: add a cgroup core co-maintainer
    cgroup: fix uninitialised iterator in for_each_subsys_which
    cgroup: replace explicit ss_mask checking with for_each_subsys_which
    cgroup: use bitmask to filter for_each_subsys
    cgroup: add seq_file forward declaration for struct cftype
    cgroup: simplify threadgroup locking
    sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem
    sched, cgroup: reorganize threadgroup locking
    cgroup: switch to unsigned long for bitmasks
    cgroup: reorganize include/linux/cgroup.h
    cgroup: separate out include/linux/cgroup-defs.h
    cgroup: fix some comment typos

    Linus Torvalds
     
  • Pull driver core updates from Greg KH:
    "Here is the driver core / firmware changes for 4.2-rc1.

    A number of small changes all over the place in the driver core, and
    in the firmware subsystem. Nothing really major, full details in the
    shortlog. Some of it is a bit of churn, given that the platform
    driver probing changes was found to not work well, so they were
    reverted.

    All of these have been in linux-next for a while with no reported
    issues"

    * tag 'driver-core-4.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (31 commits)
    Revert "base/platform: Only insert MEM and IO resources"
    Revert "base/platform: Continue on insert_resource() error"
    Revert "of/platform: Use platform_device interface"
    Revert "base/platform: Remove code duplication"
    firmware: add missing kfree for work on async call
    fs: sysfs: don't pass count == 0 to bin file readers
    base:dd - Fix for typo in comment to function driver_deferred_probe_trigger().
    base/platform: Remove code duplication
    of/platform: Use platform_device interface
    base/platform: Continue on insert_resource() error
    base/platform: Only insert MEM and IO resources
    firmware: use const for remaining firmware names
    firmware: fix possible use after free on name on asynchronous request
    firmware: check for file truncation on direct firmware loading
    firmware: fix __getname() missing failure check
    drivers: of/base: move of_init to driver_init
    drivers/base: cacheinfo: fix annoying typo when DT nodes are absent
    sysfs: disambiguate between "error code" and "failure" in comments
    driver-core: fix build for !CONFIG_MODULES
    driver-core: make __device_attach() static
    ...

    Linus Torvalds
     
  • Merge second patchbomb from Andrew Morton:

    - most of the rest of MM

    - lots of misc things

    - procfs updates

    - printk feature work

    - updates to get_maintainer, MAINTAINERS, checkpatch

    - lib/ updates

    * emailed patches from Andrew Morton : (96 commits)
    exit,stats: /* obey this comment */
    coredump: add __printf attribute to cn_*printf functions
    coredump: use from_kuid/kgid when formatting corename
    fs/reiserfs: remove unneeded cast
    NILFS2: support NFSv2 export
    fs/befs/btree.c: remove unneeded initializations
    fs/minix: remove unneeded cast
    init/do_mounts.c: add create_dev() failure log
    kasan: remove duplicate definition of the macro KASAN_FREE_PAGE
    fs/efs: femove unneeded cast
    checkpatch: emit "NOTE: " message only once after multiple files
    checkpatch: emit an error when there's a diff in a changelog
    checkpatch: validate MODULE_LICENSE content
    checkpatch: add multi-line handling for PREFER_ETHER_ADDR_COPY
    checkpatch: suggest using eth_zero_addr() and eth_broadcast_addr()
    checkpatch: fix processing of MEMSET issues
    checkpatch: suggest using ether_addr_equal*()
    checkpatch: avoid NOT_UNIFIED_DIFF errors on cover-letter.patch files
    checkpatch: remove local from codespell path
    checkpatch: add --showfile to allow input via pipe to show filenames
    ...

    Linus Torvalds
     

26 Jun, 2015

3 commits

  • If create_dev() function fails to create the root mount device
    (/dev/root), then it goes to panic as root device not found but there is
    no printk in this case. So I have added the log in case it fails to
    create the root device. It will help in debugging.

    [akpm@linux-foundation.org: simplify printk(), use pr_emerg(), display errno]
    Signed-off-by: Vishnu Pratap Singh
    Acked-by: Pavel Machek
    Cc: Paul Gortmaker
    Cc: Mike Snitzer
    Cc: Dan Ehrenberg
    Cc: Miklos Szeredi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vishnu Pratap Singh
     
  • Commit 818411616baf ("fs, proc: introduce /proc//task//children
    entry") introduced the children entry for checkpoint restore and the
    file is only available on kernels configured with CONFIG_EXPERT and
    CONFIG_CHECKPOINT_RESTORE.

    This is available in most distributions (Fedora, Debian, Ubuntu, CoreOS)
    because they usually enable CONFIG_EXPERT and CONFIG_CHECKPOINT_RESTORE.
    But Arch does not enable CONFIG_EXPERT or CONFIG_CHECKPOINT_RESTORE.

    However, the children proc file is useful outside of checkpoint restore.
    I would like to use it in rkt. The rkt process exec() another program
    it does not control, and that other program will fork()+exec() a child
    process. I would like to find the pid of the child process from an
    external tool without iterating in /proc over all processes to find
    which one has a parent pid equal to rkt.

    This commit introduces CONFIG_PROC_CHILDREN and makes
    CONFIG_CHECKPOINT_RESTORE select it. This allows enabling
    /proc//task//children without needing to enable
    CONFIG_CHECKPOINT_RESTORE and CONFIG_EXPERT.

    Alban tested that /proc//task//children is present when the
    kernel is configured with CONFIG_PROC_CHILDREN=y but without
    CONFIG_CHECKPOINT_RESTORE

    Signed-off-by: Iago López Galeiras
    Tested-by: Alban Crequy
    Reviewed-by: Cyrill Gorcunov
    Cc: Oleg Nesterov
    Cc: Kees Cook
    Cc: Pavel Emelyanov
    Cc: Serge Hallyn
    Cc: KAMEZAWA Hiroyuki
    Cc: Alexander Viro
    Cc: Andy Lutomirski
    Cc: Djalal Harouni
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Iago López Galeiras
     
  • Pull cgroup writeback support from Jens Axboe:
    "This is the big pull request for adding cgroup writeback support.

    This code has been in development for a long time, and it has been
    simmering in for-next for a good chunk of this cycle too. This is one
    of those problems that has been talked about for at least half a
    decade, finally there's a solution and code to go with it.

    Also see last weeks writeup on LWN:

    http://lwn.net/Articles/648292/"

    * 'for-4.2/writeback' of git://git.kernel.dk/linux-block: (85 commits)
    writeback, blkio: add documentation for cgroup writeback support
    vfs, writeback: replace FS_CGROUP_WRITEBACK with SB_I_CGROUPWB
    writeback: do foreign inode detection iff cgroup writeback is enabled
    v9fs: fix error handling in v9fs_session_init()
    bdi: fix wrong error return value in cgwb_create()
    buffer: remove unusued 'ret' variable
    writeback: disassociate inodes from dying bdi_writebacks
    writeback: implement foreign cgroup inode bdi_writeback switching
    writeback: add lockdep annotation to inode_to_wb()
    writeback: use unlocked_inode_to_wb transaction in inode_congested()
    writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
    writeback: implement [locked_]inode_to_wb_and_lock_list()
    writeback: implement foreign cgroup inode detection
    writeback: make writeback_control track the inode being written back
    writeback: relocate wb[_try]_get(), wb_put(), inode_{attach|detach}_wb()
    mm: vmscan: disable memcg direct reclaim stalling if cgroup writeback support is in use
    writeback: implement memcg writeback domain based throttling
    writeback: reset wb_domain->dirty_limit[_tstmp] when memcg domain size changes
    writeback: implement memcg wb_domain
    writeback: update wb_over_bg_thresh() to use wb_domain aware operations
    ...

    Linus Torvalds
     

24 Jun, 2015

1 commit

  • Pull power management and ACPI updates from Rafael Wysocki:
    "The rework of backlight interface selection API from Hans de Goede
    stands out from the number of commits and the number of affected
    places perspective. The cpufreq core fixes from Viresh Kumar are
    quite significant too as far as the number of commits goes and because
    they should reduce CPU online/offline overhead quite a bit in the
    majority of cases.

    From the new featues point of view, the ACPICA update (to upstream
    revision 20150515) adding support for new ACPI 6 material to ACPICA is
    the one that matters the most as some new significant features will be
    based on it going forward. Also included is an update of the ACPI
    device power management core to follow ACPI 6 (which in turn reflects
    the Windows' device PM implementation), a PM core extension to support
    wakeup interrupts in a more generic way and support for the ACPI _CCA
    device configuration object.

    The rest is mostly fixes and cleanups all over and some documentation
    updates, including new DT bindings for Operating Performance Points.

    There is one fix for a regression introduced in the 4.1 cycle, but it
    adds quite a number of lines of code, it wasn't really ready before
    Thursday and you were on vacation, so I refrained from pushing it on
    the last minute for 4.1.

    Specifics:

    - ACPICA update to upstream revision 20150515 including basic support
    for ACPI 6 features: new ACPI tables introduced by ACPI 6 (STAO,
    XENV, WPBT, NFIT, IORT), changes related to the other tables (DTRM,
    FADT, LPIT, MADT), new predefined names (_BTH, _CR3, _DSD, _LPI,
    _MTL, _PRR, _RDI, _RST, _TFP, _TSN), fixes and cleanups (Bob Moore,
    Lv Zheng).

    - ACPI device power management core code update to follow ACPI 6
    which reflects the ACPI device power management implementation in
    Windows (Rafael J Wysocki).

    - rework of the backlight interface selection logic to reduce the
    number of kernel command line options and improve the handling of
    DMI quirks that may be involved in that and to make the code
    generally more straightforward (Hans de Goede).

    - fixes for the ACPI Embedded Controller (EC) driver related to the
    handling of EC transactions (Lv Zheng).

    - fix for a regression related to the ACPI resources management and
    resulting from a recent change of ACPI initialization code ordering
    (Rafael J Wysocki).

    - fix for a system initialization regression related to ACPI
    introduced during the 3.14 cycle and caused by running the code
    that switches the platform over to the ACPI mode too early in the
    initialization sequence (Rafael J Wysocki).

    - support for the ACPI _CCA device configuration object related to
    DMA cache coherence (Suravee Suthikulpanit).

    - ACPI/APEI fixes and cleanups (Jiri Kosina, Borislav Petkov).

    - ACPI battery driver cleanups (Luis Henriques, Mathias Krause).

    - ACPI processor driver cleanups (Hanjun Guo).

    - cleanups and documentation update related to the ACPI device
    properties interface based on _DSD (Rafael J Wysocki).

    - ACPI device power management fixes (Rafael J Wysocki).

    - assorted cleanups related to ACPI (Dominik Brodowski, Fabian
    Frederick, Lorenzo Pieralisi, Mathias Krause, Rafael J Wysocki).

    - fix for a long-standing issue causing General Protection Faults to
    be generated occasionally on return to user space after resume from
    ACPI-based suspend-to-RAM on 32-bit x86 (Ingo Molnar).

    - fix to make the suspend core code return -EBUSY consistently in all
    cases when system suspend is aborted due to wakeup detection (Ruchi
    Kandoi).

    - support for automated device wakeup IRQ handling allowing drivers
    to make their PM support more starightforward (Tony Lindgren).

    - new tracepoints for suspend-to-idle tracing and rework of the
    prepare/complete callbacks tracing in the PM core (Todd E Brandt,
    Rafael J Wysocki).

    - wakeup sources framework enhancements (Jin Qian).

    - new macro for noirq system PM callbacks (Grygorii Strashko).

    - assorted cleanups related to system suspend (Rafael J Wysocki).

    - cpuidle core cleanups to make the code more efficient (Rafael J
    Wysocki).

    - powernv/pseries cpuidle driver update (Shilpasri G Bhat).

    - cpufreq core fixes related to CPU online/offline that should reduce
    the overhead of these operations quite a bit, unless the CPU in
    question is physically going away (Viresh Kumar, Saravana Kannan).

    - serialization of cpufreq governor callbacks to avoid race
    conditions in some cases (Viresh Kumar).

    - intel_pstate driver fixes and cleanups (Doug Smythies, Prarit
    Bhargava, Joe Konno).

    - cpufreq driver (arm_big_little, cpufreq-dt, qoriq) updates (Sudeep
    Holla, Felipe Balbi, Tang Yuantian).

    - assorted cleanups in cpufreq drivers and core (Shailendra Verma,
    Fabian Frederick, Wang Long).

    - new Device Tree bindings for representing Operating Performance
    Points (Viresh Kumar).

    - updates for the common clock operations support code in the PM core
    (Rajendra Nayak, Geert Uytterhoeven).

    - PM domains core code update (Geert Uytterhoeven).

    - Intel Knights Landing support for the RAPL (Running Average Power
    Limit) power capping driver (Dasaratharaman Chandramouli).

    - fixes related to the floor frequency setting on Atom SoCs in the
    RAPL power capping driver (Ajay Thomas).

    - runtime PM framework documentation update (Ben Dooks).

    - cpupower tool fix (Herton R Krzesinski)"

    * tag 'pm+acpi-4.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (194 commits)
    cpuidle: powernv/pseries: Auto-promotion of snooze to deeper idle state
    x86: Load __USER_DS into DS/ES after resume
    PM / OPP: Add binding for 'opp-suspend'
    PM / OPP: Allow multiple OPP tables to be passed via DT
    PM / OPP: Add new bindings to address shortcomings of existing bindings
    ACPI: Constify ACPI device IDs in documentation
    ACPI / enumeration: Document the rules regarding the PRP0001 device ID
    ACPI / video: Make acpi_video_unregister_backlight() private
    acpi-video-detect: Remove old API
    toshiba-acpi: Port to new backlight interface selection API
    thinkpad-acpi: Port to new backlight interface selection API
    sony-laptop: Port to new backlight interface selection API
    samsung-laptop: Port to new backlight interface selection API
    msi-wmi: Port to new backlight interface selection API
    msi-laptop: Port to new backlight interface selection API
    intel-oaktrail: Port to new backlight interface selection API
    ideapad-laptop: Port to new backlight interface selection API
    fujitsu-laptop: Port to new backlight interface selection API
    eeepc-laptop: Port to new backlight interface selection API
    dell-wmi: Port to new backlight interface selection API
    ...

    Linus Torvalds
     

23 Jun, 2015

2 commits

  • Andreas turned this option on, only to find out Debian (and Ubuntu!)
    don't enable support in their kmod builds.

    Shorten the text, and suggest N at the bottom (at least for now).

    Reported-by: Andreas Mohr
    Signed-off-by: Rusty Russell

    Rusty Russell
     
  • Pull perf updates from Ingo Molnar:
    "Kernel side changes mostly consist of work on x86 PMU drivers:

    - x86 Intel PT (hardware CPU tracer) improvements (Alexander
    Shishkin)

    - x86 Intel CQM (cache quality monitoring) improvements (Thomas
    Gleixner)

    - x86 Intel PEBSv3 support (Peter Zijlstra)

    - x86 Intel PEBS interrupt batching support for lower overhead
    sampling (Zheng Yan, Kan Liang)

    - x86 PMU scheduler fixes and improvements (Peter Zijlstra)

    There's too many tooling improvements to list them all - here are a
    few select highlights:

    'perf bench':

    - Introduce new 'perf bench futex' benchmark: 'wake-parallel', to
    measure parallel waker threads generating contention for kernel
    locks (hb->lock). (Davidlohr Bueso)

    'perf top', 'perf report':

    - Allow disabling/enabling events dynamicaly in 'perf top':
    a 'perf top' session can instantly become a 'perf report'
    one, i.e. going from dynamic analysis to a static one,
    returning to a dynamic one is possible, to toogle the
    modes, just press 'f' to 'freeze/unfreeze' the sampling. (Arnaldo Carvalho de Melo)

    - Make Ctrl-C stop processing on TUI, allowing interrupting the load of big
    perf.data files (Namhyung Kim)

    'perf probe': (Masami Hiramatsu)

    - Support glob wildcards for function name
    - Support $params special probe argument: Collect all function arguments
    - Make --line checks validate C-style function name.
    - Add --no-inlines option to avoid searching inline functions
    - Greatly speed up 'perf probe --list' by caching debuginfo.
    - Improve --filter support for 'perf probe', allowing using its arguments
    on other commands, as --add, --del, etc.

    'perf sched':

    - Add option in 'perf sched' to merge like comms to lat output (Josef Bacik)

    Plus tons of infrastructure work - in particular preparation for
    upcoming threaded perf report support, but also lots of other work -
    and fixes and other improvements. See (much) more details in the
    shortlog and in the git log"

    * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (305 commits)
    perf tools: Configurable per thread proc map processing time out
    perf tools: Add time out to force stop proc map processing
    perf report: Fix sort__sym_cmp to also compare end of symbol
    perf hists browser: React to unassigned hotkey pressing
    perf top: Tell the user how to unfreeze events after pressing 'f'
    perf hists browser: Honour the help line provided by builtin-{top,report}.c
    perf hists browser: Do not exit when 'f' is pressed in 'report' mode
    perf top: Replace CTRL+z with 'f' as hotkey for enable/disable events
    perf annotate: Rename source_line_percent to source_line_samples
    perf annotate: Display total number of samples with --show-total-period
    perf tools: Ensure thread-stack is flushed
    perf top: Allow disabling/enabling events dynamicly
    perf evlist: Add toggle_enable() method
    perf trace: Fix race condition at the end of started workloads
    perf probe: Speed up perf probe --list by caching debuginfo
    perf probe: Show usage even if the last event is skipped
    perf tools: Move libtraceevent dynamic list to separated LDFLAGS variable
    perf tools: Fix a problem when opening old perf.data with different byte order
    perf tools: Ignore .config-detected in .gitignore
    perf probe: Fix to return error if no probe is added
    ...

    Linus Torvalds
     

11 Jun, 2015

1 commit

  • Commit 73f7d1ca3263 "ACPI / init: Run acpi_early_init() before
    timekeeping_init()" moved the ACPI subsystem initialization,
    including the ACPI mode enabling, to an earlier point in the
    initialization sequence, to allow the timekeeping subsystem
    use ACPI early. Unfortunately, that resulted in boot regressions
    on some systems and the early ACPI initialization was moved toward
    its original position in the kernel initialization code by commit
    c4e1acbb35e4 "ACPI / init: Invoke early ACPI initialization later".

    However, that turns out to be insufficient, as boot is still broken
    on the Tyan S8812 mainboard.

    To fix that issue, split the ACPI early initialization code into
    two pieces so the majority of it still located in acpi_early_init()
    and the part switching over the platform into the ACPI mode goes into
    a new function, acpi_subsystem_init(), executed at the original early
    ACPI initialization spot.

    That fixes the Tyan S8812 boot problem, but still allows ACPI
    tables to be loaded earlier which is useful to the EFI code in
    efi_enter_virtual_mode().

    Link: https://bugzilla.kernel.org/show_bug.cgi?id=97141
    Fixes: 73f7d1ca3263 "ACPI / init: Run acpi_early_init() before timekeeping_init()"
    Reported-and-tested-by: Marius Tolzmann
    Signed-off-by: Rafael J. Wysocki
    Acked-by: Toshi Kani
    Reviewed-by: Hanjun Guo
    Reviewed-by: Lee, Chun-Yi

    Rafael J. Wysocki
     

02 Jun, 2015

1 commit

  • cgroup writeback requires support from both bdi and filesystem sides.
    Add BDI_CAP_CGROUP_WRITEBACK and FS_CGROUP_WRITEBACK to indicate
    support and enable BDI_CAP_CGROUP_WRITEBACK on block based bdi's by
    default. Also, define CONFIG_CGROUP_WRITEBACK which is enabled if
    both MEMCG and BLK_CGROUP are enabled.

    inode_cgwb_enabled() which determines whether a given inode's both bdi
    and fs support cgroup writeback is added.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Signed-off-by: Jens Axboe

    Tejun Heo
     

28 May, 2015

10 commits

  • Andrew worried about the overhead on small systems; only use the fancy
    code when either perf or tracing is enabled.

    Cc: Rusty Russell
    Cc: Steven Rostedt
    Requested-by: Andrew Morton
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Rusty Russell

    Peter Zijlstra
     
  • The RCU implementation is chosen based on PREEMPT and SMP config options
    and is not really a user-selectable choice. This commit removes the
    menu entry, given that there is not much point in calling something a
    choice when there is in fact no choice.. The TINY_RCU, TREE_RCU, and
    PREEMPT_RCU Kconfig options continue to be selected based solely on the
    values of the PREEMPT and SMP options.

    Signed-off-by: Pranith Kumar
    Signed-off-by: Paul E. McKenney

    Pranith Kumar
     
  • This commit updates the initialization of the kthread_prio boot parameter
    so that RCU will build even when CONFIG_RCU_KTHREAD_PRIO is undefined.
    The kthread_prio boot parameter is set to CONFIG_RCU_KTHREAD_PRIO if
    that is defined, otherwise to 1 if CONFIG_RCU_BOOST is defined and
    to zero otherwise. This commit then makes CONFIG_RCU_KTHREAD_PRIO
    depend on CONFIG_RCU_EXPERT, so that Kconfig users won't be asked about
    CONFIG_RCU_KTHREAD_PRIO unless they want to be.

    Reported-by: Linus Torvalds
    Reported-by: Ingo Molnar
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Pranith Kumar

    Paul E. McKenney
     
  • This commit introduces an RCU_FANOUT_LEAF C-preprocessor macro so
    that RCU will build even when CONFIG_RCU_FANOUT_LEAF is undefined.
    The RCU_FANOUT_LEAF macro is set to the value of CONFIG_RCU_FANOUT_LEAF
    when defined, otherwise it is set to 32 for 32-bit systems and 64 for
    64-bit systems. This commit then makes CONFIG_RCU_FANOUT_LEAF depend
    on CONFIG_RCU_EXPERT, so that Kconfig users won't be asked about
    CONFIG_RCU_FANOUT_LEAF unless they want to be.

    Reported-by: Ingo Molnar
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Pranith Kumar

    Paul E. McKenney
     
  • This commit introduces an RCU_FANOUT C-preprocessor macro so that RCU will
    build even when CONFIG_RCU_FANOUT is undefined. The RCU_FANOUT macro is
    set to the value of CONFIG_RCU_FANOUT when defined, otherwise it is set
    to 32 for 32-bit systems and 64 for 64-bit systems. This commit then
    makes CONFIG_RCU_FANOUT depend on CONFIG_RCU_EXPERT, so that Kconfig
    users won't be asked about CONFIG_RCU_FANOUT unless they want to be.

    Reported-by: Ingo Molnar
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Pranith Kumar

    Paul E. McKenney
     
  • RCU_FANOUT_LEAF's range and default values depend on the value of
    RCU_FANOUT, which at the time seemed like a cute way to save two lines
    of Kconfig code. However, adding a dependency from both of these
    Kconfig parameters on RCU_EXPERT requires that RCU_FANOUT_LEAF operate
    correctly even if RCU_FANOUT is undefined. This commit therefore
    allows RCU_FANOUT_LEAF to take on the full range of permitted values,
    even in cases where RCU_FANOUT is undefined.

    Signed-off-by: Paul E. McKenney
    [ paulmck: Eliminate redundant "default" as suggested by Pranith Kumar. ]
    Reviewed-by: Pranith Kumar

    Paul E. McKenney
     
  • This commit creates an RCU_EXPERT Kconfig and hides the independent
    boolean RCU-related user-visible Kconfig parameters behind it, namely
    RCU_FAST_NO_HZ and RCU_BOOST. This prevents Kconfig from asking about
    these parameters unless the user really wants to be asked.

    Reported-by: Linus Torvalds
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Pranith Kumar

    Paul E. McKenney
     
  • The CONFIG_RCU_FANOUT_EXACT Kconfig parameter is used primarily (and
    perhaps only) by rcutorture to verify that RCU works correctly in specific
    rcu_node combining-tree configurations. It therefore does not make
    much sense have this as a question to people attempting to configure
    their kernels. So this commit creates an rcutree.rcu_fanout_exact=
    boot parameter that rcutorture can use, and eliminates the original
    CONFIG_RCU_FANOUT_EXACT Kconfig parameter.

    Reported-by: Ingo Molnar
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Pranith Kumar

    Paul E. McKenney
     
  • Currently, Kconfig will ask the user whether RCU_USER_QS should be set.
    This is silly because Kconfig already has all the information that it
    needs to set this parameter. This commit therefore directly drives
    the value of RCU_USER_QS via NO_HZ_FULL's "select" statement.

    Reported-by: Ingo Molnar
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Pranith Kumar
    Acked-by: Frederic Weisbecker

    Paul E. McKenney
     
  • Currently, Kconfig will ask the user whether TASKS_RCU should be set.
    This is silly because Kconfig already has all the information that it
    needs to set this parameter. This commit therefore directly drives
    the value of TASKS_RCU via "select" statements. Which means that
    as subsystems require TASKS_RCU, those subsystems will need to add
    "select" statements of their own.

    Reported-by: Ingo Molnar
    Signed-off-by: Paul E. McKenney
    Cc: Steven Rostedt
    Reviewed-by: Pranith Kumar

    Paul E. McKenney
     

27 May, 2015

2 commits

  • Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • The cgroup side of threadgroup locking uses signal_struct->group_rwsem
    to synchronize against threadgroup changes. This per-process rwsem
    adds small overhead to thread creation, exit and exec paths, forces
    cgroup code paths to do lock-verify-unlock-retry dance in a couple
    places and makes it impossible to atomically perform operations across
    multiple processes.

    This patch replaces signal_struct->group_rwsem with a global
    percpu_rwsem cgroup_threadgroup_rwsem which is cheaper on the reader
    side and contained in cgroups proper. This patch converts one-to-one.

    This does make writer side heavier and lower the granularity; however,
    cgroup process migration is a fairly cold path, we do want to optimize
    thread operations over it and cgroup migration operations don't take
    enough time for the lower granularity to matter.

    Signed-off-by: Tejun Heo
    Cc: Ingo Molnar
    Cc: Peter Zijlstra

    Tejun Heo
     

20 May, 2015

1 commit

  • This adds an extra argument onto parse_params() to be used
    as a way to make the unused callback a bit more useful and
    generic by allowing the caller to pass on a data structure
    of its choice. An example use case is to allow us to easily
    make module parameters for every module which we will do
    next.

    @ parse @
    identifier name, args, params, num, level_min, level_max;
    identifier unknown, param, val, doing;
    type s16;
    @@
    extern char *parse_args(const char *name,
    char *args,
    const struct kernel_param *params,
    unsigned num,
    s16 level_min,
    s16 level_max,
    + void *arg,
    int (*unknown)(char *param, char *val,
    const char *doing
    + , void *arg
    ));

    @ parse_mod @
    identifier name, args, params, num, level_min, level_max;
    identifier unknown, param, val, doing;
    type s16;
    @@
    char *parse_args(const char *name,
    char *args,
    const struct kernel_param *params,
    unsigned num,
    s16 level_min,
    s16 level_max,
    + void *arg,
    int (*unknown)(char *param, char *val,
    const char *doing
    + , void *arg
    ))
    {
    ...
    }

    @ parse_args_found @
    expression R, E1, E2, E3, E4, E5, E6;
    identifier func;
    @@

    (
    R =
    parse_args(E1, E2, E3, E4, E5, E6,
    + NULL,
    func);
    |
    R =
    parse_args(E1, E2, E3, E4, E5, E6,
    + NULL,
    &func);
    |
    R =
    parse_args(E1, E2, E3, E4, E5, E6,
    + NULL,
    NULL);
    |
    parse_args(E1, E2, E3, E4, E5, E6,
    + NULL,
    func);
    |
    parse_args(E1, E2, E3, E4, E5, E6,
    + NULL,
    &func);
    |
    parse_args(E1, E2, E3, E4, E5, E6,
    + NULL,
    NULL);
    )

    @ parse_args_unused depends on parse_args_found @
    identifier parse_args_found.func;
    @@

    int func(char *param, char *val, const char *unused
    + , void *arg
    )
    {
    ...
    }

    @ mod_unused depends on parse_args_found @
    identifier parse_args_found.func;
    expression A1, A2, A3;
    @@

    - func(A1, A2, A3);
    + func(A1, A2, A3, NULL);

    Generated-by: Coccinelle SmPL
    Cc: cocci@systeme.lip6.fr
    Cc: Tejun Heo
    Cc: Arjan van de Ven
    Cc: Greg Kroah-Hartman
    Cc: Rusty Russell
    Cc: Christoph Hellwig
    Cc: Felipe Contreras
    Cc: Ewan Milne
    Cc: Jean Delvare
    Cc: Hannes Reinecke
    Cc: Jani Nikula
    Cc: linux-kernel@vger.kernel.org
    Reviewed-by: Tejun Heo
    Acked-by: Rusty Russell
    Signed-off-by: Luis R. Rodriguez
    Signed-off-by: Greg Kroah-Hartman

    Luis R. Rodriguez
     

08 May, 2015

1 commit

  • On powerpc the perf event interrupt is not masked when interrupts are
    disabled, allowing it to function as an NMI.

    This causes problems if perf is using vmalloc. If we take a page fault
    on the vmalloc region the fault handler will fail the page fault because
    it detects we are coming in from an NMI (see do_hash_page()).

    We don't actually need or want vmalloc backed perf so just disable it on
    powerpc.

    Signed-off-by: Michael Ellerman
    Signed-off-by: Peter Zijlstra (Intel)
    Cc:
    Cc: Andrew Morton
    Cc: Anton Blanchard
    Cc: Borislav Petkov
    Cc: H. Peter Anvin
    Cc: Paul Mackerras
    Cc: Thomas Gleixner
    Cc: acme@ghostprotocols.net
    Cc: sukadev@linux.vnet.ibm.com
    Link: http://lkml.kernel.org/r/1430720799-18426-1-git-send-email-mpe@ellerman.id.au
    Signed-off-by: Ingo Molnar

    Michael Ellerman