15 May, 2019

1 commit

  • Patch series "mm: Randomize free memory", v10.

    This patch (of 3):

    Randomization of the page allocator improves the average utilization of
    a direct-mapped memory-side-cache. Memory side caching is a platform
    capability that Linux has been previously exposed to in HPC
    (high-performance computing) environments on specialty platforms. In
    that instance it was a smaller pool of high-bandwidth-memory relative to
    higher-capacity / lower-bandwidth DRAM. Now, this capability is going
    to be found on general purpose server platforms where DRAM is a cache in
    front of higher latency persistent memory [1].

    Robert offered an explanation of the state of the art of Linux
    interactions with memory-side-caches [2], and I copy it here:

    It's been a problem in the HPC space:
    http://www.nersc.gov/research-and-development/knl-cache-mode-performance-coe/

    A kernel module called zonesort is available to try to help:
    https://software.intel.com/en-us/articles/xeon-phi-software

    and this abandoned patch series proposed that for the kernel:
    https://lkml.kernel.org/r/20170823100205.17311-1-lukasz.daniluk@intel.com

    Dan's patch series doesn't attempt to ensure buffers won't conflict, but
    also reduces the chance that the buffers will. This will make performance
    more consistent, albeit slower than "optimal" (which is near impossible
    to attain in a general-purpose kernel). That's better than forcing
    users to deploy remedies like:
    "To eliminate this gradual degradation, we have added a Stream
    measurement to the Node Health Check that follows each job;
    nodes are rebooted whenever their measured memory bandwidth
    falls below 300 GB/s."

    A replacement for zonesort was merged upstream in commit cc9aec03e58f
    ("x86/numa_emulation: Introduce uniform split capability"). With this
    numa_emulation capability, memory can be split into cache sized
    ("near-memory" sized) numa nodes. A bind operation to such a node, and
    disabling workloads on other nodes, enables full cache performance.
    However, once the workload exceeds the cache size then cache conflicts
    are unavoidable. While HPC environments might be able to tolerate
    time-scheduling of cache sized workloads, for general purpose server
    platforms, the oversubscribed cache case will be the common case.

    The worst case scenario is that a server system owner benchmarks a
    workload at boot with an un-contended cache only to see that performance
    degrade over time, even below the average cache performance due to
    excessive conflicts. Randomization clips the peaks and fills in the
    valleys of cache utilization to yield steady average performance.

    Here are some performance impact details of the patches:

    1/ An Intel internal synthetic memory bandwidth measurement tool, saw a
    3X speedup in a contrived case that tries to force cache conflicts.
    The contrived cased used the numa_emulation capability to force an
    instance of the benchmark to be run in two of the near-memory sized
    numa nodes. If both instances were placed on the same emulated they
    would fit and cause zero conflicts. While on separate emulated nodes
    without randomization they underutilized the cache and conflicted
    unnecessarily due to the in-order allocation per node.

    2/ A well known Java server application benchmark was run with a heap
    size that exceeded cache size by 3X. The cache conflict rate was 8%
    for the first run and degraded to 21% after page allocator aging. With
    randomization enabled the rate levelled out at 11%.

    3/ A MongoDB workload did not observe measurable difference in
    cache-conflict rates, but the overall throughput dropped by 7% with
    randomization in one case.

    4/ Mel Gorman ran his suite of performance workloads with randomization
    enabled on platforms without a memory-side-cache and saw a mix of some
    improvements and some losses [3].

    While there is potentially significant improvement for applications that
    depend on low latency access across a wide working-set, the performance
    may be negligible to negative for other workloads. For this reason the
    shuffle capability defaults to off unless a direct-mapped
    memory-side-cache is detected. Even then, the page_alloc.shuffle=0
    parameter can be specified to disable the randomization on those systems.

    Outside of memory-side-cache utilization concerns there is potentially
    security benefit from randomization. Some data exfiltration and
    return-oriented-programming attacks rely on the ability to infer the
    location of sensitive data objects. The kernel page allocator, especially
    early in system boot, has predictable first-in-first out behavior for
    physical pages. Pages are freed in physical address order when first
    onlined.

    Quoting Kees:
    "While we already have a base-address randomization
    (CONFIG_RANDOMIZE_MEMORY), attacks against the same hardware and
    memory layouts would certainly be using the predictability of
    allocation ordering (i.e. for attacks where the base address isn't
    important: only the relative positions between allocated memory).
    This is common in lots of heap-style attacks. They try to gain
    control over ordering by spraying allocations, etc.

    I'd really like to see this because it gives us something similar
    to CONFIG_SLAB_FREELIST_RANDOM but for the page allocator."

    While SLAB_FREELIST_RANDOM reduces the predictability of some local slab
    caches it leaves vast bulk of memory to be predictably in order allocated.
    However, it should be noted, the concrete security benefits are hard to
    quantify, and no known CVE is mitigated by this randomization.

    Introduce shuffle_free_memory(), and its helper shuffle_zone(), to perform
    a Fisher-Yates shuffle of the page allocator 'free_area' lists when they
    are initially populated with free memory at boot and at hotplug time. Do
    this based on either the presence of a page_alloc.shuffle=Y command line
    parameter, or autodetection of a memory-side-cache (to be added in a
    follow-on patch).

    The shuffling is done in terms of CONFIG_SHUFFLE_PAGE_ORDER sized free
    pages where the default CONFIG_SHUFFLE_PAGE_ORDER is MAX_ORDER-1 i.e. 10,
    4MB this trades off randomization granularity for time spent shuffling.
    MAX_ORDER-1 was chosen to be minimally invasive to the page allocator
    while still showing memory-side cache behavior improvements, and the
    expectation that the security implications of finer granularity
    randomization is mitigated by CONFIG_SLAB_FREELIST_RANDOM. The
    performance impact of the shuffling appears to be in the noise compared to
    other memory initialization work.

    This initial randomization can be undone over time so a follow-on patch is
    introduced to inject entropy on page free decisions. It is reasonable to
    ask if the page free entropy is sufficient, but it is not enough due to
    the in-order initial freeing of pages. At the start of that process
    putting page1 in front or behind page0 still keeps them close together,
    page2 is still near page1 and has a high chance of being adjacent. As
    more pages are added ordering diversity improves, but there is still high
    page locality for the low address pages and this leads to no significant
    impact to the cache conflict rate.

    [1]: https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/
    [2]: https://lkml.kernel.org/r/AT5PR8401MB1169D656C8B5E121752FC0F8AB120@AT5PR8401MB1169.NAMPRD84.PROD.OUTLOOK.COM
    [3]: https://lkml.org/lkml/2018/10/12/309

    [dan.j.williams@intel.com: fix shuffle enable]
    Link: http://lkml.kernel.org/r/154943713038.3858443.4125180191382062871.stgit@dwillia2-desk3.amr.corp.intel.com
    [cai@lca.pw: fix SHUFFLE_PAGE_ALLOCATOR help texts]
    Link: http://lkml.kernel.org/r/20190425201300.75650-1-cai@lca.pw
    Link: http://lkml.kernel.org/r/154899811738.3165233.12325692939590944259.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Signed-off-by: Qian Cai
    Reviewed-by: Kees Cook
    Acked-by: Michal Hocko
    Cc: Dave Hansen
    Cc: Keith Busch
    Cc: Robert Elliott
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     

08 May, 2019

1 commit

  • Pull driver core/kobject updates from Greg KH:
    "Here is the "big" set of driver core patches for 5.2-rc1

    There are a number of ACPI patches in here as well, as Rafael said
    they should go through this tree due to the driver core changes they
    required. They have all been acked by the ACPI developers.

    There are also a number of small subsystem-specific changes in here,
    due to some changes to the kobject core code. Those too have all been
    acked by the various subsystem maintainers.

    As for content, it's pretty boring outside of the ACPI changes:
    - spdx cleanups
    - kobject documentation updates
    - default attribute groups for kobjects
    - other minor kobject/driver core fixes

    All have been in linux-next for a while with no reported issues"

    * tag 'driver-core-5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (47 commits)
    kobject: clean up the kobject add documentation a bit more
    kobject: Fix kernel-doc comment first line
    kobject: Remove docstring reference to kset
    firmware_loader: Fix a typo ("syfs" -> "sysfs")
    kobject: fix dereference before null check on kobj
    Revert "driver core: platform: Fix the usage of platform device name(pdev->name)"
    init/config: Do not select BUILD_BIN2C for IKCONFIG
    Provide in-kernel headers to make extending kernel easier
    kobject: Improve doc clarity kobject_init_and_add()
    kobject: Improve docs for kobject_add/del
    driver core: platform: Fix the usage of platform device name(pdev->name)
    livepatch: Replace klp_ktype_patch's default_attrs with groups
    cpufreq: schedutil: Replace default_attrs field with groups
    padata: Replace padata_attr_type default_attrs field with groups
    irqdesc: Replace irq_kobj_type's default_attrs field with groups
    net-sysfs: Replace ktype default_attrs field with groups
    block: Replace all ktype default_attrs with groups
    samples/kobject: Replace foo_ktype's default_attrs field with groups
    kobject: Add support for default attribute groups to kobj_type
    driver core: Postpone DMA tear-down until after devres release for probe failure
    ...

    Linus Torvalds
     

29 Apr, 2019

2 commits

  • Since commit 13610aa908dc ("kernel/configs: use .incbin directive to
    embed config_data.gz"), IKCONFIG no longer uses BUILD_BIN2C so prevent
    it from being selected in Kconfig.

    Reviewed-by: Masahiro Yamada
    Signed-off-by: Joel Fernandes (Google)
    Signed-off-by: Greg Kroah-Hartman

    Joel Fernandes (Google)
     
  • Introduce in-kernel headers which are made available as an archive
    through proc (/proc/kheaders.tar.xz file). This archive makes it
    possible to run eBPF and other tracing programs that need to extend the
    kernel for tracing purposes without any dependency on the file system
    having headers.

    A github PR is sent for the corresponding BCC patch at:
    https://github.com/iovisor/bcc/pull/2312

    On Android and embedded systems, it is common to switch kernels but not
    have kernel headers available on the file system. Further once a
    different kernel is booted, any headers stored on the file system will
    no longer be useful. This is an issue even well known to distros.
    By storing the headers as a compressed archive within the kernel, we can
    avoid these issues that have been a hindrance for a long time.

    The best way to use this feature is by building it in. Several users
    have a need for this, when they switch debug kernels, they do not want to
    update the filesystem or worry about it where to store the headers on
    it. However, the feature is also buildable as a module in case the user
    desires it not being part of the kernel image. This makes it possible to
    load and unload the headers from memory on demand. A tracing program can
    load the module, do its operations, and then unload the module to save
    kernel memory. The total memory needed is 3.3MB.

    By having the archive available at a fixed location independent of
    filesystem dependencies and conventions, all debugging tools can
    directly refer to the fixed location for the archive, without concerning
    with where the headers on a typical filesystem which significantly
    simplifies tooling that needs kernel headers.

    The code to read the headers is based on /proc/config.gz code and uses
    the same technique to embed the headers.

    Other approaches were discussed such as having an in-memory mountable
    filesystem, but that has drawbacks such as requiring an in-kernel xz
    decompressor which we don't have today, and requiring usage of 42 MB of
    kernel memory to host the decompressed headers at anytime. Also this
    approach is simpler than such approaches.

    Reviewed-by: Masahiro Yamada
    Signed-off-by: Joel Fernandes (Google)
    Signed-off-by: Greg Kroah-Hartman

    Joel Fernandes (Google)
     

19 Apr, 2019

1 commit

  • Make the anon_inodes facility unconditional so that it can be used by core
    VFS code and pidfd code.

    Signed-off-by: David Howells
    Signed-off-by: Al Viro
    [christian@brauner.io: adapt commit message to mention pidfds]
    Signed-off-by: Christian Brauner

    David Howells
     

11 Mar, 2019

2 commits

  • Pull Kbuild updates from Masahiro Yamada:

    - do not generate unneeded top-level built-in.a

    - let git ignore O= directory entirely

    - optimize scripts/kallsyms slightly

    - exclude DWARF info from *.s regardless of config options

    - fix GCC toolchain search path for Clang to prepare ld.lld support

    - do not generate modules.order when CONFIG_MODULES is disabled

    - simplify single target rules and remove VPATH for external module
    build

    - allow to add optional flags to dpkg-buildpackage when building
    deb-pkg

    - move some compiler option tests from Makefile to Kconfig

    - various Makefile cleanups

    * tag 'kbuild-v5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (40 commits)
    kbuild: remove scripts/basic/% build target
    kbuild: use -Werror=implicit-... instead of -Werror-implicit-...
    kbuild: clean up scripts/gcc-version.sh
    kbuild: remove cc-version macro
    kbuild: update comment block of scripts/clang-version.sh
    kbuild: remove commented-out INITRD_COMPRESS
    kbuild: move -gsplit-dwarf, -gdwarf-4 option tests to Kconfig
    kbuild: [bin]deb-pkg: add DPKG_FLAGS variable
    kbuild: move ".config not found!" message from Kconfig to Makefile
    kbuild: invoke syncconfig if include/config/auto.conf.cmd is missing
    kbuild: simplify single target rules
    kbuild: remove empty rules for makefiles
    kbuild: make -r/-R effective in top Makefile for old Make versions
    kbuild: move tools_silent to a more relevant place
    kbuild: compute false-positive -Wmaybe-uninitialized cases in Kconfig
    kbuild: refactor cc-cross-prefix implementation
    kbuild: hardcode genksyms path and remove GENKSYMS variable
    scripts/gdb: refactor rules for symlink creation
    kbuild: create symlink to vmlinux-gdb.py in scripts_gdb target
    scripts/gdb: do not descend into scripts/gdb from scripts
    ...

    Linus Torvalds
     
  • Pull timer fix from Thomas Gleixner:
    "A single fix to prevent a unmet dependencies warning in Kconfig"

    * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    time: Make VIRT_CPU_ACCOUNTING_GEN depend on GENERIC_CLOCKEVENTS

    Linus Torvalds
     

07 Mar, 2019

1 commit

  • Moving the CONTEXT_TRACKING Kconfig option into kernel/time/Kconfig added
    an implicit dependency on the surrounding GENERIC_CLOCKEVENTS option, but
    this is not always enabled when it is possible to select
    VIRT_CPU_ACCOUNTING_GEN:

    WARNING: unmet direct dependencies detected for CONTEXT_TRACKING
    Depends on [n]: GENERIC_CLOCKEVENTS [=n]
    Selected by [y]:
    - VIRT_CPU_ACCOUNTING_GEN [=y] && && HAVE_CONTEXT_TRACKING [=y] && HAVE_VIRT_CPU_ACCOUNTING_GEN [=y]

    Platforms without GENERIC_CLOCKEVENTS are rare enough so that corner case
    can be just ignored. Make it a dependency for VIRT_CPU_ACCOUNTING_GEN to
    simplify the configuration.

    Fixes: a4cffdad7314 ("time: Move CONTEXT_TRACKING to kernel/time/Kconfig")
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Thomas Gleixner
    Cc: "Paul E . McKenney"
    Cc: Frederic Weisbecker
    Link: https://lkml.kernel.org/r/20190304200202.1163250-1-arnd@arndb.de

    Arnd Bergmann
     

04 Mar, 2019

1 commit


28 Feb, 2019

1 commit

  • The submission queue (SQ) and completion queue (CQ) rings are shared
    between the application and the kernel. This eliminates the need to
    copy data back and forth to submit and complete IO.

    IO submissions use the io_uring_sqe data structure, and completions
    are generated in the form of io_uring_cqe data structures. The SQ
    ring is an index into the io_uring_sqe array, which makes it possible
    to submit a batch of IOs without them being contiguous in the ring.
    The CQ ring is always contiguous, as completion events are inherently
    unordered, and hence any io_uring_cqe entry can point back to an
    arbitrary submission.

    Two new system calls are added for this:

    io_uring_setup(entries, params)
    Sets up an io_uring instance for doing async IO. On success,
    returns a file descriptor that the application can mmap to
    gain access to the SQ ring, CQ ring, and io_uring_sqes.

    io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
    Initiates IO against the rings mapped to this fd, or waits for
    them to complete, or both. The behavior is controlled by the
    parameters passed in. If 'to_submit' is non-zero, then we'll
    try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
    kernel will wait for 'min_complete' events, if they aren't
    already available. It's valid to set IORING_ENTER_GETEVENTS
    and 'min_complete' == 0 at the same time, this allows the
    kernel to return already completed events without waiting
    for them. This is useful only for polling, as for IRQ
    driven IO, the application can just check the CQ ring
    without entering the kernel.

    With this setup, it's possible to do async IO with a single system
    call. Future developments will enable polled IO with this interface,
    and polled submission as well. The latter will enable an application
    to do IO without doing ANY system calls at all.

    For IRQ driven IO, an application only needs to enter the kernel for
    completions if it wants to wait for them to occur.

    Each io_uring is backed by a workqueue, to support buffered async IO
    as well. We will only punt to an async context if the command would
    need to wait for IO on the device side. Any data that can be accessed
    directly in the page cache is done inline. This avoids the slowness
    issue of usual threadpools, since cached data is accessed as quickly
    as a sync interface.

    Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c

    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Jens Axboe
     

27 Feb, 2019

1 commit

  • Since -Wmaybe-uninitialized was introduced by GCC 4.7, we have patched
    various false positives:

    - commit e74fc973b6e5 ("Turn off -Wmaybe-uninitialized when building
    with -Os") turned off this option for -Os.

    - commit 815eb71e7149 ("Kbuild: disable 'maybe-uninitialized' warning
    for CONFIG_PROFILE_ALL_BRANCHES") turned off this option for
    CONFIG_PROFILE_ALL_BRANCHES

    - commit a76bcf557ef4 ("Kbuild: enable -Wmaybe-uninitialized warning
    for "make W=1"") turned off this option for GCC < 4.9
    Arnd provided more explanation in https://lkml.org/lkml/2017/3/14/903

    I think this looks better by shifting the logic from Makefile to Kconfig.

    Link: https://github.com/ClangBuiltLinux/linux/issues/350
    Signed-off-by: Masahiro Yamada
    Reviewed-by: Nathan Chancellor
    Tested-by: Nick Desaulniers

    Masahiro Yamada
     

02 Feb, 2019

2 commits

  • The current help text caused some confusion in online forums about
    whether or not to default-enable or default-disable psi in vendor
    kernels. This is because it doesn't communicate the reason for why we
    made this setting configurable in the first place: that the overhead is
    non-zero in an artificial scheduler stress test.

    Since this isn't representative of real workloads, and the effect was
    not measurable in scheduler-heavy real world applications such as the
    webservers and memcache installations at Facebook, it's fair to point
    out that this is a pretty cautious option to select.

    Link: http://lkml.kernel.org/r/20190129233617.16767-1-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reviewed-by: Andrew Morton
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Link: http://lkml.kernel.org/r/20190129150813.15785-1-j.neuschaefer@gmx.net
    Signed-off-by: Jonathan Neuschäfer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jonathan Neuschäfer
     

14 Jan, 2019

1 commit

  • When building using GCC 4.7 or older, -ffunction-sections & the -pg flag
    used by ftrace are incompatible. This causes warnings or build failures
    (where -Werror applies) such as the following:

    arch/mips/generic/init.c:
    error: -ffunction-sections disabled; it makes profiling impossible

    This used to be taken into account by the ordering of calls to cc-option
    from within the top-level Makefile, which was introduced by commit
    90ad4052e85c ("kbuild: avoid conflict between -ffunction-sections and
    -pg on gcc-4.7"). Unfortunately this was broken when the
    CONFIG_LD_DEAD_CODE_DATA_ELIMINATION cc-option check was moved to
    Kconfig in commit e85d1d65cd8a ("kbuild: test dead code/data elimination
    support in Kconfig"), because the flags used by this check no longer
    include -pg.

    Fix this by not allowing CONFIG_LD_DEAD_CODE_DATA_ELIMINATION to be
    enabled at the same time as ftrace/CONFIG_FUNCTION_TRACER when building
    using GCC 4.7 or older.

    Signed-off-by: Paul Burton
    Fixes: e85d1d65cd8a ("kbuild: test dead code/data elimination support in Kconfig")
    Reported-by: Geert Uytterhoeven
    Cc: Nicholas Piggin
    Cc: stable@vger.kernel.org # v4.19+
    Signed-off-by: Masahiro Yamada

    Paul Burton
     

06 Jan, 2019

1 commit

  • Currently, CONFIG_JUMP_LABEL just means "I _want_ to use jump label".

    The jump label is controlled by HAVE_JUMP_LABEL, which is defined
    like this:

    #if defined(CC_HAVE_ASM_GOTO) && defined(CONFIG_JUMP_LABEL)
    # define HAVE_JUMP_LABEL
    #endif

    We can improve this by testing 'asm goto' support in Kconfig, then
    make JUMP_LABEL depend on CC_HAS_ASM_GOTO.

    Ugly #ifdef HAVE_JUMP_LABEL will go away, and CONFIG_JUMP_LABEL will
    match to the real kernel capability.

    Signed-off-by: Masahiro Yamada
    Acked-by: Michael Ellerman (powerpc)
    Tested-by: Sedat Dilek

    Masahiro Yamada
     

28 Dec, 2018

1 commit

  • Pull audit updates from Paul Moore:
    "In the finest of holiday of traditions, I have a number of gifts to
    share today. While most of them are re-gifts from others, unlike the
    typical re-gift, these are things you will want in and around your
    tree; I promise.

    This pull request is perhaps a bit larger than our typical PR, but
    most of it comes from Jan's rework of audit's fanotify code; a very
    welcome improvement. We ran this through our normal regression tests,
    as well as some newly created stress tests and everything looks good.

    Richard added a few patches, mostly cleaning up a few things and and
    shortening some of the audit records that we send to userspace; a
    change the userspace folks are quite happy about.

    Finally YueHaibing and I kick in a few patches to simplify things a
    bit and make the code less prone to errors.

    Lastly, I want to say thanks one more time to everyone who has
    contributed patches, testing, and code reviews for the audit subsystem
    over the past year. The project is what it is due to your help and
    contributions - thank you"

    * tag 'audit-pr-20181224' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit: (22 commits)
    audit: remove duplicated include from audit.c
    audit: shorten PATH cap values when zero
    audit: use current whenever possible
    audit: minimize our use of audit_log_format()
    audit: remove WATCH and TREE config options
    audit: use session_info helper
    audit: localize audit_log_session_info prototype
    audit: Use 'mark' name for fsnotify_mark variables
    audit: Replace chunk attached to mark instead of replacing mark
    audit: Simplify locking around untag_chunk()
    audit: Drop all unused chunk nodes during deletion
    audit: Guarantee forward progress of chunk untagging
    audit: Allocate fsnotify mark independently of chunk
    audit: Provide helper for dropping mark's chunk reference
    audit: Remove pointless check in insert_hash()
    audit: Factor out chunk replacement code
    audit: Make hash table insertion safe against concurrent lookups
    audit: Embed key into chunk
    audit: Fix possible tagging failures
    audit: Fix possible spurious -ENOSPC error
    ...

    Linus Torvalds
     

15 Dec, 2018

1 commit

  • The kernel commandline parameter named in CONFIG_PSI_DEFAULT_DISABLED
    help text contradicts the documentation in kernel-parameters.txt, and
    the code. Fix that.

    Link: http://lkml.kernel.org/r/20181203213416.GA12627@cmpxchg.org
    Fixes: e0c274472d ("psi: make disabling/enabling easier for vendor kernels")
    Signed-off-by: Baruch Siach
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Baruch Siach
     

01 Dec, 2018

1 commit

  • Mel Gorman reports a hackbench regression with psi that would prohibit
    shipping the suse kernel with it default-enabled, but he'd still like
    users to be able to opt in at little to no cost to others.

    With the current combination of CONFIG_PSI and the psi_disabled bool set
    from the commandline, this is a challenge. Do the following things to
    make it easier:

    1. Add a config option CONFIG_PSI_DEFAULT_DISABLED that allows distros
    to enable CONFIG_PSI in their kernel but leave the feature disabled
    unless a user requests it at boot-time.

    To avoid double negatives, rename psi_disabled= to psi=.

    2. Make psi_disabled a static branch to eliminate any branch costs
    when the feature is disabled.

    In terms of numbers before and after this patch, Mel says:

    : The following is a comparision using CONFIG_PSI=n as a baseline against
    : your patch and a vanilla kernel
    :
    : 4.20.0-rc4 4.20.0-rc4 4.20.0-rc4
    : kconfigdisable-v1r1 vanilla psidisable-v1r1
    : Amean 1 1.3100 ( 0.00%) 1.3923 ( -6.28%) 1.3427 ( -2.49%)
    : Amean 3 3.8860 ( 0.00%) 4.1230 * -6.10%* 3.8860 ( -0.00%)
    : Amean 5 6.8847 ( 0.00%) 8.0390 * -16.77%* 6.7727 ( 1.63%)
    : Amean 7 9.9310 ( 0.00%) 10.8367 * -9.12%* 9.9910 ( -0.60%)
    : Amean 12 16.6577 ( 0.00%) 18.2363 * -9.48%* 17.1083 ( -2.71%)
    : Amean 18 26.5133 ( 0.00%) 27.8833 * -5.17%* 25.7663 ( 2.82%)
    : Amean 24 34.3003 ( 0.00%) 34.6830 ( -1.12%) 32.0450 ( 6.58%)
    : Amean 30 40.0063 ( 0.00%) 40.5800 ( -1.43%) 41.5087 ( -3.76%)
    : Amean 32 40.1407 ( 0.00%) 41.2273 ( -2.71%) 39.9417 ( 0.50%)
    :
    : It's showing that the vanilla kernel takes a hit (as the bisection
    : indicated it would) and that disabling PSI by default is reasonably
    : close in terms of performance for this particular workload on this
    : particular machine so;

    Link: http://lkml.kernel.org/r/20181127165329.GA29728@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Tested-by: Mel Gorman
    Reported-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

20 Nov, 2018

1 commit


27 Oct, 2018

2 commits

  • On a system that executes multiple cgrouped jobs and independent
    workloads, we don't just care about the health of the overall system, but
    also that of individual jobs, so that we can ensure individual job health,
    fairness between jobs, or prioritize some jobs over others.

    This patch implements pressure stall tracking for cgroups. In kernels
    with CONFIG_PSI=y, cgroup2 groups will have cpu.pressure, memory.pressure,
    and io.pressure files that track aggregate pressure stall times for only
    the tasks inside the cgroup.

    Link: http://lkml.kernel.org/r/20180828172258.3185-10-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Tejun Heo
    Acked-by: Peter Zijlstra (Intel)
    Tested-by: Daniel Drake
    Tested-by: Suren Baghdasaryan
    Cc: Christopher Lameter
    Cc: Ingo Molnar
    Cc: Johannes Weiner
    Cc: Mike Galbraith
    Cc: Peter Enderborg
    Cc: Randy Dunlap
    Cc: Shakeel Butt
    Cc: Vinayak Menon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • When systems are overcommitted and resources become contended, it's hard
    to tell exactly the impact this has on workload productivity, or how close
    the system is to lockups and OOM kills. In particular, when machines work
    multiple jobs concurrently, the impact of overcommit in terms of latency
    and throughput on the individual job can be enormous.

    In order to maximize hardware utilization without sacrificing individual
    job health or risk complete machine lockups, this patch implements a way
    to quantify resource pressure in the system.

    A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
    expose the percentage of time the system is stalled on CPU, memory, or IO,
    respectively. Stall states are aggregate versions of the per-task delay
    accounting delays:

    cpu: some tasks are runnable but not executing on a CPU
    memory: tasks are reclaiming, or waiting for swapin or thrashing cache
    io: tasks are waiting for io completions

    These percentages of walltime can be thought of as pressure percentages,
    and they give a general sense of system health and productivity loss
    incurred by resource overcommit. They can also indicate when the system
    is approaching lockup scenarios and OOMs.

    To do this, psi keeps track of the task states associated with each CPU
    and samples the time they spend in stall states. Every 2 seconds, the
    samples are averaged across CPUs - weighted by the CPUs' non-idle time to
    eliminate artifacts from unused CPUs - and translated into percentages of
    walltime. A running average of those percentages is maintained over 10s,
    1m, and 5m periods (similar to the loadaverage).

    [hannes@cmpxchg.org: doc fixlet, per Randy]
    Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org
    [hannes@cmpxchg.org: code optimization]
    Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org
    [hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter]
    Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org
    [hannes@cmpxchg.org: fix build]
    Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org
    Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Peter Zijlstra (Intel)
    Tested-by: Daniel Drake
    Tested-by: Suren Baghdasaryan
    Cc: Christopher Lameter
    Cc: Ingo Molnar
    Cc: Johannes Weiner
    Cc: Mike Galbraith
    Cc: Peter Enderborg
    Cc: Randy Dunlap
    Cc: Shakeel Butt
    Cc: Tejun Heo
    Cc: Vinayak Menon
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

02 Oct, 2018

1 commit

  • Create a config for enabling irq load tracking in the scheduler.
    irq load tracking is useful only when irq or paravirtual time is
    accounted but it's only possible with SMP for now.

    Also use __maybe_unused to remove the compilation warning in
    update_rq_clock_task() that has been introduced by:

    2e62c4743adc ("sched/fair: Remove #ifdefs from scale_rt_capacity()")

    Suggested-by: Ingo Molnar
    Reported-by: Dou Liyang
    Reported-by: Miguel Ojeda
    Signed-off-by: Vincent Guittot
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: bp@alien8.de
    Cc: dou_liyang@163.com
    Fixes: 2e62c4743adc ("sched/fair: Remove #ifdefs from scale_rt_capacity()")
    Link: http://lkml.kernel.org/r/1537867062-27285-1-git-send-email-vincent.guittot@linaro.org
    Signed-off-by: Ingo Molnar

    Vincent Guittot
     

26 Aug, 2018

1 commit

  • Pull more Kbuild updates from Masahiro Yamada:

    - add build_{menu,n,g,x}config targets for compile-testing Kconfig

    - fix and improve recursive dependency detection in Kconfig

    - fix parallel building of menuconfig/nconfig

    - fix syntax error in clang-version.sh

    - suppress distracting log from syncconfig

    - remove obsolete "rpm" target

    - remove VMLINUX_SYMBOL(_STR) macro entirely

    - fix microblaze build with CONFIG_DYNAMIC_FTRACE

    - move compiler test for dead code/data elimination to Kconfig

    - rename well-known LDFLAGS variable to KBUILD_LDFLAGS

    - misc fixes and cleanups

    * tag 'kbuild-v4.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
    kbuild: rename LDFLAGS to KBUILD_LDFLAGS
    kbuild: pass LDFLAGS to recordmcount.pl
    kbuild: test dead code/data elimination support in Kconfig
    initramfs: move gen_initramfs_list.sh from scripts/ to usr/
    vmlinux.lds.h: remove stale include
    export.h: remove VMLINUX_SYMBOL() and VMLINUX_SYMBOL_STR()
    Coccinelle: remove pci_alloc_consistent semantic to detect in zalloc-simple.cocci
    kbuild: make sorting initramfs contents independent of locale
    kbuild: remove "rpm" target, which is alias of "rpm-pkg"
    kbuild: Fix LOADLIBES rename in Documentation/kbuild/makefiles.txt
    kconfig: suppress "configuration written to .config" for syncconfig
    kconfig: fix "Can't open ..." in parallel build
    kbuild: Add a space after `!` to prevent parsing as file pattern
    scripts: modpost: check memory allocation results
    kconfig: improve the recursive dependency report
    kconfig: report recursive dependency involving 'imply'
    kconfig: error out when seeing recursive dependency
    kconfig: add build-only configurator targets
    scripts/dtc: consolidate include path options in Makefile

    Linus Torvalds
     

24 Aug, 2018

1 commit


23 Aug, 2018

2 commits

  • The CHECKPOINT_RESTORE configuration option was introduced in 2012 and
    combined with EXPERT. CHECKPOINT_RESTORE is already enabled in many
    distribution kernels and also part of the defconfigs of various
    architectures.

    To make it easier for distributions to enable CHECKPOINT_RESTORE this
    removes EXPERT and moves the configuration option out of the EXPERT block.

    Link: http://lkml.kernel.org/r/20180712130733.11510-1-adrian@lisas.de
    Signed-off-by: Adrian Reber
    Acked-by: Oleg Nesterov
    Reviewed-by: Hendrik Brueckner
    Acked-by: Pavel Emelyanov
    Cc: Eric W. Biederman
    Cc: Andrei Vagin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Reber
     
  • Correct typos of "it's" to "its.

    Link: http://lkml.kernel.org/r/0ac627b6-5527-55f4-0489-1631aa34fc11@infradead.org
    Signed-off-by: Randy Dunlap
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Masahiro Yamada
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     

18 Aug, 2018

1 commit

  • Introduce new config option, which is used to replace repeating
    CONFIG_MEMCG && !CONFIG_SLOB pattern. Next patches add a little more
    memcg+kmem related code, so let's keep the defines more clearly.

    Link: http://lkml.kernel.org/r/153063053670.1818.15013136946600481138.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai
    Acked-by: Vladimir Davydov
    Tested-by: Shakeel Butt
    Cc: Al Viro
    Cc: Andrey Ryabinin
    Cc: Chris Wilson
    Cc: Greg Kroah-Hartman
    Cc: Guenter Roeck
    Cc: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Josef Bacik
    Cc: Li RongQing
    Cc: Matthew Wilcox
    Cc: Matthias Kaehlcke
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Philippe Ombredanne
    Cc: Roman Gushchin
    Cc: Sahitya Tummala
    Cc: Stephen Rothwell
    Cc: Tetsuo Handa
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     

16 Aug, 2018

3 commits

  • Pull Kconfig consolidation from Masahiro Yamada:
    "Consolidation of Kconfig files by Christoph Hellwig.

    Move the source statements of arch-independent Kconfig files instead
    of duplicating the includes in every arch/$(SRCARCH)/Kconfig"

    * tag 'kconfig-v4.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
    kconfig: add a Memory Management options" menu
    kconfig: move the "Executable file formats" menu to fs/Kconfig.binfmt
    kconfig: use a menu in arch/Kconfig to reduce clutter
    kconfig: include kernel/Kconfig.preempt from init/Kconfig
    Kconfig: consolidate the "Kernel hacking" menu
    kconfig: include common Kconfig files from top-level Kconfig
    kconfig: remove duplicate SWAP symbol defintions
    um: create a proper drivers Kconfig
    um: cleanup Kconfig files
    um: stop abusing KBUILD_KCONFIG

    Linus Torvalds
     
  • Pull Kconfig updates from Masahiro Yamada:

    - show clearer error messages where pkg-config is needed, but not
    installed

    - rename SYMBOL_AUTO to SYMBOL_NO_WRITE to reflect its semantics

    - create all necessary directories by Kconfig tool itself instead of
    Makefile

    - update the .config unconditionally when syncconfig is invoked

    - use 'include' directive instead of '-include' where
    include/config/{auto,tristate}.conf is mandatory

    - do not try to update the .config when running install targets

    - add .DELETE_ON_ERROR to delete partially updated files

    - misc cleanups and fixes

    * tag 'kconfig-v4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
    kconfig: remove P_ENV property type
    kconfig: remove unused sym_get_env_prop() function
    kconfig: fix the rule of mainmenu_stmt symbol
    init/Kconfig: Use short unix-style option instead of --longname
    Kbuild: Makefile.modbuiltin: include auto.conf and tristate.conf mandatory
    kbuild: remove auto.conf from prerequisite of phony targets
    kbuild: do not update config for 'make kernelrelease'
    kbuild: do not update config when running install targets
    kbuild: add .DELETE_ON_ERROR special target
    kbuild: use 'include' directive to load auto.conf from top Makefile
    kconfig: allow all config targets to write auto.conf if missing
    kconfig: make syncconfig update .config regardless of sym_change_count
    kconfig: create directories needed for syncconfig by itself
    kconfig: remove unneeded directory generation from local*config
    kconfig: split out useful helpers in confdata.c
    kconfig: rename file_write_dep and move it to confdata.c
    kconfig: fix typos in description of "choice" in kconfig-language.txt
    kconfig: handle format string before calling conf_message_callback()
    kconfig: rename SYMBOL_AUTO to SYMBOL_NO_WRITE
    kconfig: check for pkg-config on make {menu,n,g,x}config

    Linus Torvalds
     
  • Pull Kbuild updates from Masahiro Yamada:

    - verify depmod is installed before modules_install

    - support build salt in case build ids must be unique between builds

    - allow users to specify additional host compiler flags via HOST*FLAGS,
    and rename internal variables to KBUILD_HOST*FLAGS

    - update buildtar script to drop vax support, add arm64 support

    - update builddeb script for better debarch support

    - document the pit-fall of if_changed usage

    - fix parallel build of UML with O= option

    - make 'samples' target depend on headers_install to fix build errors

    - remove deprecated host-progs variable

    - add a new coccinelle script for refcount_t vs atomic_t check

    - improve double-test coccinelle script

    - misc cleanups and fixes

    * tag 'kbuild-v4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (41 commits)
    coccicheck: return proper error code on fail
    Coccinelle: doubletest: reduce side effect false positives
    kbuild: remove deprecated host-progs variable
    kbuild: make samples really depend on headers_install
    um: clean up archheaders recipe
    kbuild: add %asm-generic to no-dot-config-targets
    um: fix parallel building with O= option
    scripts: Add Python 3 support to tracing/draw_functrace.py
    builddeb: Add automatic support for sh{3,4}{,eb} architectures
    builddeb: Add automatic support for riscv* architectures
    builddeb: Add automatic support for m68k architecture
    builddeb: Add automatic support for or1k architecture
    builddeb: Add automatic support for sparc64 architecture
    builddeb: Add automatic support for mips{,64}r6{,el} architectures
    builddeb: Add automatic support for mips64el architecture
    builddeb: Add automatic support for ppc64 and powerpcspe architectures
    builddeb: Introduce functions to simplify kconfig tests in set_debarch
    builddeb: Drop check for 32-bit s390
    builddeb: Change architecture detection fallback to use dpkg-architecture
    builddeb: Skip architecture detection when KBUILD_DEBARCH is set
    ...

    Linus Torvalds
     

14 Aug, 2018

1 commit

  • Pull s390 updates from Heiko Carstens:
    "Since Martin is on vacation you get the s390 pull request from me:

    - Host large page support for KVM guests. As the patches have large
    impact on arch/s390/mm/ this series goes out via both the KVM and
    the s390 tree.

    - Add an option for no compression to the "Kernel compression mode"
    menu, this will come in handy with the rework of the early boot
    code.

    - A large rework of the early boot code that will make life easier
    for KASAN and KASLR. With the rework the bootable uncompressed
    image is not generated anymore, only the bzImage is available. For
    debuggung purposes the new "no compression" option is used.

    - Re-enable the gcc plugins as the issue with the latent entropy
    plugin is solved with the early boot code rework.

    - More spectre relates changes:
    + Detect the etoken facility and remove expolines automatically.
    + Add expolines to a few more indirect branches.

    - A rewrite of the common I/O layer trace points to make them
    consumable by 'perf stat'.

    - Add support for format-3 PCI function measurement blocks.

    - Changes for the zcrypt driver:
    + Add attributes to indicate the load of cards and queues.
    + Restructure some code for the upcoming AP device support in KVM.

    - Build flags improvements in various Makefiles.

    - A few fixes for the kdump support.

    - A couple of patches for gcc 8 compile warning cleanup.

    - Cleanup s390 specific proc handlers.

    - Add s390 support to the restartable sequence self tests.

    - Some PTR_RET vs PTR_ERR_OR_ZERO cleanup.

    - Lots of bug fixes"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (107 commits)
    s390/dasd: fix hanging offline processing due to canceled worker
    s390/dasd: fix panic for failed online processing
    s390/mm: fix addressing exception after suspend/resume
    rseq/selftests: add s390 support
    s390: fix br_r1_trampoline for machines without exrl
    s390/lib: use expoline for all bcr instructions
    s390/numa: move initial setup of node_to_cpumask_map
    s390/kdump: Fix elfcorehdr size calculation
    s390/cpum_sf: save TOD clock base in SDBs for time conversion
    KVM: s390: Add huge page enablement control
    s390/mm: Add huge page gmap linking support
    s390/mm: hugetlb pages within a gmap can not be freed
    KVM: s390: Add skey emulation fault handling
    s390/mm: Add huge pmd storage key handling
    s390/mm: Clear skeys for newly mapped huge guest pmds
    s390/mm: Clear huge page storage keys on enable_skey
    s390/mm: Add huge page dirty sync support
    s390/mm: Add gmap pmd invalidation and clearing
    s390/mm: Add gmap pmd notification bit setting
    s390/mm: Add gmap pmd linking
    ...

    Linus Torvalds
     

09 Aug, 2018

1 commit


02 Aug, 2018

3 commits


18 Jul, 2018

1 commit

  • In Fedora, the debug information is packaged separately (foo-debuginfo) and
    can be installed separately. There's been a long standing issue where only
    one version of a debuginfo info package can be installed at a time. There's
    been an effort for Fedora for parallel debuginfo to rectify this problem.

    Part of the requirement to allow parallel debuginfo to work is that build ids
    are unique between builds. The existing upstream rpm implementation ensures
    this by re-calculating the build-id using the version and release as a
    seed. This doesn't work 100% for the kernel because of the vDSO which is
    its own binary and doesn't get updated when embedded.

    Fix this by adding some data in an ELF note for both the kernel and modules.
    The data is controlled via a Kconfig option so distributions can set it
    to an appropriate value to ensure uniqueness between builds.

    Suggested-by: Masahiro Yamada
    Signed-off-by: Laura Abbott
    Signed-off-by: Masahiro Yamada

    Laura Abbott
     

01 Jul, 2018

1 commit

  • …masahiroy/linux-kbuild

    Pull Kbuild fixes from Masahiro Yamada:

    - introduce __diag_* macros and suppress -Wattribute-alias warnings
    from GCC 8

    - fix stack protector test script for x86_64

    - fix line number handling in Kconfig

    - document that '#' starts a comment in Kconfig

    - handle P_SYMBOL property in dump debugging of Kconfig

    - correct help message of LD_DEAD_CODE_DATA_ELIMINATION

    - fix occasional segmentation faults in Kconfig

    * tag 'kbuild-fixes-v4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
    kconfig: loop boundary condition fix
    kbuild: reword help of LD_DEAD_CODE_DATA_ELIMINATION
    kconfig: handle P_SYMBOL in print_symbol()
    kconfig: document Kconfig source file comments
    kconfig: fix line numbers for if-entries in menu tree
    stack-protector: Fix test with 32-bit userland and CONFIG_64BIT=y
    powerpc: Remove -Wattribute-alias pragmas
    disable -Wattribute-alias warning for SYSCALL_DEFINEx()
    kbuild: add macro for controlling warnings to linux/compiler.h

    Linus Torvalds
     

28 Jun, 2018

1 commit

  • Since commit 5d20ee3192a5 ("kbuild: Allow LD_DEAD_CODE_DATA_ELIMINATION
    to be selectable if enabled"), HAVE_LD_DEAD_CODE_DATA_ELIMINATION is
    supposed to be selected by architectures that are capable of this
    functionality. LD_DEAD_CODE_DATA_ELIMINATION is now users' selection.
    Update the help message.

    Signed-off-by: Masahiro Yamada

    Masahiro Yamada
     

25 Jun, 2018

1 commit

  • Add "None" as the kernel compression mode.

    This option is useful for debugging the kernel in slow simulation
    environments, where decompressing and moving the kernel is awfully slow.

    Uncompressed kernel implementation might allow early boot code to skip the
    decompressor and jump right at uncompressed kernel image entry point.

    Platforms implementing that should define HAVE_KERNEL_UNCOMPRESSED.

    Reviewed-by: Heiko Carstens
    Signed-off-by: Vasily Gorbik
    Signed-off-by: Martin Schwidefsky

    Vasily Gorbik
     

14 Jun, 2018

1 commit

  • Currently the code is split over various files with dma- prefixes in the
    lib/ and drives/base directories, and the number of files keeps growing.
    Move them into a single directory to keep the code together and remove
    the file name prefixes. To match the irq infrastructure this directory
    is placed under the kernel/ directory.

    Signed-off-by: Christoph Hellwig

    Christoph Hellwig