21 Jan, 2016

2 commits

  • What CONFIG_INET and CONFIG_LEGACY_KMEM guard inside the memory
    controller code is insignificant, having these conditionals is not
    worth the complication and fragility that comes with them.

    [akpm@linux-foundation.org: rework mem_cgroup_css_free() statement ordering]
    Signed-off-by: Johannes Weiner
    Cc: Michal Hocko
    Acked-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Let the user know that CONFIG_MEMCG_KMEM does not apply to the cgroup2
    interface. This also makes legacy-only code sections stand out better.

    [arnd@arndb.de: mm: memcontrol: only manage socket pressure for CONFIG_INET]
    Signed-off-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Acked-by: Vladimir Davydov
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

18 Jan, 2016

1 commit

  • Pull audit updates from Paul Moore:
    "Seven audit patches for 4.5, all very minor despite the diffstat.

    The diffstat churn for linux/audit.h can be attributed to needing to
    reshuffle the linux/audit.h header to fix the seccomp auditing issue
    (see the commit description for details).

    Besides the seccomp/audit fix, most of the fixes are around trying to
    improve the connection with the audit daemon and a Kconfig
    simplification. Nothing crazy, and everything passes our little
    audit-testsuite"

    * 'upstream' of git://git.infradead.org/users/pcmoore/audit:
    audit: always enable syscall auditing when supported and audit is enabled
    audit: force seccomp event logging to honor the audit_enabled flag
    audit: Delete unnecessary checks before two function calls
    audit: wake up threads if queue switched from limited to unlimited
    audit: include auditd's threads in audit_log_start() wait exception
    audit: remove audit_backlog_wait_overflow
    audit: don't needlessly reset valid wait time

    Linus Torvalds
     

17 Jan, 2016

1 commit

  • uselib hasn't been used since libc5; glibc does not use it. Deprecate
    uselib a bit more, by making the default y only if libc5 was widely used
    on the plaform.

    This makes arm64 kernel built with defconfig slightly smaller

    bloat-o-meter:
    add/remove: 0/3 grow/shrink: 0/2 up/down: 0/-1390 (-1390)
    function old new delta
    kernel_config_data 18164 18162 -2
    uselib_flags 20 - -20
    padzero 216 192 -24
    sys_uselib 380 - -380
    load_elf_library 964 - -964

    Signed-off-by: Riku Voipio
    Reviewed-by: Josh Triplett
    Acked-by: Geert Uytterhoeven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Riku Voipio
     

13 Jan, 2016

2 commits

  • To the best of our knowledge, everyone who enables audit at compile
    time also enables syscall auditing; this patch simplifies the Kconfig
    menus by removing the option to disable syscall auditing when audit
    is selected and the target arch supports it.

    Signed-off-by: Paul Moore

    Paul Moore
     
  • Pull cgroup updates from Tejun Heo:

    - cgroup v2 interface is now official. It's no longer hidden behind a
    devel flag and can be mounted using the new cgroup2 fs type.

    Unfortunately, cpu v2 interface hasn't made it yet due to the
    discussion around in-process hierarchical resource distribution and
    only memory and io controllers can be used on the v2 interface at the
    moment.

    - The existing documentation which has always been a bit of mess is
    relocated under Documentation/cgroup-v1/. Documentation/cgroup-v2.txt
    is added as the authoritative documentation for the v2 interface.

    - Some features are added through for-4.5-ancestor-test branch to
    enable netfilter xt_cgroup match to use cgroup v2 paths. The actual
    netfilter changes will be merged through the net tree which pulled in
    the said branch.

    - Various cleanups

    * 'for-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: rename cgroup documentations
    cgroup: fix a typo.
    cgroup: Remove resource_counter.txt in Documentation/cgroup-legacy/00-INDEX.
    cgroup: demote subsystem init messages to KERN_DEBUG
    cgroup: Fix uninitialized variable warning
    cgroup: put controller Kconfig options in meaningful order
    cgroup: clean up the kernel configuration menu nomenclature
    cgroup_pids: fix a typo.
    Subject: cgroup: Fix incomplete dd command in blkio documentation
    cgroup: kill cgrp_ss_priv[CGROUP_CANFORK_COUNT] and friends
    cpuset: Replace all instances of time_t with time64_t
    cgroup: replace unified-hierarchy.txt with a proper cgroup v2 documentation
    cgroup: rename Documentation/cgroups/ to Documentation/cgroup-legacy/
    cgroup: replace __DEVEL__sane_behavior with cgroup2 fs type

    Linus Torvalds
     

19 Dec, 2015

2 commits


13 Dec, 2015

1 commit

  • Currently the full stop_machine() routine is only enabled on SMP if
    module unloading is enabled, or if the CPUs are hotpluggable. This
    leads to configurations where stop_machine() is broken as it will then
    only run the callback on the local CPU with irqs disabled, and not stop
    the other CPUs or run the callback on them.

    For example, this breaks MTRR setup on x86 in certain configs since
    ea8596bb2d8d379 ("kprobes/x86: Remove unused text_poke_smp() and
    text_poke_smp_batch() functions") as the MTRR is only established on the
    boot CPU.

    This patch removes the Kconfig option for STOP_MACHINE and uses the SMP
    and HOTPLUG_CPU config options to compile the correct stop_machine() for
    the architecture, removing the false dependency on MODULE_UNLOAD in the
    process.

    Link: https://lkml.org/lkml/2014/10/8/124
    References: https://bugs.freedesktop.org/show_bug.cgi?id=84794
    Signed-off-by: Chris Wilson
    Acked-by: Ingo Molnar
    Cc: "Paul E. McKenney"
    Cc: Pranith Kumar
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: H. Peter Anvin
    Cc: Tejun Heo
    Cc: Iulia Manda
    Cc: Andy Lutomirski
    Cc: Rusty Russell
    Cc: Peter Zijlstra
    Cc: Chuck Ebbert
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Wilson
     

12 Sep, 2015

1 commit

  • Here is an implementation of a new system call, sys_membarrier(), which
    executes a memory barrier on all threads running on the system. It is
    implemented by calling synchronize_sched(). It can be used to
    distribute the cost of user-space memory barriers asymmetrically by
    transforming pairs of memory barriers into pairs consisting of
    sys_membarrier() and a compiler barrier. For synchronization primitives
    that distinguish between read-side and write-side (e.g. userspace RCU
    [1], rwlocks), the read-side can be accelerated significantly by moving
    the bulk of the memory barrier overhead to the write-side.

    The existing applications of which I am aware that would be improved by
    this system call are as follows:

    * Through Userspace RCU library (http://urcu.so)
    - DNS server (Knot DNS) https://www.knot-dns.cz/
    - Network sniffer (http://netsniff-ng.org/)
    - Distributed object storage (https://sheepdog.github.io/sheepdog/)
    - User-space tracing (http://lttng.org)
    - Network storage system (https://www.gluster.org/)
    - Virtual routers (https://events.linuxfoundation.org/sites/events/files/slides/DPDK_RCU_0MQ.pdf)
    - Financial software (https://lkml.org/lkml/2015/3/23/189)

    Those projects use RCU in userspace to increase read-side speed and
    scalability compared to locking. Especially in the case of RCU used by
    libraries, sys_membarrier can speed up the read-side by moving the bulk of
    the memory barrier cost to synchronize_rcu().

    * Direct users of sys_membarrier
    - core dotnet garbage collector (https://github.com/dotnet/coreclr/issues/198)

    Microsoft core dotnet GC developers are planning to use the mprotect()
    side-effect of issuing memory barriers through IPIs as a way to implement
    Windows FlushProcessWriteBuffers() on Linux. They are referring to
    sys_membarrier in their github thread, specifically stating that
    sys_membarrier() is what they are looking for.

    To explain the benefit of this scheme, let's introduce two example threads:

    Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
    Thread B (frequent, e.g. executing liburcu
    rcu_read_lock()/rcu_read_unlock())

    In a scheme where all smp_mb() in thread A are ordering memory accesses
    with respect to smp_mb() present in Thread B, we can change each
    smp_mb() within Thread A into calls to sys_membarrier() and each
    smp_mb() within Thread B into compiler barriers "barrier()".

    Before the change, we had, for each smp_mb() pairs:

    Thread A Thread B
    previous mem accesses previous mem accesses
    smp_mb() smp_mb()
    following mem accesses following mem accesses

    After the change, these pairs become:

    Thread A Thread B
    prev mem accesses prev mem accesses
    sys_membarrier() barrier()
    follow mem accesses follow mem accesses

    As we can see, there are two possible scenarios: either Thread B memory
    accesses do not happen concurrently with Thread A accesses (1), or they
    do (2).

    1) Non-concurrent Thread A vs Thread B accesses:

    Thread A Thread B
    prev mem accesses
    sys_membarrier()
    follow mem accesses
    prev mem accesses
    barrier()
    follow mem accesses

    In this case, thread B accesses will be weakly ordered. This is OK,
    because at that point, thread A is not particularly interested in
    ordering them with respect to its own accesses.

    2) Concurrent Thread A vs Thread B accesses

    Thread A Thread B
    prev mem accesses prev mem accesses
    sys_membarrier() barrier()
    follow mem accesses follow mem accesses

    In this case, thread B accesses, which are ensured to be in program
    order thanks to the compiler barrier, will be "upgraded" to full
    smp_mb() by synchronize_sched().

    * Benchmarks

    On Intel Xeon E5405 (8 cores)
    (one thread is calling sys_membarrier, the other 7 threads are busy
    looping)

    1000 non-expedited sys_membarrier calls in 33s =3D 33 milliseconds/call.

    * User-space user of this system call: Userspace RCU library

    Both the signal-based and the sys_membarrier userspace RCU schemes
    permit us to remove the memory barrier from the userspace RCU
    rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
    accelerating them. These memory barriers are replaced by compiler
    barriers on the read-side, and all matching memory barriers on the
    write-side are turned into an invocation of a memory barrier on all
    active threads in the process. By letting the kernel perform this
    synchronization rather than dumbly sending a signal to every process
    threads (as we currently do), we diminish the number of unnecessary wake
    ups and only issue the memory barriers on active threads. Non-running
    threads do not need to execute such barrier anyway, because these are
    implied by the scheduler context switches.

    Results in liburcu:

    Operations in 10s, 6 readers, 2 writers:

    memory barriers in reader: 1701557485 reads, 2202847 writes
    signal-based scheme: 9830061167 reads, 6700 writes
    sys_membarrier: 9952759104 reads, 425 writes
    sys_membarrier (dyn. check): 7970328887 reads, 425 writes

    The dynamic sys_membarrier availability check adds some overhead to
    the read-side compared to the signal-based scheme, but besides that,
    sys_membarrier slightly outperforms the signal-based scheme. However,
    this non-expedited sys_membarrier implementation has a much slower grace
    period than signal and memory barrier schemes.

    Besides diminishing the number of wake-ups, one major advantage of the
    membarrier system call over the signal-based scheme is that it does not
    need to reserve a signal. This plays much more nicely with libraries,
    and with processes injected into for tracing purposes, for which we
    cannot expect that signals will be unused by the application.

    An expedited version of this system call can be added later on to speed
    up the grace period. Its implementation will likely depend on reading
    the cpu_curr()->mm without holding each CPU's rq lock.

    This patch adds the system call to x86 and to asm-generic.

    [1] http://urcu.so

    membarrier(2) man page:

    MEMBARRIER(2) Linux Programmer's Manual MEMBARRIER(2)

    NAME
    membarrier - issue memory barriers on a set of threads

    SYNOPSIS
    #include

    int membarrier(int cmd, int flags);

    DESCRIPTION
    The cmd argument is one of the following:

    MEMBARRIER_CMD_QUERY
    Query the set of supported commands. It returns a bitmask of
    supported commands.

    MEMBARRIER_CMD_SHARED
    Execute a memory barrier on all threads running on the system.
    Upon return from system call, the caller thread is ensured that
    all running threads have passed through a state where all memory
    accesses to user-space addresses match program order between
    entry to and return from the system call (non-running threads
    are de facto in such a state). This covers threads from all pro=E2=80=90
    cesses running on the system. This command returns 0.

    The flags argument needs to be 0. For future extensions.

    All memory accesses performed in program order from each targeted
    thread is guaranteed to be ordered with respect to sys_membarrier(). If
    we use the semantic "barrier()" to represent a compiler barrier forcing
    memory accesses to be performed in program order across the barrier,
    and smp_mb() to represent explicit memory barriers forcing full memory
    ordering across the barrier, we have the following ordering table for
    each pair of barrier(), sys_membarrier() and smp_mb():

    The pair ordering is detailed as (O: ordered, X: not ordered):

    barrier() smp_mb() sys_membarrier()
    barrier() X X O
    smp_mb() X O O
    sys_membarrier() O O O

    RETURN VALUE
    On success, these system calls return zero. On error, -1 is returned,
    and errno is set appropriately. For a given command, with flags
    argument set to 0, this system call is guaranteed to always return the
    same value until reboot.

    ERRORS
    ENOSYS System call is not implemented.

    EINVAL Invalid arguments.

    Linux 2015-04-15 MEMBARRIER(2)

    Signed-off-by: Mathieu Desnoyers
    Reviewed-by: Paul E. McKenney
    Reviewed-by: Josh Triplett
    Cc: KOSAKI Motohiro
    Cc: Steven Rostedt
    Cc: Nicholas Miell
    Cc: Ingo Molnar
    Cc: Alan Cox
    Cc: Lai Jiangshan
    Cc: Stephen Hemminger
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: David Howells
    Cc: Pranith Kumar
    Cc: Michael Kerrisk
    Cc: Shuah Khan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mathieu Desnoyers
     

09 Sep, 2015

1 commit

  • Pull security subsystem updates from James Morris:
    "Highlights:

    - PKCS#7 support added to support signed kexec, also utilized for
    module signing. See comments in 3f1e1bea.

    ** NOTE: this requires linking against the OpenSSL library, which
    must be installed, e.g. the openssl-devel on Fedora **

    - Smack
    - add IPv6 host labeling; ignore labels on kernel threads
    - support smack labeling mounts which use binary mount data

    - SELinux:
    - add ioctl whitelisting (see
    http://kernsec.org/files/lss2015/vanderstoep.pdf)
    - fix mprotect PROT_EXEC regression caused by mm change

    - Seccomp:
    - add ptrace options for suspend/resume"

    * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (57 commits)
    PKCS#7: Add OIDs for sha224, sha284 and sha512 hash algos and use them
    Documentation/Changes: Now need OpenSSL devel packages for module signing
    scripts: add extract-cert and sign-file to .gitignore
    modsign: Handle signing key in source tree
    modsign: Use if_changed rule for extracting cert from module signing key
    Move certificate handling to its own directory
    sign-file: Fix warning about BIO_reset() return value
    PKCS#7: Add MODULE_LICENSE() to test module
    Smack - Fix build error with bringup unconfigured
    sign-file: Document dependency on OpenSSL devel libraries
    PKCS#7: Appropriately restrict authenticated attributes and content type
    KEYS: Add a name for PKEY_ID_PKCS7
    PKCS#7: Improve and export the X.509 ASN.1 time object decoder
    modsign: Use extract-cert to process CONFIG_SYSTEM_TRUSTED_KEYS
    extract-cert: Cope with multiple X.509 certificates in a single file
    sign-file: Generate CMS message as signature instead of PKCS#7
    PKCS#7: Support CMS messages also [RFC5652]
    X.509: Change recorded SKID & AKID to not include Subject or Issuer
    PKCS#7: Check content type and versions
    MAINTAINERS: The keyrings mailing list has moved
    ...

    Linus Torvalds
     

06 Sep, 2015

2 commits

  • Pull vfs updates from Al Viro:
    "In this one:

    - d_move fixes (Eric Biederman)

    - UFS fixes (me; locking is mostly sane now, a bunch of bugs in error
    handling ought to be fixed)

    - switch of sb_writers to percpu rwsem (Oleg Nesterov)

    - superblock scalability (Josef Bacik and Dave Chinner)

    - swapon(2) race fix (Hugh Dickins)"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (65 commits)
    vfs: Test for and handle paths that are unreachable from their mnt_root
    dcache: Reduce the scope of i_lock in d_splice_alias
    dcache: Handle escaped paths in prepend_path
    mm: fix potential data race in SyS_swapon
    inode: don't softlockup when evicting inodes
    inode: rename i_wb_list to i_io_list
    sync: serialise per-superblock sync operations
    inode: convert inode_sb_list_lock to per-sb
    inode: add hlist_fake to avoid the inode hash lock in evict
    writeback: plug writeback at a high level
    change sb_writers to use percpu_rw_semaphore
    shift percpu_counter_destroy() into destroy_super_work()
    percpu-rwsem: kill CONFIG_PERCPU_RWSEM
    percpu-rwsem: introduce percpu_rwsem_release() and percpu_rwsem_acquire()
    percpu-rwsem: introduce percpu_down_read_trylock()
    document rwsem_release() in sb_wait_write()
    fix the broken lockdep logic in __sb_start_write()
    introduce __sb_writers_{acquired,release}() helpers
    ufs_inode_get{frag,block}(): get rid of 'phys' argument
    ufs_getfrag_block(): tidy up a bit
    ...

    Linus Torvalds
     
  • Merge patch-bomb from Andrew Morton:

    - a few misc things

    - Andy's "ambient capabilities"

    - fs/nofity updates

    - the ocfs2 queue

    - kernel/watchdog.c updates and feature work.

    - some of MM. Includes Andrea's userfaultfd feature.

    [ Hadn't noticed that userfaultfd was 'default y' when applying the
    patches, so that got fixed in this merge instead. We do _not_ mark
    new features that nobody uses yet 'default y' - Linus ]

    * emailed patches from Andrew Morton : (118 commits)
    mm/hugetlb.c: make vma_has_reserves() return bool
    mm/madvise.c: make madvise_behaviour_valid() return bool
    mm/memory.c: make tlb_next_batch() return bool
    mm/dmapool.c: change is_page_busy() return from int to bool
    mm: remove struct node_active_region
    mremap: simplify the "overlap" check in mremap_to()
    mremap: don't do uneccesary checks if new_len == old_len
    mremap: don't do mm_populate(new_addr) on failure
    mm: move ->mremap() from file_operations to vm_operations_struct
    mremap: don't leak new_vma if f_op->mremap() fails
    mm/hugetlb.c: make vma_shareable() return bool
    mm: make GUP handle pfn mapping unless FOLL_GET is requested
    mm: fix status code which move_pages() returns for zero page
    mm: memcontrol: bring back the VM_BUG_ON() in mem_cgroup_swapout()
    genalloc: add support of multiple gen_pools per device
    genalloc: add name arg to gen_pool_get() and devm_gen_pool_create()
    mm/memblock: WARN_ON when nid differs from overlap region
    Documentation/features/vm: add feature description and arch support status for batched TLB flush after unmap
    mm: defer flush of writable TLB entries
    mm: send one IPI per CPU to TLB flush all entries after unmapping pages
    ...

    Linus Torvalds
     

05 Sep, 2015

2 commits

  • An IPI is sent to flush remote TLBs when a page is unmapped that was
    potentially accesssed by other CPUs. There are many circumstances where
    this happens but the obvious one is kswapd reclaiming pages belonging to a
    running process as kswapd and the task are likely running on separate
    CPUs.

    On small machines, this is not a significant problem but as machine gets
    larger with more cores and more memory, the cost of these IPIs can be
    high. This patch uses a simple structure that tracks CPUs that
    potentially have TLB entries for pages being unmapped. When the unmapping
    is complete, the full TLB is flushed on the assumption that a refill cost
    is lower than flushing individual entries.

    Architectures wishing to do this must give the following guarantee.

    If a clean page is unmapped and not immediately flushed, the
    architecture must guarantee that a write to that linear address
    from a CPU with a cached TLB entry will trap a page fault.

    This is essentially what the kernel already depends on but the window is
    much larger with this patch applied and is worth highlighting. The
    architecture should consider whether the cost of the full TLB flush is
    higher than sending an IPI to flush each individual entry. An additional
    architecture helper called flush_tlb_local is required. It's a trivial
    wrapper with some accounting in the x86 case.

    The impact of this patch depends on the workload as measuring any benefit
    requires both mapped pages co-located on the LRU and memory pressure. The
    case with the biggest impact is multiple processes reading mapped pages
    taken from the vm-scalability test suite. The test case uses NR_CPU
    readers of mapped files that consume 10*RAM.

    Linear mapped reader on a 4-node machine with 64G RAM and 48 CPUs

    4.2.0-rc1 4.2.0-rc1
    vanilla flushfull-v7
    Ops lru-file-mmap-read-elapsed 159.62 ( 0.00%) 120.68 ( 24.40%)
    Ops lru-file-mmap-read-time_range 30.59 ( 0.00%) 2.80 ( 90.85%)
    Ops lru-file-mmap-read-time_stddv 6.70 ( 0.00%) 0.64 ( 90.38%)

    4.2.0-rc1 4.2.0-rc1
    vanilla flushfull-v7
    User 581.00 611.43
    System 5804.93 4111.76
    Elapsed 161.03 122.12

    This is showing that the readers completed 24.40% faster with 29% less
    system CPU time. From vmstats, it is known that the vanilla kernel was
    interrupted roughly 900K times per second during the steady phase of the
    test and the patched kernel was interrupts 180K times per second.

    The impact is lower on a single socket machine.

    4.2.0-rc1 4.2.0-rc1
    vanilla flushfull-v7
    Ops lru-file-mmap-read-elapsed 25.33 ( 0.00%) 20.38 ( 19.54%)
    Ops lru-file-mmap-read-time_range 0.91 ( 0.00%) 1.44 (-58.24%)
    Ops lru-file-mmap-read-time_stddv 0.28 ( 0.00%) 0.47 (-65.34%)

    4.2.0-rc1 4.2.0-rc1
    vanilla flushfull-v7
    User 58.09 57.64
    System 111.82 76.56
    Elapsed 27.29 22.55

    It's still a noticeable improvement with vmstat showing interrupts went
    from roughly 500K per second to 45K per second.

    The patch will have no impact on workloads with no memory pressure or have
    relatively few mapped pages. It will have an unpredictable impact on the
    workload running on the CPU being flushed as it'll depend on how many TLB
    entries need to be refilled and how long that takes. Worst case, the TLB
    will be completely cleared of active entries when the target PFNs were not
    resident at all.

    [sasha.levin@oracle.com: trace tlb flush after disabling preemption in try_to_unmap_flush]
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Dave Hansen
    Acked-by: Ingo Molnar
    Cc: Linus Torvalds
    Signed-off-by: Sasha Levin
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This allows to select the userfaultfd during configuration to build it.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Pavel Emelyanov
    Cc: Sanidhya Kashyap
    Cc: zhang.zhanghailiang@huawei.com
    Cc: "Kirill A. Shutemov"
    Cc: Andres Lagar-Cavilla
    Cc: Dave Hansen
    Cc: Paolo Bonzini
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Andy Lutomirski
    Cc: Hugh Dickins
    Cc: Peter Feiner
    Cc: "Dr. David Alan Gilbert"
    Cc: Johannes Weiner
    Cc: "Huangpeng (Peter)"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

02 Sep, 2015

1 commit

  • Pull cgroup updates from Tejun Heo:

    - a new PIDs controller is added. It turns out that PIDs are actually
    an independent resource from kmem due to the limited PID space.

    - more core preparations for the v2 interface. Once cpu side interface
    is settled, it should be ready for lifting the devel mask.
    for-4.3-unified-base was temporarily branched so that other trees
    (block) can pull cgroup core changes that blkcg changes depend on.

    - a non-critical idr_preload usage bug fix.

    * 'for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: pids: fix invalid get/put usage
    cgroup: introduce cgroup_subsys->legacy_name
    cgroup: don't print subsystems for the default hierarchy
    cgroup: make cftype->private a unsigned long
    cgroup: export cgrp_dfl_root
    cgroup: define controller file conventions
    cgroup: fix idr_preload usage
    cgroup: add documentation for the PIDs controller
    cgroup: implement the PIDs subsystem
    cgroup: allow a cgroup subsystem to reject a fork

    Linus Torvalds
     

15 Aug, 2015

1 commit


14 Aug, 2015

1 commit


13 Aug, 2015

1 commit

  • The revised sign-file program is no longer a script that wraps the openssl
    program, but now rather a program that makes use of OpenSSL's crypto
    library. This means that to build the sign-file program, the kernel build
    process now has a dependency on the OpenSSL development packages in
    addition to OpenSSL itself.

    Document this in Kconfig and in module-signing.txt.

    Signed-off-by: David Howells
    Reviewed-by: David Woodhouse

    David Howells
     

07 Aug, 2015

6 commits

  • Let the user explicitly provide a file containing trusted keys, instead of
    just automatically finding files matching *.x509 in the build tree and
    trusting whatever we find. This really ought to be an *explicit*
    configuration, and the build rules for dealing with the files were
    fairly painful too.

    Fix applied from James Morris that removes an '=' from a macro definition
    in kernel/Makefile as this is a feature that only exists from GNU make 3.82
    onwards.

    Signed-off-by: David Woodhouse
    Signed-off-by: David Howells

    David Woodhouse
     
  • The current rule for generating signing_key.priv and signing_key.x509 is
    a classic example of a bad rule which has a tendency to break parallel
    make. When invoked to create *either* target, it generates the other
    target as a side-effect that make didn't predict.

    So let's switch to using a single file signing_key.pem which contains
    both key and certificate. That matches what we do in the case of an
    external key specified by CONFIG_MODULE_SIG_KEY anyway, so it's also
    slightly cleaner.

    Signed-off-by: David Woodhouse
    Signed-off-by: David Howells

    David Woodhouse
     
  • Where an external PEM file or PKCS#11 URI is given, we can get the cert
    from it for ourselves instead of making the user drop signing_key.x509
    in place for us.

    Signed-off-by: David Woodhouse
    Signed-off-by: David Howells

    David Woodhouse
     
  • Signed-off-by: David Woodhouse
    Signed-off-by: David Howells

    David Woodhouse
     
  • Extract the function that drives the PKCS#7 signature verification given a
    data blob and a PKCS#7 blob out from the module signing code and lump it with
    the system keyring code as it's generic. This makes it independent of module
    config options and opens it to use by the firmware loader.

    Signed-off-by: David Howells
    Cc: Luis R. Rodriguez
    Cc: Rusty Russell
    Cc: Ming Lei
    Cc: Seth Forshee
    Cc: Kyle McMartin

    David Howells
     
  • Move to using PKCS#7 messages as module signatures because:

    (1) We have to be able to support the use of X.509 certificates that don't
    have a subjKeyId set. We're currently relying on this to look up the
    X.509 certificate in the trusted keyring list.

    (2) PKCS#7 message signed information blocks have a field that supplies the
    data required to match with the X.509 certificate that signed it.

    (3) The PKCS#7 certificate carries fields that specify the digest algorithm
    used to generate the signature in a standardised way and the X.509
    certificates specify the public key algorithm in a standardised way - so
    we don't need our own methods of specifying these.

    (4) We now have PKCS#7 message support in the kernel for signed kexec purposes
    and we can make use of this.

    To make this work, the old sign-file script has been replaced with a program
    that needs compiling in a previous patch. The rules to build it are added
    here.

    Signed-off-by: David Howells
    Tested-by: Vivek Goyal

    David Howells
     

23 Jul, 2015

1 commit


15 Jul, 2015

1 commit

  • Adds a new single-purpose PIDs subsystem to limit the number of
    tasks that can be forked inside a cgroup. Essentially this is an
    implementation of RLIMIT_NPROC that applies to a cgroup rather than a
    process tree.

    However, it should be noted that organisational operations (adding and
    removing tasks from a PIDs hierarchy) will *not* be prevented. Rather,
    the number of tasks in the hierarchy cannot exceed the limit through
    forking. This is due to the fact that, in the unified hierarchy, attach
    cannot fail (and it is not possible for a task to overcome its PIDs
    cgroup policy limit by attaching to a child cgroup -- even if migrating
    mid-fork it must be able to fork in the parent first).

    PIDs are fundamentally a global resource, and it is possible to reach
    PID exhaustion inside a cgroup without hitting any reasonable kmemcg
    policy. Once you've hit PID exhaustion, you're only in a marginally
    better state than OOM. This subsystem allows PID exhaustion inside a
    cgroup to be prevented.

    Signed-off-by: Aleksa Sarai
    Signed-off-by: Tejun Heo

    Aleksa Sarai
     

07 Jul, 2015

1 commit


04 Jul, 2015

3 commits

  • Pull scheduler fixes from Ingo Molnar:
    "Debug info and other statistics fixes and related enhancements"

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/numa: Fix numa balancing stats in /proc/pid/sched
    sched/numa: Show numa_group ID in /proc/sched_debug task listings
    sched/debug: Move print_cfs_rq() declaration to kernel/sched/sched.h
    sched/stat: Expose /proc/pid/schedstat if CONFIG_SCHED_INFO=y
    sched/stat: Simplify the sched_info accounting dependency

    Linus Torvalds
     
  • Pull max log buf size increase from Ingo Molnar:
    "Ran into this limit recently, so increase it by an order of magnitude"

    * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    printk: Increase maximum CONFIG_LOG_BUF_SHIFT from 21 to 25

    Linus Torvalds
     
  • Both CONFIG_SCHEDSTATS=y and CONFIG_TASK_DELAY_ACCT=y track task
    sched_info, which results in ugly #if clauses.

    Simplify the code by introducing a synthethic CONFIG_SCHED_INFO
    switch, selected by both.

    Signed-off-by: Naveen N. Rao
    Cc: Balbir Singh
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Srikar Dronamraju
    Cc: Thomas Gleixner
    Cc: a.p.zijlstra@chello.nl
    Cc: ricklind@us.ibm.com
    Link: http://lkml.kernel.org/r/8d19eef800811a94b0f91bcbeb27430a884d7433.1435255405.git.naveen.n.rao@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar

    Naveen N. Rao
     

02 Jul, 2015

1 commit

  • Pull module updates from Rusty Russell:
    "Main excitement here is Peter Zijlstra's lockless rbtree optimization
    to speed module address lookup. He found some abusers of the module
    lock doing that too.

    A little bit of parameter work here too; including Dan Streetman's
    breaking up the big param mutex so writing a parameter can load
    another module (yeah, really). Unfortunately that broke the usual
    suspects, !CONFIG_MODULES and !CONFIG_SYSFS, so those fixes were
    appended too"

    * tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: (26 commits)
    modules: only use mod->param_lock if CONFIG_MODULES
    param: fix module param locks when !CONFIG_SYSFS.
    rcu: merge fix for Convert ACCESS_ONCE() to READ_ONCE() and WRITE_ONCE()
    module: add per-module param_lock
    module: make perm const
    params: suppress unused variable error, warn once just in case code changes.
    modules: clarify CONFIG_MODULE_COMPRESS help, suggest 'N'.
    kernel/module.c: avoid ifdefs for sig_enforce declaration
    kernel/workqueue.c: remove ifdefs over wq_power_efficient
    kernel/params.c: export param_ops_bool_enable_only
    kernel/params.c: generalize bool_enable_only
    kernel/module.c: use generic module param operaters for sig_enforce
    kernel/params: constify struct kernel_param_ops uses
    sysfs: tightened sysfs permission checks
    module: Rework module_addr_{min,max}
    module: Use __module_address() for module_address_lookup()
    module: Make the mod_tree stuff conditional on PERF_EVENTS || TRACING
    module: Optimize __module_address() using a latched RB-tree
    rbtree: Implement generic latch_tree
    seqlock: Introduce raw_read_seqcount_latch()
    ...

    Linus Torvalds
     

01 Jul, 2015

1 commit

  • So I tried to some kernel debugging that produced a ton of kernel messages
    on a big box, and wanted to save them all: but CONFIG_LOG_BUF_SHIFT maxes
    out at 21 (2 MB).

    Increase it to 25 (32 MB).

    This does not affect any existing config or defaults.

    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

27 Jun, 2015

2 commits

  • Pull cgroup updates from Tejun Heo:

    - threadgroup_lock got reorganized so that its users can pick the
    actual locking mechanism to use. Its only user - cgroups - is
    updated to use a percpu_rwsem instead of per-process rwsem.

    This makes things a bit lighter on hot paths and allows cgroups to
    perform and fail multi-task (a process) migrations atomically.
    Multi-task migrations are used in several places including the
    unified hierarchy.

    - Delegation rule and documentation added to unified hierarchy. This
    will likely be the last interface update from the cgroup core side
    for unified hierarchy before lifting the devel mask.

    - Some groundwork for the pids controller which is scheduled to be
    merged in the coming devel cycle.

    * 'for-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: add delegation section to unified hierarchy documentation
    cgroup: require write perm on common ancestor when moving processes on the default hierarchy
    cgroup: separate out cgroup_procs_write_permission() from __cgroup_procs_write()
    kernfs: make kernfs_get_inode() public
    MAINTAINERS: add a cgroup core co-maintainer
    cgroup: fix uninitialised iterator in for_each_subsys_which
    cgroup: replace explicit ss_mask checking with for_each_subsys_which
    cgroup: use bitmask to filter for_each_subsys
    cgroup: add seq_file forward declaration for struct cftype
    cgroup: simplify threadgroup locking
    sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem
    sched, cgroup: reorganize threadgroup locking
    cgroup: switch to unsigned long for bitmasks
    cgroup: reorganize include/linux/cgroup.h
    cgroup: separate out include/linux/cgroup-defs.h
    cgroup: fix some comment typos

    Linus Torvalds
     
  • Merge second patchbomb from Andrew Morton:

    - most of the rest of MM

    - lots of misc things

    - procfs updates

    - printk feature work

    - updates to get_maintainer, MAINTAINERS, checkpatch

    - lib/ updates

    * emailed patches from Andrew Morton : (96 commits)
    exit,stats: /* obey this comment */
    coredump: add __printf attribute to cn_*printf functions
    coredump: use from_kuid/kgid when formatting corename
    fs/reiserfs: remove unneeded cast
    NILFS2: support NFSv2 export
    fs/befs/btree.c: remove unneeded initializations
    fs/minix: remove unneeded cast
    init/do_mounts.c: add create_dev() failure log
    kasan: remove duplicate definition of the macro KASAN_FREE_PAGE
    fs/efs: femove unneeded cast
    checkpatch: emit "NOTE: " message only once after multiple files
    checkpatch: emit an error when there's a diff in a changelog
    checkpatch: validate MODULE_LICENSE content
    checkpatch: add multi-line handling for PREFER_ETHER_ADDR_COPY
    checkpatch: suggest using eth_zero_addr() and eth_broadcast_addr()
    checkpatch: fix processing of MEMSET issues
    checkpatch: suggest using ether_addr_equal*()
    checkpatch: avoid NOT_UNIFIED_DIFF errors on cover-letter.patch files
    checkpatch: remove local from codespell path
    checkpatch: add --showfile to allow input via pipe to show filenames
    ...

    Linus Torvalds
     

26 Jun, 2015

2 commits

  • Commit 818411616baf ("fs, proc: introduce /proc//task//children
    entry") introduced the children entry for checkpoint restore and the
    file is only available on kernels configured with CONFIG_EXPERT and
    CONFIG_CHECKPOINT_RESTORE.

    This is available in most distributions (Fedora, Debian, Ubuntu, CoreOS)
    because they usually enable CONFIG_EXPERT and CONFIG_CHECKPOINT_RESTORE.
    But Arch does not enable CONFIG_EXPERT or CONFIG_CHECKPOINT_RESTORE.

    However, the children proc file is useful outside of checkpoint restore.
    I would like to use it in rkt. The rkt process exec() another program
    it does not control, and that other program will fork()+exec() a child
    process. I would like to find the pid of the child process from an
    external tool without iterating in /proc over all processes to find
    which one has a parent pid equal to rkt.

    This commit introduces CONFIG_PROC_CHILDREN and makes
    CONFIG_CHECKPOINT_RESTORE select it. This allows enabling
    /proc//task//children without needing to enable
    CONFIG_CHECKPOINT_RESTORE and CONFIG_EXPERT.

    Alban tested that /proc//task//children is present when the
    kernel is configured with CONFIG_PROC_CHILDREN=y but without
    CONFIG_CHECKPOINT_RESTORE

    Signed-off-by: Iago López Galeiras
    Tested-by: Alban Crequy
    Reviewed-by: Cyrill Gorcunov
    Cc: Oleg Nesterov
    Cc: Kees Cook
    Cc: Pavel Emelyanov
    Cc: Serge Hallyn
    Cc: KAMEZAWA Hiroyuki
    Cc: Alexander Viro
    Cc: Andy Lutomirski
    Cc: Djalal Harouni
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Iago López Galeiras
     
  • Pull cgroup writeback support from Jens Axboe:
    "This is the big pull request for adding cgroup writeback support.

    This code has been in development for a long time, and it has been
    simmering in for-next for a good chunk of this cycle too. This is one
    of those problems that has been talked about for at least half a
    decade, finally there's a solution and code to go with it.

    Also see last weeks writeup on LWN:

    http://lwn.net/Articles/648292/"

    * 'for-4.2/writeback' of git://git.kernel.dk/linux-block: (85 commits)
    writeback, blkio: add documentation for cgroup writeback support
    vfs, writeback: replace FS_CGROUP_WRITEBACK with SB_I_CGROUPWB
    writeback: do foreign inode detection iff cgroup writeback is enabled
    v9fs: fix error handling in v9fs_session_init()
    bdi: fix wrong error return value in cgwb_create()
    buffer: remove unusued 'ret' variable
    writeback: disassociate inodes from dying bdi_writebacks
    writeback: implement foreign cgroup inode bdi_writeback switching
    writeback: add lockdep annotation to inode_to_wb()
    writeback: use unlocked_inode_to_wb transaction in inode_congested()
    writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
    writeback: implement [locked_]inode_to_wb_and_lock_list()
    writeback: implement foreign cgroup inode detection
    writeback: make writeback_control track the inode being written back
    writeback: relocate wb[_try]_get(), wb_put(), inode_{attach|detach}_wb()
    mm: vmscan: disable memcg direct reclaim stalling if cgroup writeback support is in use
    writeback: implement memcg writeback domain based throttling
    writeback: reset wb_domain->dirty_limit[_tstmp] when memcg domain size changes
    writeback: implement memcg wb_domain
    writeback: update wb_over_bg_thresh() to use wb_domain aware operations
    ...

    Linus Torvalds
     

23 Jun, 2015

2 commits

  • Andreas turned this option on, only to find out Debian (and Ubuntu!)
    don't enable support in their kmod builds.

    Shorten the text, and suggest N at the bottom (at least for now).

    Reported-by: Andreas Mohr
    Signed-off-by: Rusty Russell

    Rusty Russell
     
  • Pull perf updates from Ingo Molnar:
    "Kernel side changes mostly consist of work on x86 PMU drivers:

    - x86 Intel PT (hardware CPU tracer) improvements (Alexander
    Shishkin)

    - x86 Intel CQM (cache quality monitoring) improvements (Thomas
    Gleixner)

    - x86 Intel PEBSv3 support (Peter Zijlstra)

    - x86 Intel PEBS interrupt batching support for lower overhead
    sampling (Zheng Yan, Kan Liang)

    - x86 PMU scheduler fixes and improvements (Peter Zijlstra)

    There's too many tooling improvements to list them all - here are a
    few select highlights:

    'perf bench':

    - Introduce new 'perf bench futex' benchmark: 'wake-parallel', to
    measure parallel waker threads generating contention for kernel
    locks (hb->lock). (Davidlohr Bueso)

    'perf top', 'perf report':

    - Allow disabling/enabling events dynamicaly in 'perf top':
    a 'perf top' session can instantly become a 'perf report'
    one, i.e. going from dynamic analysis to a static one,
    returning to a dynamic one is possible, to toogle the
    modes, just press 'f' to 'freeze/unfreeze' the sampling. (Arnaldo Carvalho de Melo)

    - Make Ctrl-C stop processing on TUI, allowing interrupting the load of big
    perf.data files (Namhyung Kim)

    'perf probe': (Masami Hiramatsu)

    - Support glob wildcards for function name
    - Support $params special probe argument: Collect all function arguments
    - Make --line checks validate C-style function name.
    - Add --no-inlines option to avoid searching inline functions
    - Greatly speed up 'perf probe --list' by caching debuginfo.
    - Improve --filter support for 'perf probe', allowing using its arguments
    on other commands, as --add, --del, etc.

    'perf sched':

    - Add option in 'perf sched' to merge like comms to lat output (Josef Bacik)

    Plus tons of infrastructure work - in particular preparation for
    upcoming threaded perf report support, but also lots of other work -
    and fixes and other improvements. See (much) more details in the
    shortlog and in the git log"

    * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (305 commits)
    perf tools: Configurable per thread proc map processing time out
    perf tools: Add time out to force stop proc map processing
    perf report: Fix sort__sym_cmp to also compare end of symbol
    perf hists browser: React to unassigned hotkey pressing
    perf top: Tell the user how to unfreeze events after pressing 'f'
    perf hists browser: Honour the help line provided by builtin-{top,report}.c
    perf hists browser: Do not exit when 'f' is pressed in 'report' mode
    perf top: Replace CTRL+z with 'f' as hotkey for enable/disable events
    perf annotate: Rename source_line_percent to source_line_samples
    perf annotate: Display total number of samples with --show-total-period
    perf tools: Ensure thread-stack is flushed
    perf top: Allow disabling/enabling events dynamicly
    perf evlist: Add toggle_enable() method
    perf trace: Fix race condition at the end of started workloads
    perf probe: Speed up perf probe --list by caching debuginfo
    perf probe: Show usage even if the last event is skipped
    perf tools: Move libtraceevent dynamic list to separated LDFLAGS variable
    perf tools: Fix a problem when opening old perf.data with different byte order
    perf tools: Ignore .config-detected in .gitignore
    perf probe: Fix to return error if no probe is added
    ...

    Linus Torvalds
     

02 Jun, 2015

1 commit

  • cgroup writeback requires support from both bdi and filesystem sides.
    Add BDI_CAP_CGROUP_WRITEBACK and FS_CGROUP_WRITEBACK to indicate
    support and enable BDI_CAP_CGROUP_WRITEBACK on block based bdi's by
    default. Also, define CONFIG_CGROUP_WRITEBACK which is enabled if
    both MEMCG and BLK_CGROUP are enabled.

    inode_cgwb_enabled() which determines whether a given inode's both bdi
    and fs support cgroup writeback is added.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Signed-off-by: Jens Axboe

    Tejun Heo