11 Sep, 2015

10 commits

  • There are two kexec load syscalls, kexec_load another and kexec_file_load.
    kexec_file_load has been splited as kernel/kexec_file.c. In this patch I
    split kexec_load syscall code to kernel/kexec.c.

    And add a new kconfig option KEXEC_CORE, so we can disable kexec_load and
    use kexec_file_load only, or vice verse.

    The original requirement is from Ted Ts'o, he want kexec kernel signature
    being checked with CONFIG_KEXEC_VERIFY_SIG enabled. But kexec-tools use
    kexec_load syscall can bypass the checking.

    Vivek Goyal proposed to create a common kconfig option so user can compile
    in only one syscall for loading kexec kernel. KEXEC/KEXEC_FILE selects
    KEXEC_CORE so that old config files still work.

    Because there's general code need CONFIG_KEXEC_CORE, so I updated all the
    architecture Kconfig with a new option KEXEC_CORE, and let KEXEC selects
    KEXEC_CORE in arch Kconfig. Also updated general kernel code with to
    kexec_load syscall.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Dave Young
    Cc: Eric W. Biederman
    Cc: Vivek Goyal
    Cc: Petr Tesarik
    Cc: Theodore Ts'o
    Cc: Josh Boyer
    Cc: David Howells
    Cc: Geert Uytterhoeven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Young
     
  • Split kexec_file syscall related code to another file kernel/kexec_file.c
    so that the #ifdef CONFIG_KEXEC_FILE in kexec.c can be dropped.

    Sharing variables and functions are moved to kernel/kexec_internal.h per
    suggestion from Vivek and Petr.

    [akpm@linux-foundation.org: fix bisectability]
    [akpm@linux-foundation.org: declare the various arch_kexec functions]
    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Dave Young
    Cc: Eric W. Biederman
    Cc: Vivek Goyal
    Cc: Petr Tesarik
    Cc: Theodore Ts'o
    Cc: Josh Boyer
    Cc: David Howells
    Cc: Geert Uytterhoeven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Young
     
  • The UMH_WAIT_PROC handler runs in its own thread in order to make sure
    that waiting for the exec kernel thread completion won't block other
    usermodehelper queued jobs.

    On older workqueue implementations, worklets couldn't sleep without
    blocking the rest of the queue. But now the workqueue subsystem handles
    that. Khelper still had the older limitation due to its singlethread
    properties but we replaced it to system unbound workqueues.

    Those are affine to the current node and can block up to some number of
    instances.

    They are a good candidate to handle UMH_WAIT_PROC assuming that we have
    enough system unbound workers to handle lots of parallel usermodehelper
    jobs.

    Signed-off-by: Frederic Weisbecker
    Cc: Rik van Riel
    Reviewed-by: Oleg Nesterov
    Cc: Christoph Lameter
    Cc: Tejun Heo
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Frederic Weisbecker
     
  • We need to launch the usermodehelper kernel threads with the widest
    affinity and this is partly why we use khelper. This workqueue has
    unbound properties and thus a wide affinity inherited by all its children.

    Now khelper also has special properties that we aren't much interested in:
    ordered and singlethread. There is really no need about ordering as all
    we do is creating kernel threads. This can be done concurrently. And
    singlethread is a useless limitation as well.

    The workqueue engine already proposes generic unbound workqueues that
    don't share these useless properties and handle well parallel jobs.

    The only worrysome specific is their affinity to the node of the current
    CPU. It's fine for creating the usermodehelper kernel threads but those
    inherit this affinity for longer jobs such as requesting modules.

    This patch proposes to use these node affine unbound workqueues assuming
    that a node is sufficient to handle several parallel usermodehelper
    requests.

    Signed-off-by: Frederic Weisbecker
    Cc: Rik van Riel
    Reviewed-by: Oleg Nesterov
    Cc: Christoph Lameter
    Cc: Tejun Heo
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Frederic Weisbecker
     
  • There seem to be quite some confusions on the comments, likely due to
    changes that came after them.

    Now since it's very non obvious why we have 3 levels of asynchronous code
    to implement usermodehelpers, it's important to comment in detail the
    reason of this layout.

    Signed-off-by: Frederic Weisbecker
    Cc: Rik van Riel
    Reviewed-by: Oleg Nesterov
    Cc: Christoph Lameter
    Cc: Tejun Heo
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Frederic Weisbecker
     
  • Khelper is affine to all CPUs. Now since it creates the
    call_usermodehelper_exec_[a]sync() kernel threads, those inherit the wide
    affinity.

    As such explicitly forcing a wide affinity from those kernel threads
    is like a no-op.

    Just remove it. It's needless and it breaks CPU isolation users who
    rely on workqueue affinity tuning.

    Signed-off-by: Frederic Weisbecker
    Cc: Rik van Riel
    Reviewed-by: Oleg Nesterov
    Cc: Christoph Lameter
    Cc: Tejun Heo
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Frederic Weisbecker
     
  • This patchset does a bunch of cleanups and converts khelper to use system
    unbound workqueues. The 3 first patches should be uncontroversial. The
    last 2 patches are debatable.

    Kmod creates kernel threads that perform userspace jobs and we want those
    to have a large affinity in order not to contend busy CPUs. This is
    (partly) why we use khelper which has a wide affinity that the kernel
    threads it create can inherit from. Now khelper is a dedicated workqueue
    that has singlethread properties which we aren't interested in.

    Hence those two debatable changes:

    _ We would like to use generic workqueues. System unbound workqueues are
    a very good candidate but they are not wide affine, only node affine.
    Now probably a node is enough to perform many parallel kmod jobs.

    _ We would like to remove the wait_for_helper kernel thread (UMH_WAIT_PROC
    handler) to use the workqueue. It means that if the workqueue blocks,
    and no other worker can take pending kmod request, we can be screwed.
    Now if we have 512 threads, this should be enough.

    This patch (of 5):

    Underscores on function names aren't much verbose to explain the purpose
    of a function. And kmod has interesting such flavours.

    Lets rename the following functions:

    * __call_usermodehelper -> call_usermodehelper_exec_work
    * ____call_usermodehelper -> call_usermodehelper_exec_async
    * wait_for_helper -> call_usermodehelper_exec_sync

    Signed-off-by: Frederic Weisbecker
    Cc: Rik van Riel
    Reviewed-by: Oleg Nesterov
    Cc: Christoph Lameter
    Cc: Tejun Heo
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Frederic Weisbecker
     
  • If request_module() successfully runs modprobe, but modprobe exits with a
    non-zero status, then the return value from request_module() will be that
    (positive) error status. So the return from request_module can be:

    negative errno
    zero for success
    positive exit code.

    Signed-off-by: NeilBrown
    Cc: Goldwyn Rodrigues
    Cc: Oleg Nesterov
    Cc: Tejun Heo
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Commit e0e817392b9a ("CRED: Add some configurable debugging [try #6]")
    added the kdebug mechanism to this file back in 2009.

    The kdebug macro calls no_printk which always evaluates arguments.

    Most of the kdebug uses have an unnecessary call of
    atomic_read(&cred->usage)

    Make the kdebug macro do nothing by defining it with
    do { if (0) no_printk(...); } while (0)
    when not enabled.

    $ size kernel/cred.o* (defconfig x86-64)
    text data bss dec hex filename
    2748 336 8 3092 c14 kernel/cred.o.new
    2788 336 8 3132 c3c kernel/cred.o.old

    Miscellanea:
    o Neaten the #define kdebug macros while there

    Signed-off-by: Joe Perches
    Cc: David Howells
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Signed-off-by: Wei Yongjun
    Acked-by: Steven Rostedt
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yongjun
     

09 Sep, 2015

7 commits

  • Merge second patch-bomb from Andrew Morton:
    "Almost all of the rest of MM. There was an unusually large amount of
    MM material this time"

    * emailed patches from Andrew Morton : (141 commits)
    zpool: remove no-op module init/exit
    mm: zbud: constify the zbud_ops
    mm: zpool: constify the zpool_ops
    mm: swap: zswap: maybe_preload & refactoring
    zram: unify error reporting
    zsmalloc: remove null check from destroy_handle_cache()
    zsmalloc: do not take class lock in zs_shrinker_count()
    zsmalloc: use class->pages_per_zspage
    zsmalloc: consider ZS_ALMOST_FULL as migrate source
    zsmalloc: partial page ordering within a fullness_list
    zsmalloc: use shrinker to trigger auto-compaction
    zsmalloc: account the number of compacted pages
    zsmalloc/zram: introduce zs_pool_stats api
    zsmalloc: cosmetic compaction code adjustments
    zsmalloc: introduce zs_can_compact() function
    zsmalloc: always keep per-class stats
    zsmalloc: drop unused variable `nr_to_migrate'
    mm/memblock.c: fix comment in __next_mem_range()
    mm/page_alloc.c: fix type information of memoryless node
    memory-hotplug: fix comments in zone_spanned_pages_in_node() and zone_spanned_pages_in_node()
    ...

    Linus Torvalds
     
  • alloc_pages_exact_node() was introduced in commit 6484eb3e2a81 ("page
    allocator: do not check NUMA node ID when the caller knows the node is
    valid") as an optimized variant of alloc_pages_node(), that doesn't
    fallback to current node for nid == NUMA_NO_NODE. Unfortunately the
    name of the function can easily suggest that the allocation is
    restricted to the given node and fails otherwise. In truth, the node is
    only preferred, unless __GFP_THISNODE is passed among the gfp flags.

    The misleading name has lead to mistakes in the past, see for example
    commits 5265047ac301 ("mm, thp: really limit transparent hugepage
    allocation to local node") and b360edb43f8e ("mm, mempolicy:
    migrate_to_node should only migrate to node").

    Another issue with the name is that there's a family of
    alloc_pages_exact*() functions where 'exact' means exact size (instead
    of page order), which leads to more confusion.

    To prevent further mistakes, this patch effectively renames
    alloc_pages_exact_node() to __alloc_pages_node() to better convey that
    it's an optimized variant of alloc_pages_node() not intended for general
    usage. Both functions get described in comments.

    It has been also considered to really provide a convenience function for
    allocations restricted to a node, but the major opinion seems to be that
    __GFP_THISNODE already provides that functionality and we shouldn't
    duplicate the API needlessly. The number of users would be small
    anyway.

    Existing callers of alloc_pages_exact_node() are simply converted to
    call __alloc_pages_node(), with the exception of sba_alloc_coherent()
    which open-codes the check for NUMA_NO_NODE, so it is converted to use
    alloc_pages_node() instead. This means it no longer performs some
    VM_BUG_ON checks, and since the current check for nid in
    alloc_pages_node() uses a 'nid < 0' comparison (which includes
    NUMA_NO_NODE), it may hide wrong values which would be previously
    exposed.

    Both differences will be rectified by the next patch.

    To sum up, this patch makes no functional changes, except temporarily
    hiding potentially buggy callers. Restricting the checks in
    alloc_pages_node() is left for the next patch which can in turn expose
    more existing buggy callers.

    Signed-off-by: Vlastimil Babka
    Acked-by: Johannes Weiner
    Acked-by: Robin Holt
    Acked-by: Michal Hocko
    Acked-by: Christoph Lameter
    Acked-by: Michael Ellerman
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Greg Thelen
    Cc: Aneesh Kumar K.V
    Cc: Pekka Enberg
    Cc: Joonsoo Kim
    Cc: Naoya Horiguchi
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Gleb Natapov
    Cc: Paolo Bonzini
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Cliff Whickman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • When seq_show_option (commit a068acf2ee77: "fs: create and use
    seq_show_option for escaping") was merged, it did not correctly collide
    with cgroup's addition of legacy_name (commit 3e1d2eed39d8: "cgroup:
    introduce cgroup_subsys->legacy_name") changes.

    This fixes the reported name.

    Signed-off-by: Kees Cook
    Acked-by: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • Pull libnvdimm updates from Dan Williams:
    "This update has successfully completed a 0day-kbuild run and has
    appeared in a linux-next release. The changes outside of the typical
    drivers/nvdimm/ and drivers/acpi/nfit.[ch] paths are related to the
    removal of IORESOURCE_CACHEABLE, the introduction of memremap(), and
    the introduction of ZONE_DEVICE + devm_memremap_pages().

    Summary:

    - Introduce ZONE_DEVICE and devm_memremap_pages() as a generic
    mechanism for adding device-driver-discovered memory regions to the
    kernel's direct map.

    This facility is used by the pmem driver to enable pfn_to_page()
    operations on the page frames returned by DAX ('direct_access' in
    'struct block_device_operations').

    For now, the 'memmap' allocation for these "device" pages comes
    from "System RAM". Support for allocating the memmap from device
    memory will arrive in a later kernel.

    - Introduce memremap() to replace usages of ioremap_cache() and
    ioremap_wt(). memremap() drops the __iomem annotation for these
    mappings to memory that do not have i/o side effects. The
    replacement of ioremap_cache() with memremap() is limited to the
    pmem driver to ease merging the api change in v4.3.

    Completion of the conversion is targeted for v4.4.

    - Similar to the usage of memcpy_to_pmem() + wmb_pmem() in the pmem
    driver, update the VFS DAX implementation and PMEM api to provide
    persistence guarantees for kernel operations on a DAX mapping.

    - Convert the ACPI NFIT 'BLK' driver to map the block apertures as
    cacheable to improve performance.

    - Miscellaneous updates and fixes to libnvdimm including support for
    issuing "address range scrub" commands, clarifying the optimal
    'sector size' of pmem devices, a clarification of the usage of the
    ACPI '_STA' (status) property for DIMM devices, and other minor
    fixes"

    * tag 'libnvdimm-for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (34 commits)
    libnvdimm, pmem: direct map legacy pmem by default
    libnvdimm, pmem: 'struct page' for pmem
    libnvdimm, pfn: 'struct page' provider infrastructure
    x86, pmem: clarify that ARCH_HAS_PMEM_API implies PMEM mapped WB
    add devm_memremap_pages
    mm: ZONE_DEVICE for "device memory"
    mm: move __phys_to_pfn and __pfn_to_phys to asm/generic/memory_model.h
    dax: drop size parameter to ->direct_access()
    nd_blk: change aperture mapping from WC to WB
    nvdimm: change to use generic kvfree()
    pmem, dax: have direct_access use __pmem annotation
    dax: update I/O path to do proper PMEM flushing
    pmem: add copy_from_iter_pmem() and clear_pmem()
    pmem, x86: clean up conditional pmem includes
    pmem: remove layer when calling arch_has_wmb_pmem()
    pmem, x86: move x86 PMEM API to new pmem.h header
    libnvdimm, e820: make CONFIG_X86_PMEM_LEGACY a tristate option
    pmem: switch to devm_ allocations
    devres: add devm_memremap
    libnvdimm, btt: write and validate parent_uuid
    ...

    Linus Torvalds
     
  • Pull tracing update from Steven Rostedt:
    "Mostly this is just clean ups and micro optimizations.

    The changes with more meat are:

    - Allowing the trace event filters to filter on CPU number and
    process ids

    - Two new markers for trace output latency were added (10 and 100
    msec latencies)

    - Have tracing_thresh filter function profiling time

    I also worked on modifying the ring buffer code for some future work,
    and moved the adding of the timestamp around. One of my changes
    caused a regression, and since other changes were built on top of it
    and already tested, I had to operate a revert of that change. Instead
    of rebasing, this change set has the code that caused a regression as
    well as the code to revert that change without touching the other
    changes that were made on top of it"

    * tag 'trace-v4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    ring-buffer: Revert "ring-buffer: Get timestamp after event is allocated"
    tracing: Don't make assumptions about length of string on task rename
    tracing: Allow triggers to filter for CPU ids and process names
    ftrace: Format MCOUNT_ADDR address as type unsigned long
    tracing: Introduce two additional marks for delay
    ftrace: Fix function_graph duration spacing with 7-digits
    ftrace: add tracing_thresh to function profile
    tracing: Clean up stack tracing and fix fentry updates
    ring-buffer: Reorganize function locations
    ring-buffer: Make sure event has enough room for extend and padding
    ring-buffer: Get timestamp after event is allocated
    ring-buffer: Move the adding of the extended timestamp out of line
    ring-buffer: Add event descriptor to simplify passing data
    ftrace: correct the counter increment for trace_buffer data
    tracing: Fix for non-continuous cpu ids
    tracing: Prefer kcalloc over kzalloc with multiply

    Linus Torvalds
     
  • Pull audit update from Paul Moore:
    "This is one of the larger audit patchsets in recent history,
    consisting of eight patches and almost 400 lines of changes.

    The bulk of the patchset is the new "audit by executable"
    functionality which allows admins to set an audit watch based on the
    executable on disk. Prior to this, admins could only track an
    application by PID, which has some obvious limitations.

    Beyond the new functionality we also have some refcnt fixes and a few
    minor cleanups"

    * 'upstream' of git://git.infradead.org/users/pcmoore/audit:
    fixup: audit: implement audit by executable
    audit: implement audit by executable
    audit: clean simple fsnotify implementation
    audit: use macros for unset inode and device values
    audit: make audit_del_rule() more robust
    audit: fix uninitialized variable in audit_add_rule()
    audit: eliminate unnecessary extra layer of watch parent references
    audit: eliminate unnecessary extra layer of watch references

    Linus Torvalds
     
  • Pull security subsystem updates from James Morris:
    "Highlights:

    - PKCS#7 support added to support signed kexec, also utilized for
    module signing. See comments in 3f1e1bea.

    ** NOTE: this requires linking against the OpenSSL library, which
    must be installed, e.g. the openssl-devel on Fedora **

    - Smack
    - add IPv6 host labeling; ignore labels on kernel threads
    - support smack labeling mounts which use binary mount data

    - SELinux:
    - add ioctl whitelisting (see
    http://kernsec.org/files/lss2015/vanderstoep.pdf)
    - fix mprotect PROT_EXEC regression caused by mm change

    - Seccomp:
    - add ptrace options for suspend/resume"

    * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (57 commits)
    PKCS#7: Add OIDs for sha224, sha284 and sha512 hash algos and use them
    Documentation/Changes: Now need OpenSSL devel packages for module signing
    scripts: add extract-cert and sign-file to .gitignore
    modsign: Handle signing key in source tree
    modsign: Use if_changed rule for extracting cert from module signing key
    Move certificate handling to its own directory
    sign-file: Fix warning about BIO_reset() return value
    PKCS#7: Add MODULE_LICENSE() to test module
    Smack - Fix build error with bringup unconfigured
    sign-file: Document dependency on OpenSSL devel libraries
    PKCS#7: Appropriately restrict authenticated attributes and content type
    KEYS: Add a name for PKEY_ID_PKCS7
    PKCS#7: Improve and export the X.509 ASN.1 time object decoder
    modsign: Use extract-cert to process CONFIG_SYSTEM_TRUSTED_KEYS
    extract-cert: Cope with multiple X.509 certificates in a single file
    sign-file: Generate CMS message as signature instead of PKCS#7
    PKCS#7: Support CMS messages also [RFC5652]
    X.509: Change recorded SKID & AKID to not include Subject or Issuer
    PKCS#7: Check content type and versions
    MAINTAINERS: The keyrings mailing list has moved
    ...

    Linus Torvalds
     

06 Sep, 2015

3 commits

  • Pull vfs updates from Al Viro:
    "In this one:

    - d_move fixes (Eric Biederman)

    - UFS fixes (me; locking is mostly sane now, a bunch of bugs in error
    handling ought to be fixed)

    - switch of sb_writers to percpu rwsem (Oleg Nesterov)

    - superblock scalability (Josef Bacik and Dave Chinner)

    - swapon(2) race fix (Hugh Dickins)"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (65 commits)
    vfs: Test for and handle paths that are unreachable from their mnt_root
    dcache: Reduce the scope of i_lock in d_splice_alias
    dcache: Handle escaped paths in prepend_path
    mm: fix potential data race in SyS_swapon
    inode: don't softlockup when evicting inodes
    inode: rename i_wb_list to i_io_list
    sync: serialise per-superblock sync operations
    inode: convert inode_sb_list_lock to per-sb
    inode: add hlist_fake to avoid the inode hash lock in evict
    writeback: plug writeback at a high level
    change sb_writers to use percpu_rw_semaphore
    shift percpu_counter_destroy() into destroy_super_work()
    percpu-rwsem: kill CONFIG_PERCPU_RWSEM
    percpu-rwsem: introduce percpu_rwsem_release() and percpu_rwsem_acquire()
    percpu-rwsem: introduce percpu_down_read_trylock()
    document rwsem_release() in sb_wait_write()
    fix the broken lockdep logic in __sb_start_write()
    introduce __sb_writers_{acquired,release}() helpers
    ufs_inode_get{frag,block}(): get rid of 'phys' argument
    ufs_getfrag_block(): tidy up a bit
    ...

    Linus Torvalds
     
  • Merge patch-bomb from Andrew Morton:

    - a few misc things

    - Andy's "ambient capabilities"

    - fs/nofity updates

    - the ocfs2 queue

    - kernel/watchdog.c updates and feature work.

    - some of MM. Includes Andrea's userfaultfd feature.

    [ Hadn't noticed that userfaultfd was 'default y' when applying the
    patches, so that got fixed in this merge instead. We do _not_ mark
    new features that nobody uses yet 'default y' - Linus ]

    * emailed patches from Andrew Morton : (118 commits)
    mm/hugetlb.c: make vma_has_reserves() return bool
    mm/madvise.c: make madvise_behaviour_valid() return bool
    mm/memory.c: make tlb_next_batch() return bool
    mm/dmapool.c: change is_page_busy() return from int to bool
    mm: remove struct node_active_region
    mremap: simplify the "overlap" check in mremap_to()
    mremap: don't do uneccesary checks if new_len == old_len
    mremap: don't do mm_populate(new_addr) on failure
    mm: move ->mremap() from file_operations to vm_operations_struct
    mremap: don't leak new_vma if f_op->mremap() fails
    mm/hugetlb.c: make vma_shareable() return bool
    mm: make GUP handle pfn mapping unless FOLL_GET is requested
    mm: fix status code which move_pages() returns for zero page
    mm: memcontrol: bring back the VM_BUG_ON() in mem_cgroup_swapout()
    genalloc: add support of multiple gen_pools per device
    genalloc: add name arg to gen_pool_get() and devm_gen_pool_create()
    mm/memblock: WARN_ON when nid differs from overlap region
    Documentation/features/vm: add feature description and arch support status for batched TLB flush after unmap
    mm: defer flush of writable TLB entries
    mm: send one IPI per CPU to TLB flush all entries after unmapping pages
    ...

    Linus Torvalds
     
  • In commit f341861fb0b ("task_work: add a scheduling point in
    task_work_run()") I fixed a latency problem adding a cond_resched()
    call.

    Later, commit ac3d0da8f329 added yet another loop to reverse a list,
    bringing back the latency spike :

    I've seen in some cases this loop taking 275 ms, if for example a
    process with 2,000,000 files is killed.

    We could add yet another cond_resched() in the reverse loop, or we
    can simply remove the reversal, as I do not think anything
    would depend on order of task_work_add() submitted works.

    Fixes: ac3d0da8f329 ("task_work: Make task_work_add() lockless")
    Signed-off-by: Eric Dumazet
    Reported-by: Maciej Żenczykowski
    Acked-by: Al Viro
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     

05 Sep, 2015

17 commits

  • This activates the userfaultfd syscall.

    [sfr@canb.auug.org.au: activate syscall fix]
    [akpm@linux-foundation.org: don't enable userfaultfd on powerpc]
    Signed-off-by: Andrea Arcangeli
    Acked-by: Pavel Emelyanov
    Cc: Sanidhya Kashyap
    Cc: zhang.zhanghailiang@huawei.com
    Cc: "Kirill A. Shutemov"
    Cc: Andres Lagar-Cavilla
    Cc: Dave Hansen
    Cc: Paolo Bonzini
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Andy Lutomirski
    Cc: Hugh Dickins
    Cc: Peter Feiner
    Cc: "Dr. David Alan Gilbert"
    Cc: Johannes Weiner
    Cc: "Huangpeng (Peter)"
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • These two flags gets set in vma->vm_flags to tell the VM common code
    if the userfaultfd is armed and in which mode (only tracking missing
    faults, only tracking wrprotect faults or both). If neither flags is
    set it means the userfaultfd is not armed on the vma.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Pavel Emelyanov
    Cc: Sanidhya Kashyap
    Cc: zhang.zhanghailiang@huawei.com
    Cc: "Kirill A. Shutemov"
    Cc: Andres Lagar-Cavilla
    Cc: Dave Hansen
    Cc: Paolo Bonzini
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Andy Lutomirski
    Cc: Hugh Dickins
    Cc: Peter Feiner
    Cc: "Dr. David Alan Gilbert"
    Cc: Johannes Weiner
    Cc: "Huangpeng (Peter)"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • This adds the vm_userfaultfd_ctx to the vm_area_struct.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Pavel Emelyanov
    Cc: Sanidhya Kashyap
    Cc: zhang.zhanghailiang@huawei.com
    Cc: "Kirill A. Shutemov"
    Cc: Andres Lagar-Cavilla
    Cc: Dave Hansen
    Cc: Paolo Bonzini
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Andy Lutomirski
    Cc: Hugh Dickins
    Cc: Peter Feiner
    Cc: "Dr. David Alan Gilbert"
    Cc: Johannes Weiner
    Cc: "Huangpeng (Peter)"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • userfaultfd needs to wake all waitqueues (pass 0 as nr parameter), instead
    of the current hardcoded 1 (that would wake just the first waitqueue in
    the head list).

    Signed-off-by: Andrea Arcangeli
    Acked-by: Pavel Emelyanov
    Cc: Sanidhya Kashyap
    Cc: zhang.zhanghailiang@huawei.com
    Cc: "Kirill A. Shutemov"
    Cc: Andres Lagar-Cavilla
    Cc: Dave Hansen
    Cc: Paolo Bonzini
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Andy Lutomirski
    Cc: Hugh Dickins
    Cc: Peter Feiner
    Cc: "Dr. David Alan Gilbert"
    Cc: Johannes Weiner
    Cc: "Huangpeng (Peter)"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Rename watchdog_suspend() to lockup_detector_suspend() and
    watchdog_resume() to lockup_detector_resume() to avoid confusion with the
    watchdog subsystem and to be consistent with the existing name
    lockup_detector_init().

    Also provide comment blocks to explain the watchdog_running and
    watchdog_suspended variables and their relationship.

    Signed-off-by: Ulrich Obergfell
    Reviewed-by: Aaron Tomlin
    Cc: Guenter Roeck
    Cc: Don Zickus
    Cc: Ulrich Obergfell
    Cc: Jiri Olsa
    Cc: Michal Hocko
    Cc: Stephane Eranian
    Cc: Chris Metcalf
    Cc: Frederic Weisbecker
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Obergfell
     
  • Remove watchdog_nmi_disable_all() and watchdog_nmi_enable_all() since
    these functions are no longer needed. If a subsystem has a need to
    deactivate the watchdog temporarily, it should utilize the
    watchdog_suspend() and watchdog_resume() functions.

    [akpm@linux-foundation.org: fix build with CONFIG_LOCKUP_DETECTOR=m]
    Signed-off-by: Ulrich Obergfell
    Reviewed-by: Aaron Tomlin
    Cc: Guenter Roeck
    Cc: Don Zickus
    Cc: Ulrich Obergfell
    Cc: Jiri Olsa
    Cc: Michal Hocko
    Cc: Stephane Eranian
    Cc: Chris Metcalf
    Cc: Frederic Weisbecker
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Obergfell
     
  • Remove update_watchdog() and restart_watchdog_hrtimer() since these
    functions are no longer needed. Changes of parameters such as the sample
    period are honored at the time when the watchdog threads are being
    unparked.

    Signed-off-by: Ulrich Obergfell
    Reviewed-by: Aaron Tomlin
    Cc: Guenter Roeck
    Cc: Don Zickus
    Cc: Ulrich Obergfell
    Cc: Jiri Olsa
    Cc: Michal Hocko
    Cc: Stephane Eranian
    Cc: Chris Metcalf
    Cc: Frederic Weisbecker
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Obergfell
     
  • This interface can be utilized to deactivate the hard and soft lockup
    detector temporarily. Callers are expected to minimize the duration of
    deactivation. Multiple deactivations are allowed to occur in parallel but
    should be rare in practice.

    [akpm@linux-foundation.org: remove unneeded static initialization]
    Signed-off-by: Ulrich Obergfell
    Reviewed-by: Aaron Tomlin
    Cc: Guenter Roeck
    Cc: Don Zickus
    Cc: Ulrich Obergfell
    Cc: Jiri Olsa
    Cc: Michal Hocko
    Cc: Stephane Eranian
    Cc: Chris Metcalf
    Cc: Frederic Weisbecker
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Obergfell
     
  • Originally watchdog_nmi_enable(cpu) and watchdog_nmi_disable(cpu) were
    only called in watchdog thread context. However, the following commits
    utilize these functions outside of watchdog thread context too.

    commit 9809b18fcf6b8d8ec4d3643677345907e6b50eca
    Author: Michal Hocko
    Date: Tue Sep 24 15:27:30 2013 -0700

    watchdog: update watchdog_thresh properly

    commit b3738d29323344da3017a91010530cf3a58590fc
    Author: Stephane Eranian
    Date: Mon Nov 17 20:07:03 2014 +0100

    watchdog: Add watchdog enable/disable all functions

    Hence, it is now possible that these functions execute concurrently with
    the same 'cpu' argument. This concurrency is problematic because per-cpu
    'watchdog_ev' can be accessed/modified without adequate synchronization.

    The patch series aims to address the above problem. However, instead of
    introducing locks to protect per-cpu 'watchdog_ev' a different approach is
    taken: Invoke these functions by parking and unparking the watchdog
    threads (to ensure they are always called in watchdog thread context).

    static struct smp_hotplug_thread watchdog_threads = {
    ...
    .park = watchdog_disable, // calls watchdog_nmi_disable()
    .unpark = watchdog_enable, // calls watchdog_nmi_enable()
    };

    Both previously mentioned commits call these functions in a similar way
    and thus in principle contain some duplicate code. The patch series also
    avoids this duplication by providing a commonly usable mechanism.

    - Patch 1/4 introduces the watchdog_{park|unpark}_threads functions that
    park/unpark all watchdog threads specified in 'watchdog_cpumask'. They
    are intended to be called inside of kernel/watchdog.c only.

    - Patch 2/4 introduces the watchdog_{suspend|resume} functions which can
    be utilized by external callers to deactivate the hard and soft lockup
    detector temporarily.

    - Patch 3/4 utilizes watchdog_{park|unpark}_threads to replace some code
    that was introduced by commit 9809b18fcf6b8d8ec4d3643677345907e6b50eca.

    - Patch 4/4 utilizes watchdog_{suspend|resume} to replace some code that
    was introduced by commit b3738d29323344da3017a91010530cf3a58590fc.

    A few corner cases should be mentioned here for completeness.

    - kthread_park() of watchdog/N could hang if cpu N is already locked up.
    However, if watchdog is enabled the lockup will be detected anyway.

    - kthread_unpark() of watchdog/N could hang if cpu N got locked up after
    kthread_park(). The occurrence of this scenario should be _very_ rare
    in practice, in particular because it is not expected that temporary
    deactivation will happen frequently, and if it happens at all it is
    expected that the duration of deactivation will be short.

    This patch (of 4): introduce watchdog_park_threads() and watchdog_unpark_threads()

    These functions are intended to be used only from inside kernel/watchdog.c
    to park/unpark all watchdog threads that are specified in
    watchdog_cpumask.

    Signed-off-by: Ulrich Obergfell
    Reviewed-by: Aaron Tomlin
    Cc: Guenter Roeck
    Cc: Don Zickus
    Cc: Ulrich Obergfell
    Cc: Jiri Olsa
    Cc: Michal Hocko
    Cc: Stephane Eranian
    Cc: Chris Metcalf
    Cc: Frederic Weisbecker
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Obergfell
     
  • The kernel's NMI watchdog has nothing to do with the watchdog subsystem.
    Its header declarations should be in linux/nmi.h, not linux/watchdog.h.

    The code provided two sets of dummy functions if HARDLOCKUP_DETECTOR is
    not configured, one in the include file and one in kernel/watchdog.c.
    Remove the dummy functions from kernel/watchdog.c and use those from the
    include file.

    Signed-off-by: Guenter Roeck
    Cc: Stephane Eranian
    Cc: Peter Zijlstra (Intel)
    Cc: Ingo Molnar
    Cc: Don Zickus
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Guenter Roeck
     
  • housekeeping_mask gathers all the CPUs that aren't part of the nohz_full
    set. This is exactly what we want the watchdog to be affine to without
    the need to use complicated cpumask operations.

    Signed-off-by: Frederic Weisbecker
    Reviewed-by: Chris Metcalf
    Cc: Thomas Gleixner
    Cc: Chris Metcalf
    Cc: Don Zickus
    Cc: Peter Zijlstra
    Cc: Ulrich Obergfell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Frederic Weisbecker
     
  • It makes the registration cheaper and simpler for the smpboot per-cpu
    kthread users that don't need to always update the cpumask after threads
    creation.

    [sfr@canb.auug.org.au: fix for allow passing the cpumask on per-cpu thread registration]
    Signed-off-by: Frederic Weisbecker
    Reviewed-by: Chris Metcalf
    Reviewed-by: Thomas Gleixner
    Cc: Chris Metcalf
    Cc: Don Zickus
    Cc: Peter Zijlstra
    Cc: Ulrich Obergfell
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Frederic Weisbecker
     
  • The per-cpu kthread cleanup() callback is the mirror of the setup()
    callback. When the per-cpu kthread is started, it first calls setup()
    to initialize the resources which are then released by cleanup() when
    the kthread exits.

    Now since the introduction of a per-cpu kthread cpumask, the kthreads
    excluded by the cpumask on boot may happen to be parked immediately
    after their creation without taking the setup() stage, waiting to be
    asked to unpark to do so. Then when smpboot_unregister_percpu_thread()
    is later called, the kthread is stopped without having ever called
    setup().

    But this triggers a bug as the kthread unconditionally calls cleanup()
    on exit but this doesn't mirror any setup(). Thus the kernel crashes
    because we try to free resources that haven't been initialized, as in
    the watchdog case:

    WATCHDOG disable 0
    WATCHDOG disable 1
    WATCHDOG disable 2
    BUG: unable to handle kernel NULL pointer dereference at (null)
    IP: hrtimer_active+0x26/0x60
    [...]
    Call Trace:
    hrtimer_try_to_cancel+0x1c/0x280
    hrtimer_cancel+0x1d/0x30
    watchdog_disable+0x56/0x70
    watchdog_cleanup+0xe/0x10
    smpboot_thread_fn+0x23c/0x2c0
    kthread+0xf8/0x110
    ret_from_fork+0x3f/0x70

    This bug is currently masked with explicit kthread unparking before
    kthread_stop() on smpboot_destroy_threads(). This forces a call to
    setup() and then unpark().

    We could fix this by unconditionally calling setup() on kthread entry.
    But setup() isn't always cheap. In the case of watchdog it launches
    hrtimer, perf events, etc... So we may as well like to skip it if there
    are chances the kthread will never be used, as in a reduced cpumask value.

    So let's simply do a state machine check before calling cleanup() that
    makes sure setup() has been called before mirroring it.

    And remove the nasty hack workaround.

    Signed-off-by: Frederic Weisbecker
    Reviewed-by: Chris Metcalf
    Reviewed-by: Thomas Gleixner
    Cc: Chris Metcalf
    Cc: Don Zickus
    Cc: Peter Zijlstra
    Cc: Ulrich Obergfell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Frederic Weisbecker
     
  • The cpumask is allocated before threads get created. If the latter step
    fails, we need to free the cpumask.

    Signed-off-by: Frederic Weisbecker
    Reviewed-by: Chris Metcalf
    Reviewed-by: Thomas Gleixner
    Cc: Chris Metcalf
    Cc: Don Zickus
    Cc: Peter Zijlstra
    Cc: Ulrich Obergfell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Frederic Weisbecker
     
  • Many file systems that implement the show_options hook fail to correctly
    escape their output which could lead to unescaped characters (e.g. new
    lines) leaking into /proc/mounts and /proc/[pid]/mountinfo files. This
    could lead to confusion, spoofed entries (resulting in things like
    systemd issuing false d-bus "mount" notifications), and who knows what
    else. This looks like it would only be the root user stepping on
    themselves, but it's possible weird things could happen in containers or
    in other situations with delegated mount privileges.

    Here's an example using overlay with setuid fusermount trusting the
    contents of /proc/mounts (via the /etc/mtab symlink). Imagine the use
    of "sudo" is something more sneaky:

    $ BASE="ovl"
    $ MNT="$BASE/mnt"
    $ LOW="$BASE/lower"
    $ UP="$BASE/upper"
    $ WORK="$BASE/work/ 0 0
    none /proc fuse.pwn user_id=1000"
    $ mkdir -p "$LOW" "$UP" "$WORK"
    $ sudo mount -t overlay -o "lowerdir=$LOW,upperdir=$UP,workdir=$WORK" none /mnt
    $ cat /proc/mounts
    none /root/ovl/mnt overlay rw,relatime,lowerdir=ovl/lower,upperdir=ovl/upper,workdir=ovl/work/ 0 0
    none /proc fuse.pwn user_id=1000 0 0
    $ fusermount -u /proc
    $ cat /proc/mounts
    cat: /proc/mounts: No such file or directory

    This fixes the problem by adding new seq_show_option and
    seq_show_option_n helpers, and updating the vulnerable show_option
    handlers to use them as needed. Some, like SELinux, need to be open
    coded due to unusual existing escape mechanisms.

    [akpm@linux-foundation.org: add lost chunk, per Kees]
    [keescook@chromium.org: seq_show_option should be using const parameters]
    Signed-off-by: Kees Cook
    Acked-by: Serge Hallyn
    Acked-by: Jan Kara
    Acked-by: Paul Moore
    Cc: J. R. Okajima
    Signed-off-by: Kees Cook
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • Credit where credit is due: this idea comes from Christoph Lameter with
    a lot of valuable input from Serge Hallyn. This patch is heavily based
    on Christoph's patch.

    ===== The status quo =====

    On Linux, there are a number of capabilities defined by the kernel. To
    perform various privileged tasks, processes can wield capabilities that
    they hold.

    Each task has four capability masks: effective (pE), permitted (pP),
    inheritable (pI), and a bounding set (X). When the kernel checks for a
    capability, it checks pE. The other capability masks serve to modify
    what capabilities can be in pE.

    Any task can remove capabilities from pE, pP, or pI at any time. If a
    task has a capability in pP, it can add that capability to pE and/or pI.
    If a task has CAP_SETPCAP, then it can add any capability to pI, and it
    can remove capabilities from X.

    Tasks are not the only things that can have capabilities; files can also
    have capabilities. A file can have no capabilty information at all [1].
    If a file has capability information, then it has a permitted mask (fP)
    and an inheritable mask (fI) as well as a single effective bit (fE) [2].
    File capabilities modify the capabilities of tasks that execve(2) them.

    A task that successfully calls execve has its capabilities modified for
    the file ultimately being excecuted (i.e. the binary itself if that
    binary is ELF or for the interpreter if the binary is a script.) [3] In
    the capability evolution rules, for each mask Z, pZ represents the old
    value and pZ' represents the new value. The rules are:

    pP' = (X & fP) | (pI & fI)
    pI' = pI
    pE' = (fE ? pP' : 0)
    X is unchanged

    For setuid binaries, fP, fI, and fE are modified by a moderately
    complicated set of rules that emulate POSIX behavior. Similarly, if
    euid == 0 or ruid == 0, then fP, fI, and fE are modified differently
    (primary, fP and fI usually end up being the full set). For nonroot
    users executing binaries with neither setuid nor file caps, fI and fP
    are empty and fE is false.

    As an extra complication, if you execute a process as nonroot and fE is
    set, then the "secure exec" rules are in effect: AT_SECURE gets set,
    LD_PRELOAD doesn't work, etc.

    This is rather messy. We've learned that making any changes is
    dangerous, though: if a new kernel version allows an unprivileged
    program to change its security state in a way that persists cross
    execution of a setuid program or a program with file caps, this
    persistent state is surprisingly likely to allow setuid or file-capped
    programs to be exploited for privilege escalation.

    ===== The problem =====

    Capability inheritance is basically useless.

    If you aren't root and you execute an ordinary binary, fI is zero, so
    your capabilities have no effect whatsoever on pP'. This means that you
    can't usefully execute a helper process or a shell command with elevated
    capabilities if you aren't root.

    On current kernels, you can sort of work around this by setting fI to
    the full set for most or all non-setuid executable files. This causes
    pP' = pI for nonroot, and inheritance works. No one does this because
    it's a PITA and it isn't even supported on most filesystems.

    If you try this, you'll discover that every nonroot program ends up with
    secure exec rules, breaking many things.

    This is a problem that has bitten many people who have tried to use
    capabilities for anything useful.

    ===== The proposed change =====

    This patch adds a fifth capability mask called the ambient mask (pA).
    pA does what most people expect pI to do.

    pA obeys the invariant that no bit can ever be set in pA if it is not
    set in both pP and pI. Dropping a bit from pP or pI drops that bit from
    pA. This ensures that existing programs that try to drop capabilities
    still do so, with a complication. Because capability inheritance is so
    broken, setting KEEPCAPS, using setresuid to switch to nonroot uids, and
    then calling execve effectively drops capabilities. Therefore,
    setresuid from root to nonroot conditionally clears pA unless
    SECBIT_NO_SETUID_FIXUP is set. Processes that don't like this can
    re-add bits to pA afterwards.

    The capability evolution rules are changed:

    pA' = (file caps or setuid or setgid ? 0 : pA)
    pP' = (X & fP) | (pI & fI) | pA'
    pI' = pI
    pE' = (fE ? pP' : pA')
    X is unchanged

    If you are nonroot but you have a capability, you can add it to pA. If
    you do so, your children get that capability in pA, pP, and pE. For
    example, you can set pA = CAP_NET_BIND_SERVICE, and your children can
    automatically bind low-numbered ports. Hallelujah!

    Unprivileged users can create user namespaces, map themselves to a
    nonzero uid, and create both privileged (relative to their namespace)
    and unprivileged process trees. This is currently more or less
    impossible. Hallelujah!

    You cannot use pA to try to subvert a setuid, setgid, or file-capped
    program: if you execute any such program, pA gets cleared and the
    resulting evolution rules are unchanged by this patch.

    Users with nonzero pA are unlikely to unintentionally leak that
    capability. If they run programs that try to drop privileges, dropping
    privileges will still work.

    It's worth noting that the degree of paranoia in this patch could
    possibly be reduced without causing serious problems. Specifically, if
    we allowed pA to persist across executing non-pA-aware setuid binaries
    and across setresuid, then, naively, the only capabilities that could
    leak as a result would be the capabilities in pA, and any attacker
    *already* has those capabilities. This would make me nervous, though --
    setuid binaries that tried to privilege-separate might fail to do so,
    and putting CAP_DAC_READ_SEARCH or CAP_DAC_OVERRIDE into pA could have
    unexpected side effects. (Whether these unexpected side effects would
    be exploitable is an open question.) I've therefore taken the more
    paranoid route. We can revisit this later.

    An alternative would be to require PR_SET_NO_NEW_PRIVS before setting
    ambient capabilities. I think that this would be annoying and would
    make granting otherwise unprivileged users minor ambient capabilities
    (CAP_NET_BIND_SERVICE or CAP_NET_RAW for example) much less useful than
    it is with this patch.

    ===== Footnotes =====

    [1] Files that are missing the "security.capability" xattr or that have
    unrecognized values for that xattr end up with has_cap set to false.
    The code that does that appears to be complicated for no good reason.

    [2] The libcap capability mask parsers and formatters are dangerously
    misleading and the documentation is flat-out wrong. fE is *not* a mask;
    it's a single bit. This has probably confused every single person who
    has tried to use file capabilities.

    [3] Linux very confusingly processes both the script and the interpreter
    if applicable, for reasons that elude me. The results from thinking
    about a script's file capabilities and/or setuid bits are mostly
    discarded.

    Preliminary userspace code is here, but it needs updating:
    https://git.kernel.org/cgit/linux/kernel/git/luto/util-linux-playground.git/commit/?h=cap_ambient&id=7f5afbd175d2

    Here is a test program that can be used to verify the functionality
    (from Christoph):

    /*
    * Test program for the ambient capabilities. This program spawns a shell
    * that allows running processes with a defined set of capabilities.
    *
    * (C) 2015 Christoph Lameter
    * Released under: GPL v3 or later.
    *
    *
    * Compile using:
    *
    * gcc -o ambient_test ambient_test.o -lcap-ng
    *
    * This program must have the following capabilities to run properly:
    * Permissions for CAP_NET_RAW, CAP_NET_ADMIN, CAP_SYS_NICE
    *
    * A command to equip the binary with the right caps is:
    *
    * setcap cap_net_raw,cap_net_admin,cap_sys_nice+p ambient_test
    *
    *
    * To get a shell with additional caps that can be inherited by other processes:
    *
    * ./ambient_test /bin/bash
    *
    *
    * Verifying that it works:
    *
    * From the bash spawed by ambient_test run
    *
    * cat /proc/$$/status
    *
    * and have a look at the capabilities.
    */

    #include
    #include
    #include
    #include
    #include
    #include

    /*
    * Definitions from the kernel header files. These are going to be removed
    * when the /usr/include files have these defined.
    */
    #define PR_CAP_AMBIENT 47
    #define PR_CAP_AMBIENT_IS_SET 1
    #define PR_CAP_AMBIENT_RAISE 2
    #define PR_CAP_AMBIENT_LOWER 3
    #define PR_CAP_AMBIENT_CLEAR_ALL 4

    static void set_ambient_cap(int cap)
    {
    int rc;

    capng_get_caps_process();
    rc = capng_update(CAPNG_ADD, CAPNG_INHERITABLE, cap);
    if (rc) {
    printf("Cannot add inheritable cap\n");
    exit(2);
    }
    capng_apply(CAPNG_SELECT_CAPS);

    /* Note the two 0s at the end. Kernel checks for these */
    if (prctl(PR_CAP_AMBIENT, PR_CAP_AMBIENT_RAISE, cap, 0, 0)) {
    perror("Cannot set cap");
    exit(1);
    }
    }

    int main(int argc, char **argv)
    {
    int rc;

    set_ambient_cap(CAP_NET_RAW);
    set_ambient_cap(CAP_NET_ADMIN);
    set_ambient_cap(CAP_SYS_NICE);

    printf("Ambient_test forking shell\n");
    if (execv(argv[1], argv + 1))
    perror("Cannot exec");

    return 0;
    }

    Signed-off-by: Christoph Lameter # Original author
    Signed-off-by: Andy Lutomirski
    Acked-by: Serge E. Hallyn
    Acked-by: Kees Cook
    Cc: Jonathan Corbet
    Cc: Aaron Jones
    Cc: Ted Ts'o
    Cc: Andrew G. Morgan
    Cc: Mimi Zohar
    Cc: Austin S Hemmelgarn
    Cc: Markku Savela
    Cc: Jarkko Sakkinen
    Cc: Michael Kerrisk
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Lutomirski
     
  • - Make it clear that the `node' arg refers to memory allocations only:
    kthread_create_on_node() does not pin the new thread to that node's
    CPUs.

    - Encourage the use of NUMA_NO_NODE.

    [nzimmer@sgi.com: use NUMA_NO_NODE in kthread_create() also]
    Cc: Nathan Zimmer
    Cc: Tejun Heo
    Cc: Eric Dumazet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

04 Sep, 2015

1 commit

  • Pull locking and atomic updates from Ingo Molnar:
    "Main changes in this cycle are:

    - Extend atomic primitives with coherent logic op primitives
    (atomic_{or,and,xor}()) and deprecate the old partial APIs
    (atomic_{set,clear}_mask())

    The old ops were incoherent with incompatible signatures across
    architectures and with incomplete support. Now every architecture
    supports the primitives consistently (by Peter Zijlstra)

    - Generic support for 'relaxed atomics':

    - _acquire/release/relaxed() flavours of xchg(), cmpxchg() and {add,sub}_return()
    - atomic_read_acquire()
    - atomic_set_release()

    This came out of porting qwrlock code to arm64 (by Will Deacon)

    - Clean up the fragile static_key APIs that were causing repeat bugs,
    by introducing a new one:

    DEFINE_STATIC_KEY_TRUE(name);
    DEFINE_STATIC_KEY_FALSE(name);

    which define a key of different types with an initial true/false
    value.

    Then allow:

    static_branch_likely()
    static_branch_unlikely()

    to take a key of either type and emit the right instruction for the
    case. To be able to know the 'type' of the static key we encode it
    in the jump entry (by Peter Zijlstra)

    - Static key self-tests (by Jason Baron)

    - qrwlock optimizations (by Waiman Long)

    - small futex enhancements (by Davidlohr Bueso)

    - ... and misc other changes"

    * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (63 commits)
    jump_label/x86: Work around asm build bug on older/backported GCCs
    locking, ARM, atomics: Define our SMP atomics in terms of _relaxed() operations
    locking, include/llist: Use linux/atomic.h instead of asm/cmpxchg.h
    locking/qrwlock: Make use of _{acquire|release|relaxed}() atomics
    locking/qrwlock: Implement queue_write_unlock() using smp_store_release()
    locking/lockref: Remove homebrew cmpxchg64_relaxed() macro definition
    locking, asm-generic: Add _{relaxed|acquire|release}() variants for 'atomic_long_t'
    locking, asm-generic: Rework atomic-long.h to avoid bulk code duplication
    locking/atomics: Add _{acquire|release|relaxed}() variants of some atomic operations
    locking, compiler.h: Cast away attributes in the WRITE_ONCE() magic
    locking/static_keys: Make verify_keys() static
    jump label, locking/static_keys: Update docs
    locking/static_keys: Provide a selftest
    jump_label: Provide a self-test
    s390/uaccess, locking/static_keys: employ static_branch_likely()
    x86, tsc, locking/static_keys: Employ static_branch_likely()
    locking/static_keys: Add selftest
    locking/static_keys: Add a new static_key interface
    locking/static_keys: Rework update logic
    locking/static_keys: Add static_key_{en,dis}able() helpers
    ...

    Linus Torvalds
     

03 Sep, 2015

2 commits

  • Pull networking updates from David Miller:
    "Another merge window, another set of networking changes. I've heard
    rumblings that the lightweight tunnels infrastructure has been voted
    networking change of the year. But what do I know?

    1) Add conntrack support to openvswitch, from Joe Stringer.

    2) Initial support for VRF (Virtual Routing and Forwarding), which
    allows the segmentation of routing paths without using multiple
    devices. There are some semantic kinks to work out still, but
    this is a reasonably strong foundation. From David Ahern.

    3) Remove spinlock fro act_bpf fast path, from Alexei Starovoitov.

    4) Ignore route nexthops with a link down state in ipv6, just like
    ipv4. From Andy Gospodarek.

    5) Remove spinlock from fast path of act_gact and act_mirred, from
    Eric Dumazet.

    6) Document the DSA layer, from Florian Fainelli.

    7) Add netconsole support to bcmgenet, systemport, and DSA. Also
    from Florian Fainelli.

    8) Add Mellanox Switch Driver and core infrastructure, from Jiri
    Pirko.

    9) Add support for "light weight tunnels", which allow for
    encapsulation and decapsulation without bearing the overhead of a
    full blown netdevice. From Thomas Graf, Jiri Benc, and a cast of
    others.

    10) Add Identifier Locator Addressing support for ipv6, from Tom
    Herbert.

    11) Support fragmented SKBs in iwlwifi, from Johannes Berg.

    12) Allow perf PMUs to be accessed from eBPF programs, from Kaixu Xia.

    13) Add BQL support to 3c59x driver, from Loganaden Velvindron.

    14) Stop using a zero TX queue length to mean that a device shouldn't
    have a qdisc attached, use an explicit flag instead. From Phil
    Sutter.

    15) Use generic geneve netdevice infrastructure in openvswitch, from
    Pravin B Shelar.

    16) Add infrastructure to avoid re-forwarding a packet in software
    that was already forwarded by a hardware switch. From Scott
    Feldman.

    17) Allow AF_PACKET fanout function to be implemented in a bpf
    program, from Willem de Bruijn"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1458 commits)
    netfilter: nf_conntrack: make nf_ct_zone_dflt built-in
    netfilter: nf_dup{4, 6}: fix build error when nf_conntrack disabled
    net: fec: clear receive interrupts before processing a packet
    ipv6: fix exthdrs offload registration in out_rt path
    xen-netback: add support for multicast control
    bgmac: Update fixed_phy_register()
    sock, diag: fix panic in sock_diag_put_filterinfo
    flow_dissector: Use 'const' where possible.
    flow_dissector: Fix function argument ordering dependency
    ixgbe: Resolve "initialized field overwritten" warnings
    ixgbe: Remove bimodal SR-IOV disabling
    ixgbe: Add support for reporting 2.5G link speed
    ixgbe: fix bounds checking in ixgbe_setup_tc for 82598
    ixgbe: support for ethtool set_rxfh
    ixgbe: Avoid needless PHY access on copper phys
    ixgbe: cleanup to use cached mask value
    ixgbe: Remove second instance of lan_id variable
    ixgbe: use kzalloc for allocating one thing
    flow: Move __get_hash_from_flowi{4,6} into flow_dissector.c
    ixgbe: Remove unused PCI bus types
    ...

    Linus Torvalds
     
  • The commit a4543a2fa9ef31 "ring-buffer: Get timestamp after event is
    allocated" is needed for some future work. But after adding it, there is a
    race somewhere that causes the saved timestamp to have a slight shift, and
    get ahead of the actual timestamp and make it look like time goes backwards.

    I'm still looking into why this happens, but in the mean time, this is
    holding up other work to get in. I'm reverting the change for now (which
    makes the problem go away), and will add it back after I know what is wrong
    and fix it.

    Signed-off-by: Steven Rostedt

    Steven Rostedt (Red Hat)