09 Sep, 2015

7 commits

  • Pull core kbuild updates from Michal Marek:
    - modpost portability fix
    - linker script fix
    - genksyms segfault fix
    - fixdep cleanup
    - fix for clang detection

    * 'kbuild' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild:
    kbuild: Fix clang detection
    kbuild: fixdep: drop meaningless hash table initialization
    kbuild: fixdep: optimize code slightly
    genksyms: Regenerate parser
    genksyms: Duplicate function pointer type definitions segfault
    kbuild: Fix .text.unlikely placement
    Avoid conflict with host definitions when cross-compiling

    Linus Torvalds
     
  • Pull tracing update from Steven Rostedt:
    "Mostly this is just clean ups and micro optimizations.

    The changes with more meat are:

    - Allowing the trace event filters to filter on CPU number and
    process ids

    - Two new markers for trace output latency were added (10 and 100
    msec latencies)

    - Have tracing_thresh filter function profiling time

    I also worked on modifying the ring buffer code for some future work,
    and moved the adding of the timestamp around. One of my changes
    caused a regression, and since other changes were built on top of it
    and already tested, I had to operate a revert of that change. Instead
    of rebasing, this change set has the code that caused a regression as
    well as the code to revert that change without touching the other
    changes that were made on top of it"

    * tag 'trace-v4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    ring-buffer: Revert "ring-buffer: Get timestamp after event is allocated"
    tracing: Don't make assumptions about length of string on task rename
    tracing: Allow triggers to filter for CPU ids and process names
    ftrace: Format MCOUNT_ADDR address as type unsigned long
    tracing: Introduce two additional marks for delay
    ftrace: Fix function_graph duration spacing with 7-digits
    ftrace: add tracing_thresh to function profile
    tracing: Clean up stack tracing and fix fentry updates
    ring-buffer: Reorganize function locations
    ring-buffer: Make sure event has enough room for extend and padding
    ring-buffer: Get timestamp after event is allocated
    ring-buffer: Move the adding of the extended timestamp out of line
    ring-buffer: Add event descriptor to simplify passing data
    ftrace: correct the counter increment for trace_buffer data
    tracing: Fix for non-continuous cpu ids
    tracing: Prefer kcalloc over kzalloc with multiply

    Linus Torvalds
     
  • Pull audit update from Paul Moore:
    "This is one of the larger audit patchsets in recent history,
    consisting of eight patches and almost 400 lines of changes.

    The bulk of the patchset is the new "audit by executable"
    functionality which allows admins to set an audit watch based on the
    executable on disk. Prior to this, admins could only track an
    application by PID, which has some obvious limitations.

    Beyond the new functionality we also have some refcnt fixes and a few
    minor cleanups"

    * 'upstream' of git://git.infradead.org/users/pcmoore/audit:
    fixup: audit: implement audit by executable
    audit: implement audit by executable
    audit: clean simple fsnotify implementation
    audit: use macros for unset inode and device values
    audit: make audit_del_rule() more robust
    audit: fix uninitialized variable in audit_add_rule()
    audit: eliminate unnecessary extra layer of watch parent references
    audit: eliminate unnecessary extra layer of watch references

    Linus Torvalds
     
  • Pull security subsystem updates from James Morris:
    "Highlights:

    - PKCS#7 support added to support signed kexec, also utilized for
    module signing. See comments in 3f1e1bea.

    ** NOTE: this requires linking against the OpenSSL library, which
    must be installed, e.g. the openssl-devel on Fedora **

    - Smack
    - add IPv6 host labeling; ignore labels on kernel threads
    - support smack labeling mounts which use binary mount data

    - SELinux:
    - add ioctl whitelisting (see
    http://kernsec.org/files/lss2015/vanderstoep.pdf)
    - fix mprotect PROT_EXEC regression caused by mm change

    - Seccomp:
    - add ptrace options for suspend/resume"

    * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (57 commits)
    PKCS#7: Add OIDs for sha224, sha284 and sha512 hash algos and use them
    Documentation/Changes: Now need OpenSSL devel packages for module signing
    scripts: add extract-cert and sign-file to .gitignore
    modsign: Handle signing key in source tree
    modsign: Use if_changed rule for extracting cert from module signing key
    Move certificate handling to its own directory
    sign-file: Fix warning about BIO_reset() return value
    PKCS#7: Add MODULE_LICENSE() to test module
    Smack - Fix build error with bringup unconfigured
    sign-file: Document dependency on OpenSSL devel libraries
    PKCS#7: Appropriately restrict authenticated attributes and content type
    KEYS: Add a name for PKEY_ID_PKCS7
    PKCS#7: Improve and export the X.509 ASN.1 time object decoder
    modsign: Use extract-cert to process CONFIG_SYSTEM_TRUSTED_KEYS
    extract-cert: Cope with multiple X.509 certificates in a single file
    sign-file: Generate CMS message as signature instead of PKCS#7
    PKCS#7: Support CMS messages also [RFC5652]
    X.509: Change recorded SKID & AKID to not include Subject or Issuer
    PKCS#7: Check content type and versions
    MAINTAINERS: The keyrings mailing list has moved
    ...

    Linus Torvalds
     
  • Pull NMI backtrace update from Russell King:
    "These changes convert the x86 NMI handling to be a library
    implementation which other architectures can make use of. Thomas
    Gleixner has reviewed and tested these changes, and wishes me to send
    these rather than taking them through the tip tree.

    The final patch in the set adds an initial implementation using this
    infrastructure to ARM, even though it doesn't send the IPI at "NMI"
    level. Patches are in progress to add the ARM equivalent of NMI, but
    we still need the IRQ-level fallback for systems where the "NMI" isn't
    available due to secure firmware denying access to it"

    * 'nmi' of git://ftp.arm.linux.org.uk/~rmk/linux-arm:
    ARM: add basic support for on-demand backtrace of other CPUs
    nmi: x86: convert to generic nmi handler
    nmi: create generic NMI backtrace implementation

    Linus Torvalds
     
  • Pull xen updates from David Vrabel:
    "Xen features and fixes for 4.3:

    - Convert xen-blkfront to the multiqueue API
    - [arm] Support binding event channels to different VCPUs.
    - [x86] Support > 512 GiB in a PV guests (off by default as such a
    guest cannot be migrated with the current toolstack).
    - [x86] PMU support for PV dom0 (limited support for using perf with
    Xen and other guests)"

    * tag 'for-linus-4.3-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: (33 commits)
    xen: switch extra memory accounting to use pfns
    xen: limit memory to architectural maximum
    xen: avoid another early crash of memory limited dom0
    xen: avoid early crash of memory limited dom0
    arm/xen: Remove helpers which are PV specific
    xen/x86: Don't try to set PCE bit in CR4
    xen/PMU: PMU emulation code
    xen/PMU: Intercept PMU-related MSR and APIC accesses
    xen/PMU: Describe vendor-specific PMU registers
    xen/PMU: Initialization code for Xen PMU
    xen/PMU: Sysfs interface for setting Xen PMU mode
    xen: xensyms support
    xen: remove no longer needed p2m.h
    xen: allow more than 512 GB of RAM for 64 bit pv-domains
    xen: move p2m list if conflicting with e820 map
    xen: add explicit memblock_reserve() calls for special pages
    mm: provide early_memremap_ro to establish read-only mapping
    xen: check for initrd conflicting with e820 map
    xen: check pre-allocated page tables for conflict with memory map
    xen: check for kernel memory conflicting with memory layout
    ...

    Linus Torvalds
     
  • Pull more irq updates from Thomas Gleixner:
    "The second part of irq related updates:

    - Provide EOImode for GIC[V3] irq chips, which is a prerequisite for
    direct interrupt handling in [KVM] guests"

    * 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    irqchip/GIC: Fix EOImode setting for non-DT/ACPI systems
    irqchip/GIC: Don't deactivate interrupts forwarded to a guest
    irqchip/GIC: Convert to EOImode == 1
    irqchip/GICv3: Don't deactivate interrupts forwarded to a guest
    irqchip/GICv3: Convert to EOImode == 1

    Linus Torvalds
     

08 Sep, 2015

3 commits

  • Instead of using physical addresses for accounting of extra memory
    areas available for ballooning switch to pfns as this is much less
    error prone regarding partial pages.

    Reported-by: Roger Pau Monné
    Tested-by: Roger Pau Monné
    Signed-off-by: Juergen Gross
    Signed-off-by: David Vrabel

    Juergen Gross
     
  • Pull NFS client updates from Trond Myklebust:
    "Highlights include:

    Stable patches:
    - Fix atomicity of pNFS commit list updates
    - Fix NFSv4 handling of open(O_CREAT|O_EXCL|O_RDONLY)
    - nfs_set_pgio_error sometimes misses errors
    - Fix a thinko in xs_connect()
    - Fix borkage in _same_data_server_addrs_locked()
    - Fix a NULL pointer dereference of migration recovery ops for v4.2
    client
    - Don't let the ctime override attribute barriers.
    - Revert "NFSv4: Remove incorrect check in can_open_delegated()"
    - Ensure flexfiles pNFS driver updates the inode after write finishes
    - flexfiles must not pollute the attribute cache with attrbutes from
    the DS
    - Fix a protocol error in layoutreturn
    - Fix a protocol issue with NFSv4.1 CLOSE stateids

    Bugfixes + cleanups
    - pNFS blocks bugfixes from Christoph
    - Various cleanups from Anna
    - More fixes for delegation corner cases
    - Don't fsync twice for O_SYNC/IS_SYNC files
    - Fix pNFS and flexfiles layoutstats bugs
    - pnfs/flexfiles: avoid duplicate tracking of mirror data
    - pnfs: Fix layoutget/layoutreturn/return-on-close serialisation
    issues
    - pnfs/flexfiles: error handling retries a layoutget before fallback
    to MDS

    Features:
    - Full support for the OPEN NFS4_CREATE_EXCLUSIVE4_1 mode from
    Kinglong
    - More RDMA client transport improvements from Chuck
    - Removal of the deprecated ib_reg_phys_mr() and ib_rereg_phys_mr()
    verbs from the SUNRPC, Lustre and core infiniband tree.
    - Optimise away the close-to-open getattr if there is no cached data"

    * tag 'nfs-for-4.3-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (108 commits)
    NFSv4: Respect the server imposed limit on how many changes we may cache
    NFSv4: Express delegation limit in units of pages
    Revert "NFS: Make close(2) asynchronous when closing NFS O_DIRECT files"
    NFS: Optimise away the close-to-open getattr if there is no cached data
    NFSv4.1/flexfiles: Clean up ff_layout_write_done_cb/ff_layout_commit_done_cb
    NFSv4.1/flexfiles: Mark the layout for return in ff_layout_io_track_ds_error()
    nfs: Remove unneeded checking of the return value from scnprintf
    nfs: Fix truncated client owner id without proto type
    NFSv4.1/flexfiles: Mark layout for return if the mirrors are invalid
    NFSv4.1/flexfiles: RW layouts are valid only if all mirrors are valid
    NFSv4.1/flexfiles: Fix incorrect usage of pnfs_generic_mark_devid_invalid()
    NFSv4.1/flexfiles: Fix freeing of mirrors
    NFSv4.1/pNFS: Don't request a minimal read layout beyond the end of file
    NFSv4.1/pnfs: Handle LAYOUTGET return values correctly
    NFSv4.1/pnfs: Don't ask for a read layout for an empty file.
    NFSv4.1: Fix a protocol issue with CLOSE stateids
    NFSv4.1/flexfiles: Don't mark the entire deviceid as bad for file errors
    SUNRPC: Prevent SYN+SYNACK+RST storms
    SUNRPC: xs_reset_transport must mark the connection as disconnected
    NFSv4.1/pnfs: Ensure layoutreturn reserves space for the opaque payload
    ...

    Linus Torvalds
     
  • Since we're tracking modifications to the page cache on a per-page
    basis, it makes sense to express the limit to how much we may cache
    in units of pages.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     

06 Sep, 2015

5 commits

  • Pull vfs updates from Al Viro:
    "In this one:

    - d_move fixes (Eric Biederman)

    - UFS fixes (me; locking is mostly sane now, a bunch of bugs in error
    handling ought to be fixed)

    - switch of sb_writers to percpu rwsem (Oleg Nesterov)

    - superblock scalability (Josef Bacik and Dave Chinner)

    - swapon(2) race fix (Hugh Dickins)"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (65 commits)
    vfs: Test for and handle paths that are unreachable from their mnt_root
    dcache: Reduce the scope of i_lock in d_splice_alias
    dcache: Handle escaped paths in prepend_path
    mm: fix potential data race in SyS_swapon
    inode: don't softlockup when evicting inodes
    inode: rename i_wb_list to i_io_list
    sync: serialise per-superblock sync operations
    inode: convert inode_sb_list_lock to per-sb
    inode: add hlist_fake to avoid the inode hash lock in evict
    writeback: plug writeback at a high level
    change sb_writers to use percpu_rw_semaphore
    shift percpu_counter_destroy() into destroy_super_work()
    percpu-rwsem: kill CONFIG_PERCPU_RWSEM
    percpu-rwsem: introduce percpu_rwsem_release() and percpu_rwsem_acquire()
    percpu-rwsem: introduce percpu_down_read_trylock()
    document rwsem_release() in sb_wait_write()
    fix the broken lockdep logic in __sb_start_write()
    introduce __sb_writers_{acquired,release}() helpers
    ufs_inode_get{frag,block}(): get rid of 'phys' argument
    ufs_getfrag_block(): tidy up a bit
    ...

    Linus Torvalds
     
  • Pull media updates from Mauro Carvalho Chehab:
    - new DVB frontend drivers: ascot2e, cxd2841er, horus3a, lnbh25
    - new HDMI capture driver: tc358743
    - new driver for NetUP DVB new boards (netup_unidvb)
    - IR support for DVBSky cards (smipcie-ir)
    - Coda driver has gain macroblock tiling support
    - Renesas R-Car gains JPEG codec driver
    - new DVB platform driver for STi boards: c8sectpfe
    - added documentation for the media core kABI to device-drivers DocBook
    - lots of driver fixups, cleanups and improvements

    * tag 'media/v4.3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (297 commits)
    [media] c8sectpfe: Remove select on undefined LIBELF_32
    [media] i2c: fix platform_no_drv_owner.cocci warnings
    [media] cx231xx: Use wake_up_interruptible() instead of wake_up_interruptible_nr()
    [media] tc358743: only queue subdev notifications if devnode is set
    [media] tc358743: add missing Kconfig dependency/select
    [media] c8sectpfe: Use %pad to print 'dma_addr_t'
    [media] DocBook media: Fix typo "the the" in xml files
    [media] tc358743: make reset gpio optional
    [media] tc358743: set direction of reset gpio using devm_gpiod_get
    [media] dvbdev: document most of the functions/data structs
    [media] dvb_frontend.h: document the struct dvb_frontend
    [media] dvb-frontend.h: document struct dtv_frontend_properties
    [media] dvb-frontend.h: document struct dvb_frontend_ops
    [media] dvb: Use DVBFE_ALGO_HW where applicable
    [media] dvb_frontend.h: document struct analog_demod_ops
    [media] dvb_frontend.h: Document struct dvb_tuner_ops
    [media] Docbook: Document struct analog_parameters
    [media] dvb_frontend.h: get rid of dvbfe_modcod
    [media] add documentation for struct dvb_tuner_info
    [media] dvb_frontend: document dvb_frontend_tune_settings
    ...

    Linus Torvalds
     
  • Pull mailbox updates from Jassi Brar:
    "Mainly we move from jiffy based timer to HRTIMER for finer control
    over polling. Then a controller reduces its polling period from 10 to
    1ms"

    * 'mailbox-for-next' of git://git.linaro.org/landing-teams/working/fujitsu/integration:
    mailbox: arm_mhu: reduce txpoll_period from 10ms to 1 ms
    mailbox: switch to hrtimer for tx_complete polling
    mailbox: Drop owner assignment from platform_driver

    Linus Torvalds
     
  • Pull nfsd updates from Bruce Fields:
    "Nothing major, but:

    - Add Jeff Layton as an nfsd co-maintainer: no change to existing
    practice, just an acknowledgement of the status quo.

    - Two patches ("nfsd: ensure that...") for a race overlooked by the
    state locking rewrite, causing a crash noticed by multiple users.

    - Lots of smaller bugfixes all over from Kinglong Mee.

    - From Jeff, some cleanup of server rpc code in preparation for
    possible shift of nfsd threads to workqueues"

    * tag 'nfsd-4.3' of git://linux-nfs.org/~bfields/linux: (52 commits)
    nfsd: deal with DELEGRETURN racing with CB_RECALL
    nfsd: return CLID_INUSE for unexpected SETCLIENTID_CONFIRM case
    nfsd: ensure that delegation stateid hash references are only put once
    nfsd: ensure that the ol stateid hash reference is only put once
    net: sunrpc: fix tracepoint Warning: unknown op '->'
    nfsd: allow more than one laundry job to run at a time
    nfsd: don't WARN/backtrace for invalid container deployment.
    fs: fix fs/locks.c kernel-doc warning
    nfsd: Add Jeff Layton as co-maintainer
    NFSD: Return word2 bitmask if setting security label in OPEN/CREATE
    NFSD: Set the attributes used to store the verifier for EXCLUSIVE4_1
    nfsd: SUPPATTR_EXCLCREAT must be encoded before SECURITY_LABEL.
    nfsd: Fix an FS_LAYOUT_TYPES/LAYOUT_TYPES encode bug
    NFSD: Store parent's stat in a separate value
    nfsd: Fix two typos in comments
    lockd: NLM grace period shouldn't block NFSv4 opens
    nfsd: include linux/nfs4.h in export.h
    sunrpc: Switch to using hash list instead single list
    sunrpc/nfsd: Remove redundant code by exports seq_operations functions
    sunrpc: Store cache_detail in seq_file's private directly
    ...

    Linus Torvalds
     
  • Merge patch-bomb from Andrew Morton:

    - a few misc things

    - Andy's "ambient capabilities"

    - fs/nofity updates

    - the ocfs2 queue

    - kernel/watchdog.c updates and feature work.

    - some of MM. Includes Andrea's userfaultfd feature.

    [ Hadn't noticed that userfaultfd was 'default y' when applying the
    patches, so that got fixed in this merge instead. We do _not_ mark
    new features that nobody uses yet 'default y' - Linus ]

    * emailed patches from Andrew Morton : (118 commits)
    mm/hugetlb.c: make vma_has_reserves() return bool
    mm/madvise.c: make madvise_behaviour_valid() return bool
    mm/memory.c: make tlb_next_batch() return bool
    mm/dmapool.c: change is_page_busy() return from int to bool
    mm: remove struct node_active_region
    mremap: simplify the "overlap" check in mremap_to()
    mremap: don't do uneccesary checks if new_len == old_len
    mremap: don't do mm_populate(new_addr) on failure
    mm: move ->mremap() from file_operations to vm_operations_struct
    mremap: don't leak new_vma if f_op->mremap() fails
    mm/hugetlb.c: make vma_shareable() return bool
    mm: make GUP handle pfn mapping unless FOLL_GET is requested
    mm: fix status code which move_pages() returns for zero page
    mm: memcontrol: bring back the VM_BUG_ON() in mem_cgroup_swapout()
    genalloc: add support of multiple gen_pools per device
    genalloc: add name arg to gen_pool_get() and devm_gen_pool_create()
    mm/memblock: WARN_ON when nid differs from overlap region
    Documentation/features/vm: add feature description and arch support status for batched TLB flush after unmap
    mm: defer flush of writable TLB entries
    mm: send one IPI per CPU to TLB flush all entries after unmapping pages
    ...

    Linus Torvalds
     

05 Sep, 2015

25 commits

  • struct node_active_region is not used anymore. Remove it.

    Signed-off-by: minkyung88.kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    minkyung88.kim
     
  • vma->vm_ops->mremap() looks more natural and clean in move_vma(), and this
    way ->mremap() can have more users. Say, vdso.

    While at it, s/aio_ring_remap/aio_ring_mremap/.

    Note: this is the minimal change before ->mremap() finds another user in
    file_operations; this method should have more arguments, and it can be
    used to kill arch_remap().

    Signed-off-by: Oleg Nesterov
    Acked-by: Pavel Emelyanov
    Acked-by: Kirill A. Shutemov
    Cc: David Rientjes
    Cc: Benjamin LaHaise
    Cc: Hugh Dickins
    Cc: Jeff Moyer
    Cc: Laurent Dufour
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • This change fills devm_gen_pool_create()/gen_pool_get() "name" argument
    stub with contents and extends of_gen_pool_get() functionality on this
    basis.

    If there is no associated platform device with a device node passed to
    of_gen_pool_get(), the function attempts to get a label property or device
    node name (= repeats MTD OF partition standard) and seeks for a named
    gen_pool registered by device of the parent device node.

    The main idea of the change is to allow registration of independent
    gen_pools under the same umbrella device, say "partitions" on "storage
    device", the original functionality of one "partition" per "storage
    device" is untouched.

    [akpm@linux-foundation.org: fix constness in devres_find()]
    [dan.carpenter@oracle.com: freeing const data pointers]
    Signed-off-by: Vladimir Zapolskiy
    Cc: Philipp Zabel
    Cc: Greg Kroah-Hartman
    Cc: Russell King
    Cc: Nicolas Ferre
    Cc: Alexandre Belloni
    Cc: Jean-Christophe Plagniol-Villard
    Cc: Shawn Guo
    Cc: Sascha Hauer
    Cc: Mauro Carvalho Chehab
    Cc: Arnd Bergmann
    Signed-off-by: Dan Carpenter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Zapolskiy
     
  • This change modifies gen_pool_get() and devm_gen_pool_create() client
    interfaces adding one more argument "name" of a gen_pool object.

    Due to implementation gen_pool_get() is capable to retrieve only one
    gen_pool associated with a device even if multiple gen_pools are created,
    fortunately right at the moment it is sufficient for the clients, hence
    provide NULL as a valid argument on both producer devm_gen_pool_create()
    and consumer gen_pool_get() sides.

    Because only one created gen_pool per device is addressable, explicitly
    add a restriction to devm_gen_pool_create() to create only one gen_pool
    per device, this implies two possible error codes returned by the
    function, account it on client side (only misc/sram). This completes
    client side changes related to genalloc updates.

    [akpm@linux-foundation.org: gen_pool_get() cleanup]
    Signed-off-by: Vladimir Zapolskiy
    Cc: Philipp Zabel
    Cc: Greg Kroah-Hartman
    Cc: Russell King
    Cc: Nicolas Ferre
    Cc: Alexandre Belloni
    Cc: Jean-Christophe Plagniol-Villard
    Cc: Shawn Guo
    Cc: Sascha Hauer
    Cc: Mauro Carvalho Chehab
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Zapolskiy
     
  • If a PTE is unmapped and it's dirty then it was writable recently. Due to
    deferred TLB flushing, it's best to assume a writable TLB cache entry
    exists. With that assumption, the TLB must be flushed before any IO can
    start or the page is freed to avoid lost writes or data corruption. This
    patch defers flushing of potentially writable TLBs as long as possible.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Dave Hansen
    Acked-by: Ingo Molnar
    Cc: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • An IPI is sent to flush remote TLBs when a page is unmapped that was
    potentially accesssed by other CPUs. There are many circumstances where
    this happens but the obvious one is kswapd reclaiming pages belonging to a
    running process as kswapd and the task are likely running on separate
    CPUs.

    On small machines, this is not a significant problem but as machine gets
    larger with more cores and more memory, the cost of these IPIs can be
    high. This patch uses a simple structure that tracks CPUs that
    potentially have TLB entries for pages being unmapped. When the unmapping
    is complete, the full TLB is flushed on the assumption that a refill cost
    is lower than flushing individual entries.

    Architectures wishing to do this must give the following guarantee.

    If a clean page is unmapped and not immediately flushed, the
    architecture must guarantee that a write to that linear address
    from a CPU with a cached TLB entry will trap a page fault.

    This is essentially what the kernel already depends on but the window is
    much larger with this patch applied and is worth highlighting. The
    architecture should consider whether the cost of the full TLB flush is
    higher than sending an IPI to flush each individual entry. An additional
    architecture helper called flush_tlb_local is required. It's a trivial
    wrapper with some accounting in the x86 case.

    The impact of this patch depends on the workload as measuring any benefit
    requires both mapped pages co-located on the LRU and memory pressure. The
    case with the biggest impact is multiple processes reading mapped pages
    taken from the vm-scalability test suite. The test case uses NR_CPU
    readers of mapped files that consume 10*RAM.

    Linear mapped reader on a 4-node machine with 64G RAM and 48 CPUs

    4.2.0-rc1 4.2.0-rc1
    vanilla flushfull-v7
    Ops lru-file-mmap-read-elapsed 159.62 ( 0.00%) 120.68 ( 24.40%)
    Ops lru-file-mmap-read-time_range 30.59 ( 0.00%) 2.80 ( 90.85%)
    Ops lru-file-mmap-read-time_stddv 6.70 ( 0.00%) 0.64 ( 90.38%)

    4.2.0-rc1 4.2.0-rc1
    vanilla flushfull-v7
    User 581.00 611.43
    System 5804.93 4111.76
    Elapsed 161.03 122.12

    This is showing that the readers completed 24.40% faster with 29% less
    system CPU time. From vmstats, it is known that the vanilla kernel was
    interrupted roughly 900K times per second during the steady phase of the
    test and the patched kernel was interrupts 180K times per second.

    The impact is lower on a single socket machine.

    4.2.0-rc1 4.2.0-rc1
    vanilla flushfull-v7
    Ops lru-file-mmap-read-elapsed 25.33 ( 0.00%) 20.38 ( 19.54%)
    Ops lru-file-mmap-read-time_range 0.91 ( 0.00%) 1.44 (-58.24%)
    Ops lru-file-mmap-read-time_stddv 0.28 ( 0.00%) 0.47 (-65.34%)

    4.2.0-rc1 4.2.0-rc1
    vanilla flushfull-v7
    User 58.09 57.64
    System 111.82 76.56
    Elapsed 27.29 22.55

    It's still a noticeable improvement with vmstat showing interrupts went
    from roughly 500K per second to 45K per second.

    The patch will have no impact on workloads with no memory pressure or have
    relatively few mapped pages. It will have an unpredictable impact on the
    workload running on the CPU being flushed as it'll depend on how many TLB
    entries need to be refilled and how long that takes. Worst case, the TLB
    will be completely cleared of active entries when the target PFNs were not
    resident at all.

    [sasha.levin@oracle.com: trace tlb flush after disabling preemption in try_to_unmap_flush]
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Dave Hansen
    Acked-by: Ingo Molnar
    Cc: Linus Torvalds
    Signed-off-by: Sasha Levin
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • When unmapping pages it is necessary to flush the TLB. If that page was
    accessed by another CPU then an IPI is used to flush the remote CPU. That
    is a lot of IPIs if kswapd is scanning and unmapping >100K pages per
    second.

    There already is a window between when a page is unmapped and when it is
    TLB flushed. This series increases the window so multiple pages can be
    flushed using a single IPI. This should be safe or the kernel is hosed
    already.

    Patch 1 simply made the rest of the series easier to write as ftrace
    could identify all the senders of TLB flush IPIS.

    Patch 2 tracks what CPUs potentially map a PFN and then sends an IPI
    to flush the entire TLB.

    Patch 3 tracks when there potentially are writable TLB entries that
    need to be batched differently

    Patch 4 increases SWAP_CLUSTER_MAX to further batch flushes

    The performance impact is documented in the changelogs but in the optimistic
    case on a 4-socket machine the full series reduces interrupts from 900K
    interrupts/second to 60K interrupts/second.

    This patch (of 4):

    It is easy to trace when an IPI is received to flush a TLB but harder to
    detect what event sent it. This patch makes it easy to identify the
    source of IPIs being transmitted for TLB flushes on x86.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Reviewed-by: Dave Hansen
    Acked-by: Ingo Molnar
    Cc: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This implements mcopy_atomic and mfill_zeropage that are the lowlevel
    VM methods that are invoked respectively by the UFFDIO_COPY and
    UFFDIO_ZEROPAGE userfaultfd commands.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Pavel Emelyanov
    Cc: Sanidhya Kashyap
    Cc: zhang.zhanghailiang@huawei.com
    Cc: "Kirill A. Shutemov"
    Cc: Andres Lagar-Cavilla
    Cc: Dave Hansen
    Cc: Paolo Bonzini
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Andy Lutomirski
    Cc: Hugh Dickins
    Cc: Peter Feiner
    Cc: "Dr. David Alan Gilbert"
    Cc: Johannes Weiner
    Cc: "Huangpeng (Peter)"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • This implements the uABI of UFFDIO_COPY and UFFDIO_ZEROPAGE.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Pavel Emelyanov
    Cc: Sanidhya Kashyap
    Cc: zhang.zhanghailiang@huawei.com
    Cc: "Kirill A. Shutemov"
    Cc: Andres Lagar-Cavilla
    Cc: Dave Hansen
    Cc: Paolo Bonzini
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Andy Lutomirski
    Cc: Hugh Dickins
    Cc: Peter Feiner
    Cc: "Dr. David Alan Gilbert"
    Cc: Johannes Weiner
    Cc: "Huangpeng (Peter)"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • This activates the userfaultfd syscall.

    [sfr@canb.auug.org.au: activate syscall fix]
    [akpm@linux-foundation.org: don't enable userfaultfd on powerpc]
    Signed-off-by: Andrea Arcangeli
    Acked-by: Pavel Emelyanov
    Cc: Sanidhya Kashyap
    Cc: zhang.zhanghailiang@huawei.com
    Cc: "Kirill A. Shutemov"
    Cc: Andres Lagar-Cavilla
    Cc: Dave Hansen
    Cc: Paolo Bonzini
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Andy Lutomirski
    Cc: Hugh Dickins
    Cc: Peter Feiner
    Cc: "Dr. David Alan Gilbert"
    Cc: Johannes Weiner
    Cc: "Huangpeng (Peter)"
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • I had requests to return the full address (not the page aligned one) to
    userland.

    It's not entirely clear how the page offset could be relevant because
    userfaults aren't like SIGBUS that can sigjump to a different place and it
    actually skip resolving the fault depending on a page offset. There's
    currently no real way to skip the fault especially because after a
    UFFDIO_COPY|ZEROPAGE, the fault is optimized to be retried within the
    kernel without having to return to userland first (not even self modifying
    code replacing the .text that touched the faulting address would prevent
    the fault to be repeated). Userland cannot skip repeating the fault even
    more so if the fault was triggered by a KVM secondary page fault or any
    get_user_pages or any copy-user inside some syscall which will return to
    kernel code. The second time FAULT_FLAG_RETRY_NOWAIT won't be set leading
    to a SIGBUS being raised because the userfault can't wait if it cannot
    release the mmap_map first (and FAULT_FLAG_RETRY_NOWAIT is required for
    that).

    Still returning userland a proper structure during the read() on the uffd,
    can allow to use the current UFFD_API for the future non-cooperative
    extensions too and it looks cleaner as well. Once we get additional
    fields there's no point to return the fault address page aligned anymore
    to reuse the bits below PAGE_SHIFT.

    The only downside is that the read() syscall will read 32bytes instead of
    8bytes but that's not going to be measurable overhead.

    The total number of new events that can be extended or of new future bits
    for already shipped events, is limited to 64 by the features field of the
    uffdio_api structure. If more will be needed a bump of UFFD_API will be
    required.

    [akpm@linux-foundation.org: use __packed]
    Signed-off-by: Andrea Arcangeli
    Acked-by: Pavel Emelyanov
    Cc: Sanidhya Kashyap
    Cc: zhang.zhanghailiang@huawei.com
    Cc: "Kirill A. Shutemov"
    Cc: Andres Lagar-Cavilla
    Cc: Dave Hansen
    Cc: Paolo Bonzini
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Andy Lutomirski
    Cc: Hugh Dickins
    Cc: Peter Feiner
    Cc: "Dr. David Alan Gilbert"
    Cc: Johannes Weiner
    Cc: "Huangpeng (Peter)"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • This is (seems to be) the minimal thing that is required to unblock
    standard uffd usage from the non-cooperative one. Now more bits can be
    added to the features field indicating e.g. UFFD_FEATURE_FORK and others
    needed for the latter use-case.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: Andrea Arcangeli
    Cc: Sanidhya Kashyap
    Cc: zhang.zhanghailiang@huawei.com
    Cc: "Kirill A. Shutemov"
    Cc: Andres Lagar-Cavilla
    Cc: Dave Hansen
    Cc: Paolo Bonzini
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Andy Lutomirski
    Cc: Hugh Dickins
    Cc: Peter Feiner
    Cc: "Dr. David Alan Gilbert"
    Cc: Johannes Weiner
    Cc: "Huangpeng (Peter)"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • vma->vm_userfaultfd_ctx is yet another vma parameter that vma_merge
    must be aware about so that we can merge vmas back like they were
    originally before arming the userfaultfd on some memory range.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Pavel Emelyanov
    Cc: Sanidhya Kashyap
    Cc: zhang.zhanghailiang@huawei.com
    Cc: "Kirill A. Shutemov"
    Cc: Andres Lagar-Cavilla
    Cc: Dave Hansen
    Cc: Paolo Bonzini
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Andy Lutomirski
    Cc: Hugh Dickins
    Cc: Peter Feiner
    Cc: "Dr. David Alan Gilbert"
    Cc: Johannes Weiner
    Cc: "Huangpeng (Peter)"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • These two flags gets set in vma->vm_flags to tell the VM common code
    if the userfaultfd is armed and in which mode (only tracking missing
    faults, only tracking wrprotect faults or both). If neither flags is
    set it means the userfaultfd is not armed on the vma.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Pavel Emelyanov
    Cc: Sanidhya Kashyap
    Cc: zhang.zhanghailiang@huawei.com
    Cc: "Kirill A. Shutemov"
    Cc: Andres Lagar-Cavilla
    Cc: Dave Hansen
    Cc: Paolo Bonzini
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Andy Lutomirski
    Cc: Hugh Dickins
    Cc: Peter Feiner
    Cc: "Dr. David Alan Gilbert"
    Cc: Johannes Weiner
    Cc: "Huangpeng (Peter)"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • This adds the vm_userfaultfd_ctx to the vm_area_struct.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Pavel Emelyanov
    Cc: Sanidhya Kashyap
    Cc: zhang.zhanghailiang@huawei.com
    Cc: "Kirill A. Shutemov"
    Cc: Andres Lagar-Cavilla
    Cc: Dave Hansen
    Cc: Paolo Bonzini
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Andy Lutomirski
    Cc: Hugh Dickins
    Cc: Peter Feiner
    Cc: "Dr. David Alan Gilbert"
    Cc: Johannes Weiner
    Cc: "Huangpeng (Peter)"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Kernel header defining the methods needed by the VM common code to
    interact with the userfaultfd.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Pavel Emelyanov
    Cc: Sanidhya Kashyap
    Cc: zhang.zhanghailiang@huawei.com
    Cc: "Kirill A. Shutemov"
    Cc: Andres Lagar-Cavilla
    Cc: Dave Hansen
    Cc: Paolo Bonzini
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Andy Lutomirski
    Cc: Hugh Dickins
    Cc: Peter Feiner
    Cc: "Dr. David Alan Gilbert"
    Cc: Johannes Weiner
    Cc: "Huangpeng (Peter)"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Defines the uAPI of the userfaultfd, notably the ioctl numbers and protocol.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Pavel Emelyanov
    Cc: Sanidhya Kashyap
    Cc: zhang.zhanghailiang@huawei.com
    Cc: "Kirill A. Shutemov"
    Cc: Andres Lagar-Cavilla
    Cc: Dave Hansen
    Cc: Paolo Bonzini
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Andy Lutomirski
    Cc: Hugh Dickins
    Cc: Peter Feiner
    Cc: "Dr. David Alan Gilbert"
    Cc: Johannes Weiner
    Cc: "Huangpeng (Peter)"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • userfaultfd needs to wake all waitqueues (pass 0 as nr parameter), instead
    of the current hardcoded 1 (that would wake just the first waitqueue in
    the head list).

    Signed-off-by: Andrea Arcangeli
    Acked-by: Pavel Emelyanov
    Cc: Sanidhya Kashyap
    Cc: zhang.zhanghailiang@huawei.com
    Cc: "Kirill A. Shutemov"
    Cc: Andres Lagar-Cavilla
    Cc: Dave Hansen
    Cc: Paolo Bonzini
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Andy Lutomirski
    Cc: Hugh Dickins
    Cc: Peter Feiner
    Cc: "Dr. David Alan Gilbert"
    Cc: Johannes Weiner
    Cc: "Huangpeng (Peter)"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Add the basic infrastructure for alloc/free operations on pointer arrays.
    It includes a generic function in the common slab code that is used in
    this infrastructure patch to create the unoptimized functionality for slab
    bulk operations.

    Allocators can then provide optimized allocation functions for situations
    in which large numbers of objects are needed. These optimization may
    avoid taking locks repeatedly and bypass metadata creation if all objects
    in slab pages can be used to provide the objects required.

    Allocators can extend the skeletons provided and add their own code to the
    bulk alloc and free functions. They can keep the generic allocation and
    freeing and just fall back to those if optimizations would not work (like
    for example when debugging is on).

    Signed-off-by: Christoph Lameter
    Signed-off-by: Jesper Dangaard Brouer
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Rename watchdog_suspend() to lockup_detector_suspend() and
    watchdog_resume() to lockup_detector_resume() to avoid confusion with the
    watchdog subsystem and to be consistent with the existing name
    lockup_detector_init().

    Also provide comment blocks to explain the watchdog_running and
    watchdog_suspended variables and their relationship.

    Signed-off-by: Ulrich Obergfell
    Reviewed-by: Aaron Tomlin
    Cc: Guenter Roeck
    Cc: Don Zickus
    Cc: Ulrich Obergfell
    Cc: Jiri Olsa
    Cc: Michal Hocko
    Cc: Stephane Eranian
    Cc: Chris Metcalf
    Cc: Frederic Weisbecker
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Obergfell
     
  • Remove watchdog_nmi_disable_all() and watchdog_nmi_enable_all() since
    these functions are no longer needed. If a subsystem has a need to
    deactivate the watchdog temporarily, it should utilize the
    watchdog_suspend() and watchdog_resume() functions.

    [akpm@linux-foundation.org: fix build with CONFIG_LOCKUP_DETECTOR=m]
    Signed-off-by: Ulrich Obergfell
    Reviewed-by: Aaron Tomlin
    Cc: Guenter Roeck
    Cc: Don Zickus
    Cc: Ulrich Obergfell
    Cc: Jiri Olsa
    Cc: Michal Hocko
    Cc: Stephane Eranian
    Cc: Chris Metcalf
    Cc: Frederic Weisbecker
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Obergfell
     
  • This interface can be utilized to deactivate the hard and soft lockup
    detector temporarily. Callers are expected to minimize the duration of
    deactivation. Multiple deactivations are allowed to occur in parallel but
    should be rare in practice.

    [akpm@linux-foundation.org: remove unneeded static initialization]
    Signed-off-by: Ulrich Obergfell
    Reviewed-by: Aaron Tomlin
    Cc: Guenter Roeck
    Cc: Don Zickus
    Cc: Ulrich Obergfell
    Cc: Jiri Olsa
    Cc: Michal Hocko
    Cc: Stephane Eranian
    Cc: Chris Metcalf
    Cc: Frederic Weisbecker
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Obergfell
     
  • The kernel's NMI watchdog has nothing to do with the watchdog subsystem.
    Its header declarations should be in linux/nmi.h, not linux/watchdog.h.

    The code provided two sets of dummy functions if HARDLOCKUP_DETECTOR is
    not configured, one in the include file and one in kernel/watchdog.c.
    Remove the dummy functions from kernel/watchdog.c and use those from the
    include file.

    Signed-off-by: Guenter Roeck
    Cc: Stephane Eranian
    Cc: Peter Zijlstra (Intel)
    Cc: Ingo Molnar
    Cc: Don Zickus
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Guenter Roeck
     
  • It makes the registration cheaper and simpler for the smpboot per-cpu
    kthread users that don't need to always update the cpumask after threads
    creation.

    [sfr@canb.auug.org.au: fix for allow passing the cpumask on per-cpu thread registration]
    Signed-off-by: Frederic Weisbecker
    Reviewed-by: Chris Metcalf
    Reviewed-by: Thomas Gleixner
    Cc: Chris Metcalf
    Cc: Don Zickus
    Cc: Peter Zijlstra
    Cc: Ulrich Obergfell
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Frederic Weisbecker
     
  • Many file systems that implement the show_options hook fail to correctly
    escape their output which could lead to unescaped characters (e.g. new
    lines) leaking into /proc/mounts and /proc/[pid]/mountinfo files. This
    could lead to confusion, spoofed entries (resulting in things like
    systemd issuing false d-bus "mount" notifications), and who knows what
    else. This looks like it would only be the root user stepping on
    themselves, but it's possible weird things could happen in containers or
    in other situations with delegated mount privileges.

    Here's an example using overlay with setuid fusermount trusting the
    contents of /proc/mounts (via the /etc/mtab symlink). Imagine the use
    of "sudo" is something more sneaky:

    $ BASE="ovl"
    $ MNT="$BASE/mnt"
    $ LOW="$BASE/lower"
    $ UP="$BASE/upper"
    $ WORK="$BASE/work/ 0 0
    none /proc fuse.pwn user_id=1000"
    $ mkdir -p "$LOW" "$UP" "$WORK"
    $ sudo mount -t overlay -o "lowerdir=$LOW,upperdir=$UP,workdir=$WORK" none /mnt
    $ cat /proc/mounts
    none /root/ovl/mnt overlay rw,relatime,lowerdir=ovl/lower,upperdir=ovl/upper,workdir=ovl/work/ 0 0
    none /proc fuse.pwn user_id=1000 0 0
    $ fusermount -u /proc
    $ cat /proc/mounts
    cat: /proc/mounts: No such file or directory

    This fixes the problem by adding new seq_show_option and
    seq_show_option_n helpers, and updating the vulnerable show_option
    handlers to use them as needed. Some, like SELinux, need to be open
    coded due to unusual existing escape mechanisms.

    [akpm@linux-foundation.org: add lost chunk, per Kees]
    [keescook@chromium.org: seq_show_option should be using const parameters]
    Signed-off-by: Kees Cook
    Acked-by: Serge Hallyn
    Acked-by: Jan Kara
    Acked-by: Paul Moore
    Cc: J. R. Okajima
    Signed-off-by: Kees Cook
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook