28 Feb, 2016

2 commits

  • Previously calls to dax_writeback_mapping_range() for all DAX filesystems
    (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range().

    dax_writeback_mapping_range() needs a struct block_device, and it used
    to get that from inode->i_sb->s_bdev. This is correct for normal inodes
    mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw
    block devices and for XFS real-time files.

    Instead, call dax_writeback_mapping_range() directly from the filesystem
    ->writepages function so that it can supply us with a valid block
    device. This also fixes DAX code to properly flush caches in response
    to sync(2).

    Signed-off-by: Ross Zwisler
    Signed-off-by: Jan Kara
    Cc: Al Viro
    Cc: Dan Williams
    Cc: Dave Chinner
    Cc: Jens Axboe
    Cc: Matthew Wilcox
    Cc: Theodore Ts'o
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     
  • The recent *sync enabling discovered that we are inserting into the
    block_device pagecache counter to the expectations of the dirty data
    tracking for dax mappings. This can lead to data corruption.

    We want to support DAX for block devices eventually, but it requires
    wider changes to properly manage the pagecache.

    dump_stack+0x85/0xc2
    dax_writeback_mapping_range+0x60/0xe0
    blkdev_writepages+0x3f/0x50
    do_writepages+0x21/0x30
    __filemap_fdatawrite_range+0xc6/0x100
    filemap_write_and_wait+0x4a/0xa0
    set_blocksize+0x70/0xd0
    sb_set_blocksize+0x1d/0x50
    ext4_fill_super+0x75b/0x3360
    mount_bdev+0x180/0x1b0
    ext4_mount+0x15/0x20
    mount_fs+0x38/0x170

    Mark the support broken so its disabled by default, but otherwise still
    available for testing.

    Signed-off-by: Dan Williams
    Signed-off-by: Ross Zwisler
    Reported-by: Ross Zwisler
    Suggested-by: Dave Chinner
    Reviewed-by: Jan Kara
    Cc: Jens Axboe
    Cc: Matthew Wilcox
    Cc: Al Viro
    Cc: Theodore Ts'o
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     

06 Feb, 2016

1 commit

  • Previously the pfn_mkwrite() fault handler for raw block devices called
    bldev_dax_fault() -> __dax_fault() to do a full DAX page fault.

    Really what the pfn_mkwrite() fault handler needs to do is call
    dax_pfn_mkwrite() to make sure that the radix tree entry for the given
    PTE is marked as dirty so that a follow-up fsync or msync call will
    flush it durably to media.

    Fixes: 5a023cdba50c ("block: enable dax for raw block devices")
    Signed-off-by: Ross Zwisler
    Cc: Alexander Viro
    Cc: Dan Williams
    Cc: Dave Chinner
    Reviewed-by: Jan Kara
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     

31 Jan, 2016

1 commit

  • Dynamically enabling DAX requires that the page cache first be flushed
    and invalidated. This must occur atomically with the change of DAX mode
    otherwise we confuse the fsync/msync tracking and violate data
    durability guarantees. Eliminate the possibilty of DAX-disabled to
    DAX-enabled transitions for now and revisit this for the next cycle.

    Cc: Jan Kara
    Cc: Jeff Moyer
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Matthew Wilcox
    Cc: Andrew Morton
    Cc: Ross Zwisler
    Signed-off-by: Dan Williams

    Dan Williams
     

24 Jan, 2016

1 commit

  • Pull final vfs updates from Al Viro:

    - The ->i_mutex wrappers (with small prereq in lustre)

    - a fix for too early freeing of symlink bodies on shmem (they need to
    be RCU-delayed) (-stable fodder)

    - followup to dedupe stuff merged this cycle

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    vfs: abort dedupe loop if fatal signals are pending
    make sure that freeing shmem fast symlinks is RCU-delayed
    wrappers for ->i_mutex access
    lustre: remove unused declaration

    Linus Torvalds
     

23 Jan, 2016

2 commits

  • Add support for tracking dirty DAX entries in the struct address_space
    radix tree. This tree is already used for dirty page writeback, and it
    already supports the use of exceptional (non struct page*) entries.

    In order to properly track dirty DAX pages we will insert new
    exceptional entries into the radix tree that represent dirty DAX PTE or
    PMD pages. These exceptional entries will also contain the writeback
    addresses for the PTE or PMD faults that we can use at fsync/msync time.

    There are currently two types of exceptional entries (shmem and shadow)
    that can be placed into the radix tree, and this adds a third. We rely
    on the fact that only one type of exceptional entry can be found in a
    given radix tree based on its usage. This happens for free with DAX vs
    shmem but we explicitly prevent shadow entries from being added to radix
    trees for DAX mappings.

    The only shadow entries that would be generated for DAX radix trees
    would be to track zero page mappings that were created for holes. These
    pages would receive minimal benefit from having shadow entries, and the
    choice to have only one type of exceptional entry in a given radix tree
    makes the logic simpler both in clear_exceptional_entry() and in the
    rest of DAX.

    Signed-off-by: Ross Zwisler
    Cc: "H. Peter Anvin"
    Cc: "J. Bruce Fields"
    Cc: "Theodore Ts'o"
    Cc: Alexander Viro
    Cc: Andreas Dilger
    Cc: Dave Chinner
    Cc: Ingo Molnar
    Cc: Jan Kara
    Cc: Jeff Layton
    Cc: Matthew Wilcox
    Cc: Thomas Gleixner
    Cc: Dan Williams
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     
  • parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
    inode_foo(inode) being mutex_foo(&inode->i_mutex).

    Please, use those for access to ->i_mutex; over the coming cycle
    ->i_mutex will become rwsem, with ->lookup() done with it held
    only shared.

    Signed-off-by: Al Viro

    Al Viro
     

20 Jan, 2016

1 commit

  • Pull core block updates from Jens Axboe:
    "We don't have a lot of core changes this time around, it's mostly in
    drivers, which will come in a subsequent pull.

    The cores changes include:

    - blk-mq
    - Prep patch from Christoph, changing blk_mq_alloc_request() to
    take flags instead of just using gfp_t for sleep/nosleep.
    - Doc patch from me, clarifying the difference between legacy
    and blk-mq for timer usage.
    - Fixes from Raghavendra for memory-less numa nodes, and a reuse
    of CPU masks.

    - Cleanup from Geliang Tang, using offset_in_page() instead of open
    coding it.

    - From Ilya, rename request_queue slab to it reflects what it holds,
    and a fix for proper use of bdgrab/put.

    - A real fix for the split across stripe boundaries from Keith. We
    yanked a broken version of this from 4.4-rc final, this one works.

    - From Mike Krinkin, emit a trace message when we split.

    - From Wei Tang, two small cleanups, not explicitly clearing memory
    that is already cleared"

    * 'for-4.5/core' of git://git.kernel.dk/linux-block:
    block: use bd{grab,put}() instead of open-coding
    block: split bios to max possible length
    block: add call to split trace point
    blk-mq: Avoid memoryless numa node encoded in hctx numa_node
    blk-mq: Reuse hardware context cpumask for tags
    blk-mq: add a flags parameter to blk_mq_alloc_request
    Revert "blk-flush: Queue through IO scheduler when flush not required"
    block: clarify blk_add_timer() use case for blk-mq
    bio: use offset_in_page macro
    block: do not initialise statics to 0 or NULL
    block: do not initialise globals to 0 or NULL
    block: rename request_queue slab cache

    Linus Torvalds
     

16 Jan, 2016

2 commits

  • The DAX implementation needs to protect new calls to ->direct_access()
    and usage of its return value against the driver for the underlying
    block device being disabled. Use blk_queue_enter()/blk_queue_exit() to
    hold off blk_cleanup_queue() from proceeding, or otherwise fail new
    mapping requests if the request_queue is being torn down.

    This also introduces blk_dax_ctl to simplify the interface from fs/dax.c
    through dax_map_atomic() to bdev_direct_access().

    [willy@linux.intel.com: fix read() of a hole]
    Signed-off-by: Dan Williams
    Reviewed-by: Jeff Moyer
    Cc: Jan Kara
    Cc: Jens Axboe
    Cc: Dave Chinner
    Cc: Ross Zwisler
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • If a ->direct_access() implementation ever returns a map count less than
    PAGE_SIZE, catch the error in bdev_direct_access(). This simplifies
    error checking in upper layers.

    Signed-off-by: Dan Williams
    Reported-by: Ross Zwisler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     

15 Jan, 2016

2 commits

  • bdev_write_page() is used by swapout and by writepage where we cannot
    use __GFP_FS or __GFP_IO. So it is misleading to mention GFP_KERNEL
    here.

    blk_queue_enter() only actually looks at __GFP_DIRECT_RECLAIM, so no
    bugs were harmed in the making of this patch.

    Cc: Dan Williams
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Mark those kmem allocations that are known to be easily triggered from
    userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
    memcg. For the list, see below:

    - threadinfo
    - task_struct
    - task_delay_info
    - pid
    - cred
    - mm_struct
    - vm_area_struct and vm_region (nommu)
    - anon_vma and anon_vma_chain
    - signal_struct
    - sighand_struct
    - fs_struct
    - files_struct
    - fdtable and fdtable->full_fds_bits
    - dentry and external_name
    - inode for all filesystems. This is the most tedious part, because
    most filesystems overwrite the alloc_inode method.

    The list is far from complete, so feel free to add more objects.
    Nevertheless, it should be close to "account everything" approach and
    keep most workloads within bounds. Malevolent users will be able to
    breach the limit, but this was possible even with the former "account
    everything" approach (simply because it did not account everything in
    fact).

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Cc: Greg Thelen
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

14 Jan, 2016

2 commits

  • Pull libnvdimm updates from Dan Williams:
    "The bulk of this has appeared in -next and independently received a
    build success notification from the kbuild robot. The 'for-4.5/block-
    dax' topic branch was rebased over the weekend to drop the "block
    device end-of-life" rework that Al would like to see re-implemented
    with a notifier, and to address bug reports against the badblocks
    integration.

    There is pending feedback against "libnvdimm: Add a poison list and
    export badblocks" received last week. Linda identified some localized
    fixups that we will handle incrementally.

    Summary:

    - Media error handling: The 'badblocks' implementation that
    originated in md-raid is up-levelled to a generic capability of a
    block device. This initial implementation is limited to being
    consulted in the pmem block-i/o path. Later, 'badblocks' will be
    consulted when creating dax mappings.

    - Raw block device dax: For virtualization and other cases that want
    large contiguous mappings of persistent memory, add the capability
    to dax-mmap a block device directly.

    - Increased /dev/mem restrictions: Add an option to treat all
    io-memory as IORESOURCE_EXCLUSIVE, i.e. disable /dev/mem access
    while a driver is actively using an address range. This behavior
    is controlled via the new CONFIG_IO_STRICT_DEVMEM option and can be
    overridden by the existing "iomem=relaxed" kernel command line
    option.

    - Miscellaneous fixes include a 'pfn'-device huge page alignment fix,
    block device shutdown crash fix, and other small libnvdimm fixes"

    * tag 'libnvdimm-for-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (32 commits)
    block: kill disk_{check|set|clear|alloc}_badblocks
    libnvdimm, pmem: nvdimm_read_bytes() badblocks support
    pmem, dax: disable dax in the presence of bad blocks
    pmem: fail io-requests to known bad blocks
    libnvdimm: convert to statically allocated badblocks
    libnvdimm: don't fail init for full badblocks list
    block, badblocks: introduce devm_init_badblocks
    block: clarify badblocks lifetime
    badblocks: rename badblocks_free to badblocks_exit
    libnvdimm, pmem: move definition of nvdimm_namespace_add_poison to nd.h
    libnvdimm: Add a poison list and export badblocks
    nfit_test: Enable DSMs for all test NFITs
    md: convert to use the generic badblocks code
    block: Add badblock management for gendisks
    badblocks: Add core badblock management code
    block: fix del_gendisk() vs blkdev_ioctl crash
    block: enable dax for raw block devices
    block: introduce bdev_file_inode()
    restrict /dev/mem to idle io memory ranges
    arch: consolidate CONFIG_STRICT_DEVM in lib/Kconfig.debug
    ...

    Linus Torvalds
     
  • - bd_acquire() and bd_forget() open-code bdgrab() and bdput()
    - raw driver uses igrab() but never checks its return value and always
    holds another ref from bind_set() while calling it, so it's
    equivalent to bdgrab()

    Signed-off-by: Ilya Dryomov
    Signed-off-by: Jens Axboe

    Ilya Dryomov
     

09 Jan, 2016

2 commits

  • If an application wants exclusive access to all of the persistent memory
    provided by an NVDIMM namespace it can use this raw-block-dax facility
    to forgo establishing a filesystem. This capability is targeted
    primarily to hypervisors wanting to provision persistent memory for
    guests. It can be disabled / enabled dynamically via the new BLKDAXSET
    ioctl.

    Cc: Jeff Moyer
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Andrew Morton
    Cc: Ross Zwisler
    Reported-by: kbuild test robot
    Reviewed-by: Jan Kara
    Signed-off-by: Dan Williams

    Dan Williams
     
  • Similar to the file_inode() helper, provide a helper to lookup the inode for a
    raw block device itself.

    Cc: Al Viro
    Suggested-by: Jan Kara
    Reviewed-by: Jan Kara
    Reviewed-by: Jeff Moyer
    Signed-off-by: Dan Williams

    Dan Williams
     

07 Jan, 2016

1 commit


05 Dec, 2015

1 commit

  • Since 52ebea749aae ("writeback: make backing_dev_info host
    cgroup-specific bdi_writebacks") inode, at some point in its lifetime,
    gets attached to a wb (struct bdi_writeback). Detaching happens on
    evict, in inode_detach_wb() called from __destroy_inode(), and involves
    updating wb.

    However, detaching an internal bdev inode from its wb in
    __destroy_inode() is too late. Its bdi and by extension root wb are
    embedded into struct request_queue, which has different lifetime rules
    and can be freed long before the final bdput() is called (can be from
    __fput() of a corresponding /dev inode, through dput() - evict() -
    bd_forget(). bdevs hold onto the underlying disk/queue pair only while
    opened; as soon as bdev is closed all bets are off. In fact,
    disk/queue can be gone before __blkdev_put() even returns:

    1499 static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part)
    1500 {
    ...
    1518 if (bdev->bd_contains == bdev) {
    1519 if (disk->fops->release)
    1520 disk->fops->release(disk, mode);

    [ Driver puts its references to disk/queue ]

    1521 }
    1522 if (!bdev->bd_openers) {
    1523 struct module *owner = disk->fops->owner;
    1524
    1525 disk_put_part(bdev->bd_part);
    1526 bdev->bd_part = NULL;
    1527 bdev->bd_disk = NULL;
    1528 if (bdev != bdev->bd_contains)
    1529 victim = bdev->bd_contains;
    1530 bdev->bd_contains = NULL;
    1531
    1532 put_disk(disk);

    [ We put ours, the queue is gone
    The last bdput() would result in a write to invalid memory ]

    1533 module_put(owner);
    ...
    1539 }

    Since bdev inodes are special anyway, detach them in __blkdev_put()
    after clearing inode's dirty bits, turning the problematic
    inode_detach_wb() in __destroy_inode() into a noop.

    add_disk() grabs its disk->queue since 523e1d399ce0 ("block: make
    gendisk hold a reference to its queue"), so the old ->release comment
    is removed in favor of the new inode_detach_wb() comment.

    Cc: stable@vger.kernel.org # 4.2+, needs backporting
    Signed-off-by: Ilya Dryomov
    Acked-by: Tejun Heo
    Tested-by: Raghavendra K T
    Signed-off-by: Jens Axboe

    Ilya Dryomov
     

02 Dec, 2015

1 commit


20 Nov, 2015

1 commit

  • Fix use after free crashes like the following:

    general protection fault: 0000 [#1] SMP
    Call Trace:
    [] ? pmem_do_bvec.isra.12+0xa6/0xf0 [nd_pmem]
    [] pmem_rw_page+0x42/0x80 [nd_pmem]
    [] bdev_read_page+0x50/0x60
    [] do_mpage_readpage+0x510/0x770
    [] ? I_BDEV+0x20/0x20
    [] ? lru_cache_add+0x1c/0x50
    [] mpage_readpages+0x107/0x170
    [] ? I_BDEV+0x20/0x20
    [] ? I_BDEV+0x20/0x20
    [] blkdev_readpages+0x1d/0x20
    [] __do_page_cache_readahead+0x28f/0x310
    [] ? __do_page_cache_readahead+0x169/0x310
    [] ? pagecache_get_page+0x2d/0x1d0
    [] filemap_fault+0x396/0x530
    [] __do_fault+0x4e/0xf0
    [] handle_mm_fault+0x11bd/0x1b50

    Cc:
    Cc: Jens Axboe
    Cc: Alexander Viro
    Reported-by: kbuild test robot
    Acked-by: Matthew Wilcox
    [willy: symmetry fixups]
    Signed-off-by: Dan Williams

    Dan Williams
     

12 Nov, 2015

1 commit

  • If a block device is hot removed and later last reference to device
    is put, we try to writeback the dirty inode. But device is gone and
    that writeback fails.

    Currently we do a WARN_ON() which does not seem to be the right thing.
    Convert it to a ratelimited kernel warning.

    Reported-by: Andi Kleen
    Signed-off-by: Vivek Goyal
    Acked-by: Tejun Heo
    [jmoyer@redhat.com: get rid of unnecessary name initialization, 80 cols]
    Signed-off-by: Jeff Moyer
    Signed-off-by: Jens Axboe

    Vivek Goyal
     

22 Oct, 2015

1 commit

  • Up until now the_integrity profile has been dynamically allocated and
    attached to struct gendisk after the disk has been made active.

    This causes problems because NVMe devices need to register the profile
    prior to the partition table being read due to a mandatory metadata
    buffer requirement. In addition, DM goes through hoops to deal with
    preallocating, but not initializing integrity profiles.

    Since the integrity profile is small (4 bytes + a pointer), Christoph
    suggested moving it to struct gendisk proper. This requires several
    changes:

    - Moving the blk_integrity definition to genhd.h.

    - Inlining blk_integrity in struct gendisk.

    - Removing the dynamic allocation code.

    - Adding helper functions which allow gendisk to set up and tear down
    the integrity sysfs dir when a disk is added/deleted.

    - Adding a blk_integrity_revalidate() callback for updating the stable
    pages bdi setting.

    - The calls that depend on whether a device has an integrity profile or
    not now key off of the bi->profile pointer.

    - Simplifying the integrity support routines in DM (Mike Snitzer).

    Signed-off-by: Martin K. Petersen
    Reported-by: Christoph Hellwig
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Mike Snitzer
    Cc: Dan Williams
    Signed-off-by: Dan Williams
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     

16 Sep, 2015

1 commit


09 Sep, 2015

3 commits

  • Merge second patch-bomb from Andrew Morton:
    "Almost all of the rest of MM. There was an unusually large amount of
    MM material this time"

    * emailed patches from Andrew Morton : (141 commits)
    zpool: remove no-op module init/exit
    mm: zbud: constify the zbud_ops
    mm: zpool: constify the zpool_ops
    mm: swap: zswap: maybe_preload & refactoring
    zram: unify error reporting
    zsmalloc: remove null check from destroy_handle_cache()
    zsmalloc: do not take class lock in zs_shrinker_count()
    zsmalloc: use class->pages_per_zspage
    zsmalloc: consider ZS_ALMOST_FULL as migrate source
    zsmalloc: partial page ordering within a fullness_list
    zsmalloc: use shrinker to trigger auto-compaction
    zsmalloc: account the number of compacted pages
    zsmalloc/zram: introduce zs_pool_stats api
    zsmalloc: cosmetic compaction code adjustments
    zsmalloc: introduce zs_can_compact() function
    zsmalloc: always keep per-class stats
    zsmalloc: drop unused variable `nr_to_migrate'
    mm/memblock.c: fix comment in __next_mem_range()
    mm/page_alloc.c: fix type information of memoryless node
    memory-hotplug: fix comments in zone_spanned_pages_in_node() and zone_spanned_pages_in_node()
    ...

    Linus Torvalds
     
  • In order to handle the !CONFIG_TRANSPARENT_HUGEPAGES case, we need to
    return VM_FAULT_FALLBACK from the inlined dax_pmd_fault(), which is
    defined in linux/mm.h. Given that we don't want to include
    in , the easiest solution is to move the DAX-related
    functions to a new header, . We could also have moved
    VM_FAULT_* definitions to a new header, or a different header that isn't
    quite such a boil-the-ocean header as , but this felt like
    the best option.

    Signed-off-by: Matthew Wilcox
    Cc: Hillf Danton
    Cc: "Kirill A. Shutemov"
    Cc: Theodore Ts'o
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Pull libnvdimm updates from Dan Williams:
    "This update has successfully completed a 0day-kbuild run and has
    appeared in a linux-next release. The changes outside of the typical
    drivers/nvdimm/ and drivers/acpi/nfit.[ch] paths are related to the
    removal of IORESOURCE_CACHEABLE, the introduction of memremap(), and
    the introduction of ZONE_DEVICE + devm_memremap_pages().

    Summary:

    - Introduce ZONE_DEVICE and devm_memremap_pages() as a generic
    mechanism for adding device-driver-discovered memory regions to the
    kernel's direct map.

    This facility is used by the pmem driver to enable pfn_to_page()
    operations on the page frames returned by DAX ('direct_access' in
    'struct block_device_operations').

    For now, the 'memmap' allocation for these "device" pages comes
    from "System RAM". Support for allocating the memmap from device
    memory will arrive in a later kernel.

    - Introduce memremap() to replace usages of ioremap_cache() and
    ioremap_wt(). memremap() drops the __iomem annotation for these
    mappings to memory that do not have i/o side effects. The
    replacement of ioremap_cache() with memremap() is limited to the
    pmem driver to ease merging the api change in v4.3.

    Completion of the conversion is targeted for v4.4.

    - Similar to the usage of memcpy_to_pmem() + wmb_pmem() in the pmem
    driver, update the VFS DAX implementation and PMEM api to provide
    persistence guarantees for kernel operations on a DAX mapping.

    - Convert the ACPI NFIT 'BLK' driver to map the block apertures as
    cacheable to improve performance.

    - Miscellaneous updates and fixes to libnvdimm including support for
    issuing "address range scrub" commands, clarifying the optimal
    'sector size' of pmem devices, a clarification of the usage of the
    ACPI '_STA' (status) property for DIMM devices, and other minor
    fixes"

    * tag 'libnvdimm-for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (34 commits)
    libnvdimm, pmem: direct map legacy pmem by default
    libnvdimm, pmem: 'struct page' for pmem
    libnvdimm, pfn: 'struct page' provider infrastructure
    x86, pmem: clarify that ARCH_HAS_PMEM_API implies PMEM mapped WB
    add devm_memremap_pages
    mm: ZONE_DEVICE for "device memory"
    mm: move __phys_to_pfn and __pfn_to_phys to asm/generic/memory_model.h
    dax: drop size parameter to ->direct_access()
    nd_blk: change aperture mapping from WC to WB
    nvdimm: change to use generic kvfree()
    pmem, dax: have direct_access use __pmem annotation
    dax: update I/O path to do proper PMEM flushing
    pmem: add copy_from_iter_pmem() and clear_pmem()
    pmem, x86: clean up conditional pmem includes
    pmem: remove layer when calling arch_has_wmb_pmem()
    pmem, x86: move x86 PMEM API to new pmem.h header
    libnvdimm, e820: make CONFIG_X86_PMEM_LEGACY a tristate option
    pmem: switch to devm_ allocations
    devres: add devm_memremap
    libnvdimm, btt: write and validate parent_uuid
    ...

    Linus Torvalds
     

28 Aug, 2015

1 commit


21 Aug, 2015

1 commit

  • Update the annotation for the kaddr pointer returned by direct_access()
    so that it is a __pmem pointer. This is consistent with the PMEM driver
    and with how this direct_access() pointer is used in the DAX code.

    Signed-off-by: Ross Zwisler
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dan Williams

    Ross Zwisler
     

18 Aug, 2015

1 commit

  • The process of reducing contention on per-superblock inode lists
    starts with moving the locking to match the per-superblock inode
    list. This takes the global lock out of the picture and reduces the
    contention problems to within a single filesystem. This doesn't get
    rid of contention as the locks still have global CPU scope, but it
    does isolate operations on different superblocks form each other.

    Signed-off-by: Dave Chinner
    Signed-off-by: Josef Bacik
    Reviewed-by: Jan Kara
    Reviewed-by: Christoph Hellwig
    Tested-by: Dave Chinner

    Dave Chinner
     

05 Jul, 2015

3 commits

  • Pull more vfs updates from Al Viro:
    "Assorted VFS fixes and related cleanups (IMO the most interesting in
    that part are f_path-related things and Eric's descriptor-related
    stuff). UFS regression fixes (it got broken last cycle). 9P fixes.
    fs-cache series, DAX patches, Jan's file_remove_suid() work"

    [ I'd say this is much more than "fixes and related cleanups". The
    file_table locking rule change by Eric Dumazet is a rather big and
    fundamental update even if the patch isn't huge. - Linus ]

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (49 commits)
    9p: cope with bogus responses from server in p9_client_{read,write}
    p9_client_write(): avoid double p9_free_req()
    9p: forgetting to cancel request on interrupted zero-copy RPC
    dax: bdev_direct_access() may sleep
    block: Add support for DAX reads/writes to block devices
    dax: Use copy_from_iter_nocache
    dax: Add block size note to documentation
    fs/file.c: __fget() and dup2() atomicity rules
    fs/file.c: don't acquire files->file_lock in fd_install()
    fs:super:get_anon_bdev: fix race condition could cause dev exceed its upper limitation
    vfs: avoid creation of inode number 0 in get_next_ino
    namei: make set_root_rcu() return void
    make simple_positive() public
    ufs: use dir_pages instead of ufs_dir_pages()
    pagemap.h: move dir_pages() over there
    remove the pointless include of lglock.h
    fs: cleanup slight list_entry abuse
    xfs: Correctly lock inode when removing suid and file capabilities
    fs: Call security_ops->inode_killpriv on truncate
    fs: Provide function telling whether file_remove_privs() will do anything
    ...

    Linus Torvalds
     
  • The brd driver is the only in-tree driver that may sleep currently.
    After some discussion on linux-fsdevel, we decided that any driver
    may choose to sleep in its ->direct_access method. To ensure that all
    callers of bdev_direct_access() are prepared for this, add a call
    to might_sleep().

    Signed-off-by: Matthew Wilcox
    Signed-off-by: Al Viro

    Matthew Wilcox
     
  • If a block device supports the ->direct_access methods, bypass the normal
    DIO path and use DAX to go straight to memcpy() instead of allocating
    a DIO and a BIO.

    Includes support for the DIO_SKIP_DIO_COUNT flag in DAX, as is done in
    do_blockdev_direct_IO().

    Signed-off-by: Matthew Wilcox
    Signed-off-by: Al Viro

    Matthew Wilcox
     

01 Jul, 2015

1 commit

  • Pull more block layer patches from Jens Axboe:
    "A few later arrivers that I didn't fold into the first pull request,
    so we had a chance to run some testing. This contains:

    - NVMe:
    - Set of fixes from Keith
    - 4.4 and earlier gcc build fix from Andrew

    - small set of xen-blk{back,front} fixes from Bob Liu.

    - warnings fix for bogus inline statement in I_BDEV() from Geert.

    - error code fixup for SG_IO ioctl from Paolo Bonzini"

    * 'for-linus' of git://git.kernel.dk/linux-block:
    drivers/block/nvme-core.c: fix build with gcc-4.4.4
    bdi: Remove "inline" keyword from exported I_BDEV() implementation
    block: fix bogus EFAULT error from SG_IO ioctl
    NVMe: Fix filesystem deadlock on removal
    NVMe: Failed controller initialization fixes
    NVMe: Unify controller probe and resume
    NVMe: Don't use fake status on cancelled command
    NVMe: Fix device cleanup on initialization failure
    drivers: xen-blkfront: only talk_to_blkback() when in XenbusStateInitialising
    xen/block: add multi-page ring support
    driver: xen-blkfront: move talk_to_blkback to a more suitable place
    drivers: xen-blkback: delay pending_req allocation to connect_ring

    Linus Torvalds
     

30 Jun, 2015

1 commit

  • Pull libnvdimm subsystem from Dan Williams:
    "The libnvdimm sub-system introduces, in addition to the
    libnvdimm-core, 4 drivers / enabling modules:

    NFIT:
    Instantiates an "nvdimm bus" with the core and registers memory
    devices (NVDIMMs) enumerated by the ACPI 6.0 NFIT (NVDIMM Firmware
    Interface table).

    After registering NVDIMMs the NFIT driver then registers "region"
    devices. A libnvdimm-region defines an access mode and the
    boundaries of persistent memory media. A region may span multiple
    NVDIMMs that are interleaved by the hardware memory controller. In
    turn, a libnvdimm-region can be carved into a "namespace" device and
    bound to the PMEM or BLK driver which will attach a Linux block
    device (disk) interface to the memory.

    PMEM:
    Initially merged in v4.1 this driver for contiguous spans of
    persistent memory address ranges is re-worked to drive
    PMEM-namespaces emitted by the libnvdimm-core.

    In this update the PMEM driver, on x86, gains the ability to assert
    that writes to persistent memory have been flushed all the way
    through the caches and buffers in the platform to persistent media.
    See memcpy_to_pmem() and wmb_pmem().

    BLK:
    This new driver enables access to persistent memory media through
    "Block Data Windows" as defined by the NFIT. The primary difference
    of this driver to PMEM is that only a small window of persistent
    memory is mapped into system address space at any given point in
    time.

    Per-NVDIMM windows are reprogrammed at run time, per-I/O, to access
    different portions of the media. BLK-mode, by definition, does not
    support DAX.

    BTT:
    This is a library, optionally consumed by either PMEM or BLK, that
    converts a byte-accessible namespace into a disk with atomic sector
    update semantics (prevents sector tearing on crash or power loss).

    The sinister aspect of sector tearing is that most applications do
    not know they have a atomic sector dependency. At least today's
    disk's rarely ever tear sectors and if they do one almost certainly
    gets a CRC error on access. NVDIMMs will always tear and always
    silently. Until an application is audited to be robust in the
    presence of sector-tearing the usage of BTT is recommended.

    Thanks to: Ross Zwisler, Jeff Moyer, Vishal Verma, Christoph Hellwig,
    Ingo Molnar, Neil Brown, Boaz Harrosh, Robert Elliott, Matthew Wilcox,
    Andy Rudoff, Linda Knippers, Toshi Kani, Nicholas Moulin, Rafael
    Wysocki, and Bob Moore"

    * tag 'libnvdimm-for-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm: (33 commits)
    arch, x86: pmem api for ensuring durability of persistent memory updates
    libnvdimm: Add sysfs numa_node to NVDIMM devices
    libnvdimm: Set numa_node to NVDIMM devices
    acpi: Add acpi_map_pxm_to_online_node()
    libnvdimm, nfit: handle unarmed dimms, mark namespaces read-only
    pmem: flag pmem block devices as non-rotational
    libnvdimm: enable iostat
    pmem: make_request cleanups
    libnvdimm, pmem: fix up max_hw_sectors
    libnvdimm, blk: add support for blk integrity
    libnvdimm, btt: add support for blk integrity
    fs/block_dev.c: skip rw_page if bdev has integrity
    libnvdimm: Non-Volatile Devices
    tools/testing/nvdimm: libnvdimm unit test infrastructure
    libnvdimm, nfit, nd_blk: driver for BLK-mode access persistent memory
    nd_btt: atomic sector updates
    libnvdimm: infrastructure for btt devices
    libnvdimm: write blk label set
    libnvdimm: write pmem label set
    libnvdimm: blk labels and namespace instantiation
    ...

    Linus Torvalds
     

28 Jun, 2015

1 commit

  • With gcc 3.4.6/4.1.2/4.2.4 (not with 4.4.7/4.6.4/4.8.4):

    CC fs/block_dev.o
    include/linux/fs.h:804: warning: ‘I_BDEV’ declared inline after being called
    include/linux/fs.h:804: warning: previous declaration of ‘I_BDEV’ was here

    Commit a212b105b07d75b4 ("bdi: make inode_to_bdi() inline") added a
    caller of I_BDEV() in a header file, exposing the bogus "inline" on the
    exported implementation.

    Drop the "inline" keyword to fix this.

    Signed-off-by: Geert Uytterhoeven
    Signed-off-by: Jens Axboe

    Geert Uytterhoeven
     

26 Jun, 2015

1 commit

  • If a block device has bio integrity enabled, rw_page will bypass the
    integrity payload, which is undesirable. Skip rw_page if this is the
    case.

    Currently brd and zram provide rw_page, and the proposed 'nd' drivers
    will too.

    Cc: Jens Axboe
    Cc: Martin K. Petersen
    Suggested-by: Matthew Wilcox
    Signed-off-by: Vishal Verma
    Signed-off-by: Dan Williams

    Vishal Verma
     

02 Jun, 2015

2 commits

  • Now that bdi definitions are moved to backing-dev-defs.h,
    backing-dev.h can include blkdev.h and inline inode_to_bdi() without
    worrying about introducing circular include dependency. The function
    gets called from hot paths and fairly trivial.

    This patch makes inode_to_bdi() and sb_is_blkdev_sb() that the
    function calls inline. blockdev_superblock and noop_backing_dev_info
    are EXPORT_GPL'd to allow the inline functions to be used from
    modules.

    While at it, make sb_is_blkdev_sb() return bool instead of int.

    v2: Fixed typo in description as suggested by Jan.

    Signed-off-by: Tejun Heo
    Reviewed-by: Jens Axboe
    Cc: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • With the planned cgroup writeback support, backing-dev related
    declarations will be more widely used across block and cgroup;
    unfortunately, including backing-dev.h from include/linux/blkdev.h
    makes cyclic include dependency quite likely.

    This patch separates out backing-dev-defs.h which only has the
    essential definitions and updates blkdev.h to include it. c files
    which need access to more backing-dev details now include
    backing-dev.h directly. This takes backing-dev.h off the common
    include dependency chain making it a lot easier to use it across block
    and cgroup.

    v2: fs/fat build failure fixed.

    Signed-off-by: Tejun Heo
    Reviewed-by: Jan Kara
    Cc: Jens Axboe
    Signed-off-by: Jens Axboe

    Tejun Heo
     

25 Apr, 2015

1 commit

  • do_blockdev_direct_IO() increments and decrements the inode
    ->i_dio_count for each IO operation. It does this to protect against
    truncate of a file. Block devices don't need this sort of protection.

    For a capable multiqueue setup, this atomic int is the only shared
    state between applications accessing the device for O_DIRECT, and it
    presents a scaling wall for that. In my testing, as much as 30% of
    system time is spent incrementing and decrementing this value. A mixed
    read/write workload improved from ~2.5M IOPS to ~9.6M IOPS, with
    better latencies too. Before:

    clat percentiles (usec):
    | 1.00th=[ 33], 5.00th=[ 34], 10.00th=[ 34], 20.00th=[ 34],
    | 30.00th=[ 34], 40.00th=[ 34], 50.00th=[ 35], 60.00th=[ 35],
    | 70.00th=[ 35], 80.00th=[ 35], 90.00th=[ 37], 95.00th=[ 80],
    | 99.00th=[ 98], 99.50th=[ 151], 99.90th=[ 155], 99.95th=[ 155],
    | 99.99th=[ 165]

    After:

    clat percentiles (usec):
    | 1.00th=[ 95], 5.00th=[ 108], 10.00th=[ 129], 20.00th=[ 149],
    | 30.00th=[ 155], 40.00th=[ 161], 50.00th=[ 167], 60.00th=[ 171],
    | 70.00th=[ 177], 80.00th=[ 185], 90.00th=[ 201], 95.00th=[ 270],
    | 99.00th=[ 390], 99.50th=[ 398], 99.90th=[ 418], 99.95th=[ 422],
    | 99.99th=[ 438]

    In other setups, Robert Elliott reported seeing good performance
    improvements:

    https://lkml.org/lkml/2015/4/3/557

    The more applications accessing the device, the worse it gets.

    Add a new direct-io flags, DIO_SKIP_DIO_COUNT, which tells
    do_blockdev_direct_IO() that it need not worry about incrementing
    or decrementing the inode i_dio_count for this caller.

    Cc: Andrew Morton
    Cc: Christoph Hellwig
    Cc: Theodore Ts'o
    Cc: Elliott, Robert (Server Storage)
    Cc: Al Viro
    Signed-off-by: Jens Axboe
    Signed-off-by: Al Viro

    Jens Axboe
     

16 Apr, 2015

1 commit