09 Jan, 2017

1 commit

  • commit af309226db916e2c6e08d3eba3fa5c34225200c4 upstream.

    If a block device is closed while iterate_bdevs() is handling it, the
    following NULL pointer dereference occurs because bdev->b_disk is NULL
    in bdev_get_queue(), which is called from blk_get_backing_dev_info() (in
    turn called by the mapping_cap_writeback_dirty() call in
    __filemap_fdatawrite_range()):

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000508
    IP: [] blk_get_backing_dev_info+0x10/0x20
    PGD 9e62067 PUD 9ee8067 PMD 0
    Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    Modules linked in:
    CPU: 1 PID: 2422 Comm: sync Not tainted 4.5.0-rc7+ #400
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
    task: ffff880009f4d700 ti: ffff880009f5c000 task.ti: ffff880009f5c000
    RIP: 0010:[] [] blk_get_backing_dev_info+0x10/0x20
    RSP: 0018:ffff880009f5fe68 EFLAGS: 00010246
    RAX: 0000000000000000 RBX: ffff88000ec17a38 RCX: ffffffff81a4e940
    RDX: 7fffffffffffffff RSI: 0000000000000000 RDI: ffff88000ec176c0
    RBP: ffff880009f5fe68 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000001 R11: 0000000000000000 R12: ffff88000ec17860
    R13: ffffffff811b25c0 R14: ffff88000ec178e0 R15: ffff88000ec17a38
    FS: 00007faee505d700(0000) GS:ffff88000fb00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 0000000000000508 CR3: 0000000009e8a000 CR4: 00000000000006e0
    Stack:
    ffff880009f5feb8 ffffffff8112e7f5 0000000000000000 7fffffffffffffff
    0000000000000000 0000000000000000 7fffffffffffffff 0000000000000001
    ffff88000ec178e0 ffff88000ec17860 ffff880009f5fec8 ffffffff8112e81f
    Call Trace:
    [] __filemap_fdatawrite_range+0x85/0x90
    [] filemap_fdatawrite+0x1f/0x30
    [] fdatawrite_one_bdev+0x16/0x20
    [] iterate_bdevs+0xf2/0x130
    [] sys_sync+0x63/0x90
    [] entry_SYSCALL_64_fastpath+0x12/0x76
    Code: 0f 1f 44 00 00 48 8b 87 f0 00 00 00 55 48 89 e5 8b 80 08 05 00 00 5d
    RIP [] blk_get_backing_dev_info+0x10/0x20
    RSP
    CR2: 0000000000000508
    ---[ end trace 2487336ceb3de62d ]---

    The crash is easily reproducible by running the following command, if an
    msleep(100) is inserted before the call to func() in iterate_devs():

    while :; do head -c1 /dev/nullb0; done > /dev/null & while :; do sync; done

    Fix it by holding the bd_mutex across the func() call and only calling
    func() if the bdev is opened.

    Fixes: 5c0d6b60a0ba ("vfs: Create function for iterating over block devices")
    Reported-and-tested-by: Wei Fang
    Signed-off-by: Rabin Vincent
    Signed-off-by: Jan Kara
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Rabin Vincent
     

06 Jan, 2017

1 commit

  • commit bcc7f5b4bee8e327689a4d994022765855c807ff upstream.

    bdev->bd_contains is not stable before calling __blkdev_get().
    When __blkdev_get() is called on a parition with ->bd_openers == 0
    it sets
    bdev->bd_contains = bdev;
    which is not correct for a partition.
    After a call to __blkdev_get() succeeds, ->bd_openers will be > 0
    and then ->bd_contains is stable.

    When FMODE_EXCL is used, blkdev_get() calls
    bd_start_claiming() -> bd_prepare_to_claim() -> bd_may_claim()

    This call happens before __blkdev_get() is called, so ->bd_contains
    is not stable. So bd_may_claim() cannot safely use ->bd_contains.
    It currently tries to use it, and this can lead to a BUG_ON().

    This happens when a whole device is already open with a bd_holder (in
    use by dm in my particular example) and two threads race to open a
    partition of that device for the first time, one opening with O_EXCL and
    one without.

    The thread that doesn't use O_EXCL gets through blkdev_get() to
    __blkdev_get(), gains the ->bd_mutex, and sets bdev->bd_contains = bdev;

    Immediately thereafter the other thread, using FMODE_EXCL, calls
    bd_start_claiming() from blkdev_get(). This should fail because the
    whole device has a holder, but because bdev->bd_contains == bdev
    bd_may_claim() incorrectly reports success.
    This thread continues and blocks on bd_mutex.

    The first thread then sets bdev->bd_contains correctly and drops the mutex.
    The thread using FMODE_EXCL then continues and when it calls bd_may_claim()
    again in:
    BUG_ON(!bd_may_claim(bdev, whole, holder));
    The BUG_ON fires.

    Fix this by removing the dependency on ->bd_contains in
    bd_may_claim(). As bd_may_claim() has direct access to the whole
    device, it can simply test if the target bdev is the whole device.

    Fixes: 6b4517a7913a ("block: implement bd_claiming and claiming block")
    Signed-off-by: NeilBrown
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    NeilBrown
     

12 Oct, 2016

1 commit

  • After much discussion, it seems that the fallocate feature flag
    FALLOC_FL_ZERO_RANGE maps nicely to SCSI WRITE SAME; and the feature
    FALLOC_FL_PUNCH_HOLE maps nicely to the devices that have been whitelisted
    for zeroing SCSI UNMAP. Punch still requires that FALLOC_FL_KEEP_SIZE is
    set. A length that goes past the end of the device will be clamped to the
    device size if KEEP_SIZE is set; or will return -EINVAL if not. Both
    start and length must be aligned to the device's logical block size.

    Since the semantics of fallocate are fairly well established already, wire
    up the two pieces. The other fallocate variants (collapse range, insert
    range, and allocate blocks) are not supported.

    Link: http://lkml.kernel.org/r/147518379992.22791.8849838163218235007.stgit@birch.djwong.org
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Bart Van Assche
    Cc: Theodore Ts'o
    Cc: Martin K. Petersen
    Cc: Mike Snitzer # tweaked header
    Cc: Brian Foster
    Cc: Christoph Hellwig
    Cc: Hannes Reinecke
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Darrick J. Wong
     

06 Oct, 2016

1 commit

  • When triggering thaw-filesystems via magic sysrq, the system enters a
    loop in do_thaw_one(), as thaw_bdev() still returns success if
    bd_fsfreeze_count == 0. To fix this, let thaw_bdev() always return
    error (and simplify the code a bit at the same time).

    Reviewed-by: Eric Farman
    Reviewed-by: Cornelia Huck
    Signed-off-by: Pierre Morel
    Reviewed-by: Jan Kara
    Signed-off-by: Jens Axboe

    Pierre Morel
     

14 Sep, 2016

1 commit

  • DAX support for block devices was removed in commits 03cdad
    ("block: disable block device DAX by default") and 99a01cd
    ("block: remove BLK_DEV_DAX config option"), but we still kept a call to
    dax_do_io and some uneeded i_flags manipulations introduced in commit
    bbab37 ("block: Add support for DAX reads/writes to block devices").

    Remove those leftovers.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Johannes Thumshirn
    Acked-by: Dan Williams
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

25 Aug, 2016

1 commit

  • Calling freeze_bdev() twice on the same block device without mounted
    filesystem get_super() will return NULL, which will lead to NULL-ptr
    dereference later in drop_super().

    Check get_super() result to fix that.

    Note, that this is a purely theoretical issue. We have only 3
    freeze_bdev() callers. 2 of them are in filesystem code and used on a
    device with mounted fs. The third one in lock_fs() has protection in
    upper-layer code against freezing block device the second time without
    thawing it first.

    Signed-off-by: Andrey Ryabinin
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Andrey Ryabinin
     

22 Aug, 2016

1 commit

  • I got this:

    kasan: GPF could be caused by NULL-ptr deref or user memory access
    general protection fault: 0000 [#1] PREEMPT SMP KASAN
    Dumping ftrace buffer:
    (ftrace buffer empty)
    CPU: 0 PID: 5505 Comm: syz-executor Not tainted 4.8.0-rc2+ #161
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014
    task: ffff880113415940 task.stack: ffff880118350000
    RIP: 0010:[] [] bd_mount+0x52/0xa0
    RSP: 0018:ffff880118357ca0 EFLAGS: 00010207
    RAX: dffffc0000000000 RBX: ffffffffffffffff RCX: ffffc90000bb6000
    RDX: 0000000000000018 RSI: ffffffff846d6b20 RDI: 00000000000000c7
    RBP: ffff880118357cb0 R08: ffff880115967c68 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801188211e8
    R13: ffffffff847baa20 R14: ffff8801139cb000 R15: 0000000000000080
    FS: 00007fa3ff6c0700(0000) GS:ffff88011aa00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007fc1d8cc7e78 CR3: 0000000109f20000 CR4: 00000000000006f0
    DR0: 000000000000001e DR1: 000000000000001e DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
    Stack:
    ffff880112cfd6c0 ffff8801188211e8 ffff880118357cf0 ffffffff8167f207
    ffffffff816d7a1e ffff880112a413c0 ffffffff847baa20 ffff8801188211e8
    0000000000000080 ffff880112cfd6c0 ffff880118357d38 ffffffff816dce0a
    Call Trace:
    [] mount_fs+0x97/0x2e0
    [] ? alloc_vfsmnt+0x55e/0x760
    [] vfs_kern_mount+0x7a/0x300
    [] ? _raw_read_unlock+0x2c/0x50
    [] do_mount+0x3d7/0x2730
    [] ? trace_do_page_fault+0x1f4/0x3a0
    [] ? copy_mount_string+0x40/0x40
    [] ? memset+0x31/0x40
    [] ? copy_mount_options+0x1ee/0x320
    [] SyS_mount+0xb2/0x120
    [] ? copy_mnt_ns+0x970/0x970
    [] do_syscall_64+0x1c4/0x4e0
    [] entry_SYSCALL64_slow_path+0x25/0x25
    Code: 83 e8 63 1b fc ff 48 85 c0 48 89 c3 74 4c e8 56 35 d1 ff 48 8d bb c8 00 00 00 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 3c 02 00 75 36 4c 8b a3 c8 00 00 00 48 b8 00 00 00 00 00 fc
    RIP [] bd_mount+0x52/0xa0
    RSP
    ---[ end trace 13690ad962168b98 ]---

    mount_pseudo() returns ERR_PTR(), not NULL, on error.

    Fixes: 3684aa7099e0 ("block-dev: enable writeback cgroup support")
    Cc: Shaohua Li
    Cc: Tejun Heo
    Cc: Jens Axboe
    Cc: stable@vger.kernel.org
    Signed-off-by: Vegard Nossum
    Signed-off-by: Jens Axboe

    Vegard Nossum
     

08 Aug, 2016

1 commit

  • Commit abf545484d31 changed it from an 'rw' flags type to the
    newer ops based interface, but now we're effectively leaking
    some bdev internals to the rest of the kernel. Since we only
    care about whether it's a read or a write at that level, just
    pass in a bool 'is_write' parameter instead.

    Then we can also move op_is_write() and friends back under
    CONFIG_BLOCK protection.

    Reviewed-by: Mike Christie
    Signed-off-by: Jens Axboe

    Jens Axboe
     

05 Aug, 2016

1 commit

  • The rw_page users were not converted to use bio/req ops. As a result
    bdev_write_page is not passing down REQ_OP_WRITE and the IOs will
    be sent down as reads.

    Signed-off-by: Mike Christie
    Fixes: 4e1b2d52a80d ("block, fs, drivers: remove REQ_OP compat defs and related code")

    Modified by me to:

    1) Drop op_flags passing into ->rw_page(), as we don't use it.
    2) Make op_is_write() and friends safe to use for !CONFIG_BLOCK

    Signed-off-by: Jens Axboe

    Mike Christie
     

04 Aug, 2016

1 commit

  • The functionality for block device DAX was already removed with commit
    acc93d30d7d4 ("Revert "block: enable dax for raw block devices"")

    However, we still had a config option hanging around that was always
    disabled because it depended on CONFIG_BROKEN. This config option was
    introduced in commit 03cdadb04077 ("block: disable block device DAX by
    default")

    This change reverts that commit, removing the dead config option.

    Link: http://lkml.kernel.org/r/20160729182314.6368-1-ross.zwisler@linux.intel.com
    Signed-off-by: Ross Zwisler
    Cc: Dave Hansen
    Acked-by: Dan Williams
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     

30 Jul, 2016

1 commit

  • Pull userns vfs updates from Eric Biederman:
    "This tree contains some very long awaited work on generalizing the
    user namespace support for mounting filesystems to include filesystems
    with a backing store. The real world target is fuse but the goal is
    to update the vfs to allow any filesystem to be supported. This
    patchset is based on a lot of code review and testing to approach that
    goal.

    While looking at what is needed to support the fuse filesystem it
    became clear that there were things like xattrs for security modules
    that needed special treatment. That the resolution of those concerns
    would not be fuse specific. That sorting out these general issues
    made most sense at the generic level, where the right people could be
    drawn into the conversation, and the issues could be solved for
    everyone.

    At a high level what this patchset does a couple of simple things:

    - Add a user namespace owner (s_user_ns) to struct super_block.

    - Teach the vfs to handle filesystem uids and gids not mapping into
    to kuids and kgids and being reported as INVALID_UID and
    INVALID_GID in vfs data structures.

    By assigning a user namespace owner filesystems that are mounted with
    only user namespace privilege can be detected. This allows security
    modules and the like to know which mounts may not be trusted. This
    also allows the set of uids and gids that are communicated to the
    filesystem to be capped at the set of kuids and kgids that are in the
    owning user namespace of the filesystem.

    One of the crazier corner casees this handles is the case of inodes
    whose i_uid or i_gid are not mapped into the vfs. Most of the code
    simply doesn't care but it is easy to confuse the inode writeback path
    so no operation that could cause an inode write-back is permitted for
    such inodes (aka only reads are allowed).

    This set of changes starts out by cleaning up the code paths involved
    in user namespace permirted mounts. Then when things are clean enough
    adds code that cleanly sets s_user_ns. Then additional restrictions
    are added that are possible now that the filesystem superblock
    contains owner information.

    These changes should not affect anyone in practice, but there are some
    parts of these restrictions that are changes in behavior.

    - Andy's restriction on suid executables that does not honor the
    suid bit when the path is from another mount namespace (think
    /proc/[pid]/fd/) or when the filesystem was mounted by a less
    privileged user.

    - The replacement of the user namespace implicit setting of MNT_NODEV
    with implicitly setting SB_I_NODEV on the filesystem superblock
    instead.

    Using SB_I_NODEV is a stronger form that happens to make this state
    user invisible. The user visibility can be managed but it caused
    problems when it was introduced from applications reasonably
    expecting mount flags to be what they were set to.

    There is a little bit of work remaining before it is safe to support
    mounting filesystems with backing store in user namespaces, beyond
    what is in this set of changes.

    - Verifying the mounter has permission to read/write the block device
    during mount.

    - Teaching the integrity modules IMA and EVM to handle filesystems
    mounted with only user namespace root and to reduce trust in their
    security xattrs accordingly.

    - Capturing the mounters credentials and using that for permission
    checks in d_automount and the like. (Given that overlayfs already
    does this, and we need the work in d_automount it make sense to
    generalize this case).

    Furthermore there are a few changes that are on the wishlist:

    - Get all filesystems supporting posix acls using the generic posix
    acls so that posix_acl_fix_xattr_from_user and
    posix_acl_fix_xattr_to_user may be removed. [Maintainability]

    - Reducing the permission checks in places such as remount to allow
    the superblock owner to perform them.

    - Allowing the superblock owner to chown files with unmapped uids and
    gids to something that is mapped so the files may be treated
    normally.

    I am not considering even obvious relaxations of permission checks
    until it is clear there are no more corner cases that need to be
    locked down and handled generically.

    Many thanks to Seth Forshee who kept this code alive, and putting up
    with me rewriting substantial portions of what he did to handle more
    corner cases, and for his diligent testing and reviewing of my
    changes"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (30 commits)
    fs: Call d_automount with the filesystems creds
    fs: Update i_[ug]id_(read|write) to translate relative to s_user_ns
    evm: Translate user/group ids relative to s_user_ns when computing HMAC
    dquot: For now explicitly don't support filesystems outside of init_user_ns
    quota: Handle quota data stored in s_user_ns in quota_setxquota
    quota: Ensure qids map to the filesystem
    vfs: Don't create inodes with a uid or gid unknown to the vfs
    vfs: Don't modify inodes with a uid or gid unknown to the vfs
    cred: Reject inodes with invalid ids in set_create_file_as()
    fs: Check for invalid i_uid in may_follow_link()
    vfs: Verify acls are valid within superblock's s_user_ns.
    userns: Handle -1 in k[ug]id_has_mapping when !CONFIG_USER_NS
    fs: Refuse uid/gid changes which don't map into s_user_ns
    selinux: Add support for unprivileged mounts from user namespaces
    Smack: Handle labels consistently in untrusted mounts
    Smack: Add support for unprivileged mounts from user namespaces
    fs: Treat foreign mounts as nosuid
    fs: Limit file caps to the user namespace of the super block
    userns: Remove the now unnecessary FS_USERNS_DEV_MOUNT flag
    userns: Remove implicit MNT_NODEV fragility.
    ...

    Linus Torvalds
     

29 Jul, 2016

1 commit

  • Pull vfs updates from Al Viro:
    "Assorted cleanups and fixes.

    Probably the most interesting part long-term is ->d_init() - that will
    have a bunch of followups in (at least) ceph and lustre, but we'll
    need to sort the barrier-related rules before it can get used for
    really non-trivial stuff.

    Another fun thing is the merge of ->d_iput() callers (dentry_iput()
    and dentry_unlink_inode()) and a bunch of ->d_compare() ones (all
    except the one in __d_lookup_lru())"

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits)
    fs/dcache.c: avoid soft-lockup in dput()
    vfs: new d_init method
    vfs: Update lookup_dcache() comment
    bdev: get rid of ->bd_inodes
    Remove last traces of ->sync_page
    new helper: d_same_name()
    dentry_cmp(): use lockless_dereference() instead of smp_read_barrier_depends()
    vfs: clean up documentation
    vfs: document ->d_real()
    vfs: merge .d_select_inode() into .d_real()
    unify dentry_iput() and dentry_unlink_inode()
    binfmt_misc: ->s_root is not going anywhere
    drop redundant ->owner initializations
    ufs: get rid of redundant checks
    orangefs: constify inode_operations
    missed comment updates from ->direct_IO() prototype change
    file_inode(f)->i_mapping is f->f_mapping
    trim fsnotify hooks a bit
    9p: new helper - v9fs_parent_fid()
    debugfs: ->d_parent is never NULL or negative
    ...

    Linus Torvalds
     

21 Jul, 2016

1 commit

  • Currently, presence of direct_access() in block_device_operations
    indicates support of DAX on its block device. Because
    block_device_operations is instantiated with 'const', this DAX
    capablity may not be enabled conditinally.

    In preparation for supporting DAX to device-mapper devices, add
    QUEUE_FLAG_DAX to request_queue flags to advertise their DAX
    support. This will allow to set the DAX capability based on how
    mapped device is composed.

    Signed-off-by: Toshi Kani
    Acked-by: Dan Williams
    Signed-off-by: Mike Snitzer
    Cc: Jens Axboe
    Cc: Ross Zwisler
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc:
    Signed-off-by: Jens Axboe

    Toshi Kani
     

20 Jul, 2016

1 commit

  • Since 2006 we have ->i_bdev pinning bdev in question, so there's no
    way to get to bdev ->evict_inode() while there's an aliasing inode
    anywhere. In other words, the only place walking the list of aliases
    is guaranteed to do it only when the list is empty...

    Remove the detritus; it should've been done in "[PATCH] Fix a race
    condition between ->i_mapping and iput()", but nobody had noticed it
    back then.

    Signed-off-by: Al Viro

    Al Viro
     

24 Jun, 2016

1 commit

  • Introduce a function may_open_dev that tests MNT_NODEV and a new
    superblock flab SB_I_NODEV. Use this new function in all of the
    places where MNT_NODEV was previously tested.

    Add the new SB_I_NODEV s_iflag to proc, sysfs, and mqueuefs as those
    filesystems should never support device nodes, and a simple superblock
    flags makes that very hard to get wrong. With SB_I_NODEV set if any
    device nodes somehow manage to show up on on a filesystem those
    device nodes will be unopenable.

    Acked-by: Seth Forshee
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

27 May, 2016

1 commit

  • Pull misc DAX updates from Vishal Verma:
    "DAX error handling for 4.7

    - Until now, dax has been disabled if media errors were found on any
    device. This enables the use of DAX in the presence of these
    errors by making all sector-aligned zeroing go through the driver.

    - The driver (already) has the ability to clear errors on writes that
    are sent through the block layer using 'DSMs' defined in ACPI 6.1.

    Other misc changes:

    - When mounting DAX filesystems, check to make sure the partition is
    page aligned. This is a requirement for DAX, and previously, we
    allowed such unaligned mounts to succeed, but subsequent
    reads/writes would fail.

    - Misc/cleanup fixes from Jan that remove unused code from DAX
    related to zeroing, writeback, and some size checks"

    * tag 'dax-misc-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
    dax: fix a comment in dax_zero_page_range and dax_truncate_page
    dax: for truncate/hole-punch, do zeroing through the driver if possible
    dax: export a low-level __dax_zero_page_range helper
    dax: use sb_issue_zerout instead of calling dax_clear_sectors
    dax: enable dax in the presence of known media errors (badblocks)
    dax: fallback from pmd to pte on error
    block: Update blkdev_dax_capable() for consistency
    xfs: Add alignment check for DAX mount
    ext2: Add alignment check for DAX mount
    ext4: Add alignment check for DAX mount
    block: Add bdev_dax_supported() for dax mount checks
    block: Add vfs_msg() interface
    dax: Remove redundant inode size checks
    dax: Remove pointless writeback from dax_do_io()
    dax: Remove zeroing from dax_io()
    dax: Remove dead zeroing code from fault handlers
    ext2: Avoid DAX zeroing to corrupt data
    ext2: Fix block zeroing in ext2_get_blocks() for DAX
    dax: Remove complete_unwritten argument
    DAX: move RADIX_DAX_ definitions to dax.c

    Linus Torvalds
     

24 May, 2016

1 commit

  • Pull libnvdimm updates from Dan Williams:
    "The bulk of this update was stabilized before the merge window and
    appeared in -next. The "device dax" implementation was revised this
    week in response to review feedback, and to address failures detected
    by the recently expanded ndctl unit test suite.

    Not included in this pull request are two dax topic branches (dax
    error handling, and dax radix-tree locking). These topics were
    deferred to get a few more days of -next integration testing, and to
    coordinate a branch baseline with Ted and the ext4 tree. Vishal and
    Ross will send the error handling and locking topics respectively in
    the next few days.

    This branch has received a positive build result from the kbuild robot
    across 226 configs.

    Summary:

    - Device DAX for persistent memory: Device DAX is the device-centric
    analogue of Filesystem DAX (CONFIG_FS_DAX). It allows memory
    ranges to be allocated and mapped without need of an intervening
    file system. Device DAX is strict, precise and predictable.
    Specifically this interface:

    a) Guarantees fault granularity with respect to a given page size
    (pte, pmd, or pud) set at configuration time.

    b) Enforces deterministic behavior by being strict about what
    fault scenarios are supported.

    Persistent memory is the first target, but the mechanism is also
    targeted for exclusive allocations of performance/feature
    differentiated memory ranges.

    - Support for the HPE DSM (device specific method) command formats.
    This enables management of these first generation devices until a
    unified DSM specification materializes.

    - Further ACPI 6.1 compliance with support for the common dimm
    identifier format.

    - Various fixes and cleanups across the subsystem"

    * tag 'libnvdimm-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (40 commits)
    libnvdimm, dax: fix deletion
    libnvdimm, dax: fix alignment validation
    libnvdimm, dax: autodetect support
    libnvdimm: release ida resources
    Revert "block: enable dax for raw block devices"
    /dev/dax, core: file operations and dax-mmap
    /dev/dax, pmem: direct access to persistent memory
    libnvdimm: stop requiring a driver ->remove() method
    libnvdimm, dax: record the specified alignment of a dax-device instance
    libnvdimm, dax: reserve space to store labels for device-dax
    libnvdimm, dax: introduce device-dax infrastructure
    nfit: add sysfs dimm 'family' and 'dsm_mask' attributes
    tools/testing/nvdimm: ND_CMD_CALL support
    nfit: disable vendor specific commands
    nfit: export subsystem ids as attributes
    nfit: fix format interface code byte order per ACPI6.1
    nfit, libnvdimm: limited/whitelisted dimm command marshaling mechanism
    nfit, libnvdimm: clarify "commands" vs "_DSMs"
    libnvdimm: increase max envelope size for ioctl
    acpi/nfit: Add sysfs "id" for NVDIMM ID
    ...

    Linus Torvalds
     

21 May, 2016

1 commit


19 May, 2016

1 commit

  • 1/ If a mapping overlaps a bad sector fail the request.

    2/ Do not opportunistically report more dax-capable capacity than is
    requested when errors present.

    Reviewed-by: Jeff Moyer
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dan Williams
    [vishal: fix a conflict with system RAM collision patches]
    [vishal: add a 'size' parameter to ->direct_access]
    [vishal: fix a conflict with DAX alignment check patches]
    Signed-off-by: Vishal Verma

    Dan Williams
     

17 May, 2016

4 commits

  • blkdev_dax_capable() is similar to bdev_dax_supported(), but needs
    to remain as a separate interface for checking dax capability of
    a raw block device.

    Rename and relocate blkdev_dax_capable() to keep them maintained
    consistently, and call bdev_direct_access() for the dax capability
    check.

    There is no change in the behavior.

    Link: https://lkml.org/lkml/2016/5/9/950
    Signed-off-by: Toshi Kani
    Reviewed-by: Jan Kara
    Cc: Alexander Viro
    Cc: Jens Axboe
    Cc: Andreas Dilger
    Cc: Jan Kara
    Cc: Dave Chinner
    Cc: Dan Williams
    Cc: Ross Zwisler
    Cc: Christoph Hellwig
    Cc: Boaz Harrosh
    Signed-off-by: Vishal Verma

    Toshi Kani
     
  • DAX imposes additional requirements to a device. Add
    bdev_dax_supported() which performs all the precondition checks
    necessary for filesystem to mount the device with dax option.

    Also add a new check to verify if a partition is aligned by 4KB.
    When a partition is unaligned, any dax read/write access fails,
    except for metadata update.

    Signed-off-by: Toshi Kani
    Reviewed-by: Christoph Hellwig
    Cc: Alexander Viro
    Cc: Jens Axboe
    Cc: "Theodore Ts'o"
    Cc: Andreas Dilger
    Cc: Jan Kara
    Cc: Dave Chinner
    Cc: Dan Williams
    Cc: Ross Zwisler
    Cc: Christoph Hellwig
    Cc: Boaz Harrosh
    Signed-off-by: Vishal Verma

    Toshi Kani
     
  • In preparation of moving DAX capability checks to the block layer
    from filesystem code, add a VFS message interface that aligns with
    filesystem's message format.

    For instance, a vfs_msg() message followed by XFS messages in case
    of a dax mount error may look like:

    VFS (pmem0p1): error: unaligned partition for dax
    XFS (pmem0p1): DAX unsupported by block device. Turning off DAX.
    XFS (pmem0p1): Mounting V5 Filesystem
    :

    vfs_msg() is largely based on ext4_msg().

    Signed-off-by: Toshi Kani
    Reviewed-by: Christoph Hellwig
    Cc: Alexander Viro
    Cc: Jens Axboe
    Cc: "Theodore Ts'o"
    Cc: Andreas Dilger
    Cc: Jan Kara
    Cc: Dave Chinner
    Cc: Dan Williams
    Cc: Ross Zwisler
    Cc: Christoph Hellwig
    Cc: Boaz Harrosh
    Signed-off-by: Vishal Verma

    Toshi Kani
     
  • Fault handlers currently take complete_unwritten argument to convert
    unwritten extents after PTEs are updated. However no filesystem uses
    this anymore as the code is racy. Remove the unused argument.

    Reviewed-by: Ross Zwisler
    Signed-off-by: Jan Kara
    Signed-off-by: Vishal Verma

    Jan Kara
     

02 May, 2016

3 commits


05 Apr, 2016

1 commit

  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

19 Mar, 2016

1 commit

  • Pull core block updates from Jens Axboe:
    "Here are the core block changes for this merge window. Not a lot of
    exciting stuff going on in this round, most of the changes have been
    on the driver side of things. That pull request is coming next. This
    pull request contains:

    - A set of fixes for chained bio handling from Christoph.

    - A tag bounds check for blk-mq from Hannes, ensuring that we don't
    do something stupid if a device reports an invalid tag value.

    - A set of fixes/updates for the CFQ IO scheduler from Jan Kara.

    - A set of blk-mq fixes from Keith, adding support for dynamic
    hardware queues, and fixing init of max_dev_sectors for stacking
    devices.

    - A fix for the dynamic hw context from Ming.

    - Enabling of cgroup writeback support on a block device, from
    Shaohua"

    * 'for-4.6/core' of git://git.kernel.dk/linux-block:
    blk-mq: add bounds check on tag-to-rq conversion
    block: bio_remaining_done() isn't unlikely
    block: cleanup bio_endio
    block: factor out chained bio completion
    block: don't unecessarily clobber bi_error for chained bios
    block-dev: enable writeback cgroup support
    blk-mq: Fix NULL pointer updating nr_requests
    blk-mq: mark request queue as mq asap
    block: Initialize max_dev_sectors to 0
    blk-mq: dynamic h/w context count
    cfq-iosched: Allow parent cgroup to preempt its child
    cfq-iosched: Allow sync noidle workloads to preempt each other
    cfq-iosched: Reorder checks in cfq_should_preempt()
    cfq-iosched: Don't group_idle if cfqq has big thinktime

    Linus Torvalds
     

04 Mar, 2016

1 commit


28 Feb, 2016

2 commits

  • Previously calls to dax_writeback_mapping_range() for all DAX filesystems
    (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range().

    dax_writeback_mapping_range() needs a struct block_device, and it used
    to get that from inode->i_sb->s_bdev. This is correct for normal inodes
    mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw
    block devices and for XFS real-time files.

    Instead, call dax_writeback_mapping_range() directly from the filesystem
    ->writepages function so that it can supply us with a valid block
    device. This also fixes DAX code to properly flush caches in response
    to sync(2).

    Signed-off-by: Ross Zwisler
    Signed-off-by: Jan Kara
    Cc: Al Viro
    Cc: Dan Williams
    Cc: Dave Chinner
    Cc: Jens Axboe
    Cc: Matthew Wilcox
    Cc: Theodore Ts'o
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     
  • The recent *sync enabling discovered that we are inserting into the
    block_device pagecache counter to the expectations of the dirty data
    tracking for dax mappings. This can lead to data corruption.

    We want to support DAX for block devices eventually, but it requires
    wider changes to properly manage the pagecache.

    dump_stack+0x85/0xc2
    dax_writeback_mapping_range+0x60/0xe0
    blkdev_writepages+0x3f/0x50
    do_writepages+0x21/0x30
    __filemap_fdatawrite_range+0xc6/0x100
    filemap_write_and_wait+0x4a/0xa0
    set_blocksize+0x70/0xd0
    sb_set_blocksize+0x1d/0x50
    ext4_fill_super+0x75b/0x3360
    mount_bdev+0x180/0x1b0
    ext4_mount+0x15/0x20
    mount_fs+0x38/0x170

    Mark the support broken so its disabled by default, but otherwise still
    available for testing.

    Signed-off-by: Dan Williams
    Signed-off-by: Ross Zwisler
    Reported-by: Ross Zwisler
    Suggested-by: Dave Chinner
    Reviewed-by: Jan Kara
    Cc: Jens Axboe
    Cc: Matthew Wilcox
    Cc: Al Viro
    Cc: Theodore Ts'o
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     

06 Feb, 2016

1 commit

  • Previously the pfn_mkwrite() fault handler for raw block devices called
    bldev_dax_fault() -> __dax_fault() to do a full DAX page fault.

    Really what the pfn_mkwrite() fault handler needs to do is call
    dax_pfn_mkwrite() to make sure that the radix tree entry for the given
    PTE is marked as dirty so that a follow-up fsync or msync call will
    flush it durably to media.

    Fixes: 5a023cdba50c ("block: enable dax for raw block devices")
    Signed-off-by: Ross Zwisler
    Cc: Alexander Viro
    Cc: Dan Williams
    Cc: Dave Chinner
    Reviewed-by: Jan Kara
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     

31 Jan, 2016

1 commit

  • Dynamically enabling DAX requires that the page cache first be flushed
    and invalidated. This must occur atomically with the change of DAX mode
    otherwise we confuse the fsync/msync tracking and violate data
    durability guarantees. Eliminate the possibilty of DAX-disabled to
    DAX-enabled transitions for now and revisit this for the next cycle.

    Cc: Jan Kara
    Cc: Jeff Moyer
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Matthew Wilcox
    Cc: Andrew Morton
    Cc: Ross Zwisler
    Signed-off-by: Dan Williams

    Dan Williams
     

24 Jan, 2016

1 commit

  • Pull final vfs updates from Al Viro:

    - The ->i_mutex wrappers (with small prereq in lustre)

    - a fix for too early freeing of symlink bodies on shmem (they need to
    be RCU-delayed) (-stable fodder)

    - followup to dedupe stuff merged this cycle

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    vfs: abort dedupe loop if fatal signals are pending
    make sure that freeing shmem fast symlinks is RCU-delayed
    wrappers for ->i_mutex access
    lustre: remove unused declaration

    Linus Torvalds
     

23 Jan, 2016

2 commits

  • Add support for tracking dirty DAX entries in the struct address_space
    radix tree. This tree is already used for dirty page writeback, and it
    already supports the use of exceptional (non struct page*) entries.

    In order to properly track dirty DAX pages we will insert new
    exceptional entries into the radix tree that represent dirty DAX PTE or
    PMD pages. These exceptional entries will also contain the writeback
    addresses for the PTE or PMD faults that we can use at fsync/msync time.

    There are currently two types of exceptional entries (shmem and shadow)
    that can be placed into the radix tree, and this adds a third. We rely
    on the fact that only one type of exceptional entry can be found in a
    given radix tree based on its usage. This happens for free with DAX vs
    shmem but we explicitly prevent shadow entries from being added to radix
    trees for DAX mappings.

    The only shadow entries that would be generated for DAX radix trees
    would be to track zero page mappings that were created for holes. These
    pages would receive minimal benefit from having shadow entries, and the
    choice to have only one type of exceptional entry in a given radix tree
    makes the logic simpler both in clear_exceptional_entry() and in the
    rest of DAX.

    Signed-off-by: Ross Zwisler
    Cc: "H. Peter Anvin"
    Cc: "J. Bruce Fields"
    Cc: "Theodore Ts'o"
    Cc: Alexander Viro
    Cc: Andreas Dilger
    Cc: Dave Chinner
    Cc: Ingo Molnar
    Cc: Jan Kara
    Cc: Jeff Layton
    Cc: Matthew Wilcox
    Cc: Thomas Gleixner
    Cc: Dan Williams
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     
  • parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
    inode_foo(inode) being mutex_foo(&inode->i_mutex).

    Please, use those for access to ->i_mutex; over the coming cycle
    ->i_mutex will become rwsem, with ->lookup() done with it held
    only shared.

    Signed-off-by: Al Viro

    Al Viro
     

20 Jan, 2016

1 commit

  • Pull core block updates from Jens Axboe:
    "We don't have a lot of core changes this time around, it's mostly in
    drivers, which will come in a subsequent pull.

    The cores changes include:

    - blk-mq
    - Prep patch from Christoph, changing blk_mq_alloc_request() to
    take flags instead of just using gfp_t for sleep/nosleep.
    - Doc patch from me, clarifying the difference between legacy
    and blk-mq for timer usage.
    - Fixes from Raghavendra for memory-less numa nodes, and a reuse
    of CPU masks.

    - Cleanup from Geliang Tang, using offset_in_page() instead of open
    coding it.

    - From Ilya, rename request_queue slab to it reflects what it holds,
    and a fix for proper use of bdgrab/put.

    - A real fix for the split across stripe boundaries from Keith. We
    yanked a broken version of this from 4.4-rc final, this one works.

    - From Mike Krinkin, emit a trace message when we split.

    - From Wei Tang, two small cleanups, not explicitly clearing memory
    that is already cleared"

    * 'for-4.5/core' of git://git.kernel.dk/linux-block:
    block: use bd{grab,put}() instead of open-coding
    block: split bios to max possible length
    block: add call to split trace point
    blk-mq: Avoid memoryless numa node encoded in hctx numa_node
    blk-mq: Reuse hardware context cpumask for tags
    blk-mq: add a flags parameter to blk_mq_alloc_request
    Revert "blk-flush: Queue through IO scheduler when flush not required"
    block: clarify blk_add_timer() use case for blk-mq
    bio: use offset_in_page macro
    block: do not initialise statics to 0 or NULL
    block: do not initialise globals to 0 or NULL
    block: rename request_queue slab cache

    Linus Torvalds
     

16 Jan, 2016

2 commits

  • The DAX implementation needs to protect new calls to ->direct_access()
    and usage of its return value against the driver for the underlying
    block device being disabled. Use blk_queue_enter()/blk_queue_exit() to
    hold off blk_cleanup_queue() from proceeding, or otherwise fail new
    mapping requests if the request_queue is being torn down.

    This also introduces blk_dax_ctl to simplify the interface from fs/dax.c
    through dax_map_atomic() to bdev_direct_access().

    [willy@linux.intel.com: fix read() of a hole]
    Signed-off-by: Dan Williams
    Reviewed-by: Jeff Moyer
    Cc: Jan Kara
    Cc: Jens Axboe
    Cc: Dave Chinner
    Cc: Ross Zwisler
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • If a ->direct_access() implementation ever returns a map count less than
    PAGE_SIZE, catch the error in bdev_direct_access(). This simplifies
    error checking in upper layers.

    Signed-off-by: Dan Williams
    Reported-by: Ross Zwisler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     

15 Jan, 2016

1 commit