12 Oct, 2016

1 commit

  • Patch series "fallocate for block devices", v11.

    This is a patchset to fix page cache coherency with BLKZEROOUT and
    implement fallocate for block devices.

    The first patch is a fix to the existing BLKZEROOUT ioctl to invalidate
    the page cache if the zeroing command to the underlying device succeeds.
    Without this patch we still have the pagecache coherence bug that's been
    in the kernel forever.

    The second patch changes the internal block device functions to reject
    attempts to discard or zeroout that are not aligned to the logical block
    size. Previously, we only checked that the start/len parameters were
    512-byte aligned, which caused kernel BUG_ONs for unaligned IOs to 4k-LBA
    devices.

    The third patch creates an fallocate handler for block devices, wires up
    the FALLOC_FL_PUNCH_HOLE flag to zeroing-discard, and connects
    FALLOC_FL_ZERO_RANGE to write-same so that we can have a consistent
    fallocate interface between files and block devices. It also allows the
    combination of PUNCH_HOLE and NO_HIDE_STALE to invoke non-zeroing discard.

    Test cases for the new block device fallocate are now in xfstests as
    generic/349-351.

    This patch (of 3):

    Invalidate the page cache (as a regular O_DIRECT write would do) to avoid
    returning stale cache contents at a later time.

    Link: http://lkml.kernel.org/r/147518378313.22791.16649519283678515021.stgit@birch.djwong.org
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Martin K. Petersen
    Reviewed-by: Bart Van Assche
    Reviewed-by: Hannes Reinecke
    Cc: Theodore Ts'o
    Cc: Mike Snitzer
    Cc: Brian Foster
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Darrick J. Wong
     

27 May, 2016

1 commit

  • Pull misc DAX updates from Vishal Verma:
    "DAX error handling for 4.7

    - Until now, dax has been disabled if media errors were found on any
    device. This enables the use of DAX in the presence of these
    errors by making all sector-aligned zeroing go through the driver.

    - The driver (already) has the ability to clear errors on writes that
    are sent through the block layer using 'DSMs' defined in ACPI 6.1.

    Other misc changes:

    - When mounting DAX filesystems, check to make sure the partition is
    page aligned. This is a requirement for DAX, and previously, we
    allowed such unaligned mounts to succeed, but subsequent
    reads/writes would fail.

    - Misc/cleanup fixes from Jan that remove unused code from DAX
    related to zeroing, writeback, and some size checks"

    * tag 'dax-misc-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
    dax: fix a comment in dax_zero_page_range and dax_truncate_page
    dax: for truncate/hole-punch, do zeroing through the driver if possible
    dax: export a low-level __dax_zero_page_range helper
    dax: use sb_issue_zerout instead of calling dax_clear_sectors
    dax: enable dax in the presence of known media errors (badblocks)
    dax: fallback from pmd to pte on error
    block: Update blkdev_dax_capable() for consistency
    xfs: Add alignment check for DAX mount
    ext2: Add alignment check for DAX mount
    ext4: Add alignment check for DAX mount
    block: Add bdev_dax_supported() for dax mount checks
    block: Add vfs_msg() interface
    dax: Remove redundant inode size checks
    dax: Remove pointless writeback from dax_do_io()
    dax: Remove zeroing from dax_io()
    dax: Remove dead zeroing code from fault handlers
    ext2: Avoid DAX zeroing to corrupt data
    ext2: Fix block zeroing in ext2_get_blocks() for DAX
    dax: Remove complete_unwritten argument
    DAX: move RADIX_DAX_ definitions to dax.c

    Linus Torvalds
     

21 May, 2016

1 commit


17 May, 2016

1 commit

  • blkdev_dax_capable() is similar to bdev_dax_supported(), but needs
    to remain as a separate interface for checking dax capability of
    a raw block device.

    Rename and relocate blkdev_dax_capable() to keep them maintained
    consistently, and call bdev_direct_access() for the dax capability
    check.

    There is no change in the behavior.

    Link: https://lkml.org/lkml/2016/5/9/950
    Signed-off-by: Toshi Kani
    Reviewed-by: Jan Kara
    Cc: Alexander Viro
    Cc: Jens Axboe
    Cc: Andreas Dilger
    Cc: Jan Kara
    Cc: Dave Chinner
    Cc: Dan Williams
    Cc: Ross Zwisler
    Cc: Christoph Hellwig
    Cc: Boaz Harrosh
    Signed-off-by: Vishal Verma

    Toshi Kani
     

05 Apr, 2016

1 commit

  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

31 Jan, 2016

1 commit

  • Dynamically enabling DAX requires that the page cache first be flushed
    and invalidated. This must occur atomically with the change of DAX mode
    otherwise we confuse the fsync/msync tracking and violate data
    durability guarantees. Eliminate the possibilty of DAX-disabled to
    DAX-enabled transitions for now and revisit this for the next cycle.

    Cc: Jan Kara
    Cc: Jeff Moyer
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Matthew Wilcox
    Cc: Andrew Morton
    Cc: Ross Zwisler
    Signed-off-by: Dan Williams

    Dan Williams
     

23 Jan, 2016

1 commit

  • parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
    inode_foo(inode) being mutex_foo(&inode->i_mutex).

    Please, use those for access to ->i_mutex; over the coming cycle
    ->i_mutex will become rwsem, with ->lookup() done with it held
    only shared.

    Signed-off-by: Al Viro

    Al Viro
     

10 Jan, 2016

1 commit


09 Jan, 2016

1 commit

  • If an application wants exclusive access to all of the persistent memory
    provided by an NVDIMM namespace it can use this raw-block-dax facility
    to forgo establishing a filesystem. This capability is targeted
    primarily to hypervisors wanting to provision persistent memory for
    guests. It can be disabled / enabled dynamically via the new BLKDAXSET
    ioctl.

    Cc: Jeff Moyer
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Andrew Morton
    Cc: Ross Zwisler
    Reported-by: kbuild test robot
    Reviewed-by: Jan Kara
    Signed-off-by: Dan Williams

    Dan Williams
     

22 Oct, 2015

2 commits

  • This commits adds a driver API and ioctls for controlling Persistent
    Reservations s/genericly/generically/ at the block layer. Persistent
    Reservations are supported by SCSI and NVMe and allow controlling who gets
    access to a device in a shared storage setup.

    Note that we add a pr_ops structure to struct block_device_operations
    instead of adding the members directly to avoid bloating all instances
    of devices that will never support Persistent Reservations.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Split out helpers for all non-trivial ioctls to make this function simpler,
    and also start passing around a pointer version of the argument, as that's
    what most ioctl handlers actually need.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

20 May, 2015

2 commits

  • The only possible problem of using mutex_lock() instead of trylock
    is about deadlock.

    If there aren't any locks held before calling blkdev_reread_part(),
    deadlock can't be caused by this conversion.

    If there are locks held before calling blkdev_reread_part(),
    and if these locks arn't required in open, close handler and I/O
    path, deadlock shouldn't be caused too.

    Both user space's ioctl(BLKRRPART) and md_setup_drive() from
    init/do_mounts_md.c belongs to the 1st case, so the conversion is safe
    for the two cases.

    For loop, the previous patches in this pathset has fixed the ABBA lock
    dependency, so the conversion is OK.

    For nbd, tx_lock is held when calling the function:

    - both open and release won't hold the lock
    - when blkdev_reread_part() is run, I/O thread has been stopped
    already, so tx_lock won't be acquired in I/O path at that time.
    - so the conversion won't cause deadlock for nbd

    For dasd, both dasd_open(), dasd_release() and request function don't
    acquire any mutex/semphone, so the conversion should be safe.

    Reviewed-by: Christoph Hellwig
    Tested-by: Jarod Wilson
    Acked-by: Jarod Wilson
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • This patch exports blkdev_reread_part() for block drivers, also
    introduce __blkdev_reread_part().

    For some drivers, such as loop, reread of partitions can be run
    from the release path, and bd_mutex may already be held prior to
    calling ioctl_by_bdev(bdev, BLKRRPART, 0), so introduce
    __blkdev_reread_part for use in such cases.

    CC: Christoph Hellwig
    CC: Jens Axboe
    CC: Tejun Heo
    CC: Alexander Viro
    CC: Markus Pargmann
    CC: Stefan Weinhuber
    CC: Stefan Haberland
    CC: Sebastian Ott
    CC: Fabian Frederick
    CC: Ming Lei
    CC: David Herrmann
    CC: Andrew Morton
    CC: Peter Zijlstra
    CC: nbd-general@lists.sourceforge.net
    CC: linux-s390@vger.kernel.org
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jarod Wilson
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Jarod Wilson
     

22 Jan, 2015

1 commit

  • blkdev_issue_discard() will zero a given block range. This is done by
    way of explicit writing, thus provisioning or allocating the blocks on
    disk.

    There are use cases where the desired behavior is to zero the blocks but
    unprovision them if possible. The blocks must deterministically contain
    zeroes when they are subsequently read back.

    This patch adds a flag to blkdev_issue_zeroout() that provides this
    variant. If the discard flag is set and a block device guarantees
    discard_zeroes_data we will use REQ_DISCARD to clear the block range. If
    the device does not support discard_zeroes_data or if the discard
    request fails we will fall back to first REQ_WRITE_SAME and then a
    regular REQ_WRITE.

    Also update the callers of blkdev_issue_zero() to reflect the new flag
    and make sb_issue_zeroout() prefer the discard approach.

    Signed-off-by: Martin K. Petersen
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     

09 Sep, 2014

1 commit

  • bdev_get_queue() returns the request_queue associated with the
    specified block_device. blk_get_backing_dev_info() makes use of
    bdev_get_queue() to determine the associated bdi given a block_device.

    All the callers of bdev_get_queue() including
    blk_get_backing_dev_info() assume that bdev_get_queue() may return
    NULL and implement NULL handling; however, bdev_get_queue() requires
    the passed in block_device is opened and attached to its gendisk.
    Because an active gendisk always has a valid request_queue associated
    with it, bdev_get_queue() can never return NULL and neither can
    blk_get_backing_dev_info().

    Make it clear that neither of the two functions can return NULL and
    remove NULL handling from all the callers.

    Signed-off-by: Tejun Heo
    Cc: Chris Mason
    Cc: Dave Chinner
    Signed-off-by: Jens Axboe

    Tejun Heo
     

02 Jul, 2014

1 commit

  • BLKSECTGET ioctl loads the request queue's max_sectors as unsigned
    short value to the argument pointer. So if the max_sector is greater
    than USHRT_MAX, the upper 16 bits of that is just discarded.

    In such case, USHRT_MAX is more preferable than the lower 16 bits of
    max_sectors.

    Signed-off-by: Akinobu Mita
    Cc: Jens Axboe
    Cc: "James E.J. Bottomley"
    Cc: Douglas Gilbert
    Cc: linux-scsi@vger.kernel.org
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Akinobu Mita
     

09 Nov, 2013

1 commit


11 Oct, 2012

1 commit

  • Pull block IO update from Jens Axboe:
    "Core block IO bits for 3.7. Not a huge round this time, it contains:

    - First series from Kent cleaning up and generalizing bio allocation
    and freeing.

    - WRITE_SAME support from Martin.

    - Mikulas patches to prevent O_DIRECT crashes when someone changes
    the block size of a device.

    - Make bio_split() work on data-less bio's (like trim/discards).

    - A few other minor fixups."

    Fixed up silent semantic mis-merge as per Mikulas Patocka and Andrew
    Morton. It is due to the VM no longer using a prio-tree (see commit
    6b2dbba8b6ac: "mm: replace vma prio_tree with an interval tree").

    So make set_blocksize() use mapping_mapped() instead of open-coding the
    internal VM knowledge that has changed.

    * 'for-3.7/core' of git://git.kernel.dk/linux-block: (26 commits)
    block: makes bio_split support bio without data
    scatterlist: refactor the sg_nents
    scatterlist: add sg_nents
    fs: fix include/percpu-rwsem.h export error
    percpu-rw-semaphore: fix documentation typos
    fs/block_dev.c:1644:5: sparse: symbol 'blkdev_mmap' was not declared
    blockdev: turn a rw semaphore into a percpu rw semaphore
    Fix a crash when block device is read and block size is changed at the same time
    block: fix request_queue->flags initialization
    block: lift the initial queue bypass mode on blk_register_queue() instead of blk_init_allocated_queue()
    block: ioctl to zero block ranges
    block: Make blkdev_issue_zeroout use WRITE SAME
    block: Implement support for WRITE SAME
    block: Consolidate command flag and queue limit checks for merges
    block: Clean up special command handling logic
    block/blk-tag.c: Remove useless kfree
    block: remove the duplicated setting for congestion_threshold
    block: reject invalid queue attribute values
    block: Add bio_clone_bioset(), bio_clone_kmalloc()
    block: Consolidate bio_alloc_bioset(), bio_kmalloc()
    ...

    Linus Torvalds
     

20 Sep, 2012

1 commit


18 Sep, 2012

1 commit


01 Aug, 2012

1 commit

  • Add a new operation code (BLKPG_RESIZE_PARTITION) to the BLKPG ioctl that
    allows altering the size of an existing partition, even if it is currently
    in use.

    This patch converts hd_struct->nr_sects into sequence counter because
    One might extend a partition while IO is happening to it and update of
    nr_sects can be non-atomic on 32bit machines with 64bit sector_t. This
    can lead to issues like reading inconsistent size of a partition. Sequence
    counter have been used so that readers don't have to take bdev mutex lock
    as we call sector_in_part() very frequently.

    Now all the access to hd_struct->nr_sects should happen using sequence
    counter read/update helper functions part_nr_sects_read/part_nr_sects_write.
    There is one exception though, set_capacity()/get_capacity(). I think
    theoritically race should exist there too but this patch does not
    modify set_capacity()/get_capacity() due to sheer number of call sites
    and I am afraid that change might break something. I have left that as a
    TODO item. We can handle it later if need be. This patch does not introduce
    any new races as such w.r.t set_capacity()/get_capacity().

    v2: Add CONFIG_LBDAF test to UP preempt case as suggested by Phillip.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Phillip Susi
    Signed-off-by: Jens Axboe

    Vivek Goyal
     

16 Jan, 2012

1 commit

  • * 'for-3.3/core' of git://git.kernel.dk/linux-block: (37 commits)
    Revert "block: recursive merge requests"
    block: Stop using macro stubs for the bio data integrity calls
    blockdev: convert some macros to static inlines
    fs: remove unneeded plug in mpage_readpages()
    block: Add BLKROTATIONAL ioctl
    block: Introduce blk_set_stacking_limits function
    block: remove WARN_ON_ONCE() in exit_io_context()
    block: an exiting task should be allowed to create io_context
    block: ioc_cgroup_changed() needs to be exported
    block: recursive merge requests
    block, cfq: fix empty queue crash caused by request merge
    block, cfq: move icq creation and rq->elv.icq association to block core
    block, cfq: restructure io_cq creation path for io_context interface cleanup
    block, cfq: move io_cq exit/release to blk-ioc.c
    block, cfq: move icq cache management to block core
    block, cfq: move io_cq lookup to blk-ioc.c
    block, cfq: move cfqd->icq_list to request_queue and add request->elv.icq
    block, cfq: reorganize cfq_io_context into generic and cfq specific parts
    block: remove elevator_queue->ops
    block: reorder elevator switch sequence
    ...

    Fix up conflicts in:
    - block/blk-cgroup.c
    Switch from can_attach_task to can_attach
    - block/cfq-iosched.c
    conflict with now removed cic index changes (we now use q->id instead)

    Linus Torvalds
     

11 Jan, 2012

1 commit


09 Jan, 2012

1 commit

  • * 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (165 commits)
    reiserfs: Properly display mount options in /proc/mounts
    vfs: prevent remount read-only if pending removes
    vfs: count unlinked inodes
    vfs: protect remounting superblock read-only
    vfs: keep list of mounts for each superblock
    vfs: switch ->show_options() to struct dentry *
    vfs: switch ->show_path() to struct dentry *
    vfs: switch ->show_devname() to struct dentry *
    vfs: switch ->show_stats to struct dentry *
    switch security_path_chmod() to struct path *
    vfs: prefer ->dentry->d_sb to ->mnt->mnt_sb
    vfs: trim includes a bit
    switch mnt_namespace ->root to struct mount
    vfs: take /proc/*/mounts and friends to fs/proc_namespace.c
    vfs: opencode mntget() mnt_set_mountpoint()
    vfs: spread struct mount - remaining argument of next_mnt()
    vfs: move fsnotify junk to struct mount
    vfs: move mnt_devname
    vfs: move mnt_list to struct mount
    vfs: switch pnode.h macros to struct mount *
    ...

    Linus Torvalds
     

06 Jan, 2012

1 commit

  • We're doing some odd things there, which already messes up various users
    (see the net/socket.c code that this removes), and it was going to add
    yet more crud to the block layer because of the incorrect error code
    translation.

    ENOIOCTLCMD is not an error return that should be returned to user mode
    from the "ioctl()" system call, but it should *not* be translated as
    EINVAL ("Invalid argument"). It should be translated as ENOTTY
    ("Inappropriate ioctl for device").

    That EINVAL confusion has apparently so permeated some code that the
    block layer actually checks for it, which is sad. We continue to do so
    for now, but add a big comment about how wrong that is, and we should
    remove it entirely eventually. In the meantime, this tries to keep the
    changes localized to just the EINVAL -> ENOTTY fix, and removing code
    that makes it harder to do the right thing.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

04 Jan, 2012

1 commit

  • Move invalidate_bdev, block_sync_page into fs/block_dev.c. Export
    kill_bdev as well, so brd doesn't have to open code it. Reduce
    buffer_head.h requirement accordingly.

    Removed a rather large comment from invalidate_bdev, as it looked a bit
    obsolete to bother moving. The small comment replacing it says enough.

    Signed-off-by: Nick Piggin
    Cc: Al Viro
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Al Viro
     

07 Nov, 2011

1 commit

  • * 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
    Revert "tracing: Include module.h in define_trace.h"
    irq: don't put module.h into irq.h for tracking irqgen modules.
    bluetooth: macroize two small inlines to avoid module.h
    ip_vs.h: fix implicit use of module_get/module_put from module.h
    nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
    include: replace linux/module.h with "struct module" wherever possible
    include: convert various register fcns to macros to avoid include chaining
    crypto.h: remove unused crypto_tfm_alg_modname() inline
    uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
    pm_runtime.h: explicitly requires notifier.h
    linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
    miscdevice.h: fix up implicit use of lists and types
    stop_machine.h: fix implicit use of smp.h for smp_processor_id
    of: fix implicit use of errno.h in include/linux/of.h
    of_platform.h: delete needless include
    acpi: remove module.h include from platform/aclinux.h
    miscdevice.h: delete unnecessary inclusion of module.h
    device_cgroup.h: delete needless include
    net: sch_generic remove redundant use of
    net: inet_timewait_sock doesnt need
    ...

    Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in
    - drivers/media/dvb/frontends/dibx000_common.c
    - drivers/media/video/{mt9m111.c,ov6650.c}
    - drivers/mfd/ab3550-core.c
    - include/linux/dmaengine.h

    Linus Torvalds
     

01 Nov, 2011

1 commit


24 Aug, 2011

1 commit

  • There are cases where suppressing partition scan is useful - e.g. for
    lo devices and pseudo SATA devices which advertise to be a disk but
    get upset on partition scan (some port multiplier control devices show
    such behavior).

    This patch adds GENHD_FL_NO_PART_SCAN which suppresses partition scan
    regardless of the number of possible partitions. disk_partitionable()
    is renamed to disk_part_scan_enabled() as suppressing partition scan
    doesn't imply the device can't be partitioned using
    BLKPG_ADD/DEL_PARTITION calls from userland. show_partition() now
    directly tests disk_max_parts() to maintain backward-compatibility.

    -v2: Updated to make it clear that only partition scan is suppressed
    not partitioning itself as suggested by Kay Sievers.

    Signed-off-by: Tejun Heo
    Cc: Kay Sievers
    Signed-off-by: Jens Axboe

    Tejun Heo
     

25 Feb, 2011

1 commit

  • Adam Kovari and others reported that disconnecting an USB drive with
    an ntfs-3g filesystem would cause "kernel BUG at fs/inode.c:1421!" to
    be triggered.

    The BUG could be traced back to ioctl(BLKBSZSET), which would
    erroneously decrement the refcount on the bdev. This is because
    blkdev_get() expects the refcount to be already incremented and either
    returns success or decrements the refcount and returns an error.

    The bug was introduced by e525fd89 (block: make blkdev_get/put()
    handle exclusive access), which didn't take into account this behavior
    of blkdev_get().

    This fixes
    https://bugzilla.kernel.org/show_bug.cgi?id=29202
    (and likely 29792 too)

    Reported-by: Adam Kovari
    Acked-by: Tejun Heo
    Signed-off-by: Miklos Szeredi
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     

14 Jan, 2011

1 commit

  • * 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-block: (43 commits)
    block: ensure that completion error gets properly traced
    blktrace: add missing probe argument to block_bio_complete
    block cfq: don't use atomic_t for cfq_group
    block cfq: don't use atomic_t for cfq_queue
    block: trace event block fix unassigned field
    block: add internal hd part table references
    block: fix accounting bug on cross partition merges
    kref: add kref_test_and_get
    bio-integrity: mark kintegrityd_wq highpri and CPU intensive
    block: make kblockd_workqueue smarter
    Revert "sd: implement sd_check_events()"
    block: Clean up exit_io_context() source code.
    Fix compile warnings due to missing removal of a 'ret' variable
    fs/block: type signature of major_to_index(int) to major_to_index(unsigned)
    block: convert !IS_ERR(p) && p to !IS_ERR_NOR_NULL(p)
    cfq-iosched: don't check cfqg in choose_service_tree()
    fs/splice: Pull buf->ops->confirm() from splice_from_pipe actors
    cdrom: export cdrom_check_events()
    sd: implement sd_check_events()
    sr: implement sr_check_events()
    ...

    Linus Torvalds
     

28 Nov, 2010

1 commit


18 Nov, 2010

1 commit


13 Nov, 2010

1 commit

  • Over time, block layer has accumulated a set of APIs dealing with bdev
    open, close, claim and release.

    * blkdev_get/put() are the primary open and close functions.

    * bd_claim/release() deal with exclusive open.

    * open/close_bdev_exclusive() are combination of open and claim and
    the other way around, respectively.

    * bd_link/unlink_disk_holder() to create and remove holder/slave
    symlinks.

    * open_by_devnum() wraps bdget() + blkdev_get().

    The interface is a bit confusing and the decoupling of open and claim
    makes it impossible to properly guarantee exclusive access as
    in-kernel open + claim sequence can disturb the existing exclusive
    open even before the block layer knows the current open if for another
    exclusive access. Reorganize the interface such that,

    * blkdev_get() is extended to include exclusive access management.
    @holder argument is added and, if is @FMODE_EXCL specified, it will
    gain exclusive access atomically w.r.t. other exclusive accesses.

    * blkdev_put() is similarly extended. It now takes @mode argument and
    if @FMODE_EXCL is set, it releases an exclusive access. Also, when
    the last exclusive claim is released, the holder/slave symlinks are
    removed automatically.

    * bd_claim/release() and close_bdev_exclusive() are no longer
    necessary and either made static or removed.

    * bd_link_disk_holder() remains the same but bd_unlink_disk_holder()
    is no longer necessary and removed.

    * open_bdev_exclusive() becomes a simple wrapper around lookup_bdev()
    and blkdev_get(). It also has an unexpected extra bdev_read_only()
    test which probably should be moved into blkdev_get().

    * open_by_devnum() is modified to take @holder argument and pass it to
    blkdev_get().

    Most of bdev open/close operations are unified into blkdev_get/put()
    and most exclusive accesses are tested atomically at the open time (as
    it should). This cleans up code and removes some, both valid and
    invalid, but unnecessary all the same, corner cases.

    open_bdev_exclusive() and open_by_devnum() can use further cleanup -
    rename to blkdev_get_by_path() and blkdev_get_by_devt() and drop
    special features. Well, let's leave them for another day.

    Most conversions are straight-forward. drbd conversion is a bit more
    involved as there was some reordering, but the logic should stay the
    same.

    Signed-off-by: Tejun Heo
    Acked-by: Neil Brown
    Acked-by: Ryusuke Konishi
    Acked-by: Mike Snitzer
    Acked-by: Philipp Reisner
    Cc: Peter Osterlund
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Jan Kara
    Cc: Andrew Morton
    Cc: Andreas Dilger
    Cc: "Theodore Ts'o"
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Alex Elder
    Cc: Christoph Hellwig
    Cc: dm-devel@redhat.com
    Cc: drbd-dev@lists.linbit.com
    Cc: Leo Chen
    Cc: Scott Branden
    Cc: Chris Mason
    Cc: Steven Whitehouse
    Cc: Dave Kleikamp
    Cc: Joern Engel
    Cc: reiserfs-devel@vger.kernel.org
    Cc: Alexander Viro

    Tejun Heo
     

10 Nov, 2010

2 commits

  • Structure hd_geometry is copied to userland with 4 padding bytes
    between cylinders and start fields uninitialized on 64-bit platforms.
    It leads to leaking of contents of kernel stack memory.

    Currently there is no memset() in real implementations of getgeo()
    in drivers/block/, so it makes sense to have memset() in blkdev_ioctl().

    Signed-off-by: Vasiliy Kulikov
    Signed-off-by: Jens Axboe

    Vasiliy Kulikov
     
  • Convert direct reads of an inode's i_size to using i_size_read().

    i_size_{read,write} use a seqcount to protect reads from accessing
    incomple writes. Concurrent i_size_write()s require mutual exclussion
    to protect the seqcount that is used by i_size_{read,write}. But
    i_size_read() callers do not need to use additional locking.

    Signed-off-by: Mike Snitzer
    Acked-by: NeilBrown
    Acked-by: Lars Ellenberg
    Signed-off-by: Jens Axboe

    Mike Snitzer
     

23 Oct, 2010

1 commit

  • * 'for-2.6.37/barrier' of git://git.kernel.dk/linux-2.6-block: (46 commits)
    xen-blkfront: disable barrier/flush write support
    Added blk-lib.c and blk-barrier.c was renamed to blk-flush.c
    block: remove BLKDEV_IFL_WAIT
    aic7xxx_old: removed unused 'req' variable
    block: remove the BH_Eopnotsupp flag
    block: remove the BLKDEV_IFL_BARRIER flag
    block: remove the WRITE_BARRIER flag
    swap: do not send discards as barriers
    fat: do not send discards as barriers
    ext4: do not send discards as barriers
    jbd2: replace barriers with explicit flush / FUA usage
    jbd2: Modify ASYNC_COMMIT code to not rely on queue draining on barrier
    jbd: replace barriers with explicit flush / FUA usage
    nilfs2: replace barriers with explicit flush / FUA usage
    reiserfs: replace barriers with explicit flush / FUA usage
    gfs2: replace barriers with explicit flush / FUA usage
    btrfs: replace barriers with explicit flush / FUA usage
    xfs: replace barriers with explicit flush / FUA usage
    block: pass gfp_mask and flags to sb_issue_discard
    dm: convey that all flushes are processed as empty
    ...

    Linus Torvalds
     

17 Sep, 2010

1 commit

  • All the blkdev_issue_* helpers can only sanely be used for synchronous
    caller. To issue cache flushes or barriers asynchronously the caller needs
    to set up a bio by itself with a completion callback to move the asynchronous
    state machine ahead. So drop the BLKDEV_IFL_WAIT flag that is always
    specified when calling blkdev_issue_* and also remove the now unused flags
    argument to blkdev_issue_flush and blkdev_issue_zeroout. For
    blkdev_issue_discard we need to keep it for the secure discard flag, which
    gains a more descriptive name and loses the bitops vs flag confusion.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

15 Sep, 2010

1 commit

  • I'm reposting this patch series as v4 since there have been no additional
    comments, and I cleaned up one extra bit of unneeded code (in 3/3). The patches
    are against Linus's tree: 2bfc96a127bc1cc94d26bfaa40159966064f9c8c
    (2.6.36-rc3).

    Would this patchset be suitable for inclusion in an mm branch?

    This changes adds a partition_meta_info struct which itself contains a
    union of structures that provide partition table specific metadata.

    This change leaves the union empty. The subsequent patch includes an
    implementation for CONFIG_EFI_PARTITION-based metadata.

    Signed-off-by: Will Drewry
    Signed-off-by: Jens Axboe

    Will Drewry
     

12 Aug, 2010

1 commit

  • Secure discard is the same as discard except that all copies of the
    discarded sectors (perhaps created by garbage collection) must also be
    erased.

    Signed-off-by: Adrian Hunter
    Acked-by: Jens Axboe
    Cc: Kyungmin Park
    Cc: Madhusudhan Chikkature
    Cc: Christoph Hellwig
    Cc: Ben Gardiner
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Hunter