23 Mar, 2012

1 commit

  • Pull cleancache changes from Konrad Rzeszutek Wilk:
    "This has some patches for the cleancache API that should have been
    submitted a _long_ time ago. They are basically cleanups:

    - rename of flush to invalidate

    - moving reporting of statistics into debugfs

    - use __read_mostly as necessary.

    Oh, and also the MAINTAINERS file change. The files (except the
    MAINTAINERS file) have been in #linux-next for months now. The late
    addition of MAINTAINERS file is a brain-fart on my side - didn't
    realize I needed that just until I was typing this up - and I based
    that patch on v3.3 - so the tree is on top of v3.3."

    * tag 'stable/for-linus-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/mm:
    MAINTAINERS: Adding cleancache API to the list.
    mm: cleancache: Use __read_mostly as appropiate.
    mm: cleancache: report statistics via debugfs instead of sysfs.
    mm: zcache/tmem/cleancache: s/flush/invalidate/
    mm: cleancache: s/flush/invalidate/

    Linus Torvalds
     

22 Mar, 2012

1 commit

  • Pull security subsystem updates for 3.4 from James Morris:
    "The main addition here is the new Yama security module from Kees Cook,
    which was discussed at the Linux Security Summit last year. Its
    purpose is to collect miscellaneous DAC security enhancements in one
    place. This also marks a departure in policy for LSM modules, which
    were previously limited to being standalone access control systems.
    Chromium OS is using Yama, and I believe there are plans for Ubuntu,
    at least.

    This patchset also includes maintenance updates for AppArmor, TOMOYO
    and others."

    Fix trivial conflict in due to the jumo_label->static_key
    rename.

    * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (38 commits)
    AppArmor: Fix location of const qualifier on generated string tables
    TOMOYO: Return error if fails to delete a domain
    AppArmor: add const qualifiers to string arrays
    AppArmor: Add ability to load extended policy
    TOMOYO: Return appropriate value to poll().
    AppArmor: Move path failure information into aa_get_name and rename
    AppArmor: Update dfa matching routines.
    AppArmor: Minor cleanup of d_namespace_path to consolidate error handling
    AppArmor: Retrieve the dentry_path for error reporting when path lookup fails
    AppArmor: Add const qualifiers to generated string tables
    AppArmor: Fix oops in policy unpack auditing
    AppArmor: Fix error returned when a path lookup is disconnected
    KEYS: testing wrong bit for KEY_FLAG_REVOKED
    TOMOYO: Fix mount flags checking order.
    security: fix ima kconfig warning
    AppArmor: Fix the error case for chroot relative path name lookup
    AppArmor: fix mapping of META_READ to audit and quiet flags
    AppArmor: Fix underflow in xindex calculation
    AppArmor: Fix dropping of allowed operations that are force audited
    AppArmor: Add mising end of structure test to caps unpacking
    ...

    Linus Torvalds
     

20 Mar, 2012

1 commit


14 Feb, 2012

2 commits


24 Jan, 2012

1 commit

  • Per akpm suggestions alter the use of the term flush to be
    invalidate. The next patch will do this across all MM.

    This change is completely cosmetic.

    [v9: akpm@linux-foundation.org: change "flush" to "invalidate", part 3]

    Signed-off-by: Dan Magenheimer
    Cc: Kamezawa Hiroyuki
    Cc: Jan Beulich
    Reviewed-by: Seth Jennings
    Cc: Jeremy Fitzhardinge
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Nitin Gupta
    Cc: Matthew Wilcox
    Cc: Chris Mason
    Cc: Rik Riel
    Cc: Andrew Morton
    [v10: Fixed fs: move code out of buffer.c conflict change]
    Signed-off-by: Konrad Rzeszutek Wilk

    Dan Magenheimer
     

18 Jan, 2012

1 commit

  • dd slept infinitely when fsfeeze failed because of EIO.
    To fix this problem, if ->freeze_fs fails, freeze_super() wakes up
    the tasks waiting for the filesystem to become unfrozen.

    When s_frozen isn't SB_UNFROZEN in __generic_file_aio_write(),
    the function sleeps until FITHAW ioctl wakes up s_wait_unfrozen.

    However, if ->freeze_fs fails, s_frozen is set to SB_UNFROZEN and then
    freeze_super() returns an error number. In this case, FITHAW ioctl returns
    EINVAL because s_frozen is already SB_UNFROZEN. There is no way to wake up
    s_wait_unfrozen, so __generic_file_aio_write() sleeps infinitely.

    Signed-off-by: Kazuya Mio
    Signed-off-by: Al Viro

    Kazuya Mio
     

07 Jan, 2012

3 commits

  • If there are any inodes on the super block that have been unlinked
    (i_nlink == 0) but have not yet been deleted then prevent the
    remounting the super block read-only.

    Reported-by: Toshiyuki Okajima
    Signed-off-by: Miklos Szeredi
    Tested-by: Toshiyuki Okajima
    Signed-off-by: Al Viro

    Miklos Szeredi
     
  • Currently remouting superblock read-only is racy in a major way.

    With the per mount read-only infrastructure it is now possible to
    prevent most races, which this patch attempts.

    Before starting the remount read-only, iterate through all mounts
    belonging to the superblock and if none of them have any pending
    writes, set sb->s_readonly_remount. This indicates that remount is in
    progress and no further write requests are allowed. If the remount
    succeeds set MS_RDONLY and reset s_readonly_remount.

    If the remounting is unsuccessful just reset s_readonly_remount.
    This can result in transient EROFS errors, despite the fact the
    remount failed. Unfortunately hodling off writes is difficult as
    remount itself may touch the filesystem (e.g. through load_nls())
    which would deadlock.

    A later patch deals with delayed writes due to nlink going to zero.

    Signed-off-by: Miklos Szeredi
    Tested-by: Toshiyuki Okajima
    Signed-off-by: Al Viro

    Miklos Szeredi
     
  • Keep track of vfsmounts belonging to a superblock. List is protected
    by vfsmount_lock.

    Signed-off-by: Miklos Szeredi
    Tested-by: Toshiyuki Okajima
    Signed-off-by: Al Viro

    Miklos Szeredi
     

04 Jan, 2012

3 commits

  • unfortunately, just checking MS_BORN after having grabbed ->s_umount in
    sget() is not enough; places that pick superblock from a list and
    grab s_umount shared need the same check in addition to checking for
    ->s_root; otherwise three-way race between failing mount, sget() and
    such list-walker can leave us with list-walker coming *second*, when
    temporary active ref grabbed by sget() (to be dropped when sget()
    notices that original mount has failed by checking MS_BORN) has
    lead to deactivate_locked_super() from failing ->mount() *not* doing
    ->kill_sb() and just releasing ->s_umount. Once sget() gets through
    and notices that MS_BORN had never been set it will drop the active
    ref and fs will be shut down and kicked out of all lists, but it's
    too late for something like sync_supers().

    Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • some stuff in there can actually become static; some belongs to pnode.h
    as it's a private interface between namespace.c and pnode.c...

    Signed-off-by: Al Viro

    Al Viro
     

02 Nov, 2011

1 commit


01 Nov, 2011

1 commit


26 Jul, 2011

1 commit

  • * 'for-3.1/core' of git://git.kernel.dk/linux-block: (24 commits)
    block: strict rq_affinity
    backing-dev: use synchronize_rcu_expedited instead of synchronize_rcu
    block: fix patch import error in max_discard_sectors check
    block: reorder request_queue to remove 64 bit alignment padding
    CFQ: add think time check for group
    CFQ: add think time check for service tree
    CFQ: move think time check variables to a separate struct
    fixlet: Remove fs_excl from struct task.
    cfq: Remove special treatment for metadata rqs.
    block: document blk_plug list access
    block: avoid building too big plug list
    compat_ioctl: fix make headers_check regression
    block: eliminate potential for infinite loop in blkdev_issue_discard
    compat_ioctl: fix warning caused by qemu
    block: flush MEDIA_CHANGE from drivers on close(2)
    blk-throttle: Make total_nr_queued unsigned
    block: Add __attribute__((format(printf...) and fix fallout
    fs/partitions/check.c: make local symbols static
    block:remove some spare spaces in genhd.c
    block:fix the comment error in blkdev.h
    ...

    Linus Torvalds
     

21 Jul, 2011

3 commits

  • Now that the per-sb shrinker is responsible for shrinking 2 or more
    caches, increase the batch size to keep econmies of scale for
    shrinking each cache. Increase the shrinker batch size to 1024
    objects.

    To allow for a large increase in batch size, add a conditional
    reschedule to prune_icache_sb() so that we don't hold the LRU spin
    lock for too long. This mirrors the behaviour of the
    __shrink_dcache_sb(), and allows us to increase the batch size
    without needing to worry about problems caused by long lock hold
    times.

    To ensure that filesystems using the per-sb shrinker callouts don't
    cause problems, document that the object freeing method must
    reschedule appropriately inside loops.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     
  • Now we have a per-superblock shrinker implementation, we can add a
    filesystem specific callout to it to allow filesystem internal
    caches to be shrunk by the superblock shrinker.

    Rather than perpetuate the multipurpose shrinker callback API (i.e.
    nr_to_scan == 0 meaning "tell me how many objects freeable in the
    cache), two operations will be added. The first will return the
    number of objects that are freeable, the second is the actual
    shrinker call.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     
  • With context based shrinkers, we can implement a per-superblock
    shrinker that shrinks the caches attached to the superblock. We
    currently have global shrinkers for the inode and dentry caches that
    split up into per-superblock operations via a coarse proportioning
    method that does not batch very well. The global shrinkers also
    have a dependency - dentries pin inodes - so we have to be very
    careful about how we register the global shrinkers so that the
    implicit call order is always correct.

    With a per-sb shrinker callout, we can encode this dependency
    directly into the per-sb shrinker, hence avoiding the need for
    strictly ordering shrinker registrations. We also have no need for
    any proportioning code for the shrinker subsystem already provides
    this functionality across all shrinkers. Allowing the shrinker to
    operate on a single superblock at a time means that we do less
    superblock list traversals and locking and reclaim should batch more
    effectively. This should result in less CPU overhead for reclaim and
    potentially faster reclaim of items from each filesystem.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     

20 Jul, 2011

5 commits

  • The per-sb shrinker has the same requirement as the writeback
    threads of ensuring that the superblock is usable and pinned for the
    time it takes to run the work. Both need to take a passive reference
    to the sb, take a read lock on the s_umount lock and then only
    continue if an unmount is not in progress.

    pin_sb_for_writeback() does this exactly, so move it to fs/super.c
    and rename it to grab_super_passive() and exporting it via
    fs/internal.h for all the VFS code to be able to use.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     
  • With the inode LRUs moving to per-sb structures, there is no longer
    a need for a global inode_lru_lock. The locking can be made more
    fine-grained by moving to a per-sb LRU lock, isolating the LRU
    operations of different filesytsems completely from each other.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     
  • The inode unused list is currently a global LRU. This does not match
    the other global filesystem cache - the dentry cache - which uses
    per-superblock LRU lists. Hence we have related filesystem object
    types using different LRU reclaimation schemes.

    To enable a per-superblock filesystem cache shrinker, both of these
    caches need to have per-sb unused object LRU lists. Hence this patch
    converts the global inode LRU to per-sb LRUs.

    The patch only does rudimentary per-sb propotioning in the shrinker
    infrastructure, as this gets removed when the per-sb shrinker
    callouts are introduced later on.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • Call the given function for all superblocks of given type. Function
    gets a superblock (with s_umount locked shared) and (void *) argument
    supplied by caller of iterator.

    Signed-off-by: Al Viro

    Al Viro
     

12 Jul, 2011

1 commit

  • fs_excl is a poor man's priority inheritance for filesystems to hint to
    the block layer that an operation is important. It was never clearly
    specified, not widely adopted, and will not prevent starvation in many
    cases (like across cgroups).

    fs_excl was introduced with the time sliced CFQ IO scheduler, to
    indicate when a process held FS exclusive resources and thus needed
    a boost.

    It doesn't cover all file systems, and it was never fully complete.
    Lets kill it.

    Signed-off-by: Justin TerAvest
    Signed-off-by: Jens Axboe

    Justin TerAvest
     

04 Jun, 2011

1 commit

  • Caching "we have already removed suid/caps" was overenthusiastic as merged.
    On network filesystems we might have had suid/caps set on another client,
    silently picked by this client on revalidate, all of that *without* clearing
    the S_NOSEC flag.

    AFAICS, the only reasonably sane way to deal with that is
    * new superblock flag; unless set, S_NOSEC is not going to be set.
    * local block filesystems set it in their ->mount() (more accurately,
    mount_bdev() does, so does btrfs ->mount(), users of mount_bdev() other than
    local block ones clear it)
    * if any network filesystem (or a cluster one) wants to use S_NOSEC,
    it'll need to set MS_NOSEC in sb->s_flags *AND* take care to clear S_NOSEC when
    inode attribute changes are picked from other clients.

    It's not an earth-shattering hole (anybody that can set suid on another client
    will almost certainly be able to write to the file before doing that anyway),
    but it's a bug that needs fixing.

    Signed-off-by: Al Viro

    Al Viro
     

27 May, 2011

2 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/djm/tmem:
    xen: cleancache shim to Xen Transcendent Memory
    ocfs2: add cleancache support
    ext4: add cleancache support
    btrfs: add cleancache support
    ext3: add cleancache support
    mm/fs: add hooks to support cleancache
    mm: cleancache core ops functions and config
    fs: add field to superblock to support cleancache
    mm/fs: cleancache documentation

    Fix up trivial conflict in fs/btrfs/extent_io.c due to includes

    Linus Torvalds
     
  • This fourth patch of eight in this cleancache series provides the
    core hooks in VFS for: initializing cleancache per filesystem;
    capturing clean pages reclaimed by page cache; attempting to get
    pages from cleancache before filesystem read; and ensuring coherency
    between pagecache, disk, and cleancache. Note that the placement
    of these hooks was stable from 2.6.18 to 2.6.38; a minor semantic
    change was required due to a patchset in 2.6.39.

    All hooks become no-ops if CONFIG_CLEANCACHE is unset, or become
    a check of a boolean global if CONFIG_CLEANCACHE is set but no
    cleancache "backend" has claimed cleancache_ops.

    Details and a FAQ can be found in Documentation/vm/cleancache.txt

    [v8: minchan.kim@gmail.com: adapt to new remove_from_page_cache function]
    Signed-off-by: Chris Mason
    Signed-off-by: Dan Magenheimer
    Reviewed-by: Jeremy Fitzhardinge
    Reviewed-by: Konrad Rzeszutek Wilk
    Cc: Andrew Morton
    Cc: Al Viro
    Cc: Matthew Wilcox
    Cc: Nick Piggin
    Cc: Mel Gorman
    Cc: Rik Van Riel
    Cc: Jan Beulich
    Cc: Andreas Dilger
    Cc: Ted Ts'o
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Nitin Gupta

    Dan Magenheimer
     

19 May, 2011

1 commit


25 Mar, 2011

1 commit

  • * 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block: (65 commits)
    Documentation/iostats.txt: bit-size reference etc.
    cfq-iosched: removing unnecessary think time checking
    cfq-iosched: Don't clear queue stats when preempt.
    blk-throttle: Reset group slice when limits are changed
    blk-cgroup: Only give unaccounted_time under debug
    cfq-iosched: Don't set active queue in preempt
    block: fix non-atomic access to genhd inflight structures
    block: attempt to merge with existing requests on plug flush
    block: NULL dereference on error path in __blkdev_get()
    cfq-iosched: Don't update group weights when on service tree
    fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away
    block: Require subsystems to explicitly allocate bio_set integrity mempool
    jbd2: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
    jbd: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
    fs: make fsync_buffers_list() plug
    mm: make generic_writepages() use plugging
    blk-cgroup: Add unaccounted time to timeslice_used.
    block: fixup plugging stubs for !CONFIG_BLOCK
    block: remove obsolete comments for blkdev_issue_zeroout.
    blktrace: Use rq->cmd_flags directly in blk_add_trace_rq.
    ...

    Fix up conflicts in fs/{aio.c,super.c}

    Linus Torvalds
     

18 Mar, 2011

1 commit

  • new function: mount_fs(). Does all work done by vfs_kern_mount()
    except the allocation and filling of vfsmount; returns root dentry
    or ERR_PTR().

    vfs_kern_mount() switched to using it and taken to fs/namespace.c,
    along with its wrappers.

    alloc_vfsmnt()/free_vfsmnt() made static.

    functions in namespace.c slightly reordered.

    Signed-off-by: Al Viro

    Al Viro
     

17 Mar, 2011

2 commits


12 Feb, 2011

1 commit

  • In commit fa0d7e3de6d6 ("fs: icache RCU free inodes"), we use rcu free
    inode instead of freeing the inode directly. It causes a crash when we
    rmmod immediately after we umount the volume[1].

    So we need to call rcu_barrier after we kill_sb so that the inode is
    freed before we do rmmod. The idea is inspired by Aneesh Kumar.
    rcu_barrier will wait for all callbacks to end before preceding. The
    original patch was done by Tao Ma, but synchronize_rcu() is not enough
    here.

    1. http://marc.info/?l=linux-fsdevel&m=129680863330185&w=2

    Tested-by: Tao Ma
    Signed-off-by: Boaz Harrosh
    Cc: Nick Piggin
    Cc: Al Viro
    Cc: Chris Mason
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Boaz Harrosh
     

17 Jan, 2011

1 commit

  • Instead of splitting refcount between (per-cpu) mnt_count
    and (SMP-only) mnt_longrefs, make all references contribute
    to mnt_count again and keep track of how many are longterm
    ones.

    Accounting rules for longterm count:
    * 1 for each fs_struct.root.mnt
    * 1 for each fs_struct.pwd.mnt
    * 1 for having non-NULL ->mnt_ns
    * decrement to 0 happens only under vfsmount lock exclusive

    That allows nice common case for mntput() - since we can't drop the
    final reference until after mnt_longterm has reached 0 due to the rules
    above, mntput() can grab vfsmount lock shared and check mnt_longterm.
    If it turns out to be non-zero (which is the common case), we know
    that this is not the final mntput() and can just blindly decrement
    percpu mnt_count. Otherwise we grab vfsmount lock exclusive and
    do usual decrement-and-check of percpu mnt_count.

    For fs_struct.c we have mnt_make_longterm() and mnt_make_shortterm();
    namespace.c uses the latter in places where we don't already hold
    vfsmount lock exclusive and opencodes a few remaining spots where
    we need to manipulate mnt_longterm.

    Note that we mostly revert the code outside of fs/namespace.c back
    to what we used to have; in particular, normal code doesn't need
    to care about two kinds of references, etc. And we get to keep
    the optimization Nick's variant had bought us...

    Signed-off-by: Al Viro

    Al Viro
     

14 Jan, 2011

1 commit

  • * 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-block: (43 commits)
    block: ensure that completion error gets properly traced
    blktrace: add missing probe argument to block_bio_complete
    block cfq: don't use atomic_t for cfq_group
    block cfq: don't use atomic_t for cfq_queue
    block: trace event block fix unassigned field
    block: add internal hd part table references
    block: fix accounting bug on cross partition merges
    kref: add kref_test_and_get
    bio-integrity: mark kintegrityd_wq highpri and CPU intensive
    block: make kblockd_workqueue smarter
    Revert "sd: implement sd_check_events()"
    block: Clean up exit_io_context() source code.
    Fix compile warnings due to missing removal of a 'ret' variable
    fs/block: type signature of major_to_index(int) to major_to_index(unsigned)
    block: convert !IS_ERR(p) && p to !IS_ERR_NOR_NULL(p)
    cfq-iosched: don't check cfqg in choose_service_tree()
    fs/splice: Pull buf->ops->confirm() from splice_from_pipe actors
    cdrom: export cdrom_check_events()
    sd: implement sd_check_events()
    sr: implement sr_check_events()
    ...

    Linus Torvalds
     

07 Jan, 2011

2 commits

  • The problem that this patch aims to fix is vfsmount refcounting scalability.
    We need to take a reference on the vfsmount for every successful path lookup,
    which often go to the same mount point.

    The fundamental difficulty is that a "simple" reference count can never be made
    scalable, because any time a reference is dropped, we must check whether that
    was the last reference. To do that requires communication with all other CPUs
    that may have taken a reference count.

    We can make refcounts more scalable in a couple of ways, involving keeping
    distributed counters, and checking for the global-zero condition less
    frequently.

    - check the global sum once every interval (this will delay zero detection
    for some interval, so it's probably a showstopper for vfsmounts).

    - keep a local count and only taking the global sum when local reaches 0 (this
    is difficult for vfsmounts, because we can't hold preempt off for the life of
    a reference, so a counter would need to be per-thread or tied strongly to a
    particular CPU which requires more locking).

    - keep a local difference of increments and decrements, which allows us to sum
    the total difference and hence find the refcount when summing all CPUs. Then,
    keep a single integer "long" refcount for slow and long lasting references,
    and only take the global sum of local counters when the long refcount is 0.

    This last scheme is what I implemented here. Attached mounts and process root
    and working directory references are "long" references, and everything else is
    a short reference.

    This allows scalable vfsmount references during path walking over mounted
    subtrees and unattached (lazy umounted) mounts with processes still running
    in them.

    This results in one fewer atomic op in the fastpath: mntget is now just a
    per-CPU inc, rather than an atomic inc; and mntput just requires a spinlock
    and non-atomic decrement in the common case. However code is otherwise bigger
    and heavier, so single threaded performance is basically a wash.

    Signed-off-by: Nick Piggin

    Nick Piggin
     
  • We can turn the dcache hash locking from a global dcache_hash_lock into
    per-bucket locking.

    Signed-off-by: Nick Piggin

    Nick Piggin
     

13 Nov, 2010

2 commits

  • After recent blkdev_get() modifications, open_by_devnum() and
    open_bdev_exclusive() are simple wrappers around blkdev_get().
    Replace them with blkdev_get_by_dev() and blkdev_get_by_path().

    blkdev_get_by_dev() is identical to open_by_devnum().
    blkdev_get_by_path() is slightly different in that it doesn't
    automatically add %FMODE_EXCL to @mode.

    All users are converted. Most conversions are mechanical and don't
    introduce any behavior difference. There are several exceptions.

    * btrfs now sets FMODE_EXCL in btrfs_device->mode, so there's no
    reason to OR it explicitly on blkdev_put().

    * gfs2, nilfs2 and the generic mount_bdev() now set FMODE_EXCL in
    sb->s_mode.

    * With the above changes, sb->s_mode now always should contain
    FMODE_EXCL. WARN_ON_ONCE() added to kill_block_super() to detect
    errors.

    The new blkdev_get_*() functions are with proper docbook comments.
    While at it, add function description to blkdev_get() too.

    Signed-off-by: Tejun Heo
    Cc: Philipp Reisner
    Cc: Neil Brown
    Cc: Mike Snitzer
    Cc: Joern Engel
    Cc: Chris Mason
    Cc: Jan Kara
    Cc: "Theodore Ts'o"
    Cc: KONISHI Ryusuke
    Cc: reiserfs-devel@vger.kernel.org
    Cc: xfs-masters@oss.sgi.com
    Cc: Alexander Viro

    Tejun Heo
     
  • Over time, block layer has accumulated a set of APIs dealing with bdev
    open, close, claim and release.

    * blkdev_get/put() are the primary open and close functions.

    * bd_claim/release() deal with exclusive open.

    * open/close_bdev_exclusive() are combination of open and claim and
    the other way around, respectively.

    * bd_link/unlink_disk_holder() to create and remove holder/slave
    symlinks.

    * open_by_devnum() wraps bdget() + blkdev_get().

    The interface is a bit confusing and the decoupling of open and claim
    makes it impossible to properly guarantee exclusive access as
    in-kernel open + claim sequence can disturb the existing exclusive
    open even before the block layer knows the current open if for another
    exclusive access. Reorganize the interface such that,

    * blkdev_get() is extended to include exclusive access management.
    @holder argument is added and, if is @FMODE_EXCL specified, it will
    gain exclusive access atomically w.r.t. other exclusive accesses.

    * blkdev_put() is similarly extended. It now takes @mode argument and
    if @FMODE_EXCL is set, it releases an exclusive access. Also, when
    the last exclusive claim is released, the holder/slave symlinks are
    removed automatically.

    * bd_claim/release() and close_bdev_exclusive() are no longer
    necessary and either made static or removed.

    * bd_link_disk_holder() remains the same but bd_unlink_disk_holder()
    is no longer necessary and removed.

    * open_bdev_exclusive() becomes a simple wrapper around lookup_bdev()
    and blkdev_get(). It also has an unexpected extra bdev_read_only()
    test which probably should be moved into blkdev_get().

    * open_by_devnum() is modified to take @holder argument and pass it to
    blkdev_get().

    Most of bdev open/close operations are unified into blkdev_get/put()
    and most exclusive accesses are tested atomically at the open time (as
    it should). This cleans up code and removes some, both valid and
    invalid, but unnecessary all the same, corner cases.

    open_bdev_exclusive() and open_by_devnum() can use further cleanup -
    rename to blkdev_get_by_path() and blkdev_get_by_devt() and drop
    special features. Well, let's leave them for another day.

    Most conversions are straight-forward. drbd conversion is a bit more
    involved as there was some reordering, but the logic should stay the
    same.

    Signed-off-by: Tejun Heo
    Acked-by: Neil Brown
    Acked-by: Ryusuke Konishi
    Acked-by: Mike Snitzer
    Acked-by: Philipp Reisner
    Cc: Peter Osterlund
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Jan Kara
    Cc: Andrew Morton
    Cc: Andreas Dilger
    Cc: "Theodore Ts'o"
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Alex Elder
    Cc: Christoph Hellwig
    Cc: dm-devel@redhat.com
    Cc: drbd-dev@lists.linbit.com
    Cc: Leo Chen
    Cc: Scott Branden
    Cc: Chris Mason
    Cc: Steven Whitehouse
    Cc: Dave Kleikamp
    Cc: Joern Engel
    Cc: reiserfs-devel@vger.kernel.org
    Cc: Alexander Viro

    Tejun Heo