14 Jan, 2011

1 commit

  • * 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-block: (43 commits)
    block: ensure that completion error gets properly traced
    blktrace: add missing probe argument to block_bio_complete
    block cfq: don't use atomic_t for cfq_group
    block cfq: don't use atomic_t for cfq_queue
    block: trace event block fix unassigned field
    block: add internal hd part table references
    block: fix accounting bug on cross partition merges
    kref: add kref_test_and_get
    bio-integrity: mark kintegrityd_wq highpri and CPU intensive
    block: make kblockd_workqueue smarter
    Revert "sd: implement sd_check_events()"
    block: Clean up exit_io_context() source code.
    Fix compile warnings due to missing removal of a 'ret' variable
    fs/block: type signature of major_to_index(int) to major_to_index(unsigned)
    block: convert !IS_ERR(p) && p to !IS_ERR_NOR_NULL(p)
    cfq-iosched: don't check cfqg in choose_service_tree()
    fs/splice: Pull buf->ops->confirm() from splice_from_pipe actors
    cdrom: export cdrom_check_events()
    sd: implement sd_check_events()
    sr: implement sr_check_events()
    ...

    Linus Torvalds
     

11 Jan, 2011

1 commit


10 Jan, 2011

10 commits


07 Jan, 2011

3 commits

  • Signed-off-by: Nick Piggin

    Nick Piggin
     
  • RCU free the struct inode. This will allow:

    - Subsequent store-free path walking patch. The inode must be consulted for
    permissions when walking, so an RCU inode reference is a must.
    - sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
    to take i_lock no longer need to take sb_inode_list_lock to walk the list in
    the first place. This will simplify and optimize locking.
    - Could remove some nested trylock loops in dcache code
    - Could potentially simplify things a bit in VM land. Do not need to take the
    page lock to follow page->mapping.

    The downsides of this is the performance cost of using RCU. In a simple
    creat/unlink microbenchmark, performance drops by about 10% due to inability to
    reuse cache-hot slab objects. As iterations increase and RCU freeing starts
    kicking over, this increases to about 20%.

    In cases where inode lifetimes are longer (ie. many inodes may be allocated
    during the average life span of a single inode), a lot of this cache reuse is
    not applicable, so the regression caused by this patch is smaller.

    The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU,
    however this adds some complexity to list walking and store-free path walking,
    so I prefer to implement this at a later date, if it is shown to be a win in
    real situations. I haven't found a regression in any non-micro benchmark so I
    doubt it will be a problem.

    Signed-off-by: Nick Piggin

    Nick Piggin
     
  • Make d_count non-atomic and protect it with d_lock. This allows us to ensure a
    0 refcount dentry remains 0 without dcache_lock. It is also fairly natural when
    we start protecting many other dentry members with d_lock.

    Signed-off-by: Nick Piggin

    Nick Piggin
     

16 Dec, 2010

1 commit

  • On 2.6.37-rc1, garbage collection ioctl of nilfs was broken due to the
    commit 263d90cefc7d82a0 ("nilfs2: remove own inode hash used for GC"),
    and leading to filesystem corruption.

    The patch doesn't queue gc-inodes for log writer if they are reused
    through the vfs inode cache. Here, gc-inode is the inode which
    buffers blocks to be relocated on GC. That patch queues gc-inodes in
    nilfs_init_gcinode() function, but this function is not called when
    they don't have I_NEW flag. Thus, some of live blocks are wrongly
    overrode without being moved to new logs.

    This resolves the problem by moving the gc-inode queueing to an outer
    function to ensure it's done right.

    Signed-off-by: Ryusuke Konishi

    Ryusuke Konishi
     

24 Nov, 2010

1 commit


23 Nov, 2010

1 commit


13 Nov, 2010

2 commits

  • After recent blkdev_get() modifications, open_by_devnum() and
    open_bdev_exclusive() are simple wrappers around blkdev_get().
    Replace them with blkdev_get_by_dev() and blkdev_get_by_path().

    blkdev_get_by_dev() is identical to open_by_devnum().
    blkdev_get_by_path() is slightly different in that it doesn't
    automatically add %FMODE_EXCL to @mode.

    All users are converted. Most conversions are mechanical and don't
    introduce any behavior difference. There are several exceptions.

    * btrfs now sets FMODE_EXCL in btrfs_device->mode, so there's no
    reason to OR it explicitly on blkdev_put().

    * gfs2, nilfs2 and the generic mount_bdev() now set FMODE_EXCL in
    sb->s_mode.

    * With the above changes, sb->s_mode now always should contain
    FMODE_EXCL. WARN_ON_ONCE() added to kill_block_super() to detect
    errors.

    The new blkdev_get_*() functions are with proper docbook comments.
    While at it, add function description to blkdev_get() too.

    Signed-off-by: Tejun Heo
    Cc: Philipp Reisner
    Cc: Neil Brown
    Cc: Mike Snitzer
    Cc: Joern Engel
    Cc: Chris Mason
    Cc: Jan Kara
    Cc: "Theodore Ts'o"
    Cc: KONISHI Ryusuke
    Cc: reiserfs-devel@vger.kernel.org
    Cc: xfs-masters@oss.sgi.com
    Cc: Alexander Viro

    Tejun Heo
     
  • Over time, block layer has accumulated a set of APIs dealing with bdev
    open, close, claim and release.

    * blkdev_get/put() are the primary open and close functions.

    * bd_claim/release() deal with exclusive open.

    * open/close_bdev_exclusive() are combination of open and claim and
    the other way around, respectively.

    * bd_link/unlink_disk_holder() to create and remove holder/slave
    symlinks.

    * open_by_devnum() wraps bdget() + blkdev_get().

    The interface is a bit confusing and the decoupling of open and claim
    makes it impossible to properly guarantee exclusive access as
    in-kernel open + claim sequence can disturb the existing exclusive
    open even before the block layer knows the current open if for another
    exclusive access. Reorganize the interface such that,

    * blkdev_get() is extended to include exclusive access management.
    @holder argument is added and, if is @FMODE_EXCL specified, it will
    gain exclusive access atomically w.r.t. other exclusive accesses.

    * blkdev_put() is similarly extended. It now takes @mode argument and
    if @FMODE_EXCL is set, it releases an exclusive access. Also, when
    the last exclusive claim is released, the holder/slave symlinks are
    removed automatically.

    * bd_claim/release() and close_bdev_exclusive() are no longer
    necessary and either made static or removed.

    * bd_link_disk_holder() remains the same but bd_unlink_disk_holder()
    is no longer necessary and removed.

    * open_bdev_exclusive() becomes a simple wrapper around lookup_bdev()
    and blkdev_get(). It also has an unexpected extra bdev_read_only()
    test which probably should be moved into blkdev_get().

    * open_by_devnum() is modified to take @holder argument and pass it to
    blkdev_get().

    Most of bdev open/close operations are unified into blkdev_get/put()
    and most exclusive accesses are tested atomically at the open time (as
    it should). This cleans up code and removes some, both valid and
    invalid, but unnecessary all the same, corner cases.

    open_bdev_exclusive() and open_by_devnum() can use further cleanup -
    rename to blkdev_get_by_path() and blkdev_get_by_devt() and drop
    special features. Well, let's leave them for another day.

    Most conversions are straight-forward. drbd conversion is a bit more
    involved as there was some reordering, but the logic should stay the
    same.

    Signed-off-by: Tejun Heo
    Acked-by: Neil Brown
    Acked-by: Ryusuke Konishi
    Acked-by: Mike Snitzer
    Acked-by: Philipp Reisner
    Cc: Peter Osterlund
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Jan Kara
    Cc: Andrew Morton
    Cc: Andreas Dilger
    Cc: "Theodore Ts'o"
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Alex Elder
    Cc: Christoph Hellwig
    Cc: dm-devel@redhat.com
    Cc: drbd-dev@lists.linbit.com
    Cc: Leo Chen
    Cc: Scott Branden
    Cc: Chris Mason
    Cc: Steven Whitehouse
    Cc: Dave Kleikamp
    Cc: Joern Engel
    Cc: reiserfs-devel@vger.kernel.org
    Cc: Alexander Viro

    Tejun Heo
     

29 Oct, 2010

1 commit


27 Oct, 2010

2 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (52 commits)
    split invalidate_inodes()
    fs: skip I_FREEING inodes in writeback_sb_inodes
    fs: fold invalidate_list into invalidate_inodes
    fs: do not drop inode_lock in dispose_list
    fs: inode split IO and LRU lists
    fs: switch bdev inode bdi's correctly
    fs: fix buffer invalidation in invalidate_list
    fsnotify: use dget_parent
    smbfs: use dget_parent
    exportfs: use dget_parent
    fs: use RCU read side protection in d_validate
    fs: clean up dentry lru modification
    fs: split __shrink_dcache_sb
    fs: improve DCACHE_REFERENCED usage
    fs: use percpu counter for nr_dentry and nr_dentry_unused
    fs: simplify __d_free
    fs: take dcache_lock inside __d_path
    fs: do not assign default i_ino in new_inode
    fs: introduce a per-cpu last_ino allocator
    new helper: ihold()
    ...

    Linus Torvalds
     
  • To help developers and applications gain visibility into writeback
    behaviour this patch adds two counters to /proc/vmstat.

    # grep nr_dirtied /proc/vmstat
    nr_dirtied 3747
    # grep nr_written /proc/vmstat
    nr_written 3618

    These entries allow user apps to understand writeback behaviour over time
    and learn how it is impacting their performance. Currently there is no
    way to inspect dirty and writeback speed over time. It's not possible for
    nr_dirty/nr_writeback.

    These entries are necessary to give visibility into writeback behaviour.
    We have /proc/diskstats which lets us understand the io in the block
    layer. We have blktrace for more in depth understanding. We have
    e2fsprogs and debugsfs to give insight into the file systems behaviour,
    but we don't offer our users the ability understand what writeback is
    doing. There is no way to know how active it is over the whole system, if
    it's falling behind or to quantify it's efforts. With these values
    exported users can easily see how much data applications are sending
    through writeback and also at what rates writeback is processing this
    data. Comparing the rates of change between the two allow developers to
    see when writeback is not able to keep up with incoming traffic and the
    rate of dirty memory being sent to the IO back end. This allows folks to
    understand their io workloads and track kernel issues. Non kernel
    engineers at Google often use these counters to solve puzzling performance
    problems.

    Patch #4 adds a pernode vmstat file with nr_dirtied and nr_written

    Patch #5 add writeback thresholds to /proc/vmstat

    Currently these values are in debugfs. But they should be promoted to
    /proc since they are useful for developers who are writing databases
    and file servers and are not debugging the kernel.

    The output is as below:

    # grep threshold /proc/vmstat
    nr_pages_dirty_threshold 409111
    nr_pages_dirty_background_threshold 818223

    This patch:

    This allows code outside of the mm core to safely manipulate page
    writeback state and not worry about the other accounting. Not using these
    routines means that some code will lose track of the accounting and we get
    bugs.

    Modify nilfs2 to use interface.

    Signed-off-by: Michael Rubin
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Wu Fengguang
    Cc: KONISHI Ryusuke
    Cc: Jiro SEKIBA
    Cc: Dave Chinner
    Cc: Jens Axboe
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael Rubin
     

26 Oct, 2010

1 commit


23 Oct, 2010

16 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2: (36 commits)
    nilfs2: eliminate sparse warning - "context imbalance"
    nilfs2: eliminate sparse warnings - "symbol not declared"
    nilfs2: get rid of bdi from nilfs object
    nilfs2: change license of exported header file
    nilfs2: add bdev freeze/thaw support
    nilfs2: accept 64-bit checkpoint numbers in cp mount option
    nilfs2: remove own inode allocator and destructor for metadata files
    nilfs2: get rid of back pointer to writable sb instance
    nilfs2: get rid of mi_nilfs back pointer to nilfs object
    nilfs2: see state of root dentry for mount check of snapshots
    nilfs2: use iget for all metadata files
    nilfs2: get rid of GCDAT inode
    nilfs2: add routines to redirect access to buffers of DAT file
    nilfs2: add routines to roll back state of DAT file
    nilfs2: add routines to save and restore bmap state
    nilfs2: do not allocate nilfs_mdt_info structure to gc-inodes
    nilfs2: allow nilfs_clear_inode to clear metadata file inodes
    nilfs2: get rid of snapshot mount flag
    nilfs2: simplify life cycle management of nilfs object
    nilfs2: do not allocate multiple super block instances for a device
    ...

    Linus Torvalds
     
  • insert sparse annotations to fix following sparse warning.

    fs/nilfs2/segment.c:2681:3: warning: context imbalance in 'nilfs_segctor_kill_thread' - unexpected unlock

    nilfs_segctor_kill_thread is only called inside sc_state_lock lock.
    sparse doesn't detect the context and warn "unexpected unlock".
    __acquires/__releases pretend to lock/unlock the sc_state_lock for sparse.

    Signed-off-by: Jiro SEKIBA
    Signed-off-by: Ryusuke Konishi

    Jiro SEKIBA
     
  • change nilfs_dat_commit_free and nilfs_inode_cachep static
    to fix following warnings

    fs/nilfs2/super.c:72:19: warning: symbol 'nilfs_inode_cachep' was not declared. Should it be static?
    fs/nilfs2/dat.c:106:6: warning: symbol 'nilfs_dat_commit_free' was not declared. Should it be static?

    Signed-off-by: Jiro SEKIBA
    Signed-off-by: Ryusuke Konishi

    Jiro SEKIBA
     
  • Nilfs now can use sb->s_bdi to get backing_dev_info, so we use it
    instead of ns_bdi on the nilfs object and remove ns_bdi.

    Signed-off-by: Ryusuke Konishi

    Ryusuke Konishi
     
  • Nilfs hasn't supported the freeze/thaw feature because it didn't work
    due to the peculiar design that multiple super block instances could
    be allocated for a device. This limitation was removed by the patch
    "nilfs2: do not allocate multiple super block instances for a device".

    So now this adds the freeze/thaw support to nilfs.

    Signed-off-by: Ryusuke Konishi

    Ryusuke Konishi
     
  • The current implementation doesn't mount snapshots with checkpoint
    numbers larger than INT_MAX since it uses match_int() for parsing
    "cp=" mount option.

    This uses simple_strtoull() for the conversion to resolve the issue.

    Signed-off-by: Ryusuke Konishi

    Ryusuke Konishi
     
  • This finally removes own inode allocator and destructor functions for
    metadata files. Several routines, nilfs_mdt_new(),
    nilfs_mdt_new_common(), nilfs_mdt_clear(), nilfs_mdt_destroy(), and
    nilfs_alloc_inode_common() will be gone.

    Signed-off-by: Ryusuke Konishi

    Ryusuke Konishi
     
  • Nilfs object holds a back pointer to a writable super block instance
    in nilfs->ns_writer, and this became eliminable since sb is now made
    per device and all inodes have a valid pointer to it.

    This deletes the ns_writer pointer and a reader/writer semaphore
    protecting it.

    Signed-off-by: Ryusuke Konishi

    Ryusuke Konishi
     
  • This removes a back pointer to nilfs object from nilfs_mdt_info
    structure that is attached to metadata files.

    Signed-off-by: Ryusuke Konishi

    Ryusuke Konishi
     
  • After applied the patch that unified sb instances, root dentry of
    snapshots can be left in dcache even after their trees are unmounted.

    The orphan root dentry/inode keeps a root object, and this causes
    false positive of nilfs_checkpoint_is_mounted function.

    This resolves the issue by having nilfs_checkpoint_is_mounted test
    whether the root dentry is busy or not.

    Signed-off-by: Ryusuke Konishi

    Ryusuke Konishi
     
  • This makes use of iget5_locked to allocate or get inode for metadata
    files to stop using own inode allocator.

    Signed-off-by: Ryusuke Konishi

    Ryusuke Konishi
     
  • This applies prepared rollback function and redirect function of
    metadata file to DAT file, and eliminates GCDAT inode.

    Signed-off-by: Ryusuke Konishi

    Ryusuke Konishi
     
  • During garbage collection (GC), DAT file, which converts virtual block
    number to real block number, may return disk block number that is not
    yet written to the device.

    To avoid access to unwritten blocks, the current implementation stores
    changes to the caches of GCDAT during GC and atomically commit the
    changes into the DAT file after they are written to the device.

    This patch, instead, adds a function that makes a copy of specified
    buffer and stores it in nilfs_shadow_map, and a function to get the
    backup copy as needed (nilfs_mdt_freeze_buffer and
    nilfs_mdt_get_frozen_buffer respectively).

    Before DAT changes block number in an entry block, it makes a copy and
    redirect access to the buffer so that address conversion function
    (i.e. nilfs_dat_translate) refers to the old address saved in the
    copy.

    This patch gives requisites for such redirection.

    Signed-off-by: Ryusuke Konishi

    Ryusuke Konishi
     
  • This adds optional function to metadata files which makes a copy of
    bmap, page caches, and b-tree node cache, and rolls back to the copy
    as needed.

    This enhancement is intended to displace gcdat inode that provides a
    similar function in a different way.

    In this patch, nilfs_shadow_map structure is added to store a copy of
    the foregoing states. nilfs_mdt_setup_shadow_map relates this
    structure to a metadata file. And, nilfs_mdt_save_to_shadow_map() and
    nilfs_mdt_restore_from_shadow_map() provides save and restore
    functions respectively. Finally, nilfs_mdt_clear_shadow_map() clears
    states of nilfs_shadow_map.

    The copy of b-tree node cache and page cache is made by duplicating
    only dirty pages into corresponding caches in nilfs_shadow_map. Their
    restoration is done by clearing dirty pages from original caches and
    by copying dirty pages back from nilfs_shadow_map.

    Signed-off-by: Ryusuke Konishi

    Ryusuke Konishi
     
  • This adds routines to save and restore the state of bmap structure.
    The bmap state is stored in a given nilfs_bmap_store object.

    These routines will be used to roll back the state of dat inode
    without using gcdat inode.

    Signed-off-by: Ryusuke Konishi

    Ryusuke Konishi
     
  • GC-inode now doesn't need the nilfs_mdt_info structure and there is no
    reason that it is a sort of metadata files.

    This stops the allocation and makes them not dependent on metadata
    file routines.

    Signed-off-by: Ryusuke Konishi

    Ryusuke Konishi