11 Jan, 2012

3 commits

  • Nothing requires that we lock the filesystem until the root inode is
    provided.

    Also iget5_locked() triggers a warning because we are holding the
    filesystem lock while allocating the inode, which result in a lockdep
    suspicion that we have a lock inversion against the reclaim path:

    [ 1986.896979] =================================
    [ 1986.896990] [ INFO: inconsistent lock state ]
    [ 1986.896997] 3.1.1-main #8
    [ 1986.897001] ---------------------------------
    [ 1986.897007] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
    [ 1986.897016] kswapd0/16 [HC0[0]:SC0[0]:HE1:SE1] takes:
    [ 1986.897023] (&REISERFS_SB(s)->lock){+.+.?.}, at: [] reiserfs_write_lock+0x20/0x2a
    [ 1986.897044] {RECLAIM_FS-ON-W} state was registered at:
    [ 1986.897050] [] mark_held_locks+0xae/0xd0
    [ 1986.897060] [] lockdep_trace_alloc+0x7d/0x91
    [ 1986.897068] [] kmem_cache_alloc+0x1a/0x93
    [ 1986.897078] [] reiserfs_alloc_inode+0x13/0x3d
    [ 1986.897088] [] alloc_inode+0x14/0x5f
    [ 1986.897097] [] iget5_locked+0x62/0x13a
    [ 1986.897106] [] reiserfs_fill_super+0x410/0x8b9
    [ 1986.897114] [] mount_bdev+0x10b/0x159
    [ 1986.897123] [] get_super_block+0x10/0x12
    [ 1986.897131] [] mount_fs+0x59/0x12d
    [ 1986.897138] [] vfs_kern_mount+0x45/0x7a
    [ 1986.897147] [] do_kern_mount+0x2f/0xb0
    [ 1986.897155] [] do_mount+0x5c2/0x612
    [ 1986.897163] [] sys_mount+0x61/0x8f
    [ 1986.897170] [] sysenter_do_call+0x12/0x32
    [ 1986.897181] irq event stamp: 7509691
    [ 1986.897186] hardirqs last enabled at (7509691): [] kmem_cache_alloc+0x6e/0x93
    [ 1986.897197] hardirqs last disabled at (7509690): [] kmem_cache_alloc+0x24/0x93
    [ 1986.897209] softirqs last enabled at (7508896): [] __do_softirq+0xee/0xfd
    [ 1986.897222] softirqs last disabled at (7508859): [] do_softirq+0x50/0x9d
    [ 1986.897234]
    [ 1986.897235] other info that might help us debug this:
    [ 1986.897242] Possible unsafe locking scenario:
    [ 1986.897244]
    [ 1986.897250] CPU0
    [ 1986.897254] ----
    [ 1986.897257] lock(&REISERFS_SB(s)->lock);
    [ 1986.897265]
    [ 1986.897269] lock(&REISERFS_SB(s)->lock);
    [ 1986.897276]
    [ 1986.897277] *** DEADLOCK ***
    [ 1986.897278]
    [ 1986.897286] no locks held by kswapd0/16.
    [ 1986.897291]
    [ 1986.897292] stack backtrace:
    [ 1986.897299] Pid: 16, comm: kswapd0 Not tainted 3.1.1-main #8
    [ 1986.897306] Call Trace:
    [ 1986.897314] [] ? printk+0xf/0x11
    [ 1986.897324] [] print_usage_bug+0x20e/0x21a
    [ 1986.897332] [] ? print_irq_inversion_bug+0x172/0x172
    [ 1986.897341] [] mark_lock+0x27f/0x483
    [ 1986.897349] [] __lock_acquire+0x628/0x1472
    [ 1986.897358] [] lock_acquire+0x47/0x5e
    [ 1986.897366] [] ? reiserfs_write_lock+0x20/0x2a
    [ 1986.897384] [] ? reiserfs_write_lock+0x20/0x2a
    [ 1986.897397] [] mutex_lock_nested+0x35/0x26f
    [ 1986.897409] [] ? reiserfs_write_lock+0x20/0x2a
    [ 1986.897421] [] reiserfs_write_lock+0x20/0x2a
    [ 1986.897433] [] map_block_for_writepage+0xc9/0x590
    [ 1986.897448] [] ? create_empty_buffers+0x33/0x8f
    [ 1986.897461] [] ? get_parent_ip+0xb/0x31
    [ 1986.897472] [] ? sub_preempt_count+0x81/0x8e
    [ 1986.897485] [] ? _raw_spin_unlock+0x27/0x3d
    [ 1986.897496] [] ? get_parent_ip+0xb/0x31
    [ 1986.897508] [] reiserfs_writepage+0x1b9/0x3e7
    [ 1986.897521] [] ? clear_page_dirty_for_io+0xcb/0xde
    [ 1986.897533] [] ? trace_hardirqs_on_caller+0x108/0x138
    [ 1986.897546] [] ? trace_hardirqs_on+0xb/0xd
    [ 1986.897559] [] shrink_page_list+0x34f/0x5e2
    [ 1986.897572] [] shrink_inactive_list+0x172/0x22c
    [ 1986.897585] [] shrink_zone+0x303/0x3b1
    [ 1986.897597] [] ? _raw_spin_unlock+0x27/0x3d
    [ 1986.897611] [] kswapd+0x3b7/0x5f2

    The deadlock shouldn't happen since we are doing that allocation in the
    mount path, the filesystem is not available for any reclaim. Still the
    warning is annoying.

    To solve this, acquire the lock later only where we need it, right before
    calling reiserfs_read_locked_inode() that wants to lock to walk the tree.

    Reported-by: Knut Petersen
    Signed-off-by: Frederic Weisbecker
    Cc: Al Viro
    Cc: Christoph Hellwig
    Cc: Jeff Mahoney
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Frederic Weisbecker
     
  • journal_init() doesn't need the lock since no operation on the filesystem
    is involved there. journal_read() and get_list_bitmap() have yet to be
    reviewed carefully though before removing the lock there. Just keep the
    it around these two calls for safety.

    Signed-off-by: Frederic Weisbecker
    Cc: Al Viro
    Cc: Christoph Hellwig
    Cc: Jeff Mahoney
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Frederic Weisbecker
     
  • In the mount path, transactions that are made before journal
    initialization don't involve the filesystem. We can delay the reiserfs
    lock until we play with the journal.

    Signed-off-by: Frederic Weisbecker
    Cc: Al Viro
    Cc: Christoph Hellwig
    Cc: Jeff Mahoney
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Frederic Weisbecker
     

10 Jan, 2012

1 commit

  • * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
    ext2/3/4: delete unneeded includes of module.h
    ext{3,4}: Fix potential race when setversion ioctl updates inode
    udf: Mark LVID buffer as uptodate before marking it dirty
    ext3: Don't warn from writepage when readonly inode is spotted after error
    jbd: Remove j_barrier mutex
    reiserfs: Force inode evictions before umount to avoid crash
    reiserfs: Fix quota mount option parsing
    udf: Treat symlink component of type 2 as /
    udf: Fix deadlock when converting file from in-ICB one to normal one
    udf: Cleanup calling convention of inode_getblk()
    ext2: Fix error handling on inode bitmap corruption
    ext3: Fix error handling on inode bitmap corruption
    ext3: replace ll_rw_block with other functions
    ext3: NULL dereference in ext3_evict_inode()
    jbd: clear revoked flag on buffers before a new transaction started
    ext3: call ext3_mark_recovery_complete() when recovery is really needed

    Linus Torvalds
     

09 Jan, 2012

2 commits

  • This patch fixes a crash in reiserfs_delete_xattrs during umount.

    When shrink_dcache_for_umount clears the dcache from
    generic_shutdown_super, delayed evictions are forced to disk. If an
    evicted inode has extended attributes associated with it, it will
    need to walk the xattr tree to locate and remove them.

    But since shrink_dcache_for_umount will BUG if it encounters active
    dentries, the xattr tree must be released before it's called or it will
    crash during every umount.

    This patch forces the evictions to occur before generic_shutdown_super
    by calling shrink_dcache_sb first. The additional evictions caused
    by the removal of each associated xattr file and dir will be automatically
    handled as they're added to the LRU list.

    CC: reiserfs-devel@vger.kernel.org
    CC: stable@kernel.org
    Signed-off-by: Jeff Mahoney
    Signed-off-by: Jan Kara

    Jeff Mahoney
     
  • When jqfmt mount option is not specified on remount, we mistakenly clear
    s_jquota_fmt value stored in superblock. Fix the problem.

    CC: stable@kernel.org
    CC: reiserfs-devel@vger.kernel.org
    Signed-off-by: Jan Kara

    Jan Kara
     

07 Jan, 2012

2 commits


04 Jan, 2012

1 commit

  • Seeing that just about every destructor got that INIT_LIST_HEAD() copied into
    it, there is no point whatsoever keeping this INIT_LIST_HEAD in inode_init_once();
    the cost of taking it into inode_init_always() will be negligible for pipes
    and sockets and negative for everything else. Not to mention the removal of
    boilerplate code from ->destroy_inode() instances...

    Signed-off-by: Al Viro

    Al Viro
     

21 Jul, 2011

1 commit


27 May, 2011

1 commit

  • Tell the filesystem if we just updated timestamp (I_DIRTY_SYNC) or
    anything else, so that the filesystem can track internally if it
    needs to push out a transaction for fdatasync or not.

    This is just the prototype change with no user for it yet. I plan
    to push large XFS changes for the next merge window, and getting
    this trivial infrastructure in this window would help a lot to avoid
    tree interdependencies.

    Also remove incorrect comments that ->dirty_inode can't block. That
    has been changed a long time ago, and many implementations rely on it.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

31 Mar, 2011

1 commit


13 Jan, 2011

1 commit

  • As Al Viro pointed out path resolution during Q_QUOTAON calls to quotactl
    is prone to deadlocks. We hold s_umount semaphore for reading during the
    path resolution and resolution itself may need to acquire the semaphore
    for writing when e. g. autofs mountpoint is passed.

    Solve the problem by performing the resolution before we get hold of the
    superblock (and thus s_umount semaphore). The whole thing is complicated
    by the fact that some filesystems (OCFS2) ignore the path argument. So to
    distinguish between filesystem which want the path and which do not we
    introduce new .quota_on_meta callback which does not get the path. OCFS2
    then uses this callback instead of old .quota_on.

    CC: Al Viro
    CC: Christoph Hellwig
    CC: Ted Ts'o
    CC: Joel Becker
    Signed-off-by: Jan Kara

    Jan Kara
     

07 Jan, 2011

1 commit

  • RCU free the struct inode. This will allow:

    - Subsequent store-free path walking patch. The inode must be consulted for
    permissions when walking, so an RCU inode reference is a must.
    - sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
    to take i_lock no longer need to take sb_inode_list_lock to walk the list in
    the first place. This will simplify and optimize locking.
    - Could remove some nested trylock loops in dcache code
    - Could potentially simplify things a bit in VM land. Do not need to take the
    page lock to follow page->mapping.

    The downsides of this is the performance cost of using RCU. In a simple
    creat/unlink microbenchmark, performance drops by about 10% due to inability to
    reuse cache-hot slab objects. As iterations increase and RCU freeing starts
    kicking over, this increases to about 20%.

    In cases where inode lifetimes are longer (ie. many inodes may be allocated
    during the average life span of a single inode), a lot of this cache reuse is
    not applicable, so the regression caused by this patch is smaller.

    The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU,
    however this adds some complexity to list walking and store-free path walking,
    so I prefer to implement this at a later date, if it is shown to be a win in
    real situations. I haven't found a regression in any non-micro benchmark so I
    doubt it will be a problem.

    Signed-off-by: Nick Piggin

    Nick Piggin
     

18 Nov, 2010

1 commit


29 Oct, 2010

1 commit


10 Aug, 2010

2 commits


27 May, 2010

1 commit

  • When quota was suspended on remount-ro, finish_unfinished() will try to turn
    it on again (which fails) and also turns the quotas off on exit. Fix the
    function to check whether quotas are already on at function entry and do
    not turn them off in that case.

    CC: reiserfs-devel@vger.kernel.org
    Signed-off-by: Jan Kara

    Jan Kara
     

24 May, 2010

5 commits

  • Follow the dquot_* style used elsewhere in dquot.c.

    [Jan Kara: Fixed up missing conversion of ext2]

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jan Kara

    Christoph Hellwig
     
  • Remount handling has fully moved into the filesystem, so all this is
    superflous now.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jan Kara

    Christoph Hellwig
     
  • Currently the VFS calls into the quotactl interface for unmounting
    filesystems. This means filesystems with their own quota handling
    can't easily distinguish between user-space originating quotaoff
    and an unount. Instead move the responsibily of the unmount handling
    into the filesystem to be consistent with all other dquot handling.

    Note that we do call dquot_disable a lot later now, e.g. after
    a sync_filesystem. But this is fine as the quota code does all its
    writes via blockdev's mapping and that is synced even later.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jan Kara

    Christoph Hellwig
     
  • Instead of having wrappers in the VFS namespace export the dquot_suspend
    and dquot_resume helpers directly. Also rename vfs_quota_disable to
    dquot_disable while we're at it.

    [Jan Kara: Moved dquot_suspend to quotaops.h and made it inline]

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jan Kara

    Christoph Hellwig
     
  • Currently do_remount_sb calls into the dquot code to tell it about going
    from rw to ro and ro to rw. Move this code into the filesystem to
    not depend on the dquot code in the VFS - note ocfs2 already ignores
    these calls and handles remount by itself. This gets rid of overloading
    the quotactl calls and allows to unify the VFS and XFS codepaths in
    that area later.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jan Kara

    Christoph Hellwig
     

05 Apr, 2010

1 commit


31 Mar, 2010

1 commit

  • Commit 8ebc423238341b52912c7295b045a32477b33f09 (reiserfs: kill-the-BKL)
    introduced a bug in the mount failure case.

    The error label releases the lock before calling journal_release_error,
    but it requires that the lock be held. do_journal_release unlocks and
    retakes it. When it releases it without it held, we trigger a BUG().

    The error_alloc label skips the unlock since the lock isn't held yet
    but none of the other conditions that are clean up exist yet either.

    This patch returns immediately after the kzalloc failure and moves
    the reiserfs_write_unlock after the journal_release_error call.

    This was reported in https://bugzilla.novell.com/show_bug.cgi?id=591807

    Reported-by: Thomas Siedentopf
    Signed-off-by: Jeff Mahoney
    Cc: Thomas Siedentopf
    Cc: Andrew Morton
    Cc: 2.6.33.x
    Signed-off-by: Frederic Weisbecker

    Jeff Mahoney
     

30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

05 Mar, 2010

6 commits

  • Get rid of the initialize dquot operation - it is now always called from
    the filesystem and if a filesystem really needs it's own (which none
    currently does) it can just call into it's own routine directly.

    Rename the now static low-level dquot_initialize helper to __dquot_initialize
    and vfs_dq_init to dquot_initialize to have a consistent namespace.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jan Kara

    Christoph Hellwig
     
  • Get rid of the drop dquot operation - it is now always called from
    the filesystem and if a filesystem really needs it's own (which none
    currently does) it can just call into it's own routine directly.

    Rename the now static low-level dquot_drop helper to __dquot_drop
    and vfs_dq_drop to dquot_drop to have a consistent namespace.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jan Kara

    Christoph Hellwig
     
  • Currently clear_inode calls vfs_dq_drop directly. This means
    we tie the quota code into the VFS. Get rid of that and make the
    filesystem responsible for the drop inside the ->clear_inode
    superblock operation.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jan Kara

    Christoph Hellwig
     
  • Get rid of the transfer dquot operation - it is now always called from
    the filesystem and if a filesystem really needs it's own (which none
    currently does) it can just call into it's own routine directly.

    Rename the now static low-level dquot_transfer helper to __dquot_transfer
    and vfs_dq_transfer to dquot_transfer to have a consistent namespace,
    and make the new dquot_transfer return a normal negative errno value
    which all callers expect.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jan Kara

    Christoph Hellwig
     
  • Get rid of the alloc_inode and free_inode dquot operations - they are
    always called from the filesystem and if a filesystem really needs
    their own (which none currently does) it can just call into it's
    own routine directly.

    Also get rid of the vfs_dq_alloc/vfs_dq_free wrappers and always
    call the lowlevel dquot_alloc_inode / dqout_free_inode routines
    directly, which now lose the number argument which is always 1.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jan Kara

    Christoph Hellwig
     
  • Get rid of the alloc_space, free_space, reserve_space, claim_space and
    release_rsv dquot operations - they are always called from the filesystem
    and if a filesystem really needs their own (which none currently does)
    it can just call into it's own routine directly.

    Move shared logic into the common __dquot_alloc_space,
    dquot_claim_space_nodirty and __dquot_free_space low-level methods,
    and rationalize the wrappers around it to move as much as possible
    code into the common block for CONFIG_QUOTA vs not. Also rename
    all these helpers to be named dquot_* instead of vfs_dq_*.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jan Kara

    Christoph Hellwig
     

16 Dec, 2009

1 commit

  • /proc/fs/reiserfs/version is on the way of removing ->read_proc interface.
    It's empty however, so simply remove it instead of doing dummy
    conversion. It's hard to see what information userspace can extract from
    empty file.

    Signed-off-by: Alexey Dobriyan
    Cc: Jeff Mahoney
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

09 Dec, 2009

1 commit

  • …t/frederic/random-tracing

    * 'reiserfs/kill-bkl' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing: (31 commits)
    kill-the-bkl/reiserfs: turn GFP_ATOMIC flag to GFP_NOFS in reiserfs_get_block()
    kill-the-bkl/reiserfs: drop the fs race watchdog from _get_block_create_0()
    kill-the-bkl/reiserfs: definitely drop the bkl from reiserfs_ioctl()
    kill-the-bkl/reiserfs: always lock the ioctl path
    kill-the-bkl/reiserfs: fix reiserfs lock to cpu_add_remove_lock dependency
    kill-the-bkl/reiserfs: Fix induced mm->mmap_sem to sysfs_mutex dependency
    kill-the-bkl/reiserfs: panic in case of lock imbalance
    kill-the-bkl/reiserfs: fix recursive reiserfs write lock in reiserfs_commit_write()
    kill-the-bkl/reiserfs: fix recursive reiserfs lock in reiserfs_mkdir()
    kill-the-bkl/reiserfs: fix "reiserfs lock" / "inode mutex" lock inversion dependency
    kill-the-bkl/reiserfs: move the concurrent tree accesses checks per superblock
    kill-the-bkl/reiserfs: acquire the inode mutex safely
    kill-the-bkl/reiserfs: unlock only when needed in search_by_key
    kill-the-bkl/reiserfs: use mutex_lock in reiserfs_mutex_lock_safe
    kill-the-bkl/reiserfs: factorize the locking in reiserfs_write_end()
    kill-the-bkl/reiserfs: reduce number of contentions in search_by_key()
    kill-the-bkl/reiserfs: don't hold the write recursively in reiserfs_lookup()
    kill-the-bkl/reiserfs: lock only once on reiserfs_get_block()
    kill-the-bkl/reiserfs: conditionaly release the write lock on fs_changed()
    kill-the-BKL/reiserfs: add reiserfs_cond_resched()
    ...

    Linus Torvalds
     

22 Sep, 2009

2 commits


14 Sep, 2009

2 commits

  • Impact: fix a deadlock

    reiserfs_dirty_inode() is the super_operations::dirty_inode() callback
    of reiserfs. It can be called from different contexts where the write
    lock can be already held.

    But this function also grab the write lock (possibly recursively).
    Subsequent release of the lock before sleep will actually not release
    the lock if the caller of mark_inode_dirty() (which in turn calls
    reiserfs_dirty_inode()) already owns the lock.

    A typical case:

    reiserfs_write_end() {
    acquire_write_lock()
    mark_inode_dirty() {
    reiserfs_dirty_inode() {
    reacquire_write_lock() {
    journal_begin() {
    do_journal_begin_r() {
    /*
    * fail to release, still
    * one depth of lock
    */
    release_write_lock()
    reiserfs_wait_on_write_block() {
    wait_event()

    The event is usually provided by something which needs the write lock but
    it hasn't been released.

    We use reiserfs_write_lock_once() here to ensure we only grab the
    write lock in one level.

    Signed-off-by: Frederic Weisbecker
    Cc: Frederic Weisbecker
    Cc: Alessio Igor Bogani
    Cc: Jeff Mahoney
    Cc: Chris Mason
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • This patch is an attempt to remove the Bkl based locking scheme from
    reiserfs and is intended.

    It is a bit inspired from an old attempt by Peter Zijlstra:

    http://lkml.indiana.edu/hypermail/linux/kernel/0704.2/2174.html

    The bkl is heavily used in this filesystem to prevent from
    concurrent write accesses on the filesystem.

    Reiserfs makes a deep use of the specific properties of the Bkl:

    - It can be acqquired recursively by a same task
    - It is released on the schedule() calls and reacquired when schedule() returns

    The two properties above are a roadmap for the reiserfs write locking so it's
    very hard to simply replace it with a common mutex.

    - We need a recursive-able locking unless we want to restructure several blocks
    of the code.
    - We need to identify the sites where the bkl was implictly relaxed
    (schedule, wait, sync, etc...) so that we can in turn release and
    reacquire our new lock explicitly.
    Such implicit releases of the lock are often required to let other
    resources producer/consumer do their job or we can suffer unexpected
    starvations or deadlocks.

    So the new lock that replaces the bkl here is a per superblock mutex with a
    specific property: it can be acquired recursively by a same task, like the
    bkl.

    For such purpose, we integrate a lock owner and a lock depth field on the
    superblock information structure.

    The first axis on this patch is to turn reiserfs_write_(un)lock() function
    into a wrapper to manage this mutex. Also some explicit calls to
    lock_kernel() have been converted to reiserfs_write_lock() helpers.

    The second axis is to find the important blocking sites (schedule...(),
    wait_on_buffer(), sync_dirty_buffer(), etc...) and then apply an explicit
    release of the write lock on these locations before blocking. Then we can
    safely wait for those who can give us resources or those who need some.
    Typically this is a fight between the current writer, the reiserfs workqueue
    (aka the async commiter) and the pdflush threads.

    The third axis is a consequence of the second. The write lock is usually
    on top of a lock dependency chain which can include the journal lock, the
    flush lock or the commit lock. So it's dangerous to release and trying to
    reacquire the write lock while we still hold other locks.

    This is fine with the bkl:

    T1 T2

    lock_kernel()
    mutex_lock(A)
    unlock_kernel()
    // do something
    lock_kernel()
    mutex_lock(A) -> already locked by T1
    schedule() (and then unlock_kernel())
    lock_kernel()
    mutex_unlock(A)
    ....

    This is not fine with a mutex:

    T1 T2

    mutex_lock(write)
    mutex_lock(A)
    mutex_unlock(write)
    // do something
    mutex_lock(write)
    mutex_lock(A) -> already locked by T1
    schedule()

    mutex_lock(write) -> already locked by T2
    deadlock

    The solution in this patch is to provide a helper which releases the write
    lock and sleep a bit if we can't lock a mutex that depend on it. It's another
    simulation of the bkl behaviour.

    The last axis is to locate the fs callbacks that are called with the bkl held,
    according to Documentation/filesystem/Locking.

    Those are:

    - reiserfs_remount
    - reiserfs_fill_super
    - reiserfs_put_super

    Reiserfs didn't need to explicitly lock because of the context of these callbacks.
    But now we must take care of that with the new locking.

    After this patch, reiserfs suffers from a slight performance regression (for now).
    On UP, a high volume write with dd reports an average of 27 MB/s instead
    of 30 MB/s without the patch applied.

    Signed-off-by: Frederic Weisbecker
    Reviewed-by: Ingo Molnar
    Cc: Jeff Mahoney
    Cc: Peter Zijlstra
    Cc: Bron Gondwana
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Alexander Viro
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     

09 Jul, 2009

1 commit

  • Fix various silly problems wrt mnt_namespace.h:

    - exit_mnt_ns() isn't used, remove it
    - done that, sched.h and nsproxy.h inclusions aren't needed
    - mount.h inclusion was need for vfsmount_lock, but no longer
    - remove mnt_namespace.h inclusion from files which don't use anything
    from mnt_namespace.h

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan