18 Dec, 2009

2 commits

  • It can happen that write does not use all the blocks allocated in
    write_begin either because of some filesystem error (like ENOSPC) or
    because page with data to write has been removed from memory. We truncate
    these blocks so that we don't have dangling blocks beyond i_size.

    Cc: Jeff Mahoney
    Signed-off-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • This reverts commit e4c570c4cb7a95dbfafa3d016d2739bf3fdfe319, as
    requested by Alexey:

    "I think I gave a good enough arguments to not merge it.
    To iterate:
    * patch makes impossible to start using ext3 on EXT3_FS=n kernels
    without reboot.
    * this is done only for one pointer on task_struct"

    None of config options which define task_struct are tristate directly
    or effectively."

    Requested-by: Alexey Dobriyan
    Acked-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

17 Dec, 2009

2 commits

  • * 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (38 commits)
    direct I/O fallback sync simplification
    ocfs: stop using do_sync_mapping_range
    cleanup blockdev_direct_IO locking
    make generic_acl slightly more generic
    sanitize xattr handler prototypes
    libfs: move EXPORT_SYMBOL for d_alloc_name
    vfs: force reval of target when following LAST_BIND symlinks (try #7)
    ima: limit imbalance msg
    Untangling ima mess, part 3: kill dead code in ima
    Untangling ima mess, part 2: deal with counters
    Untangling ima mess, part 1: alloc_file()
    O_TRUNC open shouldn't fail after file truncation
    ima: call ima_inode_free ima_inode_free
    IMA: clean up the IMA counts updating code
    ima: only insert at inode creation time
    ima: valid return code from ima_inode_alloc
    fs: move get_empty_filp() deffinition to internal.h
    Sanitize exec_permission_lite()
    Kill cached_lookup() and real_lookup()
    Kill path_lookup_open()
    ...

    Trivial conflicts in fs/direct-io.c

    Linus Torvalds
     
  • Add a flags argument to struct xattr_handler and pass it to all xattr
    handler methods. This allows using the same methods for multiple
    handlers, e.g. for the ACL methods which perform exactly the same action
    for the access and default ACLs, just using a different underlying
    attribute. With a little more groundwork it'll also allow sharing the
    methods for the regular user/trusted/secure handlers in extN, ocfs2 and
    jffs2 like it's already done for xfs in this patch.

    Also change the inode argument to the handlers to a dentry to allow
    using the handlers mechnism for filesystems that require it later,
    e.g. cifs.

    [with GFS2 bits updated by Steven Whitehouse ]

    Signed-off-by: Christoph Hellwig
    Reviewed-by: James Morris
    Acked-by: Joel Becker
    Signed-off-by: Al Viro

    Christoph Hellwig
     

16 Dec, 2009

3 commits


10 Dec, 2009

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (42 commits)
    tree-wide: fix misspelling of "definition" in comments
    reiserfs: fix misspelling of "journaled"
    doc: Fix a typo in slub.txt.
    inotify: remove superfluous return code check
    hdlc: spelling fix in find_pvc() comment
    doc: fix regulator docs cut-and-pasteism
    mtd: Fix comment in Kconfig
    doc: Fix IRQ chip docs
    tree-wide: fix assorted typos all over the place
    drivers/ata/libata-sff.c: comment spelling fixes
    fix typos/grammos in Documentation/edac.txt
    sysctl: add missing comments
    fs/debugfs/inode.c: fix comment typos
    sgivwfb: Make use of ARRAY_SIZE.
    sky2: fix sky2_link_down copy/paste comment error
    tree-wide: fix typos "couter" -> "counter"
    tree-wide: fix typos "offest" -> "offset"
    fix kerneldoc for set_irq_msi()
    spidev: fix double "of of" in comment
    comment typo fix: sybsystem -> subsystem
    ...

    Linus Torvalds
     

09 Dec, 2009

1 commit

  • …t/frederic/random-tracing

    * 'reiserfs/kill-bkl' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing: (31 commits)
    kill-the-bkl/reiserfs: turn GFP_ATOMIC flag to GFP_NOFS in reiserfs_get_block()
    kill-the-bkl/reiserfs: drop the fs race watchdog from _get_block_create_0()
    kill-the-bkl/reiserfs: definitely drop the bkl from reiserfs_ioctl()
    kill-the-bkl/reiserfs: always lock the ioctl path
    kill-the-bkl/reiserfs: fix reiserfs lock to cpu_add_remove_lock dependency
    kill-the-bkl/reiserfs: Fix induced mm->mmap_sem to sysfs_mutex dependency
    kill-the-bkl/reiserfs: panic in case of lock imbalance
    kill-the-bkl/reiserfs: fix recursive reiserfs write lock in reiserfs_commit_write()
    kill-the-bkl/reiserfs: fix recursive reiserfs lock in reiserfs_mkdir()
    kill-the-bkl/reiserfs: fix "reiserfs lock" / "inode mutex" lock inversion dependency
    kill-the-bkl/reiserfs: move the concurrent tree accesses checks per superblock
    kill-the-bkl/reiserfs: acquire the inode mutex safely
    kill-the-bkl/reiserfs: unlock only when needed in search_by_key
    kill-the-bkl/reiserfs: use mutex_lock in reiserfs_mutex_lock_safe
    kill-the-bkl/reiserfs: factorize the locking in reiserfs_write_end()
    kill-the-bkl/reiserfs: reduce number of contentions in search_by_key()
    kill-the-bkl/reiserfs: don't hold the write recursively in reiserfs_lookup()
    kill-the-bkl/reiserfs: lock only once on reiserfs_get_block()
    kill-the-bkl/reiserfs: conditionaly release the write lock on fs_changed()
    kill-the-BKL/reiserfs: add reiserfs_cond_resched()
    ...

    Linus Torvalds
     

05 Dec, 2009

1 commit


21 Nov, 2009

1 commit


15 Oct, 2009

3 commits

  • We had a watchdog in _get_block_create_0() that jumped to a fixup retry
    path in case the bkl got relaxed while calling kmap().
    This is not necessary anymore since we now have a reiserfs lock that is
    not implicitly relaxed while sleeping.

    Signed-off-by: Frederic Weisbecker
    Cc: Jeff Mahoney
    Cc: Chris Mason
    Cc: Ingo Molnar
    Cc: Alexander Beregalov
    Cc: Laurent Riffard
    Cc: Thomas Gleixner

    Frederic Weisbecker
     
  • The reiserfs ioctl path doesn't need the big kernel lock anymore , now
    that the filesystem synchronizes through its own lock.

    We can then turn reiserfs_ioctl() into an unlocked_ioctl callback.

    Signed-off-by: Frederic Weisbecker
    Cc: Jeff Mahoney
    Cc: Chris Mason
    Cc: Ingo Molnar
    Cc: Alexander Beregalov
    Cc: Laurent Riffard
    Cc: Thomas Gleixner

    Frederic Weisbecker
     
  • Reiserfs uses the ioctl callback for its file operations, which means
    that its ioctl path is still locked by the bkl, this was synchronizing
    with the rest of the filsystem operations. We have changed that by
    locking it with the new reiserfs lock but we do that only from the
    compat_ioctl callback.

    Fix that by locking reiserfs_ioctl() everytime.

    Signed-off-by: Frederic Weisbecker
    Cc: Jeff Mahoney
    Cc: Chris Mason
    Cc: Ingo Molnar
    Cc: Alexander Beregalov
    Cc: Laurent Riffard
    Cc: Thomas Gleixner

    Frederic Weisbecker
     

05 Oct, 2009

1 commit

  • While creating the reiserfs workqueue during the journal
    initialization, we are holding the reiserfs lock, but
    create_workqueue() also holds the cpu_add_remove_lock, creating
    then the following dependency:

    - reiserfs lock -> cpu_add_remove_lock

    But we also have the following existing dependencies:

    - mm->mmap_sem -> reiserfs lock
    - cpu_add_remove_lock -> cpu_hotplug.lock -> slub_lock -> sysfs_mutex

    The merged dependency chain then becomes:

    - mm->mmap_sem -> reiserfs lock -> cpu_add_remove_lock ->
    cpu_hotplug.lock -> slub_lock -> sysfs_mutex

    But when we fill a dir entry in sysfs_readir(), we are holding the
    sysfs_mutex and we also might fault while copying the directory entry
    to the user, leading to the following dependency:

    - sysfs_mutex -> mm->mmap_sem

    The end result is then a lock inversion between sysfs_mutex and
    mm->mmap_sem, as reported in the following lockdep warning:

    [ INFO: possible circular locking dependency detected ]
    2.6.31-07095-g25a3912 #4
    -------------------------------------------------------
    udevadm/790 is trying to acquire lock:
    (&mm->mmap_sem){++++++}, at: [] might_fault+0x72/0xc0

    but task is already holding lock:
    (sysfs_mutex){+.+.+.}, at: [] sysfs_readdir+0x7c/0x260

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #5 (sysfs_mutex){+.+.+.}:
    [...]

    -> #4 (slub_lock){+++++.}:
    [...]

    -> #3 (cpu_hotplug.lock){+.+.+.}:
    [...]

    -> #2 (cpu_add_remove_lock){+.+.+.}:
    [...]

    -> #1 (&REISERFS_SB(s)->lock){+.+.+.}:
    [...]

    -> #0 (&mm->mmap_sem){++++++}:
    [...]

    This can be fixed by relaxing the reiserfs lock while creating the
    workqueue.
    This is fine to relax the lock here, we just keep it around to pass
    through reiserfs lock checks and for paranoid reasons.

    Reported-by: Alexander Beregalov
    Tested-by: Alexander Beregalov
    Signed-off-by: Frederic Weisbecker
    Cc: Jeff Mahoney
    Cc: Chris Mason
    Cc: Ingo Molnar
    Cc: Alexander Beregalov
    Cc: Laurent Riffard

    Frederic Weisbecker
     

22 Sep, 2009

2 commits


17 Sep, 2009

1 commit

  • Alexander Beregalov reported the following warning:

    =======================================================
    [ INFO: possible circular locking dependency detected ]
    2.6.31-03149-gdcc030a #1
    -------------------------------------------------------
    udevadm/716 is trying to acquire lock:
    (&mm->mmap_sem){++++++}, at: [] might_fault+0x4a/0xa0

    but task is already holding lock:
    (sysfs_mutex){+.+.+.}, at: [] sysfs_readdir+0x5a/0x200

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #3 (sysfs_mutex){+.+.+.}:
    [...]

    -> #2 (&bdev->bd_mutex){+.+.+.}:
    [...]

    -> #1 (&REISERFS_SB(s)->lock){+.+.+.}:
    [...]

    -> #0 (&mm->mmap_sem){++++++}:
    [...]

    On reiserfs mount path, we take the reiserfs lock and while
    initializing the journal, we open the device, taking the
    bdev->bd_mutex. Then rescan_partition() may signal the change
    to sysfs.

    We have then the following dependency:

    reiserfs_lock -> bd_mutex -> sysfs_mutex

    Later, while entering reiserfs_readpage() after a pagefault in an
    mmaped reiserfs file, we are holding the mm->mmap_sem, and we are going
    to take the reiserfs lock too.
    We have then the following dependency:

    mm->mmap_sem -> reiserfs_lock

    which, expanded with the previous dependency gives us:

    mm->mmap_sem -> reiserfs_lock -> bd_mutex -> sysfs_mutex

    Now while entering the sysfs readdir path, we are holding the
    sysfs_mutex. And when we copy a directory entry to the user buffer, we
    might fault and then take the mm->mmap_sem lock. Which leads to the
    circular locking dependency reported.

    We can fix that by relaxing the reiserfs lock during the call to
    journal_init_dev(), which is the place where we open the mounted
    device.

    This is fine to relax the lock here because we are in the begining of
    the reiserfs mount path and there is nothing to protect at this time,
    the journal is not intialized.
    We just keep this lock around for paranoid reasons.

    Reported-by: Alexander Beregalov
    Tested-by: Alexander Beregalov
    Signed-off-by: Frederic Weisbecker
    Cc: Jeff Mahoney
    Cc: Chris Mason
    Cc: Ingo Molnar
    Cc: Alexander Beregalov
    Cc: Laurent Riffard

    Frederic Weisbecker
     

14 Sep, 2009

22 commits

  • Until now, trying to unlock the reiserfs write lock whereas the current
    task doesn't hold it lead to a simple warning.
    We should actually warn and panic in this case to avoid the user datas
    to reach an unstable state.

    Signed-off-by: Frederic Weisbecker
    Cc: Jeff Mahoney
    Cc: Chris Mason
    Cc: Ingo Molnar
    Cc: Alexander Beregalov
    Cc: Laurent Riffard

    Frederic Weisbecker
     
  • reiserfs_commit_write() is always called with the write lock held.
    Thus the current calls to reiserfs_write_lock() in this function are
    acquiring the lock recursively.
    We can safely drop them.

    This also solves further assumptions for this lock to be really
    released while calling reiserfs_write_unlock().

    Signed-off-by: Frederic Weisbecker
    Cc: Jeff Mahoney
    Cc: Chris Mason
    Cc: Ingo Molnar
    Cc: Alexander Beregalov
    Cc: Laurent Riffard

    Frederic Weisbecker
     
  • reiserfs_mkdir() acquires the reiserfs lock, assuming it has been called
    from the dir inodes callbacks, without the lock held.

    But it can also be called from other internal sites such as
    reiserfs_xattr_init() which already holds the lock. This recursive
    locking leads to further wrong assumptions. For example, later calls
    to reiserfs_mutex_lock_safe() won't actually unlock the reiserfs lock
    the time we acquire a given mutex, creating unexpected lock inversions.

    Signed-off-by: Frederic Weisbecker
    Cc: Jeff Mahoney
    Cc: Chris Mason
    Cc: Ingo Molnar
    Cc: Alexander Beregalov
    Cc: Laurent Riffard

    Frederic Weisbecker
     
  • reiserfs_xattr_init is called with the reiserfs write lock held, but
    if the ".reiserfs_priv" entry is not created, we take the superblock
    root directory inode mutex until .reiserfs_priv is created.

    This creates a lock dependency inversion against other sites such as
    reiserfs_file_release() which takes an inode mutex and the reiserfs
    lock after.

    Signed-off-by: Frederic Weisbecker
    Cc: Jeff Mahoney
    Cc: Chris Mason
    Cc: Ingo Molnar
    Cc: Alexander Beregalov
    Cc: Laurent Riffard

    Frederic Weisbecker
     
  • When do_balance() balances the tree, a trick is performed to
    provide the ability for other tree writers/readers to check whether
    do_balance() is executing concurrently (requires CONFIG_REISERFS_CHECK).

    This is done to protect concurrent accesses to the tree. The trick
    is the following:

    When do_balance is called, a unique global variable called cur_tb
    takes a pointer to the current tree to be rebalanced.
    Once do_balance finishes its work, cur_tb takes the NULL value.

    Then, concurrent tree readers/writers just have to check the value
    of cur_tb to ensure do_balance isn't executing concurrently.
    If it is, then it proves that schedule() occured on do_balance(),
    which then relaxed the bkl that protected the tree.

    Now that the bkl has be turned into a mutex, this check is still
    fine even though do_balance() becomes preemptible: the write lock
    will not be automatically released on schedule(), so the tree is
    still protected.

    But this is only fine if we have a single reiserfs mountpoint.
    Indeed, because the bkl is a global lock, it didn't allowed
    concurrent executions between a tree reader/writer in a mount point
    and a do_balance() on another tree from another mountpoint.

    So assuming all these readers/writers weren't supposed to be
    reentrant, the current check now sometimes detect false positives with
    the current per-superblock mutex which allows this reentrancy.

    This patch keeps the concurrent tree accesses check but moves it
    per superblock, so that only trees from a same mount point are
    checked to be not accessed concurrently.

    [ Impact: fix spurious panic while running several reiserfs mount-points ]

    Cc: Jeff Mahoney
    Cc: Chris Mason
    Cc: Ingo Molnar
    Cc: Alexander Beregalov
    Signed-off-by: Frederic Weisbecker

    Frederic Weisbecker
     
  • While searching a pathname, an inode mutex can be acquired
    in do_lookup() which calls reiserfs_lookup() which in turn
    acquires the write lock.

    On the other side reiserfs_fill_super() can acquire the write_lock
    and then call reiserfs_lookup_privroot() which can acquire an
    inode mutex (the root of the mount point).

    So we theoretically risk an AB - BA lock inversion that could lead
    to a deadlock.

    As for other lock dependencies found since the bkl to mutex
    conversion, the fix is to use reiserfs_mutex_lock_safe() which
    drops the lock dependency to the write lock.

    [ Impact: fix a possible deadlock with reiserfs ]

    Cc: Jeff Mahoney
    Cc: Chris Mason
    Cc: Ingo Molnar
    Cc: Alexander Beregalov
    Signed-off-by: Frederic Weisbecker

    Frederic Weisbecker
     
  • search_by_key() is the site which most requires the lock.
    This is mostly because it is a very central function and also
    because it releases/reaqcuires the write lock at least once each
    time it is called.

    Such release/reacquire creates a lot of contention in this place and
    also opens more the window which let another thread changing the tree.
    When it happens, the current path searching over the tree must be
    retried from the beggining (the root) which is a wasteful and
    time consuming recovery.

    This patch factorizes two release/reacquire sequences:

    - reading leaf nodes blocks
    - reading current block

    The latter immediately follows the former.

    The whole sequence is safe as a single unlocked section because
    we check just after if the tree has changed during these operations.

    Cc: Jeff Mahoney
    Cc: Chris Mason
    Cc: Ingo Molnar
    Cc: Alexander Beregalov
    Signed-off-by: Frederic Weisbecker

    Frederic Weisbecker
     
  • reiserfs_mutex_lock_safe() is a hack to avoid any dependency between
    an internal reiserfs mutex and the write lock, it has been proposed
    to follow the old bkl logic.

    The code does the following:

    while (!mutex_trylock(m)) {
    reiserfs_write_unlock(s);
    schedule();
    reiserfs_write_lock(s);
    }

    It then imitate the implicit behaviour of the lock when it was
    a Bkl and hadn't such dependency:

    mutex_lock(m) {
    if (fastpath)
    let's go
    else {
    wait_for_mutex() {
    schedule() {
    unlock_kernel()
    reacquire_lock_kernel()
    }
    }
    }
    }

    The problem is that by using such explicit schedule(), we don't
    benefit of the adaptive mutex spinning on owner.

    The logic in use now is:

    reiserfs_write_unlock(s);
    mutex_lock(m); // -> possible adaptive spinning
    reiserfs_write_lock(s);

    [ Impact: restore the use of adaptive spinning mutexes in reiserfs ]

    Cc: Jeff Mahoney
    Cc: Chris Mason
    Cc: Ingo Molnar
    Cc: Alexander Beregalov
    Signed-off-by: Frederic Weisbecker

    Frederic Weisbecker
     
  • reiserfs_write_end() is a hot path in reiserfs.
    We have two wasteful write lock lock/release inside that can be gathered
    without changing the code logic.

    This patch factorizes them out in a single protected section, reducing the
    number of contentions inside.

    [ Impact: reduce lock contention in a reiserfs hotpath ]

    Cc: Jeff Mahoney
    Cc: Chris Mason
    Cc: Ingo Molnar
    Cc: Alexander Beregalov
    Signed-off-by: Frederic Weisbecker

    Frederic Weisbecker
     
  • search_by_key() is a central function in reiserfs which searches
    the patch in the fs tree from the root to a node given its key.

    It is the function that is most requesting the write lock
    because it's a path very often used.

    Also we forget to release the lock while reading the next tree node,
    making us holding the lock in a wasteful way.

    Then we release the lock while reading the current node and its childs,
    all-in-one. It should be safe because we have a reference to these
    blocks and even if we read a block that will be concurrently changed,
    we have an fs_changed check later that will make us retry the path from
    the root.

    [ Impact: release the write lock while unused in a hot path ]

    Cc: Jeff Mahoney
    Cc: Chris Mason
    Cc: Ingo Molnar
    Cc: Alexander Beregalov
    Signed-off-by: Frederic Weisbecker

    Frederic Weisbecker
     
  • The write lock can be acquired recursively in reiserfs_lookup(). But we may
    want to *really* release the lock before possible rescheduling from a
    reiserfs_lookup() callee.

    Hence we want to only acquire the lock once (ie: not recursively).

    [ Impact: prevent from possible false unreleased write lock on sleeping ]

    Cc: Jeff Mahoney
    Cc: Chris Mason
    Cc: Ingo Molnar
    Cc: Alexander Beregalov
    Signed-off-by: Frederic Weisbecker

    Frederic Weisbecker
     
  • reiserfs_get_block() is one of these sites where the write lock might
    be acquired recursively.

    It's a particular problem because this function is called very often.
    It's a hot spot which needs to reschedule() periodically while converting
    direct items to indirect ones because it can take some time.

    Then if we are applying the write lock release/reacquire pattern on
    schedule() here, it may not produce the desired effect since we may have
    locked in more than one depth.

    The solution is to use reiserfs_write_lock_once() which won't try
    to reacquire the lock recursively. Then the lock will be *really*
    released before schedule().

    Also, we only release the lock if TIF_NEED_RESCHED is set to not
    create wasteful numerous contentions.

    [ Impact: fix a too long holded lock case in reiserfs_get_block() ]

    Cc: Jeff Mahoney
    Cc: Chris Mason
    Cc: Ingo Molnar
    Cc: Alexander Beregalov
    Signed-off-by: Frederic Weisbecker

    Frederic Weisbecker
     
  • flush_commit_list() uses ll_rw_block() to commit the pending log blocks.
    ll_rw_block() might sleep, and the bkl was released at this point. Then
    we can also relax the write lock at this point.

    [ Impact: release the reiserfs write lock when it is not needed ]

    Cc: Jeff Mahoney
    Cc: Chris Mason
    Cc: Alexander Beregalov
    Signed-off-by: Frederic Weisbecker

    Frederic Weisbecker
     
  • reiserfs_read_bitmap_block() uses sb_bread() to read the bitmap block. This
    helper might sleep.

    Then, when the bkl was used, it was released at this point. We can then
    relax the write lock too here.

    [ Impact: release the reiserfs write lock when it is not needed ]

    Cc: Jeff Mahoney
    Cc: Chris Mason
    Cc: Alexander Beregalov
    Signed-off-by: Frederic Weisbecker

    Frederic Weisbecker
     
  • get_neighbors() is used to get the left and/or right blocks
    against a given one in order to balance a tree.

    sb_bread() is used to read the buffer of these neighors blocks and
    while it waits for this operation, it might sleep.

    The bkl was released at this point, and then we can also release
    the write lock before calling sb_bread().

    This is safe because if the filesystem is changed after this
    lock release, the function returns REPEAT_SEARCH (aka SCHEDULE_OCCURRED
    in the function header comments) in order to repeat the neighbhor
    research.

    [ Impact: release the reiserfs write lock when it is not needed ]

    Cc: Jeff Mahoney
    Cc: Chris Mason
    Cc: Alexander Beregalov
    Signed-off-by: Frederic Weisbecker

    Frederic Weisbecker
     
  • prepare_for_delete_or_cut() can process several types of items, including
    indirect items, ie: items which contain no file data but pointers to
    unformatted nodes scattering the datas of a file.

    In this case it has to zero out these pointers to block numbers of
    unformatted nodes and release the bitmap from these block numbers.

    It can take some time, so a rescheduling() is performed between each
    block processed. We can safely release the write lock while
    rescheduling(), like the bkl did, because the code checks just after
    if the item has moved after sleeping.

    [ Impact: release the reiserfs write lock when it is not needed ]

    Cc: Jeff Mahoney
    Cc: Chris Mason
    Cc: Alexander Beregalov
    Signed-off-by: Frederic Weisbecker

    Frederic Weisbecker
     
  • When do_journal_end() copies data to the journal blocks buffers in memory,
    it reschedules if needed between each block copied and dirtyfied.

    We can also release the write lock at this rescheduling stage,
    like did the bkl implicitly.

    [ Impact: release the reiserfs write lock when it is not needed ]

    Cc: Jeff Mahoney
    Cc: Chris Mason
    Cc: Alexander Beregalov
    Signed-off-by: Frederic Weisbecker

    Frederic Weisbecker
     
  • Impact: fix a deadlock

    reiserfs_dirty_inode() is the super_operations::dirty_inode() callback
    of reiserfs. It can be called from different contexts where the write
    lock can be already held.

    But this function also grab the write lock (possibly recursively).
    Subsequent release of the lock before sleep will actually not release
    the lock if the caller of mark_inode_dirty() (which in turn calls
    reiserfs_dirty_inode()) already owns the lock.

    A typical case:

    reiserfs_write_end() {
    acquire_write_lock()
    mark_inode_dirty() {
    reiserfs_dirty_inode() {
    reacquire_write_lock() {
    journal_begin() {
    do_journal_begin_r() {
    /*
    * fail to release, still
    * one depth of lock
    */
    release_write_lock()
    reiserfs_wait_on_write_block() {
    wait_event()

    The event is usually provided by something which needs the write lock but
    it hasn't been released.

    We use reiserfs_write_lock_once() here to ensure we only grab the
    write lock in one level.

    Signed-off-by: Frederic Weisbecker
    Cc: Frederic Weisbecker
    Cc: Alessio Igor Bogani
    Cc: Jeff Mahoney
    Cc: Chris Mason
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • Impact: fix a deadlock

    reiserfs_truncate_file() can be called from multiple context where
    the write lock can be already hold or not.

    This function also acquire (possibly recursively) the write
    lock. Subsequent releases before sleeping will not actually release
    the lock because we may be in more than one lock depth degree.

    A typical case is:

    reiserfs_file_release {
    acquire_the_lock()
    reiserfs_truncate_file()
    reacquire_the_lock()
    journal_begin() {
    do_journal_begin_r() {
    reiserfs_wait_on_write_block() {
    /*
    * Not released because still one
    * depth owned
    */
    release_lock()
    wait_for_event()

    At this stage the event never happen because the one which provides
    it needs the write lock.

    We use reiserfs_write_lock_once() here to ensure that we don't acquire the
    write lock recursively.

    Signed-off-by: Frederic Weisbecker
    Cc: Alessio Igor Bogani
    Cc: Jeff Mahoney
    Cc: Alexander Beregalov
    Cc: Chris Mason
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • Sometimes we don't want to recursively hold the per superblock write
    lock because we want to be sure it is actually released when we come
    to sleep.

    This patch introduces the necessary tools for that.

    reiserfs_write_lock_once() does the same job than reiserfs_write_lock()
    except that it won't try to acquire recursively the lock if the current
    task already owns it. Also the lock_depth before the call of this function
    is returned.

    reiserfs_write_unlock_once() unlock only if reiserfs_write_lock_once()
    returned a depth equal to -1, ie: only if it actually locked.

    Signed-off-by: Frederic Weisbecker
    Cc: Alessio Igor Bogani
    Cc: Jeff Mahoney
    Cc: Alexander Beregalov
    Cc: Chris Mason
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • Impact: fix a deadlock

    The j_flush_mutex is acquired safely in journal.c:
    if we can't take it, we free the reiserfs per superblock lock
    and wait a bit.

    But we have a remaining place in kupdate_transactions() where
    j_flush_mutex is still acquired traditionnaly. Thus the following
    scenario (warned by lockdep) can happen:

    A B

    mutex_lock(&write_lock) mutex_lock(&write_lock)
    mutex_lock(&j_flush_mutex) mutex_lock(&j_flush_mutex) //block
    mutex_unlock(&write_lock)
    sleep...
    mutex_lock(&write_lock) //deadlock

    Fix this by using reiserfs_mutex_lock_safe() in kupdate_transactions().

    Signed-off-by: Frederic Weisbecker
    Cc: Alessio Igor Bogani
    Cc: Jeff Mahoney
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • This patch is an attempt to remove the Bkl based locking scheme from
    reiserfs and is intended.

    It is a bit inspired from an old attempt by Peter Zijlstra:

    http://lkml.indiana.edu/hypermail/linux/kernel/0704.2/2174.html

    The bkl is heavily used in this filesystem to prevent from
    concurrent write accesses on the filesystem.

    Reiserfs makes a deep use of the specific properties of the Bkl:

    - It can be acqquired recursively by a same task
    - It is released on the schedule() calls and reacquired when schedule() returns

    The two properties above are a roadmap for the reiserfs write locking so it's
    very hard to simply replace it with a common mutex.

    - We need a recursive-able locking unless we want to restructure several blocks
    of the code.
    - We need to identify the sites where the bkl was implictly relaxed
    (schedule, wait, sync, etc...) so that we can in turn release and
    reacquire our new lock explicitly.
    Such implicit releases of the lock are often required to let other
    resources producer/consumer do their job or we can suffer unexpected
    starvations or deadlocks.

    So the new lock that replaces the bkl here is a per superblock mutex with a
    specific property: it can be acquired recursively by a same task, like the
    bkl.

    For such purpose, we integrate a lock owner and a lock depth field on the
    superblock information structure.

    The first axis on this patch is to turn reiserfs_write_(un)lock() function
    into a wrapper to manage this mutex. Also some explicit calls to
    lock_kernel() have been converted to reiserfs_write_lock() helpers.

    The second axis is to find the important blocking sites (schedule...(),
    wait_on_buffer(), sync_dirty_buffer(), etc...) and then apply an explicit
    release of the write lock on these locations before blocking. Then we can
    safely wait for those who can give us resources or those who need some.
    Typically this is a fight between the current writer, the reiserfs workqueue
    (aka the async commiter) and the pdflush threads.

    The third axis is a consequence of the second. The write lock is usually
    on top of a lock dependency chain which can include the journal lock, the
    flush lock or the commit lock. So it's dangerous to release and trying to
    reacquire the write lock while we still hold other locks.

    This is fine with the bkl:

    T1 T2

    lock_kernel()
    mutex_lock(A)
    unlock_kernel()
    // do something
    lock_kernel()
    mutex_lock(A) -> already locked by T1
    schedule() (and then unlock_kernel())
    lock_kernel()
    mutex_unlock(A)
    ....

    This is not fine with a mutex:

    T1 T2

    mutex_lock(write)
    mutex_lock(A)
    mutex_unlock(write)
    // do something
    mutex_lock(write)
    mutex_lock(A) -> already locked by T1
    schedule()

    mutex_lock(write) -> already locked by T2
    deadlock

    The solution in this patch is to provide a helper which releases the write
    lock and sleep a bit if we can't lock a mutex that depend on it. It's another
    simulation of the bkl behaviour.

    The last axis is to locate the fs callbacks that are called with the bkl held,
    according to Documentation/filesystem/Locking.

    Those are:

    - reiserfs_remount
    - reiserfs_fill_super
    - reiserfs_put_super

    Reiserfs didn't need to explicitly lock because of the context of these callbacks.
    But now we must take care of that with the new locking.

    After this patch, reiserfs suffers from a slight performance regression (for now).
    On UP, a high volume write with dd reports an average of 27 MB/s instead
    of 30 MB/s without the patch applied.

    Signed-off-by: Frederic Weisbecker
    Reviewed-by: Ingo Molnar
    Cc: Jeff Mahoney
    Cc: Peter Zijlstra
    Cc: Bron Gondwana
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Alexander Viro
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker