11 Jan, 2012

2 commits


15 Sep, 2011

1 commit


12 Jul, 2011

1 commit

  • fs_excl is a poor man's priority inheritance for filesystems to hint to
    the block layer that an operation is important. It was never clearly
    specified, not widely adopted, and will not prevent starvation in many
    cases (like across cgroups).

    fs_excl was introduced with the time sliced CFQ IO scheduler, to
    indicate when a process held FS exclusive resources and thus needed
    a boost.

    It doesn't cover all file systems, and it was never fully complete.
    Lets kill it.

    Signed-off-by: Justin TerAvest
    Signed-off-by: Jens Axboe

    Justin TerAvest
     

31 Mar, 2011

1 commit


01 Feb, 2011

1 commit


14 Jan, 2011

1 commit

  • * 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-block: (43 commits)
    block: ensure that completion error gets properly traced
    blktrace: add missing probe argument to block_bio_complete
    block cfq: don't use atomic_t for cfq_group
    block cfq: don't use atomic_t for cfq_queue
    block: trace event block fix unassigned field
    block: add internal hd part table references
    block: fix accounting bug on cross partition merges
    kref: add kref_test_and_get
    bio-integrity: mark kintegrityd_wq highpri and CPU intensive
    block: make kblockd_workqueue smarter
    Revert "sd: implement sd_check_events()"
    block: Clean up exit_io_context() source code.
    Fix compile warnings due to missing removal of a 'ret' variable
    fs/block: type signature of major_to_index(int) to major_to_index(unsigned)
    block: convert !IS_ERR(p) && p to !IS_ERR_NOR_NULL(p)
    cfq-iosched: don't check cfqg in choose_service_tree()
    fs/splice: Pull buf->ops->confirm() from splice_from_pipe actors
    cdrom: export cdrom_check_events()
    sd: implement sd_check_events()
    sr: implement sr_check_events()
    ...

    Linus Torvalds
     

18 Nov, 2010

1 commit


13 Nov, 2010

2 commits

  • After recent blkdev_get() modifications, open_by_devnum() and
    open_bdev_exclusive() are simple wrappers around blkdev_get().
    Replace them with blkdev_get_by_dev() and blkdev_get_by_path().

    blkdev_get_by_dev() is identical to open_by_devnum().
    blkdev_get_by_path() is slightly different in that it doesn't
    automatically add %FMODE_EXCL to @mode.

    All users are converted. Most conversions are mechanical and don't
    introduce any behavior difference. There are several exceptions.

    * btrfs now sets FMODE_EXCL in btrfs_device->mode, so there's no
    reason to OR it explicitly on blkdev_put().

    * gfs2, nilfs2 and the generic mount_bdev() now set FMODE_EXCL in
    sb->s_mode.

    * With the above changes, sb->s_mode now always should contain
    FMODE_EXCL. WARN_ON_ONCE() added to kill_block_super() to detect
    errors.

    The new blkdev_get_*() functions are with proper docbook comments.
    While at it, add function description to blkdev_get() too.

    Signed-off-by: Tejun Heo
    Cc: Philipp Reisner
    Cc: Neil Brown
    Cc: Mike Snitzer
    Cc: Joern Engel
    Cc: Chris Mason
    Cc: Jan Kara
    Cc: "Theodore Ts'o"
    Cc: KONISHI Ryusuke
    Cc: reiserfs-devel@vger.kernel.org
    Cc: xfs-masters@oss.sgi.com
    Cc: Alexander Viro

    Tejun Heo
     
  • Over time, block layer has accumulated a set of APIs dealing with bdev
    open, close, claim and release.

    * blkdev_get/put() are the primary open and close functions.

    * bd_claim/release() deal with exclusive open.

    * open/close_bdev_exclusive() are combination of open and claim and
    the other way around, respectively.

    * bd_link/unlink_disk_holder() to create and remove holder/slave
    symlinks.

    * open_by_devnum() wraps bdget() + blkdev_get().

    The interface is a bit confusing and the decoupling of open and claim
    makes it impossible to properly guarantee exclusive access as
    in-kernel open + claim sequence can disturb the existing exclusive
    open even before the block layer knows the current open if for another
    exclusive access. Reorganize the interface such that,

    * blkdev_get() is extended to include exclusive access management.
    @holder argument is added and, if is @FMODE_EXCL specified, it will
    gain exclusive access atomically w.r.t. other exclusive accesses.

    * blkdev_put() is similarly extended. It now takes @mode argument and
    if @FMODE_EXCL is set, it releases an exclusive access. Also, when
    the last exclusive claim is released, the holder/slave symlinks are
    removed automatically.

    * bd_claim/release() and close_bdev_exclusive() are no longer
    necessary and either made static or removed.

    * bd_link_disk_holder() remains the same but bd_unlink_disk_holder()
    is no longer necessary and removed.

    * open_bdev_exclusive() becomes a simple wrapper around lookup_bdev()
    and blkdev_get(). It also has an unexpected extra bdev_read_only()
    test which probably should be moved into blkdev_get().

    * open_by_devnum() is modified to take @holder argument and pass it to
    blkdev_get().

    Most of bdev open/close operations are unified into blkdev_get/put()
    and most exclusive accesses are tested atomically at the open time (as
    it should). This cleans up code and removes some, both valid and
    invalid, but unnecessary all the same, corner cases.

    open_bdev_exclusive() and open_by_devnum() can use further cleanup -
    rename to blkdev_get_by_path() and blkdev_get_by_devt() and drop
    special features. Well, let's leave them for another day.

    Most conversions are straight-forward. drbd conversion is a bit more
    involved as there was some reordering, but the logic should stay the
    same.

    Signed-off-by: Tejun Heo
    Acked-by: Neil Brown
    Acked-by: Ryusuke Konishi
    Acked-by: Mike Snitzer
    Acked-by: Philipp Reisner
    Cc: Peter Osterlund
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Jan Kara
    Cc: Andrew Morton
    Cc: Andreas Dilger
    Cc: "Theodore Ts'o"
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Alex Elder
    Cc: Christoph Hellwig
    Cc: dm-devel@redhat.com
    Cc: drbd-dev@lists.linbit.com
    Cc: Leo Chen
    Cc: Scott Branden
    Cc: Chris Mason
    Cc: Steven Whitehouse
    Cc: Dave Kleikamp
    Cc: Joern Engel
    Cc: reiserfs-devel@vger.kernel.org
    Cc: Alexander Viro

    Tejun Heo
     

10 Sep, 2010

1 commit

  • Switch to the WRITE_FLUSH_FUA flag for log writes and remove the EOPNOTSUPP
    detection for barriers. Note that reiserfs had a fairly different code
    path for barriers before as it wa the only filesystem actually making use
    of them. The new code always uses the old non-barrier codepath and just
    sets the WRITE_FLUSH_FUA explicitly for the journal commits.

    Signed-off-by: Christoph Hellwig
    Acked-by: Jan Kara
    Acked-by: Chris Mason
    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

18 Aug, 2010

1 commit

  • These flags aren't real I/O types, but tell ll_rw_block to always
    lock the buffer instead of giving up on a failed trylock.

    Instead add a new write_dirty_buffer helper that implements this semantic
    and use it from the existing SWRITE* callers. Note that the ll_rw_block
    code had a bug where it didn't promote WRITE_SYNC_PLUG properly, which
    this patch fixes.

    In the ufs code clean up the helper that used to call ll_rw_block
    to mirror sync_dirty_buffer, which is the function it implements for
    compound buffers.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

11 Aug, 2010

1 commit


30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

25 Mar, 2010

1 commit

  • The reiserfs journal behaves inconsistently when determining whether to
    allow a mount of a read-only device.

    This is due to the use of the continue_replay variable to short circuit
    the journal scanning. If it's set, it's assumed that there are
    transactions to replay, but there may not be. If it's unset, it's assumed
    that there aren't any, and that may not be the case either.

    I've observed two failure cases:
    1) Where a clean file system on a read-only device refuses to mount
    2) Where a clean file system on a read-only device passes the
    optimization and then tries writing the journal header to update
    the latest mount id.

    The former is easily observable by using a freshly created file system on
    a read-only loopback device.

    This patch moves the check into journal_read_transaction, where it can
    bail out before it's about to replay a transaction. That way it can go
    through and skip transactions where appropriate, yet still refuse to mount
    a file system with outstanding transactions.

    Signed-off-by: Jeff Mahoney
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Mahoney
     

28 Jan, 2010

1 commit

  • Vmalloc is called to allocate journal->j_cnode_free_list but
    we hold the reiserfs lock at this time, which raises a
    {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} lock inversion.

    Just drop the reiserfs lock at this time, as it's not even
    needed but kept for paranoid reasons.

    This fixes:

    [ INFO: inconsistent lock state ]
    2.6.33-rc5 #1
    ---------------------------------
    inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
    kswapd0/313 [HC0[0]:SC0[0]:HE1:SE1] takes:
    (&REISERFS_SB(s)->lock){+.+.?.}, at: []
    reiserfs_write_lock_once+0x28/0x50
    {RECLAIM_FS-ON-W} state was registered at:
    [] mark_held_locks+0x62/0x90
    [] lockdep_trace_alloc+0x9a/0xc0
    [] kmem_cache_alloc+0x26/0xf0
    [] __get_vm_area_node+0x6c/0xf0
    [] __vmalloc_node+0x7e/0xa0
    [] vmalloc+0x2b/0x30
    [] journal_init+0x6cb/0xa10
    [] reiserfs_fill_super+0x342/0xb80
    [] get_sb_bdev+0x145/0x180
    [] get_super_block+0x21/0x30
    [] vfs_kern_mount+0x40/0xd0
    [] do_kern_mount+0x39/0xd0
    [] do_mount+0x2c7/0x6d0
    [] sys_mount+0x66/0xa0
    [] mount_block_root+0xc4/0x245
    [] mount_root+0x59/0x5f
    [] prepare_namespace+0x111/0x14b
    [] kernel_init+0xcf/0xdb
    [] kernel_thread_helper+0x6/0x1c
    irq event stamp: 63236801
    hardirqs last enabled at (63236801): []
    __mutex_unlock_slowpath+0x9a/0x120
    hardirqs last disabled at (63236800): []
    __mutex_unlock_slowpath+0x39/0x120
    softirqs last enabled at (63218800): [] __do_softirq+0xc1/0x110
    softirqs last disabled at (63218789): [] do_softirq+0x4d/0x60

    other info that might help us debug this:
    2 locks held by kswapd0/313:
    #0: (shrinker_rwsem){++++..}, at: [] shrink_slab+0x24/0x170
    #1: (&type->s_umount_key#19){++++..}, at: []
    shrink_dcache_memory+0xfd/0x1a0

    stack backtrace:
    Pid: 313, comm: kswapd0 Not tainted 2.6.33-rc5 #1
    Call Trace:
    [] ? printk+0x18/0x1c
    [] print_usage_bug+0x15f/0x1a0
    [] mark_lock+0x39f/0x5a0
    [] ? trace_hardirqs_off+0xb/0x10
    [] ? check_usage_forwards+0x0/0xf0
    [] __lock_acquire+0x214/0xa70
    [] ? sched_clock_cpu+0x95/0x110
    [] lock_acquire+0x7a/0xa0
    [] ? reiserfs_write_lock_once+0x28/0x50
    [] mutex_lock_nested+0x5f/0x2b0
    [] ? reiserfs_write_lock_once+0x28/0x50
    [] ? reiserfs_write_lock_once+0x28/0x50
    [] reiserfs_write_lock_once+0x28/0x50
    [] reiserfs_delete_inode+0x50/0x140
    [] ? generic_delete_inode+0x5f/0x150
    [] ? reiserfs_delete_inode+0x0/0x140
    [] generic_delete_inode+0x9c/0x150
    [] generic_drop_inode+0x3d/0x60
    [] iput+0x47/0x50
    [] dentry_iput+0x6f/0xf0
    [] d_kill+0x24/0x50
    [] __shrink_dcache_sb+0x21d/0x2b0
    [] shrink_dcache_memory+0x12f/0x1a0
    [] shrink_slab+0x10e/0x170
    [] kswapd+0x477/0x6a0
    [] ? isolate_pages_global+0x0/0x1b0
    [] ? autoremove_wake_function+0x0/0x40
    [] ? kswapd+0x0/0x6a0
    [] kthread+0x6c/0x80
    [] ? kthread+0x0/0x80
    [] kernel_thread_helper+0x6/0x1c

    Reported-by: Alexander Beregalov
    Signed-off-by: Frederic Weisbecker
    Cc: Christian Kujau
    Cc: Chris Mason

    Frederic Weisbecker
     

02 Jan, 2010

1 commit

  • Keeping the reiserfs lock while freeing the journal on
    umount path triggers a lock inversion between bdev->bd_mutex
    and the reiserfs lock.

    We don't need the reiserfs lock at this stage. The filesystem
    is not usable anymore, and there are no more pending commits,
    everything got flushed (even this operation was done in parallel
    and didn't required the reiserfs lock from the current process).

    This fixes the following lockdep report:

    =======================================================
    [ INFO: possible circular locking dependency detected ]
    2.6.32-atom #172
    -------------------------------------------------------
    umount/3904 is trying to acquire lock:
    (&bdev->bd_mutex){+.+.+.}, at: [] __blkdev_put+0x22/0x160

    but task is already holding lock:
    (&REISERFS_SB(s)->lock){+.+.+.}, at: [] reiserfs_write_lock+0x29/0x40

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #3 (&REISERFS_SB(s)->lock){+.+.+.}:
    [] __lock_acquire+0x11ff/0x19e0
    [] lock_acquire+0x68/0x90
    [] mutex_lock_nested+0x5b/0x340
    [] reiserfs_write_lock_once+0x29/0x50
    [] reiserfs_get_block+0x85/0x1620
    [] do_mpage_readpage+0x1f0/0x6d0
    [] mpage_readpages+0xc0/0x100
    [] reiserfs_readpages+0x19/0x20
    [] __do_page_cache_readahead+0x1bc/0x260
    [] ra_submit+0x28/0x40
    [] filemap_fault+0x40e/0x420
    [] __do_fault+0x3d/0x430
    [] handle_mm_fault+0x12e/0x790
    [] do_page_fault+0x135/0x330
    [] error_code+0x6b/0x70
    [] load_elf_binary+0x82a/0x1a10
    [] search_binary_handler+0x90/0x1d0
    [] do_execve+0x1df/0x250
    [] sys_execve+0x46/0x70
    [] syscall_call+0x7/0xb

    -> #2 (&mm->mmap_sem){++++++}:
    [] __lock_acquire+0x11ff/0x19e0
    [] lock_acquire+0x68/0x90
    [] might_fault+0x8b/0xb0
    [] copy_to_user+0x32/0x70
    [] filldir64+0xa4/0xf0
    [] sysfs_readdir+0x116/0x210
    [] vfs_readdir+0x8d/0xb0
    [] sys_getdents64+0x69/0xb0
    [] sysenter_do_call+0x12/0x32

    -> #1 (sysfs_mutex){+.+.+.}:
    [] __lock_acquire+0x11ff/0x19e0
    [] lock_acquire+0x68/0x90
    [] mutex_lock_nested+0x5b/0x340
    [] sysfs_addrm_start+0x2c/0xb0
    [] create_dir+0x40/0x90
    [] sysfs_create_dir+0x2b/0x50
    [] kobject_add_internal+0xc2/0x1b0
    [] kobject_add_varg+0x31/0x50
    [] kobject_add+0x2c/0x60
    [] device_add+0x94/0x560
    [] add_partition+0x18a/0x2a0
    [] rescan_partitions+0x33a/0x450
    [] __blkdev_get+0x12f/0x2d0
    [] blkdev_get+0xa/0x10
    [] register_disk+0x108/0x130
    [] add_disk+0xd9/0x130
    [] sd_probe_async+0x105/0x1d0
    [] async_thread+0xcf/0x230
    [] kthread+0x74/0x80
    [] kernel_thread_helper+0x7/0x3c

    -> #0 (&bdev->bd_mutex){+.+.+.}:
    [] __lock_acquire+0x18f6/0x19e0
    [] lock_acquire+0x68/0x90
    [] mutex_lock_nested+0x5b/0x340
    [] __blkdev_put+0x22/0x160
    [] blkdev_put+0xa/0x10
    [] free_journal_ram+0xd2/0x130
    [] do_journal_release+0x98/0x190
    [] journal_release+0xa/0x10
    [] reiserfs_put_super+0x36/0x130
    [] generic_shutdown_super+0x4f/0xe0
    [] kill_block_super+0x25/0x40
    [] reiserfs_kill_sb+0x7f/0x90
    [] deactivate_super+0x7a/0x90
    [] mntput_no_expire+0x98/0xd0
    [] sys_umount+0x4c/0x310
    [] sys_oldumount+0x19/0x20
    [] sysenter_do_call+0x12/0x32

    other info that might help us debug this:

    2 locks held by umount/3904:
    #0: (&type->s_umount_key#30){+++++.}, at: [] deactivate_super+0x75/0x90
    #1: (&REISERFS_SB(s)->lock){+.+.+.}, at: [] reiserfs_write_lock+0x29/0x40

    stack backtrace:
    Pid: 3904, comm: umount Not tainted 2.6.32-atom #172
    Call Trace:
    [] ? printk+0x18/0x1a
    [] print_circular_bug+0xca/0xd0
    [] __lock_acquire+0x18f6/0x19e0
    [] ? free_pcppages_bulk+0x1f/0x250
    [] lock_acquire+0x68/0x90
    [] ? __blkdev_put+0x22/0x160
    [] ? __blkdev_put+0x22/0x160
    [] mutex_lock_nested+0x5b/0x340
    [] ? __blkdev_put+0x22/0x160
    [] ? mark_held_locks+0x62/0x80
    [] ? kfree+0x92/0xd0
    [] __blkdev_put+0x22/0x160
    [] ? trace_hardirqs_on+0xb/0x10
    [] blkdev_put+0xa/0x10
    [] free_journal_ram+0xd2/0x130
    [] do_journal_release+0x98/0x190
    [] journal_release+0xa/0x10
    [] reiserfs_put_super+0x36/0x130
    [] ? up_write+0x16/0x30
    [] generic_shutdown_super+0x4f/0xe0
    [] kill_block_super+0x25/0x40
    [] ? vfs_quota_off+0x0/0x20
    [] reiserfs_kill_sb+0x7f/0x90
    [] deactivate_super+0x7a/0x90
    [] mntput_no_expire+0x98/0xd0
    [] sys_umount+0x4c/0x310
    [] sys_oldumount+0x19/0x20
    [] sysenter_do_call+0x12/0x32

    Signed-off-by: Frederic Weisbecker
    Cc: Alexander Beregalov
    Cc: Chris Mason
    Cc: Ingo Molnar

    Frederic Weisbecker
     

30 Dec, 2009

1 commit

  • Commit 500f5a0bf5f0624dae34307010e240ec090e4cde
    (reiserfs: Fix possible recursive lock) fixed a vmalloc under reiserfs
    lock that triggered a lockdep warning because of a
    IN-FS-RECLAIM RECLAIM-FS-ON locking dependency inversion.

    But this patch has ommitted another vmalloc call in the same path
    that allocates the journal. Relax the lock for this one too.

    Reported-by: Alexander Beregalov
    Signed-off-by: Frederic Weisbecker
    Cc: Chris Mason
    Cc: Ingo Molnar

    Frederic Weisbecker
     

05 Oct, 2009

1 commit

  • While creating the reiserfs workqueue during the journal
    initialization, we are holding the reiserfs lock, but
    create_workqueue() also holds the cpu_add_remove_lock, creating
    then the following dependency:

    - reiserfs lock -> cpu_add_remove_lock

    But we also have the following existing dependencies:

    - mm->mmap_sem -> reiserfs lock
    - cpu_add_remove_lock -> cpu_hotplug.lock -> slub_lock -> sysfs_mutex

    The merged dependency chain then becomes:

    - mm->mmap_sem -> reiserfs lock -> cpu_add_remove_lock ->
    cpu_hotplug.lock -> slub_lock -> sysfs_mutex

    But when we fill a dir entry in sysfs_readir(), we are holding the
    sysfs_mutex and we also might fault while copying the directory entry
    to the user, leading to the following dependency:

    - sysfs_mutex -> mm->mmap_sem

    The end result is then a lock inversion between sysfs_mutex and
    mm->mmap_sem, as reported in the following lockdep warning:

    [ INFO: possible circular locking dependency detected ]
    2.6.31-07095-g25a3912 #4
    -------------------------------------------------------
    udevadm/790 is trying to acquire lock:
    (&mm->mmap_sem){++++++}, at: [] might_fault+0x72/0xc0

    but task is already holding lock:
    (sysfs_mutex){+.+.+.}, at: [] sysfs_readdir+0x7c/0x260

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #5 (sysfs_mutex){+.+.+.}:
    [...]

    -> #4 (slub_lock){+++++.}:
    [...]

    -> #3 (cpu_hotplug.lock){+.+.+.}:
    [...]

    -> #2 (cpu_add_remove_lock){+.+.+.}:
    [...]

    -> #1 (&REISERFS_SB(s)->lock){+.+.+.}:
    [...]

    -> #0 (&mm->mmap_sem){++++++}:
    [...]

    This can be fixed by relaxing the reiserfs lock while creating the
    workqueue.
    This is fine to relax the lock here, we just keep it around to pass
    through reiserfs lock checks and for paranoid reasons.

    Reported-by: Alexander Beregalov
    Tested-by: Alexander Beregalov
    Signed-off-by: Frederic Weisbecker
    Cc: Jeff Mahoney
    Cc: Chris Mason
    Cc: Ingo Molnar
    Cc: Alexander Beregalov
    Cc: Laurent Riffard

    Frederic Weisbecker
     

17 Sep, 2009

1 commit

  • Alexander Beregalov reported the following warning:

    =======================================================
    [ INFO: possible circular locking dependency detected ]
    2.6.31-03149-gdcc030a #1
    -------------------------------------------------------
    udevadm/716 is trying to acquire lock:
    (&mm->mmap_sem){++++++}, at: [] might_fault+0x4a/0xa0

    but task is already holding lock:
    (sysfs_mutex){+.+.+.}, at: [] sysfs_readdir+0x5a/0x200

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #3 (sysfs_mutex){+.+.+.}:
    [...]

    -> #2 (&bdev->bd_mutex){+.+.+.}:
    [...]

    -> #1 (&REISERFS_SB(s)->lock){+.+.+.}:
    [...]

    -> #0 (&mm->mmap_sem){++++++}:
    [...]

    On reiserfs mount path, we take the reiserfs lock and while
    initializing the journal, we open the device, taking the
    bdev->bd_mutex. Then rescan_partition() may signal the change
    to sysfs.

    We have then the following dependency:

    reiserfs_lock -> bd_mutex -> sysfs_mutex

    Later, while entering reiserfs_readpage() after a pagefault in an
    mmaped reiserfs file, we are holding the mm->mmap_sem, and we are going
    to take the reiserfs lock too.
    We have then the following dependency:

    mm->mmap_sem -> reiserfs_lock

    which, expanded with the previous dependency gives us:

    mm->mmap_sem -> reiserfs_lock -> bd_mutex -> sysfs_mutex

    Now while entering the sysfs readdir path, we are holding the
    sysfs_mutex. And when we copy a directory entry to the user buffer, we
    might fault and then take the mm->mmap_sem lock. Which leads to the
    circular locking dependency reported.

    We can fix that by relaxing the reiserfs lock during the call to
    journal_init_dev(), which is the place where we open the mounted
    device.

    This is fine to relax the lock here because we are in the begining of
    the reiserfs mount path and there is nothing to protect at this time,
    the journal is not intialized.
    We just keep this lock around for paranoid reasons.

    Reported-by: Alexander Beregalov
    Tested-by: Alexander Beregalov
    Signed-off-by: Frederic Weisbecker
    Cc: Jeff Mahoney
    Cc: Chris Mason
    Cc: Ingo Molnar
    Cc: Alexander Beregalov
    Cc: Laurent Riffard

    Frederic Weisbecker
     

14 Sep, 2009

6 commits

  • While searching a pathname, an inode mutex can be acquired
    in do_lookup() which calls reiserfs_lookup() which in turn
    acquires the write lock.

    On the other side reiserfs_fill_super() can acquire the write_lock
    and then call reiserfs_lookup_privroot() which can acquire an
    inode mutex (the root of the mount point).

    So we theoretically risk an AB - BA lock inversion that could lead
    to a deadlock.

    As for other lock dependencies found since the bkl to mutex
    conversion, the fix is to use reiserfs_mutex_lock_safe() which
    drops the lock dependency to the write lock.

    [ Impact: fix a possible deadlock with reiserfs ]

    Cc: Jeff Mahoney
    Cc: Chris Mason
    Cc: Ingo Molnar
    Cc: Alexander Beregalov
    Signed-off-by: Frederic Weisbecker

    Frederic Weisbecker
     
  • reiserfs_mutex_lock_safe() is a hack to avoid any dependency between
    an internal reiserfs mutex and the write lock, it has been proposed
    to follow the old bkl logic.

    The code does the following:

    while (!mutex_trylock(m)) {
    reiserfs_write_unlock(s);
    schedule();
    reiserfs_write_lock(s);
    }

    It then imitate the implicit behaviour of the lock when it was
    a Bkl and hadn't such dependency:

    mutex_lock(m) {
    if (fastpath)
    let's go
    else {
    wait_for_mutex() {
    schedule() {
    unlock_kernel()
    reacquire_lock_kernel()
    }
    }
    }
    }

    The problem is that by using such explicit schedule(), we don't
    benefit of the adaptive mutex spinning on owner.

    The logic in use now is:

    reiserfs_write_unlock(s);
    mutex_lock(m); // -> possible adaptive spinning
    reiserfs_write_lock(s);

    [ Impact: restore the use of adaptive spinning mutexes in reiserfs ]

    Cc: Jeff Mahoney
    Cc: Chris Mason
    Cc: Ingo Molnar
    Cc: Alexander Beregalov
    Signed-off-by: Frederic Weisbecker

    Frederic Weisbecker
     
  • flush_commit_list() uses ll_rw_block() to commit the pending log blocks.
    ll_rw_block() might sleep, and the bkl was released at this point. Then
    we can also relax the write lock at this point.

    [ Impact: release the reiserfs write lock when it is not needed ]

    Cc: Jeff Mahoney
    Cc: Chris Mason
    Cc: Alexander Beregalov
    Signed-off-by: Frederic Weisbecker

    Frederic Weisbecker
     
  • When do_journal_end() copies data to the journal blocks buffers in memory,
    it reschedules if needed between each block copied and dirtyfied.

    We can also release the write lock at this rescheduling stage,
    like did the bkl implicitly.

    [ Impact: release the reiserfs write lock when it is not needed ]

    Cc: Jeff Mahoney
    Cc: Chris Mason
    Cc: Alexander Beregalov
    Signed-off-by: Frederic Weisbecker

    Frederic Weisbecker
     
  • Impact: fix a deadlock

    The j_flush_mutex is acquired safely in journal.c:
    if we can't take it, we free the reiserfs per superblock lock
    and wait a bit.

    But we have a remaining place in kupdate_transactions() where
    j_flush_mutex is still acquired traditionnaly. Thus the following
    scenario (warned by lockdep) can happen:

    A B

    mutex_lock(&write_lock) mutex_lock(&write_lock)
    mutex_lock(&j_flush_mutex) mutex_lock(&j_flush_mutex) //block
    mutex_unlock(&write_lock)
    sleep...
    mutex_lock(&write_lock) //deadlock

    Fix this by using reiserfs_mutex_lock_safe() in kupdate_transactions().

    Signed-off-by: Frederic Weisbecker
    Cc: Alessio Igor Bogani
    Cc: Jeff Mahoney
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • This patch is an attempt to remove the Bkl based locking scheme from
    reiserfs and is intended.

    It is a bit inspired from an old attempt by Peter Zijlstra:

    http://lkml.indiana.edu/hypermail/linux/kernel/0704.2/2174.html

    The bkl is heavily used in this filesystem to prevent from
    concurrent write accesses on the filesystem.

    Reiserfs makes a deep use of the specific properties of the Bkl:

    - It can be acqquired recursively by a same task
    - It is released on the schedule() calls and reacquired when schedule() returns

    The two properties above are a roadmap for the reiserfs write locking so it's
    very hard to simply replace it with a common mutex.

    - We need a recursive-able locking unless we want to restructure several blocks
    of the code.
    - We need to identify the sites where the bkl was implictly relaxed
    (schedule, wait, sync, etc...) so that we can in turn release and
    reacquire our new lock explicitly.
    Such implicit releases of the lock are often required to let other
    resources producer/consumer do their job or we can suffer unexpected
    starvations or deadlocks.

    So the new lock that replaces the bkl here is a per superblock mutex with a
    specific property: it can be acquired recursively by a same task, like the
    bkl.

    For such purpose, we integrate a lock owner and a lock depth field on the
    superblock information structure.

    The first axis on this patch is to turn reiserfs_write_(un)lock() function
    into a wrapper to manage this mutex. Also some explicit calls to
    lock_kernel() have been converted to reiserfs_write_lock() helpers.

    The second axis is to find the important blocking sites (schedule...(),
    wait_on_buffer(), sync_dirty_buffer(), etc...) and then apply an explicit
    release of the write lock on these locations before blocking. Then we can
    safely wait for those who can give us resources or those who need some.
    Typically this is a fight between the current writer, the reiserfs workqueue
    (aka the async commiter) and the pdflush threads.

    The third axis is a consequence of the second. The write lock is usually
    on top of a lock dependency chain which can include the journal lock, the
    flush lock or the commit lock. So it's dangerous to release and trying to
    reacquire the write lock while we still hold other locks.

    This is fine with the bkl:

    T1 T2

    lock_kernel()
    mutex_lock(A)
    unlock_kernel()
    // do something
    lock_kernel()
    mutex_lock(A) -> already locked by T1
    schedule() (and then unlock_kernel())
    lock_kernel()
    mutex_unlock(A)
    ....

    This is not fine with a mutex:

    T1 T2

    mutex_lock(write)
    mutex_lock(A)
    mutex_unlock(write)
    // do something
    mutex_lock(write)
    mutex_lock(A) -> already locked by T1
    schedule()

    mutex_lock(write) -> already locked by T2
    deadlock

    The solution in this patch is to provide a helper which releases the write
    lock and sleep a bit if we can't lock a mutex that depend on it. It's another
    simulation of the bkl behaviour.

    The last axis is to locate the fs callbacks that are called with the bkl held,
    according to Documentation/filesystem/Locking.

    Those are:

    - reiserfs_remount
    - reiserfs_fill_super
    - reiserfs_put_super

    Reiserfs didn't need to explicitly lock because of the context of these callbacks.
    But now we must take care of that with the new locking.

    After this patch, reiserfs suffers from a slight performance regression (for now).
    On UP, a high volume write with dd reports an average of 27 MB/s instead
    of 30 MB/s without the patch applied.

    Signed-off-by: Frederic Weisbecker
    Reviewed-by: Ingo Molnar
    Cc: Jeff Mahoney
    Cc: Peter Zijlstra
    Cc: Bron Gondwana
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Alexander Viro
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     

11 Jul, 2009

1 commit


31 Mar, 2009

6 commits

  • This patch is a simple s/p_s_sb/sb/g to the reiserfs code. This is the
    first in a series of patches to rip out some of the awful variable
    naming in reiserfs.

    Signed-off-by: Jeff Mahoney
    Signed-off-by: Linus Torvalds

    Jeff Mahoney
     
  • This patch strips trailing whitespace from the reiserfs code.

    Signed-off-by: Jeff Mahoney
    Signed-off-by: Linus Torvalds

    Jeff Mahoney
     
  • This patch kills off reiserfs_journal_abort as it is never called, and
    combines __reiserfs_journal_abort_{soft,hard} into one function called
    reiserfs_abort_journal, which performs the same work. It is silent
    as opposed to the old version, since the message was always issued
    after a regular 'abort' message.

    Signed-off-by: Jeff Mahoney
    Signed-off-by: Linus Torvalds

    Jeff Mahoney
     
  • ReiserFS panics can be somewhat inconsistent.
    In some cases:
    * a unique identifier may be associated with it
    * the function name may be included
    * the device may be printed separately

    This patch aims to make warnings more consistent. reiserfs_warning() prints
    the device name, so printing it a second time is not required. The function
    name for a warning is always helpful in debugging, so it is now automatically
    inserted into the output. Hans has stated that every warning should have
    a unique identifier. Some cases lack them, others really shouldn't have them.
    reiserfs_warning() now expects an id associated with each message. In the
    rare case where one isn't needed, "" will suffice.

    Signed-off-by: Jeff Mahoney
    Signed-off-by: Linus Torvalds

    Jeff Mahoney
     
  • ReiserFS warnings can be somewhat inconsistent.
    In some cases:
    * a unique identifier may be associated with it
    * the function name may be included
    * the device may be printed separately

    This patch aims to make warnings more consistent. reiserfs_warning() prints
    the device name, so printing it a second time is not required. The function
    name for a warning is always helpful in debugging, so it is now automatically
    inserted into the output. Hans has stated that every warning should have
    a unique identifier. Some cases lack them, others really shouldn't have them.
    reiserfs_warning() now expects an id associated with each message. In the
    rare case where one isn't needed, "" will suffice.

    Signed-off-by: Jeff Mahoney
    Signed-off-by: Linus Torvalds

    Jeff Mahoney
     
  • This patch fixes up the reiserfs code such that transaction ids are
    always unsigned ints. In places they can currently be signed ints or
    unsigned longs.

    The former just causes an annoying clm-2200 warning and may join a
    transaction when it should wait.

    The latter is just for correctness since the disk format uses a 32-bit
    transaction id. There aren't any runtime problems that result from it
    not wrapping at the correct location since the value is truncated
    correctly even on big endian systems. The 0 value might make it to
    disk, but the mount-time checks will bump it to 10 itself.

    Signed-off-by: Jeff Mahoney
    Signed-off-by: Linus Torvalds

    Jeff Mahoney
     

21 Oct, 2008

4 commits


05 Aug, 2008

2 commits

  • Like the page lock change, this also requires name change, so convert the
    raw test_and_set bitop to a trylock.

    Signed-off-by: Nick Piggin
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Converting page lock to new locking bitops requires a change of page flag
    operation naming, so we might as well convert it to something nicer
    (!TestSetPageLocked_Lock => trylock_page, SetPageLocked => set_page_locked).

    This also facilitates lockdeping of page lock.

    Signed-off-by: Nick Piggin
    Acked-by: KOSAKI Motohiro
    Acked-by: Peter Zijlstra
    Acked-by: Andrew Morton
    Acked-by: Benjamin Herrenschmidt
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

26 Jul, 2008

1 commit

  • j_commit_lock is a semaphore but uses it as if it were a mutex. This patch
    converts it to a mutex.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Jeff Mahoney
    Cc: Matthew Wilcox
    Cc: Chris Mason
    Cc: Edward Shishkin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Mahoney