08 Jun, 2016

2 commits


05 Apr, 2016

2 commits

  • Mostly direct substitution with occasional adjustment or removing
    outdated comments.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

07 Jan, 2016

1 commit


20 Nov, 2014

1 commit


14 Oct, 2014

1 commit


12 Aug, 2014

1 commit

  • If do_journal_release() races with do_journal_end() which requeues
    delayed works for transaction flushing, we can leave work items for
    flushing outstanding transactions queued while freeing them. That
    results in use after free and possible crash in run_timers_softirq().

    Fix the problem by not requeueing works if superblock is being shut down
    (MS_ACTIVE not set) and using cancel_delayed_work_sync() in
    do_journal_release().

    CC: stable@vger.kernel.org
    Signed-off-by: Jan Kara

    Jan Kara
     

07 May, 2014

8 commits


24 Sep, 2013

2 commits

  • There are two locks involved in managing the journal lists. The general
    reiserfs_write_lock and the journal->j_flush_mutex.

    While flush_journal_list is sleeping to acquire the j_flush_mutex or to
    submit a block for write, it will drop the write lock. This allows
    another thread to acquire the write lock and ultimately call
    flush_used_journal_lists to traverse the list of journal lists and
    select one for flushing. It can select the journal_list that has just
    had flush_journal_list called on it in the original thread and call it
    again with the same journal_list.

    The second thread then drops the write lock to acquire j_flush_mutex and
    the first thread reacquires it and continues execution and eventually
    clears and frees the journal list before dropping j_flush_mutex and
    returning.

    The second thread acquires j_flush_mutex and ends up operating on a
    journal_list that has already been released. If the memory hasn't
    been reused, we'll soon after hit a BUG_ON because the transaction id
    has already been cleared. If it's been reused, we'll crash in other
    fun ways.

    Since flush_journal_list will synchronize on j_flush_mutex, we can fix
    the race by taking a proper reference in flush_used_journal_lists
    and checking to see if it's still valid after the mutex is taken. It's
    safe to iterate the list of journal lists and pick a list with
    just the write lock as long as a reference is taken on the journal list
    before we drop the lock. We already have code to handle whether a
    transaction has been flushed already so we can use that to handle the
    race and get rid of the trans_id BUG_ON.

    Signed-off-by: Jeff Mahoney
    Signed-off-by: Jan Kara

    Jeff Mahoney
     
  • Commit a3172027 introduced test_transaction as a requirement for
    flushing old lists -- but it can never return 1 unless the transaction
    has already been flushed.

    As a result, we have a routine that iterates the j_realblocks list but
    doesn't actually do anything. Since it's been this way since 2006 and
    the latency numbers were what Chris expected, let's just rip it out.

    Signed-off-by: Jeff Mahoney
    Signed-off-by: Jan Kara

    Jeff Mahoney
     

09 Aug, 2013

1 commit

  • The reiserfs write lock replaced the BKL and uses similar semantics.

    Frederic's locking code makes a distinction between when the lock is nested
    and when it's being acquired/released, but I don't think that's the right
    distinction to make.

    The right distinction is between the lock being released at end-of-use and
    the lock being released for a schedule. The unlock should return the depth
    and the lock should restore it, rather than the other way around as it is now.

    This patch implements that and adds a number of places where the lock
    should be dropped.

    Signed-off-by: Jeff Mahoney

    Jeff Mahoney
     

07 May, 2013

1 commit


01 Jun, 2012

3 commits

  • This patch stops reiserfs using the VFS 'write_super()' method along with the
    s_dirt flag, because they are on their way out.

    The whole "superblock write-out" VFS infrastructure is served by the
    'sync_supers()' kernel thread, which wakes up every 5 (by default) seconds and
    writes out all dirty superblock using the '->write_super()' call-back. But the
    problem with this thread is that it wastes power by waking up the system every
    5 seconds, even if there are no diry superblocks, or there are no client
    file-systems which would need this (e.g., btrfs does not use
    '->write_super()'). So we want to kill it completely and thus, we need to make
    file-systems to stop using the '->write_super()' VFS service, and then remove
    it together with the kernel thread.

    Signed-off-by: Artem Bityutskiy
    Signed-off-by: Al Viro

    Artem Bityutskiy
     
  • The 'journal_mark_dirty()' function currently first marks the superblock as
    dirty by setting 's_dirt' to 1, then does various sanity checks and returns,
    then actuall does all the magic with the journal.

    This is not an ideal order, though. It makes more sense to first do all the
    checks, then do all the internal stuff, and at the end notify the VFS that the
    superblock is now dirty.

    This patch moves the 's_dirt = 1' assignment from the very beginning of this
    function to the very end.

    Signed-off-by: Artem Bityutskiy
    Signed-off-by: Al Viro

    Artem Bityutskiy
     
  • Turn 'reiserfs_flush_old_commits()' into a void function because the callers
    do not cares about what it returns anyway.

    We are going to remove the 'sb->s_dirt' field completely and this patch is a
    small step towards this direction.

    Signed-off-by: Artem Bityutskiy
    Signed-off-by: Al Viro

    Artem Bityutskiy
     

29 Mar, 2012

1 commit


21 Mar, 2012

1 commit


11 Jan, 2012

2 commits


15 Sep, 2011

1 commit


12 Jul, 2011

1 commit

  • fs_excl is a poor man's priority inheritance for filesystems to hint to
    the block layer that an operation is important. It was never clearly
    specified, not widely adopted, and will not prevent starvation in many
    cases (like across cgroups).

    fs_excl was introduced with the time sliced CFQ IO scheduler, to
    indicate when a process held FS exclusive resources and thus needed
    a boost.

    It doesn't cover all file systems, and it was never fully complete.
    Lets kill it.

    Signed-off-by: Justin TerAvest
    Signed-off-by: Jens Axboe

    Justin TerAvest
     

31 Mar, 2011

1 commit


01 Feb, 2011

1 commit


14 Jan, 2011

1 commit

  • * 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-block: (43 commits)
    block: ensure that completion error gets properly traced
    blktrace: add missing probe argument to block_bio_complete
    block cfq: don't use atomic_t for cfq_group
    block cfq: don't use atomic_t for cfq_queue
    block: trace event block fix unassigned field
    block: add internal hd part table references
    block: fix accounting bug on cross partition merges
    kref: add kref_test_and_get
    bio-integrity: mark kintegrityd_wq highpri and CPU intensive
    block: make kblockd_workqueue smarter
    Revert "sd: implement sd_check_events()"
    block: Clean up exit_io_context() source code.
    Fix compile warnings due to missing removal of a 'ret' variable
    fs/block: type signature of major_to_index(int) to major_to_index(unsigned)
    block: convert !IS_ERR(p) && p to !IS_ERR_NOR_NULL(p)
    cfq-iosched: don't check cfqg in choose_service_tree()
    fs/splice: Pull buf->ops->confirm() from splice_from_pipe actors
    cdrom: export cdrom_check_events()
    sd: implement sd_check_events()
    sr: implement sr_check_events()
    ...

    Linus Torvalds
     

18 Nov, 2010

1 commit


13 Nov, 2010

2 commits

  • After recent blkdev_get() modifications, open_by_devnum() and
    open_bdev_exclusive() are simple wrappers around blkdev_get().
    Replace them with blkdev_get_by_dev() and blkdev_get_by_path().

    blkdev_get_by_dev() is identical to open_by_devnum().
    blkdev_get_by_path() is slightly different in that it doesn't
    automatically add %FMODE_EXCL to @mode.

    All users are converted. Most conversions are mechanical and don't
    introduce any behavior difference. There are several exceptions.

    * btrfs now sets FMODE_EXCL in btrfs_device->mode, so there's no
    reason to OR it explicitly on blkdev_put().

    * gfs2, nilfs2 and the generic mount_bdev() now set FMODE_EXCL in
    sb->s_mode.

    * With the above changes, sb->s_mode now always should contain
    FMODE_EXCL. WARN_ON_ONCE() added to kill_block_super() to detect
    errors.

    The new blkdev_get_*() functions are with proper docbook comments.
    While at it, add function description to blkdev_get() too.

    Signed-off-by: Tejun Heo
    Cc: Philipp Reisner
    Cc: Neil Brown
    Cc: Mike Snitzer
    Cc: Joern Engel
    Cc: Chris Mason
    Cc: Jan Kara
    Cc: "Theodore Ts'o"
    Cc: KONISHI Ryusuke
    Cc: reiserfs-devel@vger.kernel.org
    Cc: xfs-masters@oss.sgi.com
    Cc: Alexander Viro

    Tejun Heo
     
  • Over time, block layer has accumulated a set of APIs dealing with bdev
    open, close, claim and release.

    * blkdev_get/put() are the primary open and close functions.

    * bd_claim/release() deal with exclusive open.

    * open/close_bdev_exclusive() are combination of open and claim and
    the other way around, respectively.

    * bd_link/unlink_disk_holder() to create and remove holder/slave
    symlinks.

    * open_by_devnum() wraps bdget() + blkdev_get().

    The interface is a bit confusing and the decoupling of open and claim
    makes it impossible to properly guarantee exclusive access as
    in-kernel open + claim sequence can disturb the existing exclusive
    open even before the block layer knows the current open if for another
    exclusive access. Reorganize the interface such that,

    * blkdev_get() is extended to include exclusive access management.
    @holder argument is added and, if is @FMODE_EXCL specified, it will
    gain exclusive access atomically w.r.t. other exclusive accesses.

    * blkdev_put() is similarly extended. It now takes @mode argument and
    if @FMODE_EXCL is set, it releases an exclusive access. Also, when
    the last exclusive claim is released, the holder/slave symlinks are
    removed automatically.

    * bd_claim/release() and close_bdev_exclusive() are no longer
    necessary and either made static or removed.

    * bd_link_disk_holder() remains the same but bd_unlink_disk_holder()
    is no longer necessary and removed.

    * open_bdev_exclusive() becomes a simple wrapper around lookup_bdev()
    and blkdev_get(). It also has an unexpected extra bdev_read_only()
    test which probably should be moved into blkdev_get().

    * open_by_devnum() is modified to take @holder argument and pass it to
    blkdev_get().

    Most of bdev open/close operations are unified into blkdev_get/put()
    and most exclusive accesses are tested atomically at the open time (as
    it should). This cleans up code and removes some, both valid and
    invalid, but unnecessary all the same, corner cases.

    open_bdev_exclusive() and open_by_devnum() can use further cleanup -
    rename to blkdev_get_by_path() and blkdev_get_by_devt() and drop
    special features. Well, let's leave them for another day.

    Most conversions are straight-forward. drbd conversion is a bit more
    involved as there was some reordering, but the logic should stay the
    same.

    Signed-off-by: Tejun Heo
    Acked-by: Neil Brown
    Acked-by: Ryusuke Konishi
    Acked-by: Mike Snitzer
    Acked-by: Philipp Reisner
    Cc: Peter Osterlund
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Jan Kara
    Cc: Andrew Morton
    Cc: Andreas Dilger
    Cc: "Theodore Ts'o"
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Alex Elder
    Cc: Christoph Hellwig
    Cc: dm-devel@redhat.com
    Cc: drbd-dev@lists.linbit.com
    Cc: Leo Chen
    Cc: Scott Branden
    Cc: Chris Mason
    Cc: Steven Whitehouse
    Cc: Dave Kleikamp
    Cc: Joern Engel
    Cc: reiserfs-devel@vger.kernel.org
    Cc: Alexander Viro

    Tejun Heo
     

10 Sep, 2010

1 commit

  • Switch to the WRITE_FLUSH_FUA flag for log writes and remove the EOPNOTSUPP
    detection for barriers. Note that reiserfs had a fairly different code
    path for barriers before as it wa the only filesystem actually making use
    of them. The new code always uses the old non-barrier codepath and just
    sets the WRITE_FLUSH_FUA explicitly for the journal commits.

    Signed-off-by: Christoph Hellwig
    Acked-by: Jan Kara
    Acked-by: Chris Mason
    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

18 Aug, 2010

1 commit

  • These flags aren't real I/O types, but tell ll_rw_block to always
    lock the buffer instead of giving up on a failed trylock.

    Instead add a new write_dirty_buffer helper that implements this semantic
    and use it from the existing SWRITE* callers. Note that the ll_rw_block
    code had a bug where it didn't promote WRITE_SYNC_PLUG properly, which
    this patch fixes.

    In the ufs code clean up the helper that used to call ll_rw_block
    to mirror sync_dirty_buffer, which is the function it implements for
    compound buffers.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

11 Aug, 2010

1 commit


30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

25 Mar, 2010

1 commit

  • The reiserfs journal behaves inconsistently when determining whether to
    allow a mount of a read-only device.

    This is due to the use of the continue_replay variable to short circuit
    the journal scanning. If it's set, it's assumed that there are
    transactions to replay, but there may not be. If it's unset, it's assumed
    that there aren't any, and that may not be the case either.

    I've observed two failure cases:
    1) Where a clean file system on a read-only device refuses to mount
    2) Where a clean file system on a read-only device passes the
    optimization and then tries writing the journal header to update
    the latest mount id.

    The former is easily observable by using a freshly created file system on
    a read-only loopback device.

    This patch moves the check into journal_read_transaction, where it can
    bail out before it's about to replay a transaction. That way it can go
    through and skip transactions where appropriate, yet still refuse to mount
    a file system with outstanding transactions.

    Signed-off-by: Jeff Mahoney
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Mahoney