21 Feb, 2017

1 commit

  • Pull GFS2 updates from Robert Peterson:
    "We've got eight GFS2 patches for this merge window:

    - Andy Price submitted a patch to make gfs2_write_full_page a static
    function.

    - Dan Carpenter submitted a patch to fix a ERR_PTR thinko.

    Three patches fix bugs related to deleting very large files, which
    cause GFS2 to run out of journal space:

    - The first one prevents GFS2 delete operation from requesting too
    much journal space.

    - The second one fixes a problem whereby GFS2 can hang because it
    wasn't taking journal space demand into its calculations.

    - The third one wakes up IO waiters when a flush is done to restart
    processes stuck waiting for journal space to become available.

    The final three patches are a performance improvement related to
    spin_lock contention between multiple writers:

    - The "tr_touched" variable was switched to a flag to be more atomic
    and eliminate the possibility of some races.

    - Function meta_lo_add was moved inline with its only caller to make
    the code more readable and efficient.

    - Contention on the gfs2_log_lock spinlock was greatly reduced by
    avoiding the lock altogether in cases where we don't really need
    it: buffers that already appear in the appropriate metadata list
    for the journal. Many thanks to Steve Whitehouse for the ideas and
    principles behind these patches"

    * tag 'gfs2-4.11.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
    gfs2: Make gfs2_write_full_page static
    GFS2: Reduce contention on gfs2_log_lock
    GFS2: Inline function meta_lo_add
    GFS2: Switch tr_touched to flag in transaction
    GFS2: Wake up io waiters whenever a flush is done
    GFS2: Made logd daemon take into account log demand
    GFS2: Limit number of transaction blocks requested for truncates
    GFS2: Fix reference to ERR_PTR in gfs2_glock_iter_next

    Linus Torvalds
     

27 Jan, 2017

1 commit


07 Jan, 2017

1 commit

  • Before this patch, if a process called function gfs2_log_reserve to
    reserve some journal blocks, but the journal not enough blocks were
    free, it would call io_schedule. However, in the log flush daemon,
    it woke up the waiters only if an gfs2_ail_flush was no longer
    required. This resulted in situations where processes would wait
    forever because the number of blocks required was so high that it
    pushed the journal into a perpetual state of flush being required.

    This patch changes the logd daemon so that it wakes up io waiters
    every time the log is actually flushed.

    Signed-off-by: Bob Peterson

    Bob Peterson
     

06 Jan, 2017

1 commit

  • Before this patch, the logd daemon only tried to flush things when
    the log blocks pinned exceeded a certain threshold. But when we're
    deleting very large files, it may require a huge number of journal
    blocks, and that, in turn, may exceed the threshold. This patch
    factors that into account.

    Signed-off-by: Bob Peterson

    Bob Peterson
     

01 Nov, 2016

1 commit


08 Jun, 2016

1 commit

  • Separate the op from the rq_flag_bits and have gfs2
    set/get the bio using bio_set_op_attrs/bio_op.

    Signed-off-by: Mike Christie
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Mike Christie
     

15 Dec, 2015

1 commit

  • When gfs2 was unmounting filesystems or changing them to read-only it
    was clearing the SDF_JOURNAL_LIVE bit before the final log flush. This
    caused a race. If an inode glock got demoted in the gap between
    clearing the bit and the shutdown flush, it would be unable to reserve
    log space to clear out the active items list in inode_go_sync, causing an
    error in inode_go_inval because the glock was still dirty.

    To solve this, the SDF_JOURNAL_LIVE bit is now cleared inside the
    shutdown log flush. This means that, because of the locking on the log
    blocks, either inode_go_sync will be able to reserve space to clean the
    glock before the shutdown flush, or the shutdown flush will clean the
    glock itself, before inode_go_sync fails to reserve the space. Either
    way, the glock will be clean before inode_go_inval.

    Signed-off-by: Benjamin Marzinski
    Signed-off-by: Bob Peterson

    Benjamin Marzinski
     

17 Nov, 2014

1 commit

  • The current gfs2 freezing code is considerably more complicated than it
    should be because it doesn't use the vfs freezing code on any node except
    the one that begins the freeze. This is because it needs to acquire a
    cluster glock before calling the vfs code to prevent a deadlock, and
    without the new freeze_super and thaw_super hooks, that was impossible. To
    deal with the issue, gfs2 had to do some hacky locking tricks to make sure
    that a frozen node couldn't be holding on a lock it needed to do the
    unfreeze ioctl.

    This patch makes use of the new hooks to simply the gfs2 locking code. Now,
    all the nodes in the cluster freeze and thaw in exactly the same way. Every
    node in the cluster caches the freeze glock in the shared state. The new
    freeze_super hook allows the freezing node to grab this freeze glock in
    the exclusive state without first calling the vfs freeze_super function.
    All the nodes in the cluster see this lock change, and call the vfs
    freeze_super function. The vfs locking code guarantees that the nodes can't
    get stuck holding the glocks necessary to unfreeze the system. To
    unfreeze, the freezing node uses the new thaw_super hook to drop the freeze
    glock. Again, all the nodes notice this, reacquire the glock in shared mode
    and call the vfs thaw_super function.

    Signed-off-by: Benjamin Marzinski
    Signed-off-by: Steven Whitehouse

    Benjamin Marzinski
     

14 May, 2014

1 commit

  • GFS2 has a transaction glock, which must be grabbed for every
    transaction, whose purpose is to deal with freezing the filesystem.
    Aside from this involving a large amount of locking, it is very easy to
    make the current fsfreeze code hang on unfreezing.

    This patch rewrites how gfs2 handles freezing the filesystem. The
    transaction glock is removed. In it's place is a freeze glock, which is
    cached (but not held) in a shared state by every node in the cluster
    when the filesystem is mounted. This lock only needs to be grabbed on
    freezing, and actions which need to be safe from freezing, like
    recovery.

    When a node wants to freeze the filesystem, it grabs this glock
    exclusively. When the freeze glock state changes on the nodes (either
    from shared to unlocked, or shared to exclusive), the filesystem does a
    special log flush. gfs2_log_flush() does all the work for flushing out
    the and shutting down the incore log, and then it tries to grab the
    freeze glock in a shared state again. Since the filesystem is stuck in
    gfs2_log_flush, no new transaction can start, and nothing can be written
    to disk. Unfreezing the filesytem simply involes dropping the freeze
    glock, allowing gfs2_log_flush() to grab and then release the shared
    lock, so it is cached for next time.

    However, in order for the unfreezing ioctl to occur, gfs2 needs to get a
    shared lock on the filesystem root directory inode to check permissions.
    If that glock has already been grabbed exclusively, fsfreeze will be
    unable to get the shared lock and unfreeze the filesystem.

    In order to allow the unfreeze, this patch makes gfs2 grab a shared lock
    on the filesystem root directory during the freeze, and hold it until it
    unfreezes the filesystem. The functions which need to grab a shared
    lock in order to allow the unfreeze ioctl to be issued now use the lock
    grabbed by the freeze code instead.

    The freeze and unfreeze code take care to make sure that this shared
    lock will not be dropped while another process is using it.

    Signed-off-by: Benjamin Marzinski
    Signed-off-by: Steven Whitehouse

    Benjamin Marzinski
     

12 Mar, 2014

1 commit

  • Upstream commit 34cc178 changed a line of code from calling function
    log_flush_commit to calling log_write_header. This had the effect of
    eliminating a call to function log_flush_wait. That causes the journal
    to skip over log headers, which results in multiple wrap points,
    which itself leads to infinite loops in journal replay, both in the
    kernel code and fsck.gfs2 code. This patch re-adds that call.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     

25 Feb, 2014

3 commits

  • By reordering some of the assignments in gfs2_log_flush() it
    is possible to remove one of the "if" statements as it can be
    merged with one higher up the function.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • Now we have a master transaction into which other transactions
    are merged, the accounting can be done using this master
    transaction. We no longer require the superblock fields which
    were being used for this function.

    In addition, this allows for a clean up in calc_reserved()
    making it rather easier understand. Also, by reducing the
    number of variables used to track the buffers being added
    and removed from the journal, a number of error checks are
    now no longer required.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • Over time, we hope to be able to improve the concurrency available
    in the log code. This is one small step towards that, by moving
    the buffer lists from the super block, and into the transaction
    structure, so that each transaction builds its own buffer lists.

    At transaction commit time, the buffer lists are merged into
    the currently accumulating transaction. That transaction then
    is passed into the before and after commit functions at journal
    flush time. Thus there should be no change in overall behaviour
    yet.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

03 Feb, 2014

1 commit

  • When we do a flush of the AIL list, we are writing out what is
    likely to be a lot of small I/Os, which are possibly in an order
    which is not ideal performance-wise. Since this is done by calling
    filemap_fdatatwrite for each individual inode's address space there
    is no overall plugging going on.

    In addition to that, we do not always wait for AIL i/o when we flush
    it, so that it is possible for things to get left behind on the queue.
    By adding explicit plugging here, we reduce the chances of this
    being an issues. A quick test using the AIL flush tracepoint shows a
    small, but measurable improvement.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

14 Dec, 2013

1 commit

  • Function gfs2_remove_from_ail drops the reference on the bh via
    brelse. This patch fixes a race condition whereby bh is deferenced
    after the brelse when setting bd->bd_blkno = bh->b_blocknr;
    Under certain rare circumstances, bh might be gone or reused,
    and bd->bd_blkno is set to whatever that memory happens to be,
    which is often 0. Later, in gfs2_trans_add_unrevoke, that bd fails
    the test "bd->bd_blkno >= blkno" which causes it to never be freed.
    The end result is that the bd is never freed from the bufdata cache,
    which results in this error:
    slab error in kmem_cache_destroy(): cache `gfs2_bufdata': Can't free all objects

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     

19 Jun, 2013

1 commit

  • This patch looks at all the outstanding blocks in all the transactions
    on the log, and moves the completed ones to the ail2 list. Then it
    issues revokes for these blocks. This will hopefully speed things up
    in situations where there is a lot of contention for glocks, especially
    if they are acquired serially.

    revoke_lo_before_commit will issue at most one log block's full of these
    preemptive revokes. The amount of reserved log space that
    gfs2_log_reserve() ignores has been incremented to allow for this extra
    block.

    This patch also consolidates the common revoke instructions into one
    function, gfs2_add_revoke().

    Signed-off-by: Benjamin Marzinski
    Signed-off-by: Steven Whitehouse

    Benjamin Marzinski
     

08 Apr, 2013

1 commit

  • In order to allow transactions and log flushes to happen at the same
    time, gfs2 needs to move the transaction accounting and active items
    list code into the gfs2_trans structure. As a first step toward this,
    this patch removes the gfs2_ail structure, and handles the active items
    list in the gfs_trans structure. This keeps gfs2 from allocating an ail
    structure on log flushes, and gives us a struture that can later be used
    to store the transaction accounting outside of the gfs2 superblock
    structure.

    With this patch, at the end of a transaction, gfs2 will add the
    gfs2_trans structure to the superblock if there is not one already.
    This structure now has the active items fields that were previously in
    gfs2_ail. This is not necessary in the case where the transaction was
    simply used to add revokes, since these are never written outside of the
    journal, and thus, don't need an active items list.

    Also, in order to make sure that the transaction structure is not
    removed while it's still in use by gfs2_trans_end, unlocking the
    sd_log_flush_lock has to happen slightly later in ending the
    transaction.

    Signed-off-by: Benjamin Marzinski
    Signed-off-by: Steven Whitehouse

    Benjamin Marzinski
     

29 Jan, 2013

1 commit

  • Instead of using a list of buffers to write ahead of the journal
    flush, this now uses a list of inodes and calls ->writepages
    via filemap_fdatawrite() in order to achieve the same thing. For
    most use cases this results in a shorter ordered write list,
    as well as much larger i/os being issued.

    The ordered write list is sorted by inode number before writing
    in order to retain the disk block ordering between inodes as
    per the previous code.

    The previous ordered write code used to conflict in its assumptions
    about how to write out the disk blocks with mpage_writepages()
    so that with this updated version we can also use mpage_writepages()
    for GFS2's ordered write, writepages implementation. So we will
    also send larger i/os from writeback too.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

02 May, 2012

1 commit

  • This patch eliminates the gfs2_log_element data structure and
    rolls its two components into the gfs2_bufdata. This makes the code
    easier to understand and makes it easier to migrate to a rbtree
    to keep the list sorted.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     

24 Apr, 2012

4 commits

  • This patch removes a log lock from around atomic operation where
    it is not needed, removes an unused variable, and also changes
    a void pointer used incorrectly to a struct page pointer.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • This is another clean up in the logging code. This per-transaction
    list was largely unused. Its main function was to ensure that the
    number of buffers in a transaction was correct, however that counter
    was only used to check the number of buffers in the bd_list_tr, plus
    an assert at the end of each transaction. With the assert now changed
    to use the calculated buffer counts, we can remove both bd_list_tr and
    its associated counter.

    This should make the code easier to understand as well as shrinking
    a couple of structures.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • Prior to this patch, we have two ways of sending i/o to the log.
    One of those is used when we need to allocate both the data
    to be written itself and also a buffer head to submit it. This
    is done via sb_getblk and friends. This is used mostly for writing
    log headers.

    The other method is used when writing blocks which have some
    in-place counterpart. This is the case for all the metadata
    blocks which are journalled, and when journaled data is in use,
    for unescaped journalled data blocks.

    This patch replaces both of those two methods, and about half
    a dozen separate i/o submission points with a single i/o
    submission function. We also go direct to bio rather than
    using buffer heads, since this allows us to build i/o
    requests of the maximum size for the block device in
    question. It also reduces the memory required for flushing
    the log, which can be very useful in low memory situations.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • The "pull" argument to log_write_header() is only used
    for debug purposes and it is not really needed any more. There
    are other tests for this particular problem, so I think we can
    dispose of it in order to simplify the code.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

09 Mar, 2012

1 commit

  • We already send both a pre and post flush to the block device
    when writing a journal header. There is no need to wait for
    the previous I/O specifically when we do this, unless we've
    turned "barriers" off.

    As a side effect, this also cleans up the code path for flushing
    the journal and makes it more readable.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

29 Feb, 2012

3 commits


09 Jan, 2012

1 commit

  • * 'pm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (76 commits)
    PM / Hibernate: Implement compat_ioctl for /dev/snapshot
    PM / Freezer: fix return value of freezable_schedule_timeout_killable()
    PM / shmobile: Allow the A4R domain to be turned off at run time
    PM / input / touchscreen: Make st1232 use device PM QoS constraints
    PM / QoS: Introduce dev_pm_qos_add_ancestor_request()
    PM / shmobile: Remove the stay_on flag from SH7372's PM domains
    PM / shmobile: Don't include SH7372's INTCS in syscore suspend/resume
    PM / shmobile: Add support for the sh7372 A4S power domain / sleep mode
    PM: Drop generic_subsys_pm_ops
    PM / Sleep: Remove forward-only callbacks from AMBA bus type
    PM / Sleep: Remove forward-only callbacks from platform bus type
    PM: Run the driver callback directly if the subsystem one is not there
    PM / Sleep: Make pm_op() and pm_noirq_op() return callback pointers
    PM/Devfreq: Add Exynos4-bus device DVFS driver for Exynos4210/4212/4412.
    PM / Sleep: Merge internal functions in generic_ops.c
    PM / Sleep: Simplify generic system suspend callbacks
    PM / Hibernate: Remove deprecated hibernation snapshot ioctls
    PM / Sleep: Fix freezer failures due to racy usermodehelper_is_disabled()
    ARM: S3C64XX: Implement basic power domain support
    PM / shmobile: Use common always on power domain governor
    ...

    Fix up trivial conflict in fs/xfs/xfs_buf.c due to removal of unused
    XBT_FORCE_SLEEP bit

    Linus Torvalds
     

22 Nov, 2011

1 commit

  • There is no reason to export two functions for entering the
    refrigerator. Calling refrigerator() instead of try_to_freeze()
    doesn't save anything noticeable or removes any race condition.

    * Rename refrigerator() to __refrigerator() and make it return bool
    indicating whether it scheduled out for freezing.

    * Update try_to_freeze() to return bool and relay the return value of
    __refrigerator() if freezing().

    * Convert all refrigerator() users to try_to_freeze().

    * Update documentation accordingly.

    * While at it, add might_sleep() to try_to_freeze().

    Signed-off-by: Tejun Heo
    Cc: Samuel Ortiz
    Cc: Chris Mason
    Cc: "Theodore Ts'o"
    Cc: Steven Whitehouse
    Cc: Andrew Morton
    Cc: Jan Kara
    Cc: KONISHI Ryusuke
    Cc: Christoph Hellwig

    Tejun Heo
     

08 Nov, 2011

1 commit

  • Christoph has split up REQ_PRIO from REQ_META. That means that
    we can drop REQ_PRIO from places where is it not needed. I'm
    not at all sure that the combination WRITE_FLUSH_FUA | REQ_PRIO
    makes any kind of sense, anyway.

    In addition, I've added REQ_META to one place in the code where
    it was missing. REQ_PRIO has been left for read/writes triggered
    by glock acquisition and writeback only. We can adjust it again
    if required, but these are the most important points from a
    performance perspective.

    Signed-off-by: Steven Whitehouse
    Cc: Christoph Hellwig

    Steven Whitehouse
     

23 Aug, 2011

1 commit

  • Add a new REQ_PRIO to let requests preempt others in the cfq I/O schedule,
    and lave REQ_META purely for marking requests as metadata in blktrace.

    All existing callers of REQ_META except for XFS are updated to also
    set REQ_PRIO for now.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Namhyung Kim
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

14 Jul, 2011

1 commit

  • This patch contains a few misc fixes which resolve a recently
    reported issue. This patch has been a real team effort and has
    received a lot of testing.

    The first issue is that the ail lock needs to be held over a few
    more operations. The lock thats added into gfs2_releasepage() may
    possibly be a candidate for replacing with RCU at some future
    point, but at this stage we've gone for the obvious fix.

    The second issue is that gfs2_write_inode() can end up calling
    a glock recursively when called from gfs2_evict_inode() via the
    syncing code, so it needs a guard added.

    The third issue is that we either need to not truncate the metadata
    pages of inodes which have zero link count, but which we cannot
    deallocate due to them still being in use by other nodes, or we need
    to ensure that those pages have all made it through the journal and
    ail lists first. This patch takes the former approach, but the
    latter has also been tested and there is nothing to choose between
    them performance-wise. So again, we could revise that decision
    in the future.

    Also, the inode eviction process is now better documented.

    Signed-off-by: Steven Whitehouse
    Tested-by: Bob Peterson
    Tested-by: Abhijith Das
    Reported-by: Barry J. Marson
    Reported-by: David Teigland

    Steven Whitehouse
     

22 May, 2011

1 commit

  • The ail flush code has always relied upon log flushing to prevent
    it from spinning needlessly. This fixes it to wait on the last
    I/O request submitted (we don't need to wait for all of it)
    instead of either spinning with io_schedule or sleeping.

    As a result cpu usage of gfs2_logd is much reduced with certain
    workloads.

    Reported-by: Abhijith Das
    Tested-by: Abhijith Das
    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

03 May, 2011

1 commit

  • In the recent patches to update the AIL list code, I managed to
    forget that the ail list lock got dropped, even though I
    added a comment specifically to remind myself :(

    Reported-by: Barry Marson
    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

20 Apr, 2011

3 commits


25 Mar, 2011

1 commit

  • * 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block: (65 commits)
    Documentation/iostats.txt: bit-size reference etc.
    cfq-iosched: removing unnecessary think time checking
    cfq-iosched: Don't clear queue stats when preempt.
    blk-throttle: Reset group slice when limits are changed
    blk-cgroup: Only give unaccounted_time under debug
    cfq-iosched: Don't set active queue in preempt
    block: fix non-atomic access to genhd inflight structures
    block: attempt to merge with existing requests on plug flush
    block: NULL dereference on error path in __blkdev_get()
    cfq-iosched: Don't update group weights when on service tree
    fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away
    block: Require subsystems to explicitly allocate bio_set integrity mempool
    jbd2: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
    jbd: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
    fs: make fsync_buffers_list() plug
    mm: make generic_writepages() use plugging
    blk-cgroup: Add unaccounted time to timeslice_used.
    block: fixup plugging stubs for !CONFIG_BLOCK
    block: remove obsolete comments for blkdev_issue_zeroout.
    blktrace: Use rq->cmd_flags directly in blk_add_trace_rq.
    ...

    Fix up conflicts in fs/{aio.c,super.c}

    Linus Torvalds
     

14 Mar, 2011

1 commit


11 Mar, 2011

1 commit

  • The log lock is currently used to protect the AIL lists and
    the movements of buffers into and out of them. The lists
    are self contained and no log specific items outside the
    lists are accessed when starting or emptying the AIL lists.

    Hence the operation of the AIL does not require the protection
    of the log lock so split them out into a new AIL specific lock
    to reduce the amount of traffic on the log lock. This will
    also reduce the amount of serialisation that occurs when
    the gfs2_logd pushes on the AIL to move it forward.

    This reduces the impact of log pushing on sequential write
    throughput.

    Signed-off-by: Dave Chinner
    Signed-off-by: Steven Whitehouse

    Dave Chinner