25 Dec, 2015
1 commit
-
When gfs2 releases the glock of an inode, it must invalidate all
information cached for that inode, including the page cache and acls.
Use the new security_inode_invalidate_secctx hook to also invalidate
security labels in that case. These items will be reread from disk
when needed after reacquiring the glock.Signed-off-by: Andreas Gruenbacher
Acked-by: Bob Peterson
Acked-by: Steven Whitehouse
Cc: cluster-devel@redhat.com
[PM: fixed spelling errors and description line lengths]
Signed-off-by: Paul Moore
30 Oct, 2015
1 commit
-
Commit e66cf161 replaced the gl_spin spinlock in struct gfs2_glock with a
gl_lockref lockref and defined gl_spin as gl_lockref.lock (the spinlock in
gl_lockref). Remove that define to make the references to gl_lockref.lock more
obvious.Signed-off-by: Andreas Gruenbacher
Signed-off-by: Bob Peterson
04 Sep, 2015
1 commit
-
What uniquely identifies a glock in the glock hash table is not
gl_name, but gl_name and its superblock pointer. This patch makes
the gl_name field correspond to a unique glock identifier. That will
allow us to simplify hashing with a future patch, since the hash
algorithm can then take the gl_name and hash its components in one
operation.Signed-off-by: Bob Peterson
Signed-off-by: Andreas Gruenbacher
Acked-by: Steven Whitehouse
19 Jun, 2015
2 commits
-
This patch allows the block allocation code to retain the buffers
for the resource groups so they don't need to be re-read from buffer
cache with every request. This is a performance improvement that's
especially noticeable when resource groups are very large. For
example, with 2GB resource groups and 4K blocks, there can be 33
blocks for every resource group. This patch allows those 33 buffers
to be kept around and not read in and thrown away with every
operation. The buffers are released when the resource group is
either synced or invalidated.Signed-off-by: Bob Peterson
Reviewed-by: Steven Whitehouse
Reviewed-by: Benjamin Marzinski -
The glocks used for resource groups often come and go hundreds of
thousands of times per second. Adding them to the lru list just
adds unnecessary contention for the lru_lock spin_lock, especially
considering we're almost certainly going to re-use the glock and
take it back off the lru microseconds later. We never want the
glock shrinker to cull them anyway. This patch adds a new bit in
the glops that determines which glock types get put onto the lru
list and which ones don't.Signed-off-by: Bob Peterson
Acked-by: Steven Whitehouse
17 Nov, 2014
1 commit
-
The current gfs2 freezing code is considerably more complicated than it
should be because it doesn't use the vfs freezing code on any node except
the one that begins the freeze. This is because it needs to acquire a
cluster glock before calling the vfs code to prevent a deadlock, and
without the new freeze_super and thaw_super hooks, that was impossible. To
deal with the issue, gfs2 had to do some hacky locking tricks to make sure
that a frozen node couldn't be holding on a lock it needed to do the
unfreeze ioctl.This patch makes use of the new hooks to simply the gfs2 locking code. Now,
all the nodes in the cluster freeze and thaw in exactly the same way. Every
node in the cluster caches the freeze glock in the shared state. The new
freeze_super hook allows the freezing node to grab this freeze glock in
the exclusive state without first calling the vfs freeze_super function.
All the nodes in the cluster see this lock change, and call the vfs
freeze_super function. The vfs locking code guarantees that the nodes can't
get stuck holding the glocks necessary to unfreeze the system. To
unfreeze, the freezing node uses the new thaw_super hook to drop the freeze
glock. Again, all the nodes notice this, reacquire the glock in shared mode
and call the vfs thaw_super function.Signed-off-by: Benjamin Marzinski
Signed-off-by: Steven Whitehouse
08 Oct, 2014
1 commit
-
use macro definition
Signed-off-by: Fabian Frederick
Signed-off-by: Steven Whitehouse
18 Jul, 2014
1 commit
-
Signed-off-by: Geert Uytterhoeven
Cc: cluster-devel@redhat.com
Signed-off-by: Steven Whitehouse
04 Jun, 2014
1 commit
-
…teve/gfs2-3.0-nmw into next
Pull gfs2 updates from Steven Whitehouse:
"This must be about the smallest merge window patch set ever for GFS2.
It is probably also the first one without a single patch from me.
That is down to a combination of factors, and I have some things in
the works that are not quite ready yet, that I hope to put in next
time around.Returning to what is here this time... we have 3 patches which fix
various warnings. Two are bug fixes (for quotas and also a rare
recovery race condition). The final patch, from Ben Marzinski, is an
important change in the freeze code which has been in progress for
some time. This removes the need to take and drop the transaction
lock for every single transaction, when the only time it was used, was
at file system freeze time. Ben's patch integrates the freeze
operation into the journal flush code as an alternative with lower
overheads and also lands up resolving some difficult to fix races at
the same time"* tag 'gfs2-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw:
GFS2: Prevent recovery before the local journal is set
GFS2: fs/gfs2/file.c: kernel-doc warning fixes
GFS2: fs/gfs2/bmap.c: kernel-doc warning fixes
GFS2: remove transaction glock
GFS2: lops.c: replace 0 by NULL for pointers
GFS2: quotas not being refreshed in gfs2_adjust_quota
14 May, 2014
1 commit
-
GFS2 has a transaction glock, which must be grabbed for every
transaction, whose purpose is to deal with freezing the filesystem.
Aside from this involving a large amount of locking, it is very easy to
make the current fsfreeze code hang on unfreezing.This patch rewrites how gfs2 handles freezing the filesystem. The
transaction glock is removed. In it's place is a freeze glock, which is
cached (but not held) in a shared state by every node in the cluster
when the filesystem is mounted. This lock only needs to be grabbed on
freezing, and actions which need to be safe from freezing, like
recovery.When a node wants to freeze the filesystem, it grabs this glock
exclusively. When the freeze glock state changes on the nodes (either
from shared to unlocked, or shared to exclusive), the filesystem does a
special log flush. gfs2_log_flush() does all the work for flushing out
the and shutting down the incore log, and then it tries to grab the
freeze glock in a shared state again. Since the filesystem is stuck in
gfs2_log_flush, no new transaction can start, and nothing can be written
to disk. Unfreezing the filesytem simply involes dropping the freeze
glock, allowing gfs2_log_flush() to grab and then release the shared
lock, so it is cached for next time.However, in order for the unfreezing ioctl to occur, gfs2 needs to get a
shared lock on the filesystem root directory inode to check permissions.
If that glock has already been grabbed exclusively, fsfreeze will be
unable to get the shared lock and unfreeze the filesystem.In order to allow the unfreeze, this patch makes gfs2 grab a shared lock
on the filesystem root directory during the freeze, and hold it until it
unfreezes the filesystem. The functions which need to grab a shared
lock in order to allow the unfreeze ioctl to be issued now use the lock
grabbed by the freeze code instead.The freeze and unfreeze code take care to make sure that this shared
lock will not be dropped while another process is using it.Signed-off-by: Benjamin Marzinski
Signed-off-by: Steven Whitehouse
18 Apr, 2014
1 commit
-
Mostly scripted conversion of the smp_mb__* barriers.
Signed-off-by: Peter Zijlstra
Acked-by: Paul E. McKenney
Link: http://lkml.kernel.org/n/tip-55dhyhocezdw1dg7u19hmh1u@git.kernel.org
Cc: Linus Torvalds
Cc: linux-arch@vger.kernel.org
Signed-off-by: Ingo Molnar
25 Feb, 2014
1 commit
-
Over time, we hope to be able to improve the concurrency available
in the log code. This is one small step towards that, by moving
the buffer lists from the super block, and into the transaction
structure, so that each transaction builds its own buffer lists.At transaction commit time, the buffer lists are merged into
the currently accumulating transaction. That transaction then
is passed into the before and after commit functions at journal
flush time. Thus there should be no change in overall behaviour
yet.Signed-off-by: Steven Whitehouse
16 Jan, 2014
1 commit
-
Al Viro has tactfully pointed out that we are using the incorrect
error code in some cases. This patch fixes that, and also removes
the (unused) return value for glock dumping.> * gfs2_iget() - ENOBUFS instead of ENOMEM. ENOBUFS is
> "No buffer space available (POSIX.1 (XSI STREAMS option))" and since
> we don't support STREAMS it's probably fair game, but... what the hell?Signed-off-by: Steven Whitehouse
Cc: Al Viro
03 Jan, 2014
2 commits
-
Prior to this patch, GFS2 had one address space for each rgrp,
stored in the glock. This patch changes them to use a single
address space in the super block. This therefore saves
(sizeof(struct address_space) * nr_of_rgrps) bytes of memory
and for large filesystems, that can be significant.It would be nice to be able to do something similar and merge
the inode metadata address space into the same global
address space. However, that is rather more complicated as the
on-disk location doesn't have a 1:1 mapping with the inodes in
general. So while it could be done, it will be a more complicated
operation as it requires changing a lot more code paths.Signed-off-by: Steven Whitehouse
-
Each rgrp header is represented as a single extent on disk, so we
can calculate the position within the address space, since we are
using address spaces mapped 1:1 to the disk. This means that it
is possible to use the range based versions of filemap_fdatawrite/wait
and for invalidating the page cache.Our eventual intent is to then be able to merge the address spaces
used for rgrps into a single address space, rather than to have
one for each glock, saving memory and reducing complexity.Since during umount, the rgrp structures are disposed of before
the glocks, we need to store the extent information in the glock
so that is is available for a final invalidation. This patch uses
a field which is otherwise unused in rgrp glocks to do that, so
that we do not have to expand the size of a glock.Signed-off-by: Steven Whitehouse
20 Dec, 2013
1 commit
-
We need to wait for any outstanding DIO to complete in a couple
of situations. Firstly, in case we are changing out of deferred
mode (in inode_go_sync) where GLF_DIRTY will not be set. That
call could be prefixed with a test for gl_state == LM_ST_DEFERRED
but it doesn't seem worth it bearing in mind that the test for
outstanding DIO is very quick anyway, in the usual case that there
is none.The second case is in inode_go_lock which will catch the cases
where we have a cached EX lock, but where we grant deferred locks
against it so that there is no glock state transistion. We only
need to wait if the state is not deferred, since DIO is valid
anyway in that state.Signed-off-by: Steven Whitehouse
15 Oct, 2013
1 commit
-
Currently glocks have an atomic reference count and also a spinlock
which covers various internal fields, such as the state. This intent of
this patch is to replace the spinlock and the atomic reference count
with a lockref structure. This contains a spinlock which we can continue
to use as before, and a reference counter which is used in conjuction
with the spinlock to replace the previous atomic counter.As a result of this there are some new rules for reference counting on
glocks. We need to distinguish between reference count changes under
gl_spin (which are now just increment or decrement of the new counter,
provided the count cannot hit zero) and those which are outside of
gl_spin, but which now take gl_spin internally.The conversion is relatively straight forward. There is probably some
further clean up which can be done, but the priority at this stage is to
make the change in as simple a manner as possible.A consequence of this change is that the reference count is being
decoupled from the lru list processing. This should allow future
adoption of the lru_list code with glocks in due course.The reason for using the "dead" state and not just relying on 0 being
the "invalid state" is so that in due course 0 ref counts can be
allowable. The intent is to eventually be able to remove the ref count
changes which are currently hidden away in state_change().Signed-off-by: Steven Whitehouse
19 Aug, 2013
1 commit
-
When run during fsync, a gfs2_log_flush could happen between the
time when gfs2_ail_flush checked the number of blocks to revoke,
and when it actually started the transaction to do those revokes.
This occassionally caused it to need more revokes than it reserved,
causing gfs2 to crash.Instead of just reserving enough revokes to handle the blocks that
currently need them, this patch makes gfs2_ail_flush reserve the
maximum number of revokes it can, without increasing the total number
of reserved log blocks. This patch also passes the number of reserved
revokes to __gfs2_ail_flush() so that it doesn't go over its limit
and cause a crash like we're seeing. Non-fsync calls to __gfs2_ail_flush
will still cause a BUG() necessary revokes are skipped.Signed-off-by: Benjamin Marzinski
Signed-off-by: Steven Whitehouse
19 Jun, 2013
1 commit
-
This patch looks at all the outstanding blocks in all the transactions
on the log, and moves the completed ones to the ail2 list. Then it
issues revokes for these blocks. This will hopefully speed things up
in situations where there is a lot of contention for glocks, especially
if they are acquired serially.revoke_lo_before_commit will issue at most one log block's full of these
preemptive revokes. The amount of reserved log space that
gfs2_log_reserve() ignores has been incremented to allow for this extra
block.This patch also consolidates the common revoke instructions into one
function, gfs2_add_revoke().Signed-off-by: Benjamin Marzinski
Signed-off-by: Steven Whitehouse
10 Apr, 2013
1 commit
-
This patch adds a bool indicating whether the demote
request was originated locally or remotely. This is then
used by the iopen ->go_callback() to make 100% sure that
it will only respond to remote callbacks.Since ->evict_inode() uses GL_NOCACHE when it attempts to
get an exclusive lock on the iopen lock, this may result
in extra scheduling of the workqueue in case that the
exclusive promotion request failed. This patch prevents
that from happening.Signed-off-by: Steven Whitehouse
13 Feb, 2013
1 commit
-
When reading dinodes from the disk convert uids and gids
into kuids and kgids to store in vfs data structures.When writing to dinodes to the disk convert kuids and kgids
in the in memory structures into plain uids and gids.For now all on disk data structures are assumed to be
stored in the initial user namespace.Cc: Steven Whitehouse
Signed-off-by: "Eric W. Biederman"
15 Nov, 2012
1 commit
-
Save the effort of allocating, reading and writing
the lvb for most glocks that do not use it.Signed-off-by: David Teigland
Signed-off-by: Steven Whitehouse
07 Nov, 2012
2 commits
-
[Editorial: This is a nit, but has been a minor irritation for a long time:]
This patch renames glops structure item for go_xmote_th to go_sync.
The functionality is unchanged; it's just for readability.Signed-off-by: Bob Peterson
Signed-off-by: Steven Whitehouse -
Two of the bug traps here could really be warnings. The others are
converted from BUG() to GLOCK_BUG_ON() since we'll most likely
need to know the glock state in order to debug any issues which
arise. As a result of this, __dump_glock has to be renamed and
is no longer static.Signed-off-by: Steven Whitehouse
24 Sep, 2012
1 commit
-
gfs2_ail_empty_gl() contains an "inline version" of gfs2_trans_begin(),
so it needs an explicit sb_start_intwrite() as well, to balance the
sb_end_intwrite() which will be called by gfs2_trans_end().With this, xfstest 068 passes on lock_nolock local gfs2.
Without it, we reach a writer count of -1 and get stuck.Signed-off-by: Eric Sandeen
Signed-off-by: Steven Whitehouse
08 May, 2012
1 commit
-
This patch removes a redundant metadata block check. See description below.
Signed-off-by: Bob Peterson
Signed-off-by: Steven Whitehouse
24 Apr, 2012
1 commit
-
This is another clean up in the logging code. This per-transaction
list was largely unused. Its main function was to ensure that the
number of buffers in a transaction was correct, however that counter
was only used to check the number of buffers in the bd_list_tr, plus
an assert at the end of each transaction. With the assert now changed
to use the calculated buffer counts, we can remove both bd_list_tr and
its associated counter.This should make the code easier to understand as well as shrinking
a couple of structures.Signed-off-by: Steven Whitehouse
02 Nov, 2011
1 commit
-
Replace remaining direct i_nlink updates with a new set_nlink()
updater function.Signed-off-by: Miklos Szeredi
Tested-by: Toshiyuki Okajima
Signed-off-by: Christoph Hellwig
21 Oct, 2011
5 commits
-
Unfortunately, it is not enough to just ignore locked buffers during
the AIL flush from fsync. We need to be able to ignore all buffers
which are locked, dirty or pinned at this stage as they might have
been added subsequent to the log flush earlier in the fsync function.In addition, this means that we no longer need to rely on i_mutex to
keep out writes during fsync, so we can, as a side-effect, remove
that protection too.Signed-off-by: Steven Whitehouse
Tested-By: Abhijith Das -
Since we have ruled out supporting online filesystem shrink,
it is possible to make the resource group list append only
during the life of a super block. This gives several benefits:Firstly, we only need to read new rindex elements as they are added
rather than needing to reread the whole rindex file each time one
element is added.Secondly, the rindex glock can be held for much shorter periods of
time, and is completely removed from the fast path for allocations.
The lock is taken in shared mode only when updating the resource
groups when the first allocation occurs, and after a grow has
taken place.Thirdly, this results in a reduction in code size, and everything
gets a lot simpler to understand in this area.Signed-off-by: Steven Whitehouse
-
Here is an update of Bob's original rbtree patch which, in addition, also
resolves the rather strange ref counting that was being done relating to
the bitmap blocks.Originally we had a dual system for journaling resource groups. The metadata
blocks were journaled and also the rgrp itself was added to a list. The reason
for adding the rgrp to the list in the journal was so that the "repolish
clones" code could be run to update the free space, and potentially send any
discard requests when the log was flushed. This was done by comparing the
"cloned" bitmap with what had been written back on disk during the transaction
commit.Due to this, there was a requirement to hang on to the rgrps' bitmap buffers
until the journal had been flushed. For that reason, there was a rather
complicated set up in the ->go_lock ->go_unlock functions for rgrps involving
both a mutex and a spinlock (the ->sd_rindex_spin) to maintain a reference
count on the buffers.However, the journal maintains a reference count on the buffers anyway, since
they are being journaled as metadata buffers. So by moving the code which deals
with the post-journal accounting for bitmap blocks to the metadata journaling
code, we can entirely dispense with the rather strange buffer ref counting
scheme and also the requirement to journal the rgrps.The net result of all this is that the ->sd_rindex_spin is left to do exactly
one job, and that is to look after the rbtree or rgrps.This patch is designed to be a stepping stone towards using RCU for the rbtree
of resource groups, however the reduction in the number of uses of the
->sd_rindex_spin is likely to have benefits for multi-threaded workloads,
anyway.The patch retains ->go_lock and ->go_unlock for rgrps, however these maybe also
be removed in future in favour of calling the functions directly where required
in the code. That will allow locking of resource groups without needing to
actually read them in - something that could be useful in speeding up statfs.In the mean time though it is valid to dereference ->bi_bh only when the rgrp
is locked. This is basically the same rule as before, modulo the references not
being valid until the following journal flush.Signed-off-by: Steven Whitehouse
Signed-off-by: Bob Peterson
Cc: Benjamin Marzinski -
Journaled data requires that a complete flush of all dirty data for
the file is done, in order that the ail flush which comes after
will succeed.Also the recently enhanced bug trap can trigger falsely in case
an ail flush from fsync races with a page read. This updates the
bug trap such that it will ignore buffers which are locked and
only trigger on dirty and/or pinned buffers when the ail flush
is run from fsync. The original bug trap is retained when ail
flush is run from ->go_sync()Signed-off-by: Steven Whitehouse
-
The assert was being tested under the wrong lock, a
legacy of the original code. Also, if it does trigger,
the resulting information was not always a lot of help.This moves the patch under the correct lock and also
prints out more useful information in tacking down the
source of the problem.Signed-off-by: Steven Whitehouse
15 Jul, 2011
3 commits
-
This adds S_NOSEC support to GFS2. We set/reset the flag either when
a user calls setattr or when we have just regained the glock
from another node. The flag is only set if there are no xattrs
on the inode and there is no suid bit set.Signed-off-by: Steven Whitehouse
Reviewed-by: Andi Kleen
Cc: Al Viro -
This patch is a performance improvement for GFS2 in a clustered
environment. It makes the glock hold time self-adjusting.Signed-off-by: Bob Peterson
Signed-off-by: Steven Whitehouse -
This patch adds a cache for the hash table to the directory code
in order to help simplify the way in which the hash table is
accessed. This is intended to be a first step towards introducing
some performance improvements in the directory code.There are two follow ups that I'm hoping to see fairly shortly. One
is to simplify the hash table reading code now that we always read the
complete hash table, whether we want one entry or all of them. The
other is to introduce readahead on the heads of the hash chains
which are referred to from the table.The hash table is a maximum of 128k in size, so it is not worth trying
to read it in small chunks.Signed-off-by: Steven Whitehouse
14 Jul, 2011
1 commit
-
This patch contains a few misc fixes which resolve a recently
reported issue. This patch has been a real team effort and has
received a lot of testing.The first issue is that the ail lock needs to be held over a few
more operations. The lock thats added into gfs2_releasepage() may
possibly be a candidate for replacing with RCU at some future
point, but at this stage we've gone for the obvious fix.The second issue is that gfs2_write_inode() can end up calling
a glock recursively when called from gfs2_evict_inode() via the
syncing code, so it needs a guard added.The third issue is that we either need to not truncate the metadata
pages of inodes which have zero link count, but which we cannot
deallocate due to them still being in use by other nodes, or we need
to ensure that those pages have all made it through the journal and
ail lists first. This patch takes the former approach, but the
latter has also been tested and there is nothing to choose between
them performance-wise. So again, we could revise that decision
in the future.Also, the inode eviction process is now better documented.
Signed-off-by: Steven Whitehouse
Tested-by: Bob Peterson
Tested-by: Abhijith Das
Reported-by: Barry J. Marson
Reported-by: David Teigland
12 Jul, 2011
1 commit
-
Right now, there is nothing that forces the log to get flushed when a node
drops its rindex glock so that another node can grow the filesystem. If the
log doesn't get flushed, GFS2 can corrupt the sd_log_le_rg list in the
following way.A node puts an rgd on the list in rg_lo_add(), and then the rindex glock is
dropped so the other node can grow the filesystem. When the node reacquires the
rindex glock, that rgd gets deleted in clear_rgrpdi() before ever being
removed from the list by gfs2_log_flush().This code simply forces a log flush when the rindex glock is invalidated,
solving the problem.Signed-off-by: Benjamin Marzinski
Signed-off-by: Steven Whitehouse
09 May, 2011
1 commit
-
Eventually there will only be a single caller of this code, so lets
move it where it can be made static at some future date.Signed-off-by: Steven Whitehouse
20 Apr, 2011
1 commit
-
This patch is designed to clean up GFS2's fsync
implementation and ensure that it really does get everything on
disk. Since ->write_inode() has been updated, we can call that
via the vfs library function sync_inode_metadata() and the only
remaining thing that has to be done is to ensure that we get
any revoke records in the log after the inode has been written back.Signed-off-by: Steven Whitehouse