21 Jul, 2011
1 commit
-
Remove the second parameter to xfs_sb_count() since all callers of
the function set them.Also, fix the header comment regarding it being called periodically.
Signed-off-by: Chandra Seetharaman
Signed-off-by: Alex Elder
25 May, 2011
1 commit
-
Now that we have reliably tracking of deleted extents in a
transaction we can easily implement "online" discard support
which calls blkdev_issue_discard once a transaction commits.The actual discard is a two stage operation as we first have
to mark the busy extent as not available for reuse before we
can start the actual discard. Note that we don't bother
supporting discard for the non-delaylog mode.Signed-off-by: Christoph Hellwig
Signed-off-by: Alex Elder
08 Apr, 2011
3 commits
-
Background inode reclaim needs to run more frequently that the XFS
syncd work is run as 30s is too long between optimal reclaim runs.
Add a new periodic work item to the xfs syncd workqueue to run a
fast, non-blocking inode reclaim scan.Background inode reclaim is kicked by the act of marking inodes for
reclaim. When an AG is first marked as having reclaimable inodes,
the background reclaim work is kicked. It will continue to run
periodically untill it detects that there are no more reclaimable
inodes. It will be kicked again when the first inode is queued for
reclaim.To ensure shrinker based inode reclaim throttles to the inode
cleaning and reclaim rate but still reclaim inodes efficiently, make it kick the
background inode reclaim so that when we are low on memory we are
trying to reclaim inodes as efficiently as possible. This kick shoul
d not be necessary, but it will protect against failures to kick the
background reclaim when inodes are first dirtied.To provide the rate throttling, make the shrinker pass do
synchronous inode reclaim so that it blocks on inodes under IO. This
means that the shrinker will reclaim inodes rather than just
skipping over them, but it does not adversely affect the rate of
reclaim because most dirty inodes are already under IO due to the
background reclaim work the shrinker kicked.These two modifications solve one of the two OOM killer invocations
Chris Mason reported recently when running a stress testing script.
The particular workload trigger for the OOM killer invocation is
where there are more threads than CPUs all unlinking files in an
extremely memory constrained environment. Unlike other solutions,
this one does not have a performance impact on performance when
memory is not constrained or the number of concurrent threads
operating is
Reviewed-by: Christoph Hellwig
Reviewed-by: Alex Elder -
On of the problems with the current inode flush at ENOSPC is that we
queue a flush per ENOSPC event, regardless of how many are already
queued. Thi can result in hundreds of queued flushes, most of
which simply burn CPU scanned and do no real work. This simply slows
down allocation at ENOSPC.We really only need one active flush at a time, and we can easily
implement that via the new xfs_syncd_wq. All we need to do is queue
a flush if one is not already active, then block waiting for the
currently active flush to complete. The result is that we only ever
have a single ENOSPC inode flush active at a time and this greatly
reduces the overhead of ENOSPC processing.On my 2p test machine, this results in tests exercising ENOSPC
conditions running significantly faster - 042 halves execution time,
083 drops from 60s to 5s, etc - while not introducing test
regressions.This allows us to remove the old xfssyncd threads and infrastructure
as they are no longer used.Signed-off-by: Dave Chinner
Reviewed-by: Christoph Hellwig
Reviewed-by: Alex Elder -
All of the work xfssyncd does is background functionality. There is
no need for a thread per filesystem to do this work - it can al be
managed by a global workqueue now they manage concurrency
effectively.Introduce a new gglobal xfssyncd workqueue, and convert the periodic
work to use this new functionality. To do this, use a delayed work
construct to schedule the next running of the periodic sync work
for the filesystem. When the sync work is complete, queue a new
delayed work for the next running of the sync work.For laptop mode, we wait on completion for the sync works, so ensure
that the sync work queuing interface can flush and wait for work to
complete to enable the work queue infrastructure to replace the
current sequence number and wakeup that is used.Because the sync work does non-trivial amounts of work, mark the
new work queue as CPU intensive.Signed-off-by: Dave Chinner
Reviewed-by: Christoph Hellwig
Reviewed-by: Alex Elder
04 Jan, 2011
1 commit
-
Currently the size of the speculative preallocation during delayed
allocation is fixed by either the allocsize mount option of a
default size. We are seeing a lot of cases where we need to
recommend using the allocsize mount option to prevent fragmentation
when buffered writes land in the same AG.Rather than using a fixed preallocation size by default (up to 64k),
make it dynamic by basing it on the current inode size. That way the
EOF preallocation will increase as the file size increases. Hence
for streaming writes we are much more likely to get large
preallocations exactly when we need it to reduce fragementation.For default settings, the size of the initial extents is determined
by the number of parallel writers and the amount of memory in the
machine. For 4GB RAM and 4 concurrent 32GB file writes:EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL
0: [0..1048575]: 1048672..2097247 0 (1048672..2097247) 1048576
1: [1048576..2097151]: 5242976..6291551 0 (5242976..6291551) 1048576
2: [2097152..4194303]: 12583008..14680159 0 (12583008..14680159) 2097152
3: [4194304..8388607]: 25165920..29360223 0 (25165920..29360223) 4194304
4: [8388608..16777215]: 58720352..67108959 0 (58720352..67108959) 8388608
5: [16777216..33554423]: 117440584..134217791 0 (117440584..134217791) 16777208
6: [33554424..50331511]: 184549056..201326143 0 (184549056..201326143) 16777088
7: [50331512..67108599]: 251657408..268434495 0 (251657408..268434495) 16777088and for 16 concurrent 16GB file writes:
EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL
0: [0..262143]: 2490472..2752615 0 (2490472..2752615) 262144
1: [262144..524287]: 6291560..6553703 0 (6291560..6553703) 262144
2: [524288..1048575]: 13631592..14155879 0 (13631592..14155879) 524288
3: [1048576..2097151]: 30408808..31457383 0 (30408808..31457383) 1048576
4: [2097152..4194303]: 52428904..54526055 0 (52428904..54526055) 2097152
5: [4194304..8388607]: 104857704..109052007 0 (104857704..109052007) 4194304
6: [8388608..16777215]: 209715304..218103911 0 (209715304..218103911) 8388608
7: [16777216..33554423]: 452984848..469762055 0 (452984848..469762055) 16777208Because it is hard to take back specualtive preallocation, cases
where there are large slow growing log files on a nearly full
filesystem may cause premature ENOSPC. Hence as the filesystem nears
full, the maximum dynamic prealloc size іs reduced according to this
table (based on 4k block size):freespace max prealloc size
>5% full extent (8GB)
4-5% 2GB (8GB >> 2)
3-4% 1GB (8GB >> 3)
2-3% 512MB (8GB >> 4)
1-2% 256MB (8GB >> 5)
> 6)This should reduce the amount of space held in speculative
preallocation for such cases.The allocsize mount option turns off the dynamic behaviour and fixes
the prealloc size to whatever the mount option specifies. i.e. the
behaviour is unchanged.Signed-off-by: Dave Chinner
19 Oct, 2010
4 commits
-
We're not actually passing around credentials inside XFS for a while
now, so remove all xfs_cred.h with it's cred_t typedef and all
instances of it.Signed-off-by: Christoph Hellwig
Signed-off-by: Alex Elder -
Export xfs_icsb_modify_counters and always use it for modifying
the per-cpu counters. Remove support for per-cpu counters from
xfs_mod_incore_sb to simplify it.Signed-off-by: Christoph Hellwig
Signed-off-by: Alex Elder -
Fail the mount if we can't allocate memory for the per-CPU counters.
This is consistent with how we handle everything else in the mount
path and makes the superblock counter modification a lot simpler.Signed-off-by: Christoph Hellwig
Signed-off-by: Alex Elder -
The reclaim walk requires different locking and has a slightly
different walk algorithm, so separate it out so that it can be
optimised separately.Signed-off-by: Dave Chinner
Reviewed-by: Christoph Hellwig
Reviewed-by: Alex Elder
27 Jul, 2010
2 commits
-
Since Linux 2.6.33 the kernel has support for real O_SYNC, which made
the osyncisosync option a no-op. Warn the users about this and remove
the mount flag for it.Signed-off-by: Christoph Hellwig
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner -
Dmapi support was never merged upstream, but we still have a lot of hooks
bloating XFS for it, all over the fast pathes of the filesystem.This patch drops over 700 lines of dmapi overhead. If we'll ever get HSM
support in mainline at least the namespace events can be done much saner
in the VFS instead of the individual filesystem, so it's not like this
is much help for future work.Signed-off-by: Christoph Hellwig
Reviewed-by: Dave Chinner
20 Jul, 2010
1 commit
-
Now the shrinker passes us a context, wire up a shrinker context per
filesystem. This allows us to remove the global mount list and the
locking problems that introduced. It also means that a shrinker call
does not need to traverse clean filesystems before finding a
filesystem with reclaimable inodes. This significantly reduces
scanning overhead when lots of filesystems are present.Signed-off-by: Dave Chinner
Reviewed-by: Christoph Hellwig
24 May, 2010
1 commit
-
The delayed logging code only changes in-memory structures and as
such can be enabled and disabled with a mount option. Add the mount
option and emit a warning that this is an experimental feature that
should not be used in production yet.We also need infrastructure to track committed items that have not
yet been written to the log. This is what the Committed Item List
(CIL) is for.The log item also needs to be extended to track the current log
vector, the associated memory buffer and it's location in the Commit
Item List. Extend the log item and log vector structures to enable
this tracking.To maintain the current log format for transactions with delayed
logging, we need to introduce a checkpoint transaction and a context
for tracking each checkpoint from initiation to transaction
completion. This includes adding a log ticket for tracking space
log required/used by the context checkpoint.To track all the changes we need an io vector array per log item,
rather than a single array for the entire transaction. Using the new
log vector structure for this requires two passes - the first to
allocate the log vector structures and chain them together, and the
second to fill them out. This log vector chain can then be passed
to the CIL for formatting, pinning and insertion into the CIL.Formatting of the log vector chain is relatively simple - it's just
a loop over the iovecs on each log vector, but it is made slightly
more complex because we re-write the iovec after the copy to point
back at the memory buffer we just copied into.This code also needs to pin log items. If the log item is not
already tracked in this checkpoint context, then it needs to be
pinned. Otherwise it is already pinned and we don't need to pin it
again.The only other complexity is calculating the amount of new log space
the formatting has consumed. This needs to be accounted to the
transaction in progress, and the accounting is made more complex
becase we need also to steal space from it for log metadata in the
checkpoint transaction. Calculate all this at insert time and update
all the tickets, counters, etc correctly.Once we've formatted all the log items in the transaction, attach
the busy extents to the checkpoint context so the busy extents live
until checkpoint completion and can be processed at that point in
time. Transactions can then be freed at this point in time.Now we need to issue checkpoints - we are tracking the amount of log space
used by the items in the CIL, so we can trigger background checkpoints when the
space usage gets to a certain threshold. Otherwise, checkpoints need ot be
triggered when a log synchronisation point is reached - a log force event.Because the log write code already handles chained log vectors, writing the
transaction is trivial, too. Construct a transaction header, add it
to the head of the chain and write it into the log, then issue a
commit record write. Then we can release the checkpoint log ticket
and attach the context to the log buffer so it can be called during
Io completion to complete the checkpoint.We also need to allow for synchronising multiple in-flight
checkpoints. This is needed for two things - the first is to ensure
that checkpoint commit records appear in the log in the correct
sequence order (so they are replayed in the correct order). The
second is so that xfs_log_force_lsn() operates correctly and only
flushes and/or waits for the specific sequence it was provided with.To do this we need a wait variable and a list tracking the
checkpoint commits in progress. We can walk this list and wait for
the checkpoints to change state or complete easily, an this provides
the necessary synchronisation for correct operation in both cases.Signed-off-by: Dave Chinner
Reviewed-by: Christoph Hellwig
Signed-off-by: Alex Elder
30 Apr, 2010
1 commit
-
On low memory boxes or those with highmem, kernel can OOM before the
background reclaims inodes via xfssyncd. Add a shrinker to run inode
reclaim so that it inode reclaim is expedited when memory is low.This is more complex than it needs to be because the VM folk don't
want a context added to the shrinker infrastructure. Hence we need
to add a global list of XFS mount structures so the shrinker can
traverse them.Signed-off-by: Dave Chinner
Reviewed-by: Christoph Hellwig
06 Mar, 2010
1 commit
03 Mar, 2010
1 commit
-
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
percpu: add __percpu sparse annotations to what's left
percpu: add __percpu sparse annotations to fs
percpu: add __percpu sparse annotations to core kernel subsystems
local_t: Remove leftover local.h
this_cpu: Remove pageset_notifier
this_cpu: Page allocator conversion
percpu, x86: Generic inc / dec percpu instructions
local_t: Move local.h include to ringbuffer.c and ring_buffer_benchmark.c
module: Use this_cpu_xx to dynamically allocate counters
local_t: Remove cpu_local_xx macros
percpu: refactor the code in pcpu_[de]populate_chunk()
percpu: remove compile warnings caused by __verify_pcpu_ptr()
percpu: make accessors check for percpu pointer in sparse
percpu: add __percpu for sparse.
percpu: make access macros universal
percpu: remove per_cpu__ prefix.
02 Mar, 2010
1 commit
-
Move the two declarations to better fitting headers now that
xfs_lrw.c is gone.Signed-off-by: Christoph Hellwig
Reviewed-by: Dave Chinner
Signed-off-by: Alex Elder
17 Feb, 2010
1 commit
-
Add __percpu sparse annotations to fs.
These annotations are to make sparse consider percpu variables to be
in a different address space and warn if accessed without going
through percpu accessors. This patch doesn't affect normal builds.Signed-off-by: Tejun Heo
Cc: "Theodore Ts'o"
Cc: Trond Myklebust
Cc: Alex Elder
Cc: Christoph Hellwig
Cc: Alexander Viro
09 Feb, 2010
1 commit
-
This mangles the reserved blocks counts a little more.
1) add a helper function for the default reserved count
2) add helper functions to save/restore counts on ro/rw
3) save/restore reserved blocks on freeze/thaw
4) disallow changing reserved count while readonlyV2: changed field name to match Dave's changes
Signed-off-by: Eric Sandeen
Signed-off-by: Alex Elder
26 Jan, 2010
1 commit
-
If we hold onto reserved blocks when doing a remount,ro we end
up writing the blocks used count to disk that includes the reserved
blocks. Reserved blocks are not actually used, so this results in
the values in the superblock being incorrect.Hence if we run xfs_check or xfs_repair -n while the filesystem is
mounted remount,ro we end up with an inconsistent filesystem being
reported. Also, running xfs_copy on the remount,ro filesystem will
result in an inconsistent image being generated.To fix this, unreserve the blocks when doing the remount,ro, and
reserved them again on remount,rw. This way a remount,ro filesystem
will appear consistent on disk to all utilities.Signed-off-by: Dave Chinner
Reviewed-by: Christoph Hellwig
20 Jan, 2010
1 commit
-
dmops uses a signed char for it's namespace event. To be consistent
with the rest of the code, convert them to unsigned char for the
namespace string.Signed-off-by: Dave Chinner
Reviewed-by: Christoph Hellwig
16 Jan, 2010
4 commits
-
Uninline xfs_perag_{get,put} so that tracepoints can be inserted
into them to speed debugging of reference count problems.Signed-off-by: Dave Chinner
Reviewed-by: Christoph Hellwig
Signed-off-by: Alex Elder -
Reference count the per-ag structures to ensure that we keep get/put
pairs balanced. Assert that the reference counts are zero at unmount
time to catch leaks. In future, reference counts will enable us to
safely remove perag structures by allowing us to detect when they
are no longer in use.Signed-off-by: Dave Chinner
Reviewed-by: Christoph Hellwig
Signed-off-by: Alex Elder -
The use of an array for the per-ag structures requires reallocation
of the array when growing the filesystem. This requires locking
access to the array to avoid use after free situations, and the
locking is difficult to get right. To avoid needing to reallocate an
array, change the per-ag structures to an allocated object per ag
and index them using a tree structure.The AGs are always densely indexed (hence the use of an array), but
the number supported is 2^32 and lookups tend to be random and hence
indexing needs to scale. A simple choice is a radix tree - it works
well with this sort of index. This change also removes another
large contiguous allocation from the mount/growfs path in XFS.The growing process now needs to change to only initialise the new
AGs required for the extra space, and as such only needs to
exclusively lock the tree for inserts. The rest of the code only
needs to lock the tree while doing lookups, and hence this will
remove all the deadlocks that currently occur on the m_perag_lock as
it is now an innermost lock. The lock is also changed to a spinlock
from a read/write lock as the hold time is now extremely short.To complete the picture, the per-ag structures will need to be
reference counted to ensure that we don't free/modify them while
they are still in use. This will be done in subsequent patch.Signed-off-by: Dave Chinner
Reviewed-by: Christoph Hellwig
Signed-off-by: Alex Elder -
xfs_get_perag is really getting the perag that an inode belongs to
based on it's inode number. Convert the use of this function to just
get the perag from a provided ag number. Use this new function to
obtain the per-ag structure when traversing the per AG inode trees
for sync and reclaim.Signed-off-by: Dave Chinner
Reviewed-by: Christoph Hellwig
Signed-off-by: Alex Elder
12 Dec, 2009
2 commits
-
Stop the flag saving as we never mangle those in the unmount path, and
hide all the weird arguents to the dmapi code inside the
XFS_SEND_PREUNMOUNT / XFS_SEND_UNMOUNT macros.Signed-off-by: Christoph Hellwig
Reviewed-by: Dave Chinner
Signed-off-by: Alex Elder -
Remove our own STATIC_INLINE macro. For small function inside
implementation files just use STATIC and let gcc inline it, and for
those in headers do the normal static inline - they are all small
enough to be inlined for debug builds, too.Signed-off-by: Christoph Hellwig
Reviewed-by: Dave Chinner
Signed-off-by: Alex Elder
01 Sep, 2009
1 commit
-
A lot more functions could be made static, but they need
forward declarations; this does some easy ones, and also
found a few unused functions in the process.Signed-off-by: Eric Sandeen
Reviewed-by: Christoph Hellwig
Signed-off-by: Felix Blyakher
08 Jun, 2009
1 commit
-
Kill the quota ops function vector and replace it with direct calls or
stubs in the CONFIG_XFS_QUOTA=n case.Make sure we check XFS_IS_QUOTA_RUNNING in the right spots. We can remove
the number of those checks because the XFS_TRANS_DQ_DIRTY flag can't be set
otherwise.This brings us back closer to the way this code worked in IRIX and earlier
Linux versions, but we keep a lot of the more useful factoring of common
code.Eventually we should also kill xfs_qm_bhv.c, but that's left for a later
patch.Reduces the size of the source code by about 250 lines and the size of
XFS module by about 1.5 kilobytes with quotas enabled:text data bss dec hex filename
615957 2960 3848 622765 980ad fs/xfs/xfs.o
617231 3152 3848 624231 98667 fs/xfs/xfs.o.oldFallout:
- xfs_qm_dqattach is split into xfs_qm_dqattach_locked which expects
the inode locked and xfs_qm_dqattach which does the locking around it,
thus removing XFS_QMOPT_ILOCKED.Signed-off-by: Christoph Hellwig
Reviewed-by: Eric Sandeen
07 Apr, 2009
1 commit
-
Currently xfs_device_flush calls sync_blockdev() which is
a no-op for XFS as all it's metadata is held in a different
address to the one sync_blockdev() works on.Call xfs_sync_inodes() instead to flush all the delayed
allocation blocks out. To do this as efficiently as possible,
do it via two passes - one to do an async flush of all the
dirty blocks and a second to wait for all the IO to complete.
This requires some modification to the xfs-sync_inodes_ag()
flush code to do efficiently.Signed-off-by: Dave Chinner
Reviewed-by: Christoph Hellwig
30 Mar, 2009
1 commit
-
With the upcoming v3 inodes the default attroffset needs to be calculated
for each specific inode, so we can't cache it in the superblock anymore.Also replace the assert for wrong inode sizes with a proper error check
also included in non-debug builds. Note that the ENOSYS return for
that might seem odd, but that error is returned by xfs_mount_validate_sb
for all theoretically valid but not supported filesystem geometries.Signed-off-by: Christoph Hellwig
Reviewed-by: Josef 'Jeff' Sipek
29 Mar, 2009
3 commits
-
Signed-off-by: Malcolm Parsons
Reviewed-by: Christoph Hellwig -
With the upcoming v3 inodes the inode data/attr area size needs to be
calculated for each specific inode, so we can't cache it in the superblock
anymore.Signed-off-by: Christoph Hellwig
Reviewed-by: Eric Sandeen
Reviewed-by: Felix Blyakher -
The ino64 mount option adds a fixed offset to 32bit inode numbers
to bring them into the 64bit range. There's no need for this kind
of debug tool given that it's easy to produce real 64bit inode numbers
for testing.Signed-off-by: Christoph Hellwig
Reviewed-by: Eric Sandeen
Reviewed-by: Felix Blyakher
09 Feb, 2009
2 commits
-
Currently we call from the nicely abstracted linux quotaops into a ugly
multiplexer just to split the calls out at the same boundary again.
Rewrite the quota ops handling to remove that obfucation.Signed-off-by: Christoph Hellwig
Reviewed-by: Dave Chinner -
xfs_ialloc_btree.h has a a cuple of macros that only obsfucate the code
but don't provide any abstraction benefits. This patches removes those
and cleans up the reamaining defintions up a little.Signed-off-by: Christoph Hellwig
Reviewed-by: Dave Chinner
04 Feb, 2009
1 commit
-
These aren't only unused but also reference a lock that doesn't exist anymore.
Signed-off-by: Christoph Hellwig
Reviewed-by: Felix Blyakher
19 Jan, 2009
1 commit
-
Currently the bad_features2 fixup and the alignment updates in the superblock
are skipped if we mount a filesystem read-only. But for the root filesystem
the typical case is to mount read-only first and only later remount writeable
so we'll never perform this update at all. It's not a big problem but means
the logs of people needing the fixup get spammed at every boot because they
never happen on disk.Reported-by: Arkadiusz Miskiewicz
Signed-off-by: Christoph Hellwig
Reviewed-by: Dave Chinner
16 Jan, 2009
1 commit
-
Remove the last of the macros-defined-to-static-functions.
Signed-off-by: Eric Sandeen
Reviewed-by: Christoph Hellwig
Signed-off-by: Lachlan McIlroy