12 Oct, 2011

2 commits


07 Mar, 2011

1 commit


22 Feb, 2011

1 commit

  • The FSGEOMETRY_V1 ioctl (and its compat equivalent) calls out to
    xfs_fs_geometry() with a version number of 3. This code path does not
    fill in the logsunit member of the passed xfs_fsop_geom_t, leading to
    the leaking of four bytes of uninitialized stack data to potentially
    unprivileged callers.

    v2 switches to memset() to avoid future issues if structure members
    change, on suggestion of Dave Chinner.

    Signed-off-by: Dan Rosenberg
    Reviewed-by: Eugene Teo
    Signed-off-by: Alex Elder

    Dan Rosenberg
     

12 Jan, 2011

1 commit

  • To ensure the log is covered and the filesystem idles correctly, we
    need to ensure that dummy transactions hit the disk and do not stay
    pinned in memory. If the superblock is pinned in memory, it can't
    be flushed so the log covering cannot make progress. The result is
    dependent on timing - more oftent han not we continue to issues a
    log covering transaction every 36s rather than idling after ~90s.

    Fix this by making the log covering transaction synchronous. To
    avoid additional log force from xfssyncd, make the log covering
    transaction take the place of the existing log force in the xfssyncd
    background sync process.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner
     

04 Jan, 2011

1 commit

  • Currently the size of the speculative preallocation during delayed
    allocation is fixed by either the allocsize mount option of a
    default size. We are seeing a lot of cases where we need to
    recommend using the allocsize mount option to prevent fragmentation
    when buffered writes land in the same AG.

    Rather than using a fixed preallocation size by default (up to 64k),
    make it dynamic by basing it on the current inode size. That way the
    EOF preallocation will increase as the file size increases. Hence
    for streaming writes we are much more likely to get large
    preallocations exactly when we need it to reduce fragementation.

    For default settings, the size of the initial extents is determined
    by the number of parallel writers and the amount of memory in the
    machine. For 4GB RAM and 4 concurrent 32GB file writes:

    EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL
    0: [0..1048575]: 1048672..2097247 0 (1048672..2097247) 1048576
    1: [1048576..2097151]: 5242976..6291551 0 (5242976..6291551) 1048576
    2: [2097152..4194303]: 12583008..14680159 0 (12583008..14680159) 2097152
    3: [4194304..8388607]: 25165920..29360223 0 (25165920..29360223) 4194304
    4: [8388608..16777215]: 58720352..67108959 0 (58720352..67108959) 8388608
    5: [16777216..33554423]: 117440584..134217791 0 (117440584..134217791) 16777208
    6: [33554424..50331511]: 184549056..201326143 0 (184549056..201326143) 16777088
    7: [50331512..67108599]: 251657408..268434495 0 (251657408..268434495) 16777088

    and for 16 concurrent 16GB file writes:

    EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL
    0: [0..262143]: 2490472..2752615 0 (2490472..2752615) 262144
    1: [262144..524287]: 6291560..6553703 0 (6291560..6553703) 262144
    2: [524288..1048575]: 13631592..14155879 0 (13631592..14155879) 524288
    3: [1048576..2097151]: 30408808..31457383 0 (30408808..31457383) 1048576
    4: [2097152..4194303]: 52428904..54526055 0 (52428904..54526055) 2097152
    5: [4194304..8388607]: 104857704..109052007 0 (104857704..109052007) 4194304
    6: [8388608..16777215]: 209715304..218103911 0 (209715304..218103911) 8388608
    7: [16777216..33554423]: 452984848..469762055 0 (452984848..469762055) 16777208

    Because it is hard to take back specualtive preallocation, cases
    where there are large slow growing log files on a nearly full
    filesystem may cause premature ENOSPC. Hence as the filesystem nears
    full, the maximum dynamic prealloc size іs reduced according to this
    table (based on 4k block size):

    freespace max prealloc size
    >5% full extent (8GB)
    4-5% 2GB (8GB >> 2)
    3-4% 1GB (8GB >> 3)
    2-3% 512MB (8GB >> 4)
    1-2% 256MB (8GB >> 5)
    > 6)

    This should reduce the amount of space held in speculative
    preallocation for such cases.

    The allocsize mount option turns off the dynamic behaviour and fixes
    the prealloc size to whatever the mount option specifies. i.e. the
    behaviour is unchanged.

    Signed-off-by: Dave Chinner

    Dave Chinner
     

19 Oct, 2010

2 commits

  • Export xfs_icsb_modify_counters and always use it for modifying
    the per-cpu counters. Remove support for per-cpu counters from
    xfs_mod_incore_sb to simplify it.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • When we are checking we can access the last block of each device, we
    do not need to use cached buffers as they will be tossed away
    immediately. Use uncached buffers for size checks so that all IO
    prior to full in-memory structure initialisation does not use the
    buffer cache.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Alex Elder

    Dave Chinner
     

24 Aug, 2010

1 commit

  • When we need to cover the log, we issue dummy transactions to ensure
    the current log tail is on disk. Unfortunately we currently use the
    root inode in the dummy transaction, and the act of committing the
    transaction dirties the inode at the VFS level.

    As a result, the VFS writeback of the dirty inode will prevent the
    filesystem from idling long enough for the log covering state
    machine to complete. The state machine gets stuck in a loop issuing
    new dummy transactions to cover the log and never makes progress.

    To avoid this problem, the dummy transactions should not cause
    externally visible state changes. To ensure this occurs, make sure
    that dummy transactions log an unchanging field in the superblock as
    it's state is never propagated outside the filesystem. This allows
    the log covering state machine to complete successfully and the
    filesystem now correctly enters a fully idle state about 90s after
    the last modification was made.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     

27 Jul, 2010

3 commits

  • Currently we need to either call IHOLD or xfs_trans_ihold on an inode when
    joining it to a transaction via xfs_trans_ijoin.

    This patches instead makes xfs_trans_ijoin usable on it's own by doing
    an implicity xfs_trans_ihold, which also allows us to drop the third
    argument. For the case where we want to hold a reference on the inode
    a xfs_trans_ijoin_ref wrapper is added which does the IHOLD and marks
    the inode for needing an xfs_iput. In addition to the cleaner interface
    to the caller this also simplifies the implementation.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner

    Christoph Hellwig
     
  • Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner

    Christoph Hellwig
     
  • Dmapi support was never merged upstream, but we still have a lot of hooks
    bloating XFS for it, all over the fast pathes of the filesystem.

    This patch drops over 700 lines of dmapi overhead. If we'll ever get HSM
    support in mainline at least the namespace events can be done much saner
    in the VFS instead of the individual filesystem, so it's not like this
    is much help for future work.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner

    Christoph Hellwig
     

16 Jan, 2010

1 commit

  • The use of an array for the per-ag structures requires reallocation
    of the array when growing the filesystem. This requires locking
    access to the array to avoid use after free situations, and the
    locking is difficult to get right. To avoid needing to reallocate an
    array, change the per-ag structures to an allocated object per ag
    and index them using a tree structure.

    The AGs are always densely indexed (hence the use of an array), but
    the number supported is 2^32 and lookups tend to be random and hence
    indexing needs to scale. A simple choice is a radix tree - it works
    well with this sort of index. This change also removes another
    large contiguous allocation from the mount/growfs path in XFS.

    The growing process now needs to change to only initialise the new
    AGs required for the extra space, and as such only needs to
    exclusively lock the tree for inserts. The rest of the code only
    needs to lock the tree while doing lookups, and hence this will
    remove all the deadlocks that currently occur on the m_perag_lock as
    it is now an innermost lock. The lock is also changed to a spinlock
    from a read/write lock as the hold time is now extremely short.

    To complete the picture, the per-ag structures will need to be
    reference counted to ensure that we don't free/modify them while
    they are still in use. This will be done in subsequent patch.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner
     

15 Dec, 2009

1 commit

  • Convert the old xfs tracing support that could only be used with the
    out of tree kdb and xfsidbg patches to use the generic event tracer.

    To use it make sure CONFIG_EVENT_TRACING is enabled and then enable
    all xfs trace channels by:

    echo 1 > /sys/kernel/debug/tracing/events/xfs/enable

    or alternatively enable single events by just doing the same in one
    event subdirectory, e.g.

    echo 1 > /sys/kernel/debug/tracing/events/xfs/xfs_ihold/enable

    or set more complex filters, etc. In Documentation/trace/events.txt
    all this is desctribed in more detail. To reads the events do a

    cat /sys/kernel/debug/tracing/trace

    Compared to the last posting this patch converts the tracing mostly to
    the one tracepoint per callsite model that other users of the new
    tracing facility also employ. This allows a very fine-grained control
    of the tracing, a cleaner output of the traces and also enables the
    perf tool to use each tracepoint as a virtual performance counter,
    allowing us to e.g. count how often certain workloads git various
    spots in XFS. Take a look at

    http://lwn.net/Articles/346470/

    for some examples.

    Also the btree tracing isn't included at all yet, as it will require
    additional core tracing features not in mainline yet, I plan to
    deliver it later.

    And the really nice thing about this patch is that it actually removes
    many lines of code while adding this nice functionality:

    fs/xfs/Makefile | 8
    fs/xfs/linux-2.6/xfs_acl.c | 1
    fs/xfs/linux-2.6/xfs_aops.c | 52 -
    fs/xfs/linux-2.6/xfs_aops.h | 2
    fs/xfs/linux-2.6/xfs_buf.c | 117 +--
    fs/xfs/linux-2.6/xfs_buf.h | 33
    fs/xfs/linux-2.6/xfs_fs_subr.c | 3
    fs/xfs/linux-2.6/xfs_ioctl.c | 1
    fs/xfs/linux-2.6/xfs_ioctl32.c | 1
    fs/xfs/linux-2.6/xfs_iops.c | 1
    fs/xfs/linux-2.6/xfs_linux.h | 1
    fs/xfs/linux-2.6/xfs_lrw.c | 87 --
    fs/xfs/linux-2.6/xfs_lrw.h | 45 -
    fs/xfs/linux-2.6/xfs_super.c | 104 ---
    fs/xfs/linux-2.6/xfs_super.h | 7
    fs/xfs/linux-2.6/xfs_sync.c | 1
    fs/xfs/linux-2.6/xfs_trace.c | 75 ++
    fs/xfs/linux-2.6/xfs_trace.h | 1369 +++++++++++++++++++++++++++++++++++++++++
    fs/xfs/linux-2.6/xfs_vnode.h | 4
    fs/xfs/quota/xfs_dquot.c | 110 ---
    fs/xfs/quota/xfs_dquot.h | 21
    fs/xfs/quota/xfs_qm.c | 40 -
    fs/xfs/quota/xfs_qm_syscalls.c | 4
    fs/xfs/support/ktrace.c | 323 ---------
    fs/xfs/support/ktrace.h | 85 --
    fs/xfs/xfs.h | 16
    fs/xfs/xfs_ag.h | 14
    fs/xfs/xfs_alloc.c | 230 +-----
    fs/xfs/xfs_alloc.h | 27
    fs/xfs/xfs_alloc_btree.c | 1
    fs/xfs/xfs_attr.c | 107 ---
    fs/xfs/xfs_attr.h | 10
    fs/xfs/xfs_attr_leaf.c | 14
    fs/xfs/xfs_attr_sf.h | 40 -
    fs/xfs/xfs_bmap.c | 507 +++------------
    fs/xfs/xfs_bmap.h | 49 -
    fs/xfs/xfs_bmap_btree.c | 6
    fs/xfs/xfs_btree.c | 5
    fs/xfs/xfs_btree_trace.h | 17
    fs/xfs/xfs_buf_item.c | 87 --
    fs/xfs/xfs_buf_item.h | 20
    fs/xfs/xfs_da_btree.c | 3
    fs/xfs/xfs_da_btree.h | 7
    fs/xfs/xfs_dfrag.c | 2
    fs/xfs/xfs_dir2.c | 8
    fs/xfs/xfs_dir2_block.c | 20
    fs/xfs/xfs_dir2_leaf.c | 21
    fs/xfs/xfs_dir2_node.c | 27
    fs/xfs/xfs_dir2_sf.c | 26
    fs/xfs/xfs_dir2_trace.c | 216 ------
    fs/xfs/xfs_dir2_trace.h | 72 --
    fs/xfs/xfs_filestream.c | 8
    fs/xfs/xfs_fsops.c | 2
    fs/xfs/xfs_iget.c | 111 ---
    fs/xfs/xfs_inode.c | 67 --
    fs/xfs/xfs_inode.h | 76 --
    fs/xfs/xfs_inode_item.c | 5
    fs/xfs/xfs_iomap.c | 85 --
    fs/xfs/xfs_iomap.h | 8
    fs/xfs/xfs_log.c | 181 +----
    fs/xfs/xfs_log_priv.h | 20
    fs/xfs/xfs_log_recover.c | 1
    fs/xfs/xfs_mount.c | 2
    fs/xfs/xfs_quota.h | 8
    fs/xfs/xfs_rename.c | 1
    fs/xfs/xfs_rtalloc.c | 1
    fs/xfs/xfs_rw.c | 3
    fs/xfs/xfs_trans.h | 47 +
    fs/xfs/xfs_trans_buf.c | 62 -
    fs/xfs/xfs_vnodeops.c | 8
    70 files changed, 2151 insertions(+), 2592 deletions(-)

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     

12 Dec, 2009

2 commits

  • Currently the low-level buffer cache interfaces are highly confusing
    as we have a _flags variant of each that does actually respect the
    flags, and one without _flags which has a flags argument that gets
    ignored and overriden with a default set. Given that very few places
    use the default arguments get rid of the duplication and convert all
    callers to pass the flags explicitly. Also remove the now confusing
    _flags postfix.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • When completing I/O requests we must not allow the memory allocator to
    recurse into the filesystem, as we might deadlock on waiting for the
    I/O completion otherwise. The only thing currently allocating normal
    GFP_KERNEL memory is the allocation of the transaction structure for
    the unwritten extent conversion. Add a memflags argument to
    _xfs_trans_alloc to allow controlling the allocator behaviour.

    Signed-off-by: Christoph Hellwig
    Reported-by: Thomas Neumann
    Tested-by: Thomas Neumann
    Reviewed-by: Alex Elder
    Signed-off-by: Alex Elder

    Christoph Hellwig
     

12 Aug, 2009

1 commit


02 Jun, 2009

1 commit

  • In the case where growing a filesystem would leave the last AG
    too small, the fixup code has an overflow in the calculation
    of the new size with one fewer ag, because "nagcount" is a 32
    bit number. If the new filesystem has > 2^32 blocks in it
    this causes a problem resulting in an EINVAL return from growfs:

    # xfs_io -f -c "truncate 19998630180864" fsfile
    # mkfs.xfs -f -bsize=4096 -dagsize=76288719b,size=3905982455b fsfile
    # mount -o loop fsfile /mnt
    # xfs_growfs /mnt

    meta-data=/dev/loop0 isize=256 agcount=52,
    agsize=76288719 blks
    = sectsz=512 attr=2
    data = bsize=4096 blocks=3905982455, imaxpct=5
    = sunit=0 swidth=0 blks
    naming =version 2 bsize=4096 ascii-ci=0
    log =internal bsize=4096 blocks=32768, version=2
    = sectsz=512 sunit=0 blks, lazy-count=0
    realtime =none extsz=4096 blocks=0, rtextents=0
    xfs_growfs: XFS_IOC_FSGROWFSDATA xfsctl failed: Invalid argument

    Reported-by: richard.ems@cape-horn-eng.com
    Signed-off-by: Eric Sandeen
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Felix Blyakher
    Signed-off-by: Felix Blyakher

    Eric Sandeen
     

29 Mar, 2009

1 commit


10 Jan, 2009

1 commit

  • Currently, ext3 in mainline Linux doesn't have the freeze feature which
    suspends write requests. So, we cannot take a backup which keeps the
    filesystem's consistency with the storage device's features (snapshot and
    replication) while it is mounted.

    In many case, a commercial filesystem (e.g. VxFS) has the freeze feature
    and it would be used to get the consistent backup.

    If Linux's standard filesystem ext3 has the freeze feature, we can do it
    without a commercial filesystem.

    So I have implemented the ioctls of the freeze feature.
    I think we can take the consistent backup with the following steps.
    1. Freeze the filesystem with the freeze ioctl.
    2. Separate the replication volume or create the snapshot
    with the storage device's feature.
    3. Unfreeze the filesystem with the unfreeze ioctl.
    4. Take the backup from the separated replication volume
    or the snapshot.

    This patch:

    VFS:
    Changed the type of write_super_lockfs and unlockfs from "void"
    to "int" so that they can return an error.
    Rename write_super_lockfs and unlockfs of the super block operation
    freeze_fs and unfreeze_fs to avoid a confusion.

    ext3, ext4, xfs, gfs2, jfs:
    Changed the type of write_super_lockfs and unlockfs from "void"
    to "int" so that write_super_lockfs returns an error if needed,
    and unlockfs always returns 0.

    reiserfs:
    Changed the type of write_super_lockfs and unlockfs from "void"
    to "int" so that they always return 0 (success) to keep a current behavior.

    Signed-off-by: Takashi Sato
    Signed-off-by: Masayuki Hamaguchi
    Cc:
    Cc:
    Cc: Christoph Hellwig
    Cc: Dave Kleikamp
    Cc: Dave Chinner
    Cc: Alasdair G Kergon
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Takashi Sato
     

02 Dec, 2008

1 commit

  • Moving the copy_from_user out of some of the ioctl helpers will
    make it easier for the compat ioctl switch to copy in the right
    struct, then just pass to the underlying helper.

    Also, move common access checks into the helpers themselves,
    and out of the native ioctl switch code, to reduce code
    duplication between native & compat ioctl callers.

    Signed-off-by: Eric Sandeen
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Lachlan McIlroy

    sandeen@sandeen.net
     

30 Oct, 2008

2 commits

  • structures.

    Always use the generic xfs_btree_block type instead of the short / long
    structures. Add XFS_BTREE_SBLOCK_LEN / XFS_BTREE_LBLOCK_LEN defines for
    the length of a short / long form block. The rationale for this is that we
    will grow more btree block header variants to support CRCs and other RAS
    information, and always accessing them through the same datatype with
    unions for the short / long form pointers makes implementing this much
    easier.

    SGI-PV: 988146

    SGI-Modid: xfs-linux-melb:xfs-kern:32300a

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Donald Douwsma
    Signed-off-by: David Chinner
    Signed-off-by: Lachlan McIlroy

    Christoph Hellwig
     
  • Replace the generic record / key / ptr addressing macros that use cpp
    token pasting with simpler macros that do the job for just one given btree
    type. The new macros lose the cur argument and thus can be used outside
    the core btree code, but also gain an xfs_mount * argument to allow for
    checking the CRC flag in the near future. Note that many of these macros
    aren't actually used in the kernel code, but only in userspace (mostly in
    xfs_repair).

    SGI-PV: 988146

    SGI-Modid: xfs-linux-melb:xfs-kern:32295a

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Donald Douwsma
    Signed-off-by: David Chinner
    Signed-off-by: Lachlan McIlroy

    Christoph Hellwig
     

28 Jul, 2008

1 commit

  • Implement ASCII case-insensitive support. It's primary purpose is for
    supporting existing filesystems that already use this case-insensitive
    mode migrated from IRIX. But, if you only need ASCII-only case-insensitive
    support (ie. English only) and will never use another language, then this
    mode is perfectly adequate.

    ASCII-CI is implemented by generating hashes based on lower-case letters
    and doing lower-case compares. It implements a new xfs_nameops vector for
    doing the hashes and comparisons for all filename operations.

    To create a filesystem with this CI mode, use: # mkfs.xfs -n version=ci

    SGI-PV: 981516
    SGI-Modid: xfs-linux-melb:xfs-kern:31209a

    Signed-off-by: Barry Naujok
    Signed-off-by: Christoph Hellwig

    Barry Naujok
     

29 Apr, 2008

2 commits

  • On uniprocessor machines, the incore superblock is used for all in memory
    accounting of free blocks. in this situation, changes to the reserved
    block count are accounted twice; once directly and once via
    xfs_mod_incore_sb(). Seeing as the modification on SMP is done via
    xfs_mod_incore_sb(), make this the only update mechanism that UP uses as
    well.

    SGI-PV: 980654
    SGI-Modid: xfs-linux-melb:xfs-kern:30997a

    Signed-off-by: David Chinner
    Signed-off-by: Lachlan McIlroy

    David Chinner
     
  • Add a new xfs_icsb_sync_counters_locked for the case where m_sb_lock
    is already taken and add a flags argument to xfs_icsb_sync_counters so
    that xfs_icsb_sync_counters_flags is not needed.

    SGI-PV: 976035
    SGI-Modid: xfs-linux-melb:xfs-kern:30917a

    Signed-off-by: Christoph Hellwig
    Signed-off-by: David Chinner
    Signed-off-by: Lachlan McIlroy

    Christoph Hellwig
     

10 Apr, 2008

1 commit


14 Feb, 2008

1 commit


07 Feb, 2008

1 commit

  • Un-obfuscate XFS_SB_LOCK, remove XFS_SB_LOCK->mutex_lock->spin_lock
    macros, call spin_lock directly, remove extraneous cookie holdover from
    old xfs code, and change lock type to spinlock_t.

    SGI-PV: 970382
    SGI-Modid: xfs-linux-melb:xfs-kern:29746a

    Signed-off-by: Eric Sandeen
    Signed-off-by: Donald Douwsma
    Signed-off-by: Tim Shimmin

    Eric Sandeen
     

16 Oct, 2007

2 commits

  • m_growlock only needs plain binary mutex semantics, so use a struct mutex
    instead of a semaphore for it.

    SGI-PV: 968563
    SGI-Modid: xfs-linux-melb:xfs-kern:29512a

    Signed-off-by: Christoph Hellwig
    Signed-off-by: David Chinner
    Signed-off-by: Tim Shimmin

    Christoph Hellwig
     
  • Now that struct bhv_vfs doesn't have any members left we can kill it and
    go directly from the super_block to the xfs_mount everywhere.

    SGI-PV: 969608
    SGI-Modid: xfs-linux-melb:xfs-kern:29509a

    Signed-off-by: Christoph Hellwig
    Signed-off-by: David Chinner
    Signed-off-by: Tim Shimmin

    Christoph Hellwig
     

15 Oct, 2007

1 commit

  • Creates a new xfs_dsb_t that is __be annotated and keeps xfs_sb_t for the
    incore one. xfs_xlatesb is renamed to xfs_sb_to_disk and only handles the
    incore -> disk conversion. A new helper xfs_sb_from_disk handles the other
    direction and doesn't need the slightly hacky table-driven approach
    because we only ever read the full sb from disk.

    The handling of shared r/o filesystems has been buggy on little endian
    system and fixing this required shuffling around of some code in that
    area.

    SGI-PV: 968563
    SGI-Modid: xfs-linux-melb:xfs-kern:29477a

    Signed-off-by: Christoph Hellwig
    Signed-off-by: David Chinner
    Signed-off-by: Tim Shimmin

    Christoph Hellwig
     

14 Jul, 2007

5 commits

  • In media spaces, video is often stored in a frame-per-file format. When
    dealing with uncompressed realtime HD video streams in this format, it is
    crucial that files do not get fragmented and that multiple files a placed
    contiguously on disk.

    When multiple streams are being ingested and played out at the same time,
    it is critical that the filesystem does not cross the streams and
    interleave them together as this creates seek and readahead cache miss
    latency and prevents both ingest and playout from meeting frame rate
    targets.

    This patch set creates a "stream of files" concept into the allocator to
    place all the data from a single stream contiguously on disk so that RAID
    array readahead can be used effectively. Each additional stream gets
    placed in different allocation groups within the filesystem, thereby
    ensuring that we don't cross any streams. When an AG fills up, we select a
    new AG for the stream that is not in use.

    The core of the functionality is the stream tracking - each inode that we
    create in a directory needs to be associated with the directories' stream.
    Hence every time we create a file, we look up the directories' stream
    object and associate the new file with that object.

    Once we have a stream object for a file, we use the AG that the stream
    object point to for allocations. If we can't allocate in that AG (e.g. it
    is full) we move the entire stream to another AG. Other inodes in the same
    stream are moved to the new AG on their next allocation (i.e. lazy
    update).

    Stream objects are kept in a cache and hold a reference on the inode.
    Hence the inode cannot be reclaimed while there is an outstanding stream
    reference. This means that on unlink we need to remove the stream
    association and we also need to flush all the associations on certain
    events that want to reclaim all unreferenced inodes (e.g. filesystem
    freeze).

    SGI-PV: 964469
    SGI-Modid: xfs-linux-melb:xfs-kern:29096a

    Signed-off-by: David Chinner
    Signed-off-by: Barry Naujok
    Signed-off-by: Donald Douwsma
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Tim Shimmin
    Signed-off-by: Vlad Apostolov

    David Chinner
     
  • During delayed allocation extent conversion or unwritten extent
    conversion, we need to reserve some blocks for transactions reservations.
    We need to reserve these blocks in case a btree split occurs and we need
    to allocate some blocks.

    Unfortunately, we've only ever reserved the number of data blocks we are
    allocating, so in both the unwritten and delalloc case we can get ENOSPC
    to the transaction reservation. This is bad because in both cases we
    cannot report the failure to the writing application.

    The fix is two-fold:

    1 - leverage the reserved block infrastructure XFS already
    has to reserve a small pool of blocks by default to allow
    specially marked transactions to dip into when we are at
    ENOSPC.
    Default setting is min(5%, 1024 blocks).

    2 - convert critical transaction reservations to be allowed
    to dip into this pool. Spots changed are delalloc
    conversion, unwritten extent conversion and growing a
    filesystem at ENOSPC.
    This also allows growing the filesytsem to succeed at ENOSPC.

    SGI-PV: 964468
    SGI-Modid: xfs-linux-melb:xfs-kern:28865a

    Signed-off-by: David Chinner
    Signed-off-by: Tim Shimmin

    David Chinner
     
  • SGI-PV: 963528
    SGI-Modid: xfs-linux-melb:xfs-kern:28856a

    Signed-off-by: Tim Shimmin
    Signed-off-by: David Chinner
    Signed-off-by: Christoph Hellwig

    Tim Shimmin
     
  • When we have a couple of hundred transactions on the fly at once, they all
    typically modify the on disk superblock in some way.
    create/unclink/mkdir/rmdir modify inode counts, allocation/freeing modify
    free block counts.

    When these counts are modified in a transaction, they must eventually lock
    the superblock buffer and apply the mods. The buffer then remains locked
    until the transaction is committed into the incore log buffer. The result
    of this is that with enough transactions on the fly the incore superblock
    buffer becomes a bottleneck.

    The result of contention on the incore superblock buffer is that
    transaction rates fall - the more pressure that is put on the superblock
    buffer, the slower things go.

    The key to removing the contention is to not require the superblock fields
    in question to be locked. We do that by not marking the superblock dirty
    in the transaction. IOWs, we modify the incore superblock but do not
    modify the cached superblock buffer. In short, we do not log superblock
    modifications to critical fields in the superblock on every transaction.
    In fact we only do it just before we write the superblock to disk every
    sync period or just before unmount.

    This creates an interesting problem - if we don't log or write out the
    fields in every transaction, then how do the values get recovered after a
    crash? the answer is simple - we keep enough duplicate, logged information
    in other structures that we can reconstruct the correct count after log
    recovery has been performed.

    It is the AGF and AGI structures that contain the duplicate information;
    after recovery, we walk every AGI and AGF and sum their individual
    counters to get the correct value, and we do a transaction into the log to
    correct them. An optimisation of this is that if we have a clean unmount
    record, we know the value in the superblock is correct, so we can avoid
    the summation walk under normal conditions and so mount/recovery times do
    not change under normal operation.

    One wrinkle that was discovered during development was that the blocks
    used in the freespace btrees are never accounted for in the AGF counters.
    This was once a valid optimisation to make; when the filesystem is full,
    the free space btrees are empty and consume no space. Hence when it
    matters, the "accounting" is correct. But that means the when we do the
    AGF summations, we would not have a correct count and xfs_check would
    complain. Hence a new counter was added to track the number of blocks used
    by the free space btrees. This is an *on-disk format change*.

    As a result of this, lazy superblock counters are a mkfs option and at the
    moment on linux there is no way to convert an old filesystem. This is
    possible - xfs_db can be used to twiddle the right bits and then
    xfs_repair will do the format conversion for you. Similarly, you can
    convert backwards as well. At some point we'll add functionality to
    xfs_admin to do the bit twiddling easily....

    SGI-PV: 964999
    SGI-Modid: xfs-linux-melb:xfs-kern:28652a

    Signed-off-by: David Chinner
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Tim Shimmin

    David Chinner
     
  • When growing a filesystem we don't check to see if the new size overflows
    the page cache index range, so we can do silly things like grow a
    filesystem page 16TB on a 32bit. Check new filesystem sizes against the
    limits the kernel can support.

    SGI-PV: 957886
    SGI-Modid: xfs-linux-melb:xfs-kern:28563a

    Signed-Off-By: Nathan Scott
    Signed-off-by: David Chinner
    Signed-off-by: Tim Shimmin

    Nathan Scott
     

08 May, 2007

1 commit


10 Feb, 2007

2 commits

  • It makes it incrementally clearer to read the code when the top of a macro
    spaghetti-pile only receives the 3 arguments it uses, rather than 2 extra
    ones which are not used. Also when you start pulling this thread out of
    the sweater (i.e. remove unused args from XFS_BTREE_*_ADDR), a couple
    other third arms etc fall off too. If they're not used in the macro, then
    they sometimes don't need to be passed to the function calling the macro
    either, etc....

    Patch provided by Eric Sandeen (sandeen@sandeen.net).

    SGI-PV: 960197
    SGI-Modid: xfs-linux-melb:xfs-kern:28037a

    Signed-off-by: Eric Sandeen
    Signed-off-by: David Chinner
    Signed-off-by: Tim Shimmin

    Eric Sandeen
     
  • The block reservation mechanism has been broken since the per-cpu
    superblock counters were introduced. Make the block reservation code work
    with the per-cpu counters by syncing the counters, snapshotting the amount
    of available space and then doing a modifcation of the counter state
    according to the result. Continue in a loop until we either have no space
    available or we reserve some space.

    SGI-PV: 956323
    SGI-Modid: xfs-linux-melb:xfs-kern:27895a

    Signed-off-by: David Chinner
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Tim Shimmin

    David Chinner