10 Jan, 2009

1 commit

  • Currently, ext3 in mainline Linux doesn't have the freeze feature which
    suspends write requests. So, we cannot take a backup which keeps the
    filesystem's consistency with the storage device's features (snapshot and
    replication) while it is mounted.

    In many case, a commercial filesystem (e.g. VxFS) has the freeze feature
    and it would be used to get the consistent backup.

    If Linux's standard filesystem ext3 has the freeze feature, we can do it
    without a commercial filesystem.

    So I have implemented the ioctls of the freeze feature.
    I think we can take the consistent backup with the following steps.
    1. Freeze the filesystem with the freeze ioctl.
    2. Separate the replication volume or create the snapshot
    with the storage device's feature.
    3. Unfreeze the filesystem with the unfreeze ioctl.
    4. Take the backup from the separated replication volume
    or the snapshot.

    This patch:

    VFS:
    Changed the type of write_super_lockfs and unlockfs from "void"
    to "int" so that they can return an error.
    Rename write_super_lockfs and unlockfs of the super block operation
    freeze_fs and unfreeze_fs to avoid a confusion.

    ext3, ext4, xfs, gfs2, jfs:
    Changed the type of write_super_lockfs and unlockfs from "void"
    to "int" so that write_super_lockfs returns an error if needed,
    and unlockfs always returns 0.

    reiserfs:
    Changed the type of write_super_lockfs and unlockfs from "void"
    to "int" so that they always return 0 (success) to keep a current behavior.

    Signed-off-by: Takashi Sato
    Signed-off-by: Masayuki Hamaguchi
    Cc:
    Cc:
    Cc: Christoph Hellwig
    Cc: Dave Kleikamp
    Cc: Dave Chinner
    Cc: Alasdair G Kergon
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Takashi Sato
     

02 Dec, 2008

1 commit

  • Moving the copy_from_user out of some of the ioctl helpers will
    make it easier for the compat ioctl switch to copy in the right
    struct, then just pass to the underlying helper.

    Also, move common access checks into the helpers themselves,
    and out of the native ioctl switch code, to reduce code
    duplication between native & compat ioctl callers.

    Signed-off-by: Eric Sandeen
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Lachlan McIlroy

    sandeen@sandeen.net
     

30 Oct, 2008

2 commits

  • structures.

    Always use the generic xfs_btree_block type instead of the short / long
    structures. Add XFS_BTREE_SBLOCK_LEN / XFS_BTREE_LBLOCK_LEN defines for
    the length of a short / long form block. The rationale for this is that we
    will grow more btree block header variants to support CRCs and other RAS
    information, and always accessing them through the same datatype with
    unions for the short / long form pointers makes implementing this much
    easier.

    SGI-PV: 988146

    SGI-Modid: xfs-linux-melb:xfs-kern:32300a

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Donald Douwsma
    Signed-off-by: David Chinner
    Signed-off-by: Lachlan McIlroy

    Christoph Hellwig
     
  • Replace the generic record / key / ptr addressing macros that use cpp
    token pasting with simpler macros that do the job for just one given btree
    type. The new macros lose the cur argument and thus can be used outside
    the core btree code, but also gain an xfs_mount * argument to allow for
    checking the CRC flag in the near future. Note that many of these macros
    aren't actually used in the kernel code, but only in userspace (mostly in
    xfs_repair).

    SGI-PV: 988146

    SGI-Modid: xfs-linux-melb:xfs-kern:32295a

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Donald Douwsma
    Signed-off-by: David Chinner
    Signed-off-by: Lachlan McIlroy

    Christoph Hellwig
     

28 Jul, 2008

1 commit

  • Implement ASCII case-insensitive support. It's primary purpose is for
    supporting existing filesystems that already use this case-insensitive
    mode migrated from IRIX. But, if you only need ASCII-only case-insensitive
    support (ie. English only) and will never use another language, then this
    mode is perfectly adequate.

    ASCII-CI is implemented by generating hashes based on lower-case letters
    and doing lower-case compares. It implements a new xfs_nameops vector for
    doing the hashes and comparisons for all filename operations.

    To create a filesystem with this CI mode, use: # mkfs.xfs -n version=ci

    SGI-PV: 981516
    SGI-Modid: xfs-linux-melb:xfs-kern:31209a

    Signed-off-by: Barry Naujok
    Signed-off-by: Christoph Hellwig

    Barry Naujok
     

29 Apr, 2008

2 commits

  • On uniprocessor machines, the incore superblock is used for all in memory
    accounting of free blocks. in this situation, changes to the reserved
    block count are accounted twice; once directly and once via
    xfs_mod_incore_sb(). Seeing as the modification on SMP is done via
    xfs_mod_incore_sb(), make this the only update mechanism that UP uses as
    well.

    SGI-PV: 980654
    SGI-Modid: xfs-linux-melb:xfs-kern:30997a

    Signed-off-by: David Chinner
    Signed-off-by: Lachlan McIlroy

    David Chinner
     
  • Add a new xfs_icsb_sync_counters_locked for the case where m_sb_lock
    is already taken and add a flags argument to xfs_icsb_sync_counters so
    that xfs_icsb_sync_counters_flags is not needed.

    SGI-PV: 976035
    SGI-Modid: xfs-linux-melb:xfs-kern:30917a

    Signed-off-by: Christoph Hellwig
    Signed-off-by: David Chinner
    Signed-off-by: Lachlan McIlroy

    Christoph Hellwig
     

10 Apr, 2008

1 commit


14 Feb, 2008

1 commit


07 Feb, 2008

1 commit

  • Un-obfuscate XFS_SB_LOCK, remove XFS_SB_LOCK->mutex_lock->spin_lock
    macros, call spin_lock directly, remove extraneous cookie holdover from
    old xfs code, and change lock type to spinlock_t.

    SGI-PV: 970382
    SGI-Modid: xfs-linux-melb:xfs-kern:29746a

    Signed-off-by: Eric Sandeen
    Signed-off-by: Donald Douwsma
    Signed-off-by: Tim Shimmin

    Eric Sandeen
     

16 Oct, 2007

2 commits

  • m_growlock only needs plain binary mutex semantics, so use a struct mutex
    instead of a semaphore for it.

    SGI-PV: 968563
    SGI-Modid: xfs-linux-melb:xfs-kern:29512a

    Signed-off-by: Christoph Hellwig
    Signed-off-by: David Chinner
    Signed-off-by: Tim Shimmin

    Christoph Hellwig
     
  • Now that struct bhv_vfs doesn't have any members left we can kill it and
    go directly from the super_block to the xfs_mount everywhere.

    SGI-PV: 969608
    SGI-Modid: xfs-linux-melb:xfs-kern:29509a

    Signed-off-by: Christoph Hellwig
    Signed-off-by: David Chinner
    Signed-off-by: Tim Shimmin

    Christoph Hellwig
     

15 Oct, 2007

1 commit

  • Creates a new xfs_dsb_t that is __be annotated and keeps xfs_sb_t for the
    incore one. xfs_xlatesb is renamed to xfs_sb_to_disk and only handles the
    incore -> disk conversion. A new helper xfs_sb_from_disk handles the other
    direction and doesn't need the slightly hacky table-driven approach
    because we only ever read the full sb from disk.

    The handling of shared r/o filesystems has been buggy on little endian
    system and fixing this required shuffling around of some code in that
    area.

    SGI-PV: 968563
    SGI-Modid: xfs-linux-melb:xfs-kern:29477a

    Signed-off-by: Christoph Hellwig
    Signed-off-by: David Chinner
    Signed-off-by: Tim Shimmin

    Christoph Hellwig
     

14 Jul, 2007

5 commits

  • In media spaces, video is often stored in a frame-per-file format. When
    dealing with uncompressed realtime HD video streams in this format, it is
    crucial that files do not get fragmented and that multiple files a placed
    contiguously on disk.

    When multiple streams are being ingested and played out at the same time,
    it is critical that the filesystem does not cross the streams and
    interleave them together as this creates seek and readahead cache miss
    latency and prevents both ingest and playout from meeting frame rate
    targets.

    This patch set creates a "stream of files" concept into the allocator to
    place all the data from a single stream contiguously on disk so that RAID
    array readahead can be used effectively. Each additional stream gets
    placed in different allocation groups within the filesystem, thereby
    ensuring that we don't cross any streams. When an AG fills up, we select a
    new AG for the stream that is not in use.

    The core of the functionality is the stream tracking - each inode that we
    create in a directory needs to be associated with the directories' stream.
    Hence every time we create a file, we look up the directories' stream
    object and associate the new file with that object.

    Once we have a stream object for a file, we use the AG that the stream
    object point to for allocations. If we can't allocate in that AG (e.g. it
    is full) we move the entire stream to another AG. Other inodes in the same
    stream are moved to the new AG on their next allocation (i.e. lazy
    update).

    Stream objects are kept in a cache and hold a reference on the inode.
    Hence the inode cannot be reclaimed while there is an outstanding stream
    reference. This means that on unlink we need to remove the stream
    association and we also need to flush all the associations on certain
    events that want to reclaim all unreferenced inodes (e.g. filesystem
    freeze).

    SGI-PV: 964469
    SGI-Modid: xfs-linux-melb:xfs-kern:29096a

    Signed-off-by: David Chinner
    Signed-off-by: Barry Naujok
    Signed-off-by: Donald Douwsma
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Tim Shimmin
    Signed-off-by: Vlad Apostolov

    David Chinner
     
  • During delayed allocation extent conversion or unwritten extent
    conversion, we need to reserve some blocks for transactions reservations.
    We need to reserve these blocks in case a btree split occurs and we need
    to allocate some blocks.

    Unfortunately, we've only ever reserved the number of data blocks we are
    allocating, so in both the unwritten and delalloc case we can get ENOSPC
    to the transaction reservation. This is bad because in both cases we
    cannot report the failure to the writing application.

    The fix is two-fold:

    1 - leverage the reserved block infrastructure XFS already
    has to reserve a small pool of blocks by default to allow
    specially marked transactions to dip into when we are at
    ENOSPC.
    Default setting is min(5%, 1024 blocks).

    2 - convert critical transaction reservations to be allowed
    to dip into this pool. Spots changed are delalloc
    conversion, unwritten extent conversion and growing a
    filesystem at ENOSPC.
    This also allows growing the filesytsem to succeed at ENOSPC.

    SGI-PV: 964468
    SGI-Modid: xfs-linux-melb:xfs-kern:28865a

    Signed-off-by: David Chinner
    Signed-off-by: Tim Shimmin

    David Chinner
     
  • SGI-PV: 963528
    SGI-Modid: xfs-linux-melb:xfs-kern:28856a

    Signed-off-by: Tim Shimmin
    Signed-off-by: David Chinner
    Signed-off-by: Christoph Hellwig

    Tim Shimmin
     
  • When we have a couple of hundred transactions on the fly at once, they all
    typically modify the on disk superblock in some way.
    create/unclink/mkdir/rmdir modify inode counts, allocation/freeing modify
    free block counts.

    When these counts are modified in a transaction, they must eventually lock
    the superblock buffer and apply the mods. The buffer then remains locked
    until the transaction is committed into the incore log buffer. The result
    of this is that with enough transactions on the fly the incore superblock
    buffer becomes a bottleneck.

    The result of contention on the incore superblock buffer is that
    transaction rates fall - the more pressure that is put on the superblock
    buffer, the slower things go.

    The key to removing the contention is to not require the superblock fields
    in question to be locked. We do that by not marking the superblock dirty
    in the transaction. IOWs, we modify the incore superblock but do not
    modify the cached superblock buffer. In short, we do not log superblock
    modifications to critical fields in the superblock on every transaction.
    In fact we only do it just before we write the superblock to disk every
    sync period or just before unmount.

    This creates an interesting problem - if we don't log or write out the
    fields in every transaction, then how do the values get recovered after a
    crash? the answer is simple - we keep enough duplicate, logged information
    in other structures that we can reconstruct the correct count after log
    recovery has been performed.

    It is the AGF and AGI structures that contain the duplicate information;
    after recovery, we walk every AGI and AGF and sum their individual
    counters to get the correct value, and we do a transaction into the log to
    correct them. An optimisation of this is that if we have a clean unmount
    record, we know the value in the superblock is correct, so we can avoid
    the summation walk under normal conditions and so mount/recovery times do
    not change under normal operation.

    One wrinkle that was discovered during development was that the blocks
    used in the freespace btrees are never accounted for in the AGF counters.
    This was once a valid optimisation to make; when the filesystem is full,
    the free space btrees are empty and consume no space. Hence when it
    matters, the "accounting" is correct. But that means the when we do the
    AGF summations, we would not have a correct count and xfs_check would
    complain. Hence a new counter was added to track the number of blocks used
    by the free space btrees. This is an *on-disk format change*.

    As a result of this, lazy superblock counters are a mkfs option and at the
    moment on linux there is no way to convert an old filesystem. This is
    possible - xfs_db can be used to twiddle the right bits and then
    xfs_repair will do the format conversion for you. Similarly, you can
    convert backwards as well. At some point we'll add functionality to
    xfs_admin to do the bit twiddling easily....

    SGI-PV: 964999
    SGI-Modid: xfs-linux-melb:xfs-kern:28652a

    Signed-off-by: David Chinner
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Tim Shimmin

    David Chinner
     
  • When growing a filesystem we don't check to see if the new size overflows
    the page cache index range, so we can do silly things like grow a
    filesystem page 16TB on a 32bit. Check new filesystem sizes against the
    limits the kernel can support.

    SGI-PV: 957886
    SGI-Modid: xfs-linux-melb:xfs-kern:28563a

    Signed-Off-By: Nathan Scott
    Signed-off-by: David Chinner
    Signed-off-by: Tim Shimmin

    Nathan Scott
     

08 May, 2007

1 commit


10 Feb, 2007

2 commits

  • It makes it incrementally clearer to read the code when the top of a macro
    spaghetti-pile only receives the 3 arguments it uses, rather than 2 extra
    ones which are not used. Also when you start pulling this thread out of
    the sweater (i.e. remove unused args from XFS_BTREE_*_ADDR), a couple
    other third arms etc fall off too. If they're not used in the macro, then
    they sometimes don't need to be passed to the function calling the macro
    either, etc....

    Patch provided by Eric Sandeen (sandeen@sandeen.net).

    SGI-PV: 960197
    SGI-Modid: xfs-linux-melb:xfs-kern:28037a

    Signed-off-by: Eric Sandeen
    Signed-off-by: David Chinner
    Signed-off-by: Tim Shimmin

    Eric Sandeen
     
  • The block reservation mechanism has been broken since the per-cpu
    superblock counters were introduced. Make the block reservation code work
    with the per-cpu counters by syncing the counters, snapshotting the amount
    of available space and then doing a modifcation of the counter state
    according to the result. Continue in a loop until we either have no space
    available or we reserve some space.

    SGI-PV: 956323
    SGI-Modid: xfs-linux-melb:xfs-kern:27895a

    Signed-off-by: David Chinner
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Tim Shimmin

    David Chinner
     

07 Sep, 2006

1 commit

  • The fix for recent ENOSPC deadlocks introduced certain limitations on
    allocations. The fix could cause xfssyncd to loop endlessly if we did not
    leave some space free for the allocator to work correctly. Basically, we
    needed to ensure that we had at least 4 blocks free for an AG free list
    and a block for the inode bmap btree at all times.

    However, this did not take into account the fact that each AG has a free
    list that needs 4 blocks. Hence any filesystem with more than one AG could
    cause oversubscription of free space and make xfssyncd spin forever trying
    to allocate space needed for AG freelists that was not available in the
    AG.

    The following patch reserves space for the free lists in all AGs plus the
    inode bmap btree which prevents oversubscription. It also prevents those
    blocks from being reported as free space (as they can never be used) and
    makes the SMP in-core superblock accounting code and the reserved block
    ioctl respect this requirement.

    SGI-PV: 955674
    SGI-Modid: xfs-linux-melb:xfs-kern:26894a

    Signed-off-by: David Chinner
    Signed-off-by: David Chatterton

    David Chinner
     

20 Jun, 2006

1 commit


09 Jun, 2006

3 commits


29 Mar, 2006

1 commit


14 Mar, 2006

1 commit


15 Jan, 2006

1 commit


11 Jan, 2006

1 commit


25 Nov, 2005

1 commit


02 Nov, 2005

5 commits