08 Sep, 2010

1 commit

  • This allows code which needs to know the eventual block number of an inode
    but can't allocate it yet due to transaction or lock ordering. For example,
    ocfs2_create_inode_in_orphan() currently gives a junk blkno for preparation
    of the orphan dir because it can't yet know where the actual inode is placed
    - that code is actually in ocfs2_mknod_locked. This is a problem when the
    orphan dirs are indexed as the junk inode number will create an index entry
    which goes unused (and fails the later removal from the orphan dir). Now
    with these interfaces, ocfs2_create_inode_in_orphan() can run the block
    group search (and get back the inode block number) *before* any actual
    allocation occurs.

    Signed-off-by: Mark Fasheh
    Signed-off-by: Tao Ma

    Mark Fasheh
     

06 May, 2010

2 commits

  • They all take an ocfs2_alloc_context, which has the allocation inode.

    Signed-off-by: Joel Becker
    Signed-off-by: Tao Ma

    Joel Becker
     
  • This patch improves Ocfs2 allocation policy by allowing an inode to
    reserve a portion of the local alloc bitmap for itself. The reserved
    portion (allocation window) is advisory in that other allocation
    windows might steal it if the local alloc bitmap becomes
    full. Otherwise, the reservations are honored and guaranteed to be
    free. When the local alloc window is moved to a different portion of
    the bitmap, existing reservations are discarded.

    Reservation windows are represented internally by a red-black
    tree. Within that tree, each node represents the reservation window of
    one inode. An LRU of active reservations is also maintained. When new
    data is written, we allocate it from the inodes window. When all bits
    in a window are exhausted, we allocate a new one as close to the
    previous one as possible. Should we not find free space, an existing
    reservation is pulled off the LRU and cannibalized.

    Signed-off-by: Mark Fasheh

    Mark Fasheh
     

13 Apr, 2010

1 commit


26 Mar, 2010

1 commit


24 Mar, 2010

1 commit

  • When the local alloc file changes windows, unused bits are freed back to the
    global bitmap. By defnition, those bits can not be in use by any file. Also,
    the local alloc will never have been able to allocate those bits if they
    were part of a previous truncate. Therefore it makes sense that we should
    clear unused local alloc bits in the undo buffer so that they can be used
    immediatly.

    [ Modified to call it ocfs2_release_clusters() -- Joel ]

    Signed-off-by: Mark Fasheh
    Signed-off-by: Joel Becker

    Mark Fasheh
     

27 Feb, 2010

1 commit

  • This patch add extent block (metadata) stealing mechanism for
    extent allocation. This mechanism is same as the inode stealing.
    if no room in slot specific extent_alloc, we will try to
    allocate extent block from the next slot.

    Signed-off-by: Tiger Yang
    Acked-by: Tao Ma
    Signed-off-by: Joel Becker

    Tiger Yang
     

04 Apr, 2009

2 commits

  • For nfs exporting, ocfs2_get_dentry() returns the dentry for fh.
    ocfs2_get_dentry() may read from disk when the inode is not in memory,
    without any cross cluster lock. this leads to the file system loading a
    stale inode.

    This patch fixes above problem.

    Solution is that in case of inode is not in memory, we get the cluster
    lock(PR) of alloc inode where the inode in question is allocated from (this
    causes node on which deletion is done sync the alloc inode) before reading
    out the inode itsself. then we check the bitmap in the group (the inode in
    question allcated from) to see if the bit is clear. if it's clear then it's
    stale. if the bit is set, we then check generation as the existing code
    does.

    We have to read out the inode in question from disk first to know its alloc
    slot and allot bit. And if its not stale we read it out using ocfs2_iget().
    The second read should then be from cache.

    And also we have to add a per superblock nfs_sync_lock to cover the lock for
    alloc inode and that for inode in question. this is because ocfs2_get_dentry()
    and ocfs2_delete_inode() lock on them in reverse order. nfs_sync_lock is locked
    in EX mode in ocfs2_get_dentry() and in PR mode in ocfs2_delete_inode(). so
    that mutliple ocfs2_delete_inode() can run concurrently in normal case.

    [mfasheh@suse.com: build warning fixes and comment cleanups]
    Signed-off-by: Wengang Wang
    Acked-by: Joel Becker
    Signed-off-by: Mark Fasheh

    wengang wang
     
  • In ocfs2, the inode block search looks for the "emptiest" inode
    group to allocate from. So if an inode alloc file has many equally
    (or almost equally) empty groups, new inodes will tend to get
    spread out amongst them, which in turn can put them all over the
    disk. This is undesirable because directory operations on conceptually
    "nearby" inodes force a large number of seeks.

    So we add ip_last_used_group in core directory inodes which records
    the last used allocation group. Another field named ip_last_used_slot
    is also added in case inode stealing happens. When claiming new inode,
    we passed in directory's inode so that the allocation can use this
    information.
    For more details, please see
    http://oss.oracle.com/osswiki/OCFS2/DesignDocs/InodeAllocationStrategy.

    Signed-off-by: Tao Ma
    Signed-off-by: Mark Fasheh

    Tao Ma
     

06 Jan, 2009

3 commits

  • Add an optional validation hook to ocfs2_read_blocks(). Now the
    validation function is only called when a block was actually read off of
    disk. It is not called when the buffer was in cache.

    We add a buffer state bit BH_NeedsValidate to flag these buffers. It
    must always be one higher than the last JBD2 buffer state bit.

    The dinode, dirblock, extent_block, and xattr_block validators are
    lifted to this scheme directly. The group_descriptor validator needs to
    be split into two pieces. The first part only needs the gd buffer and
    is passed to ocfs2_read_block(). The second part requires the dinode as
    well, and is called every time. It's only 3 compares, so it's tiny.
    This also allows us to clean up the non-fatal gd check used by resize.c.
    It now has no magic argument.

    Signed-off-by: Joel Becker
    Signed-off-by: Mark Fasheh

    Joel Becker
     
  • We have a clean call for validating group descriptors, but every place
    that wants the always does a read_block()+validate() call pair. Create
    a toplevel ocfs2_read_group_descriptor() that does the right
    thing. This allows us to leverage the single call point later for
    fancier handling. We also add validation of gd->bg_generation against
    the superblock and gd->bg_blkno against the block we thought we read.

    Signed-off-by: Joel Becker
    Signed-off-by: Mark Fasheh

    Joel Becker
     
  • Currently the validation of group descriptors is directly duplicated so
    that one version can error the filesystem and the other (resize) can
    just report the problem. Consolidate to one function that takes a
    boolean. Wrap that function with the old call for the old users.

    This is in preparation for lifting the read+validate step into a
    single function.

    Signed-off-by: Joel Becker
    Signed-off-by: Mark Fasheh

    Joel Becker
     

14 Oct, 2008

7 commits

  • ocfs2 inode numbers are block numbers. For any filesystem with less
    than 2^32 blocks, this is not a problem. However, when ocfs2 starts
    using JDB2, it will be able to support filesystems with more than 2^32
    blocks. This would result in inode numbers higher than 2^32.

    The problem is that stat(2) can't handle those numbers on 32bit
    machines. The simple solution is to have ocfs2 allocate all inodes
    below that boundary.

    The suballoc code is changed to honor an optional block limit. Only the
    inode suballocator sets that limit - all other allocations stay unlimited.

    The biggest trick is to grow the inode suballocator beneath that limit.
    There's no point in allocating block groups that are above the limit,
    then rejecting their elements later on. We want to prevent the inode
    allocator from ever having block groups above the limit. This involves
    a little gyration with the local alloc code. If the local alloc window
    is above the limit, it signals the caller to try the global bitmap but
    does not disable the local alloc file (which can be used for other
    allocations).

    [ Minor cleanup - removed an ML_NOTICE comment. --Mark ]

    Signed-off-by: Joel Becker
    Signed-off-by: Mark Fasheh

    Joel Becker
     
  • We now have three different kinds of extent trees in ocfs2: inode data
    (dinode), extended attributes (xattr_tree), and extended attribute
    values (xattr_value). There is a nice abstraction for them,
    ocfs2_extent_tree, but it is hidden in alloc.c. All the calling
    functions have to pick amongst a varied API and pass in type bits and
    often extraneous pointers.

    A better way is to make ocfs2_extent_tree a first-class object.
    Everyone converts their object to an ocfs2_extent_tree() via the
    ocfs2_get_*_extent_tree() calls, then uses the ocfs2_extent_tree for all
    tree calls to alloc.c.

    This simplifies a lot of callers, making for readability. It also
    provides an easy way to add additional extent tree types, as they only
    need to be defined in alloc.c with a ocfs2_get__extent_tree()
    function.

    Signed-off-by: Joel Becker
    Signed-off-by: Mark Fasheh

    Joel Becker
     
  • This patch implements storing extended attributes both in inode or a single
    external block. We only store EA's in-inode when blocksize > 512 or that
    inode block has free space for it. When an EA's value is larger than 80
    bytes, we will store the value via b-tree outside inode or block.

    Signed-off-by: Tiger Yang
    Signed-off-by: Mark Fasheh

    Tiger Yang
     
  • Add some thin wrappers around ocfs2_insert_extent() for each of the 3
    different btree types, ocfs2_inode_insert_extent(),
    ocfs2_xattr_value_insert_extent() and ocfs2_xattr_tree_insert_extent(). The
    last is for the xattr index btree, which will be used in a followup patch.

    All the old callers in file.c etc will call ocfs2_dinode_insert_extent(),
    while the other two handle the xattr issue. And the init of extent tree are
    handled by these functions.

    When storing xattr value which is too large, we will allocate some clusters
    for it and here ocfs2_extent_list and ocfs2_extent_rec will also be used. In
    order to re-use the b-tree operation code, a new parameter named "private"
    is added into ocfs2_extent_tree and it is used to indicate the root of
    ocfs2_exent_list. The reason is that we can't deduce the root from the
    buffer_head now. It may be in an inode, an ocfs2_xattr_block or even worse,
    in any place in an ocfs2_xattr_bucket.

    Signed-off-by: Tao Ma
    Signed-off-by: Mark Fasheh

    Tao Ma
     
  • In the old extent tree operation, we take the hypothesis that we
    are using the ocfs2_extent_list in ocfs2_dinode as the tree root.
    As xattr will also use ocfs2_extent_list to store large value
    for a xattr entry, we refactor the tree operation so that xattr
    can use it directly.

    The refactoring includes 4 steps:
    1. Abstract set/get of last_eb_blk and update_clusters since they may
    be stored in different location for dinode and xattr.
    2. Add a new structure named ocfs2_extent_tree to indicate the
    extent tree the operation will work on.
    3. Remove all the use of fe_bh and di, use root_bh and root_el in
    extent tree instead. So now all the fe_bh is replaced with
    et->root_bh, el with root_el accordingly.
    4. Make ocfs2_lock_allocators generic. Now it is limited to be only used
    in file extend allocation. But the whole function is useful when we want
    to store large EAs.

    Note: This patch doesn't touch ocfs2_commit_truncate() since it is not used
    for anything other than truncate inode data btrees.

    Signed-off-by: Tao Ma
    Signed-off-by: Mark Fasheh

    Tao Ma
     
  • ocfs2_extend_meta_needed(), ocfs2_calc_extend_credits() and
    ocfs2_reserve_new_metadata() are all useful for extent tree operations. But
    they are all limited to an inode btree because they use a struct
    ocfs2_dinode parameter. Change their parameter to struct ocfs2_extent_list
    (the part of an ocfs2_dinode they actually use) so that the xattr btree code
    can use these functions.

    Signed-off-by: Tao Ma
    Signed-off-by: Mark Fasheh

    Tao Ma
     
  • Ocfs2's local allocator disables itself for the duration of a mount point
    when it has trouble allocating a large enough area from the primary bitmap.
    That can cause performance problems, especially for disks which were only
    temporarily full or fragmented. This patch allows for the allocator to
    shrink it's window first, before being disabled. Later, it can also be
    re-enabled so that any performance drop is minimized.

    To do this, we allow the value of osb->local_alloc_bits to be shrunk when
    needed. The default value is recorded in a mostly read-only variable so that
    we can re-initialize when required.

    Locking had to be updated so that we could protect changes to
    local_alloc_bits. Mostly this involves protecting various local alloc values
    with the osb spinlock. A new state is also added, OCFS2_LA_THROTTLED, which
    is used when the local allocator is has shrunk, but is not disabled. If the
    available space dips below 1 megabyte, the local alloc file is disabled. In
    either case, local alloc is re-enabled 30 seconds after the event, or when
    an appropriate amount of bits is seen in the primary bitmap.

    Signed-off-by: Mark Fasheh

    Mark Fasheh
     

18 Apr, 2008

1 commit

  • In inode stealing, we no longer restrict the allocation to
    happen in the local node. So it is neccessary for us to add
    a new member in ocfs2_alloc_context to indicate which slot
    we are using for allocation. We also modify the process of
    local alloc so that this member can be used there also.

    Signed-off-by: Tao Ma
    Signed-off-by: Sunil Mushran
    Signed-off-by: Mark Fasheh

    Tao Ma
     

26 Jan, 2008

1 commit

  • This patch adds the ability for a userspace program to request an extend of
    last cluster group on an Ocfs2 file system. The request is made via ioctl,
    OCFS2_IOC_GROUP_EXTEND. This is derived from EXT3_IOC_GROUP_EXTEND, but is
    obviously Ocfs2 specific.

    tunefs.ocfs2 would call this for an online-resize operation if the last
    cluster group isn't full.

    Signed-off-by: Tao Ma
    Signed-off-by: Mark Fasheh

    Tao Ma
     

21 Sep, 2007

1 commit

  • The ocfs2 write code loops through a page much like the block code, except
    that ocfs2 allocation units can be any size, including larger than page
    size. Typically it's equal to or larger than page size - most kernels run 4k
    pages, the minimum ocfs2 allocation (cluster) size.

    Some changes introduced during 2.6.23 changed the way writes to pages are
    handled, and inadvertantly broke support for > 4k page size. Instead of just
    writing one cluster at a time, we now handle the whole page in one pass.

    This means that multiple (small) seperate allocations might happen in the
    same pass. The allocation code howver typically optimizes by getting the
    maximum which was reserved. This triggered a BUG_ON in the extend code where
    it'd ask for a single bit (for one part of a > 4k page) and get back more
    than it asked for.

    Fix this by providing a variant of the high level allocation function which
    allows the caller to specify a maximum. The traditional function remains and
    just calls the new one with a maximum determined from the initial
    reservation.

    Signed-off-by: Mark Fasheh

    Mark Fasheh
     

11 Jul, 2007

2 commits


02 Dec, 2006

2 commits


08 Aug, 2006

1 commit

  • Record the most recently used allocation group on the allocation context, so
    that subsequent allocations can attempt to optimize for contiguousness.
    Local alloc especially should benefit from this as the current chain search
    tends to let it spew across the disk.

    Signed-off-by: Mark Fasheh

    Mark Fasheh
     

04 Jan, 2006

1 commit