02 Dec, 2011

1 commit

  • The dqc_bitmap field of struct ocfs2_local_disk_chunk is 32-bit aligned,
    but not 64-bit aligned. The dqc_bitmap is accessed by ocfs2_set_bit(),
    ocfs2_clear_bit(), ocfs2_test_bit(), or ocfs2_find_next_zero_bit(). These
    are wrapper macros for ext2_*_bit() which need to take an unsigned long
    aligned address (though some architectures are able to handle unaligned
    address correctly)

    So some 64bit architectures may not be able to access the dqc_bitmap
    correctly.

    This avoids such unaligned access by using another wrapper functions for
    ext2_*_bit(). The code is taken from fs/ext4/mballoc.c which also need to
    handle unaligned bitmap access.

    Signed-off-by: Akinobu Mita
    Acked-by: Joel Becker
    Cc: Mark Fasheh
    Signed-off-by: Andrew Morton
    Signed-off-by: Joel Becker

    Akinobu Mita
     

01 Jun, 2011

1 commit


29 Mar, 2011

1 commit

  • * 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2: (39 commits)
    Treat writes as new when holes span across page boundaries
    fs,ocfs2: Move o2net_get_func_run_time under CONFIG_OCFS2_FS_STATS.
    ocfs2/dlm: Move kmalloc() outside the spinlock
    ocfs2: Make the left masklogs compat.
    ocfs2: Remove masklog ML_AIO.
    ocfs2: Remove masklog ML_UPTODATE.
    ocfs2: Remove masklog ML_BH_IO.
    ocfs2: Remove masklog ML_JOURNAL.
    ocfs2: Remove masklog ML_EXPORT.
    ocfs2: Remove masklog ML_DCACHE.
    ocfs2: Remove masklog ML_NAMEI.
    ocfs2: Remove mlog(0) from fs/ocfs2/dir.c
    ocfs2: remove NAMEI from symlink.c
    ocfs2: Remove masklog ML_QUOTA.
    ocfs2: Remove mlog(0) from quota_local.c.
    ocfs2: Remove masklog ML_RESERVATIONS.
    ocfs2: Remove masklog ML_XATTR.
    ocfs2: Remove masklog ML_SUPER.
    ocfs2: Remove mlog(0) from fs/ocfs2/heartbeat.c
    ocfs2: Remove mlog(0) from fs/ocfs2/slot_map.c
    ...

    Fix up trivial conflict in fs/ocfs2/super.c

    Linus Torvalds
     

24 Mar, 2011

1 commit

  • As a preparation for removing ext2 non-atomic bit operations from
    asm/bitops.h. This converts ext2 non-atomic bit operations to
    little-endian bit operations.

    Signed-off-by: Akinobu Mita
    Acked-by: Joel Becker
    Cc: Mark Fasheh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     

20 Feb, 2011

1 commit

  • Patch makes use of the hrtimer to track times in ocfs2 lock stats.

    The patch is a bit involved to ensure no additional impact on the memory
    footprint. The size of ocfs2_inode_cache remains 1280 bytes on 32-bit systems.

    A related change was to modify the unit of the max wait time from nanosec to
    microsec allowing us to track max time larger than 4 secs. This change
    necessitated the bumping of the output version in the debugfs file,
    locking_state, from 2 to 3.

    Signed-off-by: Sunil Mushran
    Signed-off-by: Joel Becker

    Sunil Mushran
     

16 Dec, 2010

1 commit

  • Recently, one of our colleagues meet with a problem that if we
    write/delete a 32mb files repeatly, we will get an ENOSPC in
    the end. And the corresponding bug is 1288.
    http://oss.oracle.com/bugzilla/show_bug.cgi?id=1288

    The real problem is that although we have freed the clusters,
    they are in truncate log and they will be summed up so that
    we can free them once in a whole.

    So this patch just try to resolve it. In case we see -ENOSPC
    in ocfs2_write_begin_no_lock, we will check whether the truncate
    log has enough clusters for our need, if yes, we will try to
    flush the truncate log at that point and try again. This method
    is inspired by Mark Fasheh . Thanks.

    Cc: Mark Fasheh
    Signed-off-by: Tao Ma
    Signed-off-by: Joel Becker

    Tao Ma
     

19 Nov, 2010

1 commit

  • Commit 1c66b360fe262 (Change some lock status member in ocfs2_lock_res
    to char.) states that these fields need to be signed due to comparision
    to -1, but only changed the type from unsigned char to char. However, it
    is a compiler option if char is a signed or unsigned type. Change these
    fields to signed char so the code will work with all compilers.

    Signed-off-by: Milton Miller
    Signed-off-by: Joel Becker

    Milton Miller
     

13 Nov, 2010

1 commit

  • Commit 83fd9c7 changes l_level, l_requested and l_blocking of
    ocfs2_lock_res from int to unsigned char. But actually it is
    initially as -1(ocfs2_lock_res_init_common) which
    correspoding to 255 for unsigned char. So the whole dlm lock
    mechanism doesn't work now which means a disaster to ocfs2.

    Cc: Goldwyn Rodrigues
    Signed-off-by: Tao Ma
    Signed-off-by: Joel Becker

    Tao Ma
     

16 Oct, 2010

1 commit


12 Oct, 2010

1 commit

  • Currently, the default behavior of O_DIRECT writes was allowing
    concurrent writing among nodes to the same file, with no cluster
    coherency guaranteed (no EX lock held). This can leave stale data in
    the cache for buffered reads on other nodes.

    The new mount option introduce a chance to choose two different
    behaviors for O_DIRECT writes:

    * coherency=full, as the default value, will disallow
    concurrent O_DIRECT writes by taking
    EX locks.

    * coherency=buffered, allow concurrent O_DIRECT writes
    without EX lock among nodes, which
    gains high performance at risk of
    getting stale data on other nodes.

    Signed-off-by: Tristan Ye
    Signed-off-by: Joel Becker

    Tristan Ye
     

10 Oct, 2010

1 commit

  • OCFS2_FEATURE_INCOMPAT_CLUSTERINFO allows us to use sb->s_cluster_info for
    both userspace and o2cb cluster stacks. It also allows us to extend cluster
    info to include stack flags.

    This patch also adds stackflags to sb->s_clusterinfo. It also introduces a
    clusterinfo flag OCFS2_CLUSTER_O2CB_GLOBAL_HEARTBEAT to denote the enabled
    global heartbeat mode.

    This incompat flag can be set/cleared using tunefs.ocfs2 --fs-features. The
    clusterinfo flag is set/cleared using tunefs.ocfs2 --update-cluster-stack.

    Signed-off-by: Sunil Mushran

    Sunil Mushran
     

08 Oct, 2010

1 commit


10 Sep, 2010

2 commits

  • Durring orphan scan, if we are slot 0, and we are replaying
    orphan_dir:0001, the general process is that for every file
    in this dir:
    1. we will iget orphan_dir:0001, since there is no inode for it.
    we will have to create an inode and read it from the disk.
    2. do the normal work, such as delete_inode and remove it from
    the dir if it is allowed.
    3. call iput orphan_dir:0001 when we are done. In this case,
    since we have no dcache for this inode, i_count will
    reach 0, and VFS will have to call clear_inode and in
    ocfs2_clear_inode we will checkpoint the inode which will let
    ocfs2_cmt and journald begin to work.
    4. We loop back to 1 for the next file.

    So you see, actually for every deleted file, we have to read the
    orphan dir from the disk and checkpoint the journal. It is very
    time consuming and cause a lot of journal checkpoint I/O.
    A better solution is that we can have another reference for these
    inodes in ocfs2_super. So if there is no other race among
    nodes(which will let dlmglue to checkpoint the inode), for step 3,
    clear_inode won't be called and for step 1, we may only need to
    read the inode for the 1st time. This is a big win for us.

    So this patch will try to cache system inodes of other slots so
    that we will have one more reference for these inodes and avoid
    the extra inode read and journal checkpoint.

    Signed-off-by: Tao Ma
    Signed-off-by: Joel Becker

    Tao Ma
     
  • Thanks for the comments. I have incorportated them all.

    CONFIG_OCFS2_FS_STATS is enabled and CONFIG_DEBUG_LOCK_ALLOC is disabled.
    Statistics now look like -
    ocfs2_write_ctxt: 2144 - 2136 = 8
    ocfs2_inode_info: 1960 - 1848 = 112
    ocfs2_journal: 168 - 160 = 8
    ocfs2_lock_res: 336 - 304 = 32
    ocfs2_refcount_tree: 512 - 472 = 40

    Signed-off-by: Goldwyn Rodrigues
    Signed-off-by: Joel Becker

    Goldwyn Rodrigues
     

06 May, 2010

4 commits

  • The default behavior for directory reservations stays the same, but we add a
    mount option so people can tweak the size of directory reservations
    according to their workloads.

    Signed-off-by: Mark Fasheh
    Signed-off-by: Joel Becker

    Mark Fasheh
     
  • I have observed that the current size of 8M gives us pretty poor
    fragmentation on multi-threaded workloads which do lots of writes.

    Generally, I can increase the size of local alloc windows and observe a
    marked decrease in fragmentation, even up and beyond window sizes of 512
    megabytes. This makes sense for a couple reasons - larger local alloc means
    more room for reservation windows. On multi-node workloads the larger local
    alloc helps as well because we don't have to do window slides as often.

    Also, I removed the OCFS2_DEFAULT_LOCAL_ALLOC_SIZE constant as it is no
    longer used and the comment above it was out of date.

    To test fragmentation, I used a workload which launched 4 threads that did
    4k writes into a series of about 140 alternating files.

    With resv_level=2, and a 4k/4k file system I observed the following average
    fragmentation for various localalloc= parameters:

    localalloc= avg. fragmentation
    8 48
    32 16
    64 10
    120 7

    On larger cluster sizes, the difference is more dramatic.

    The new default size top out at 256M, which we'll only get for cluster
    sizes of 32K and above.

    Signed-off-by: Mark Fasheh
    Signed-off-by: Joel Becker

    Mark Fasheh
     
  • This patch pulls the local alloc sizing code into localalloc.c and provides
    a callout to it from ocfs2_fill_super(). Behavior is essentially unchanged
    except that I correctly calculate the maximum local alloc size. The old code
    in ocfs2_parse_options() calculated the max size as:

    ocfs2_local_alloc_size(sb) * 8

    which is correct, in bits. Unfortunately though the option passed in is in
    megabytes. Ultimately, this bug made no real difference - the shrink code
    would catch a too-large size and bring it down to something reasonable.
    Still, it's less than efficient as-is.

    Signed-off-by: Mark Fasheh
    Signed-off-by: Joel Becker

    Mark Fasheh
     
  • This patch improves Ocfs2 allocation policy by allowing an inode to
    reserve a portion of the local alloc bitmap for itself. The reserved
    portion (allocation window) is advisory in that other allocation
    windows might steal it if the local alloc bitmap becomes
    full. Otherwise, the reservations are honored and guaranteed to be
    free. When the local alloc window is moved to a different portion of
    the bitmap, existing reservations are discarded.

    Reservation windows are represented internally by a red-black
    tree. Within that tree, each node represents the reservation window of
    one inode. An LRU of active reservations is also maintained. When new
    data is written, we allocate it from the inodes window. When all bits
    in a window are exhausted, we allocate a new one as close to the
    previous one as possible. Should we not find free space, an existing
    reservation is pulled off the LRU and cannibalized.

    Signed-off-by: Mark Fasheh

    Mark Fasheh
     

22 Apr, 2010

1 commit


13 Apr, 2010

1 commit


24 Mar, 2010

1 commit

  • When the local alloc file changes windows, unused bits are freed back to the
    global bitmap. By defnition, those bits can not be in use by any file. Also,
    the local alloc will never have been able to allocate those bits if they
    were part of a previous truncate. Therefore it makes sense that we should
    clear unused local alloc bits in the undo buffer so that they can be used
    immediatly.

    [ Modified to call it ocfs2_release_clusters() -- Joel ]

    Signed-off-by: Mark Fasheh
    Signed-off-by: Joel Becker

    Mark Fasheh
     

03 Mar, 2010

1 commit

  • Currently we were adding ioctl cmds/structures for ocfs2 into ocfs2_fs.h
    which was used for define ocfs2 on-disk layout. That sounds a little bit
    confusing, and it may be quickly polluted espcially when growing the
    ocfs2_info_request ioctls afterwards(it will grow i bet).

    As a result, such OCFS2 IOCs do need to be placed somewhere other than
    ocfs2_fs.h, a separated ocfs2_ioctl.h will be added to store such ioctl
    structures and definitions which could also be used from userspace to
    invoke ioctls call.

    Signed-off-by: Tristan Ye
    Signed-off-by: Joel Becker

    Tristan Ye
     

27 Feb, 2010

2 commits


03 Feb, 2010

1 commit

  • There is possibility of a livelock in __ocfs2_cluster_lock(). If a node were
    to get an ast for an upconvert request, followed immediately by a bast,
    there is a small window where the fs may downconvert the lock before the
    process requesting the upconvert is able to take the lock.

    This patch adds a new flag to indicate that the upconvert is still in
    progress and that the dc thread should not downconvert it right now.

    Wengang Wang and Joel Becker
    contributed heavily to this patch.

    Reported-by: David Teigland
    Signed-off-by: Sunil Mushran
    Signed-off-by: Joel Becker

    Sunil Mushran
     

25 Dec, 2009

1 commit

  • * 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2:
    ocfs2/trivial: Use le16_to_cpu for a disk value in xattr.c
    ocfs2/trivial: Use proper mask for 2 places in hearbeat.c
    Ocfs2: Let ocfs2 support fiemap for symlink and fast symlink.
    Ocfs2: Should ocfs2 support fiemap for S_IFDIR inode?
    ocfs2: Use FIEMAP_EXTENT_SHARED
    fiemap: Add new extent flag FIEMAP_EXTENT_SHARED
    ocfs2: replace u8 by __u8 in ocfs2_fs.h
    ocfs2: explicit declare uninitialized var in user_cluster_connect()
    ocfs2-devel: remove redundant OCFS2_MOUNT_POSIX_ACL check in ocfs2_get_acl_nolock()
    ocfs2: return -EAGAIN instead of EAGAIN in dlm
    ocfs2/cluster: Make fence method configurable - v2
    ocfs2: Set MS_POSIXACL on remount
    ocfs2: Make acl use the default
    ocfs2: Always include ACL support

    Linus Torvalds
     

14 Nov, 2009

1 commit


29 Oct, 2009

1 commit

  • Change acl mount options handling to match the one of XFS and BTRFS and
    hopefully it is also easier to use now. When admin does not specify any
    acl mount option, acls are enabled if and only if the filesystem has
    xattr feature enabled. If admin specifies 'acl' mount option, we fail
    the mount if the filesystem does not have xattr feature and thus acls
    cannot be enabled.

    Signed-off-by: Jan Kara
    Signed-off-by: Joel Becker

    Jan Kara
     

23 Sep, 2009

3 commits


05 Sep, 2009

5 commits

  • The next step in divorcing metadata I/O management from struct inode is
    to pass struct ocfs2_caching_info to the journal functions. Thus the
    journal locks a metadata cache with the cache io_lock function. It also
    can compare ci_last_trans and ci_created_trans directly.

    This is a large patch because of all the places we change
    ocfs2_journal_access..(handle, inode, ...) to
    ocfs2_journal_access..(handle, INODE_CACHE(inode), ...).

    Signed-off-by: Joel Becker

    Joel Becker
     
  • Similar ip_last_trans, ip_created_trans tracks the creation of a journal
    managed inode. This specifically tracks what transaction created the
    inode. This is so the code can know if the inode has ever been written
    to disk.

    This behavior is desirable for any journal managed object. We move it
    to struct ocfs2_caching_info as ci_created_trans so that any object
    using ocfs2_caching_info can rely on this behavior.

    Signed-off-by: Joel Becker

    Joel Becker
     
  • We have the read side of metadata caching isolated to struct
    ocfs2_caching_info, now we need the write side. This means the journal
    functions. The journal only does a couple of things with struct inode.

    This change moves the ip_last_trans field onto struct
    ocfs2_caching_info as ci_last_trans. This field tells the journal
    whether a pending journal flush is required.

    Signed-off-by: Joel Becker

    Joel Becker
     
  • We don't really want to cart around too many new fields on the
    ocfs2_caching_info structure. So let's wrap all our access of the
    parent object in a set of operations. One pointer on caching_info, and
    more flexibility to boot.

    Signed-off-by: Joel Becker

    Joel Becker
     
  • We want to use the ocfs2_caching_info structure in places that are not
    inodes. To do that, it can no longer rely on referencing the inode
    directly.

    This patch moves the flags to ocfs2_caching_info->ci_flags, stores
    pointers to the parent's locks on the ocfs2_caching_info, and renames
    the constants and flags to reflect its independant state.

    Signed-off-by: Joel Becker

    Joel Becker
     

22 Jul, 2009

1 commit

  • In commit ea455f8ab68338ba69f5d3362b342c115bea8e13, we moved the dentry lock
    put process into ocfs2_wq. This causes problems during umount because ocfs2_wq
    can drop references to inodes while they are being invalidated by
    invalidate_inodes() causing all sorts of nasty things (invalidate_inodes()
    ending in an infinite loop, "Busy inodes after umount" messages etc.).

    We fix the problem by stopping ocfs2_wq from doing any further releasing of
    inode references on the superblock being unmounted, wait until it finishes
    the current round of releasing and finally cleaning up all the references in
    dentry_lock_list from ocfs2_put_super().

    The issue was tracked down by Tao Ma .

    Signed-off-by: Jan Kara
    Signed-off-by: Joel Becker

    Jan Kara
     

23 Jun, 2009

2 commits

  • Add lockdep support to OCFS2. The support also covers all of the cluster
    locks except for open locks, journal locks, and local quotafile locks. These
    are special because they are acquired for a node, not for a particular process
    and lockdep cannot deal with such type of locking.

    Signed-off-by: Jan Kara
    Signed-off-by: Joel Becker

    Jan Kara
     
  • Currently if the orphan scan fires a tick before the user issues the umount,
    the umount will wait for the queued orphan scan tasks to complete.

    This patch makes the umount stop the orphan scan as early as possible so as
    to reduce the probability of the queued tasks slowing down the umount.

    Signed-off-by: Sunil Mushran
    Signed-off-by: Joel Becker

    Sunil Mushran
     

04 Jun, 2009

1 commit

  • It would be nice to know how often we get checksum failures. Even
    better, how many of them we can fix with the single bit ecc. So, we add
    a statistics structure. The structure can be installed into debugfs
    wherever the user wants.

    For ocfs2, we'll put it in the superblock-specific debugfs directory and
    pass it down from our higher-level functions. The stats are only
    registered with debugfs when the filesystem supports metadata ecc.

    Signed-off-by: Joel Becker

    Joel Becker