05 Apr, 2016

1 commit

  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

19 Dec, 2015

1 commit

  • Before this patch, when function try_rgrp_unlink queued a glock for
    delete_work to reclaim the space, it used the inode glock to do so.
    That's different from the iopen callback which uses the iopen glock
    for the same purpose. We should be consistent and always use the
    iopen glock. This may also save us reference counting problems with
    the inode glock, since clear_glock does an extra glock_put() for the
    inode glock.

    Signed-off-by: Bob Peterson

    Bob Peterson
     

15 Dec, 2015

1 commit

  • Before this patch, multi-block reservation structures were allocated
    from a special slab. This patch folds the structure into the gfs2_inode
    structure. The disadvantage is that the gfs2_inode needs more memory,
    even when a file is opened read-only. The advantages are: (a) we don't
    need the special slab and the extra time it takes to allocate and
    deallocate from it. (b) we no longer need to worry that the structure
    exists for things like quota management. (c) This also allows us to
    remove the calls to get_write_access and put_write_access since we
    know the structure will exist.

    Signed-off-by: Bob Peterson

    Bob Peterson
     

24 Nov, 2015

1 commit

  • This patch basically reverts the majority of patch 5407e24.
    That patch eliminated the gfs2_qadata structure in favor of just
    using the reservations structure. The problem with doing that is that
    it increases the size of the reservations structure. That is not an
    issue until it comes time to fold the reservations structure into the
    inode in memory so we know it's always there. By separating out the
    quota structure again, we aren't punishing the non-quota users by
    making all the inodes bigger, requiring more slab space. This patch
    creates a new slab area to allocate the quota stuff so it's managed
    a little more sanely.

    Signed-off-by: Bob Peterson

    Bob Peterson
     

17 Nov, 2015

1 commit

  • When gfs2 allocates an inode and its extended attribute block next to
    each other at inode create time, the inode's directory entry indicates
    that in de_rahead. In that case, we can readahead the extended
    attribute block when we read in the inode.

    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Bob Peterson

    Andreas Gruenbacher
     

09 Nov, 2015

1 commit


30 Oct, 2015

1 commit

  • Commit e66cf161 replaced the gl_spin spinlock in struct gfs2_glock with a
    gl_lockref lockref and defined gl_spin as gl_lockref.lock (the spinlock in
    gl_lockref). Remove that define to make the references to gl_lockref.lock more
    obvious.

    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Bob Peterson

    Andreas Gruenbacher
     

04 Sep, 2015

2 commits

  • None of these statistics can meaningfully be negative, and the
    numerator for do_div() must have the type u64. The generic
    implementation of do_div() used on some 32-bit architectures asserts
    that, resulting in a compiler error in gfs2_rgrp_congested().

    Fixes: 0166b197c2ed ("GFS2: Average in only non-zero round-trip times ...")

    Signed-off-by: Ben Hutchings
    Signed-off-by: Bob Peterson
    Acked-by: Andreas Gruenbacher

    Ben Hutchings
     
  • What uniquely identifies a glock in the glock hash table is not
    gl_name, but gl_name and its superblock pointer. This patch makes
    the gl_name field correspond to a unique glock identifier. That will
    allow us to simplify hashing with a future patch, since the hash
    algorithm can then take the gl_name and hash its components in one
    operation.

    Signed-off-by: Bob Peterson
    Signed-off-by: Andreas Gruenbacher
    Acked-by: Steven Whitehouse

    Bob Peterson
     

19 Jun, 2015

1 commit

  • This patch allows the block allocation code to retain the buffers
    for the resource groups so they don't need to be re-read from buffer
    cache with every request. This is a performance improvement that's
    especially noticeable when resource groups are very large. For
    example, with 2GB resource groups and 4K blocks, there can be 33
    blocks for every resource group. This patch allows those 33 buffers
    to be kept around and not read in and thrown away with every
    operation. The buffers are released when the resource group is
    either synced or invalidated.

    Signed-off-by: Bob Peterson
    Reviewed-by: Steven Whitehouse
    Reviewed-by: Benjamin Marzinski

    Bob Peterson
     

19 May, 2015

1 commit


06 May, 2015

1 commit

  • The function set_rgrp_preferences() does not handle the (rarely
    returned) NULL value from gfs2_rgrpd_get_next() and this patch
    fixes that.

    The fs image in question is only 150MB in size which allows for
    only 1 rgrp to be created. The in-memory rb tree has only 1 node
    and when gfs2_rgrpd_get_next() is called on this sole rgrp, it
    returns NULL. (Default behavior is to wrap around the rb tree and
    return the first node to give the illusion of a circular linked
    list. In the case of only 1 rgrp, we can't have
    gfs2_rgrpd_get_next() return the same rgrp (first, last, next all
    point to the same rgrp)... that would cause unintended consequences
    and infinite loops.)

    Signed-off-by: Abhi Das
    Signed-off-by: Bob Peterson

    Abhi Das
     

24 Apr, 2015

2 commits

  • This patch changes function gfs2_rgrp_congested so that it only factors
    in non-zero values into its average round trip time. If the round-trip
    time is zero for a particular cpu, that cpu has obviously never dealt
    with bouncing the resource group in question, so factoring in a zero
    value will only skew the numbers. It also fixes a compile error on
    some arches related to division.

    Signed-off-by: Bob Peterson
    Acked-by: Steven Whitehouse

    Bob Peterson
     
  • This patch changes function gfs2_rgrp_congested so that it uses an
    average srttb (smoothed round trip time for blocking rgrp glocks)
    rather than the CPU-specific value. If we use the CPU-specific value
    it can incorrectly report no contention when there really is contention
    due to the glock processing occurring on a different CPU.

    Signed-off-by: Bob Peterson
    Acked-by: Steven Whitehouse

    Bob Peterson
     

19 Mar, 2015

1 commit

  • struct gfs2_alloc_parms is passed to gfs2_quota_check() and
    gfs2_inplace_reserve() with ap->target containing the number of
    blocks being requested for allocation in the current operation.

    We add a new field to struct gfs2_alloc_parms called 'allowed'.
    gfs2_quota_check() and gfs2_inplace_reserve() return the max
    blocks allowed by quota and the max blocks allowed by the chosen
    rgrp respectively in 'allowed'.

    A new field 'min_target', when non-zero, tells gfs2_quota_check()
    and gfs2_inplace_reserve() to not return -EDQUOT/-ENOSPC when
    there are atleast 'min_target' blocks allowable/available. The
    assumption is that the caller is ok with just 'min_target' blocks
    and will likely proceed with allocating them.

    Signed-off-by: Abhi Das
    Signed-off-by: Bob Peterson
    Acked-by: Steven Whitehouse

    Abhi Das
     

04 Nov, 2014

2 commits


03 Oct, 2014

1 commit


19 Sep, 2014

1 commit

  • This patch checks if i_goal is either zero or if doesn't exist
    within any rgrp (i.e gfs2_blk2rgrpd() returns NULL). If so, it
    assigns the ip->i_no_addr block as the i_goal.

    There are two scenarios where a bad i_goal can result in a
    -EBADSLT error.

    1. Attempting to allocate to an existing inode:
    Control reaches gfs2_inplace_reserve() and ip->i_goal is bad.
    We need to fix i_goal here.

    2. A new inode is created in a directory whose i_goal is hosed:
    In this case, the parent dir's i_goal is copied onto the new
    inode. Since the new inode is not yet created, the ip->i_no_addr
    field is invalid and so, the fix in gfs2_inplace_reserve() as per
    1) won't work in this scenario. We need to catch and fix it sooner
    in the parent dir itself (gfs2_create_inode()), before it is
    copied to the new inode.

    Signed-off-by: Abhi Das
    Signed-off-by: Steven Whitehouse

    Abhi Das
     

18 Jul, 2014

1 commit


14 May, 2014

1 commit

  • GFS2 has a transaction glock, which must be grabbed for every
    transaction, whose purpose is to deal with freezing the filesystem.
    Aside from this involving a large amount of locking, it is very easy to
    make the current fsfreeze code hang on unfreezing.

    This patch rewrites how gfs2 handles freezing the filesystem. The
    transaction glock is removed. In it's place is a freeze glock, which is
    cached (but not held) in a shared state by every node in the cluster
    when the filesystem is mounted. This lock only needs to be grabbed on
    freezing, and actions which need to be safe from freezing, like
    recovery.

    When a node wants to freeze the filesystem, it grabs this glock
    exclusively. When the freeze glock state changes on the nodes (either
    from shared to unlocked, or shared to exclusive), the filesystem does a
    special log flush. gfs2_log_flush() does all the work for flushing out
    the and shutting down the incore log, and then it tries to grab the
    freeze glock in a shared state again. Since the filesystem is stuck in
    gfs2_log_flush, no new transaction can start, and nothing can be written
    to disk. Unfreezing the filesytem simply involes dropping the freeze
    glock, allowing gfs2_log_flush() to grab and then release the shared
    lock, so it is cached for next time.

    However, in order for the unfreezing ioctl to occur, gfs2 needs to get a
    shared lock on the filesystem root directory inode to check permissions.
    If that glock has already been grabbed exclusively, fsfreeze will be
    unable to get the shared lock and unfreeze the filesystem.

    In order to allow the unfreeze, this patch makes gfs2 grab a shared lock
    on the filesystem root directory during the freeze, and hold it until it
    unfreezes the filesystem. The functions which need to grab a shared
    lock in order to allow the unfreeze ioctl to be issued now use the lock
    grabbed by the freeze code instead.

    The freeze and unfreeze code take care to make sure that this shared
    lock will not be dropped while another process is using it.

    Signed-off-by: Benjamin Marzinski
    Signed-off-by: Steven Whitehouse

    Benjamin Marzinski
     

07 Mar, 2014

2 commits

  • Add pr_fmt, remove embedded "GFS2: " prefixes.
    This now consistently emits lower case "gfs2: " for each message.

    Other miscellanea around these changes:

    o Add missing newlines
    o Coalesce formats
    o Realign arguments

    Signed-off-by: Joe Perches
    Signed-off-by: Steven Whitehouse

    Joe Perches
     
  • -All printk(KERN_foo converted to pr_foo().
    -Messages updated to fit in 80 columns.
    -fs_macros converted as well.
    -fs_printk removed.

    Signed-off-by: Fabian Frederick
    Signed-off-by: Steven Whitehouse

    Fabian Frederick
     

10 Feb, 2014

1 commit

  • Mark functions as static in gfs2/rgrp.c because they are not used
    outside this file.

    This eliminates the following warning in gfs2/rgrp.c:
    fs/gfs2/rgrp.c:1092:5: warning: no previous prototype for ‘gfs2_rgrp_bh_get’ [-Wmissing-prototypes]
    fs/gfs2/rgrp.c:1157:5: warning: no previous prototype for ‘update_rgrp_lvb’ [-Wmissing-prototypes]

    Signed-off-by: Rashika Kheria
    Reviewed-by: Josh Triplett
    Signed-off-by: Steven Whitehouse

    Rashika Kheria
     

04 Feb, 2014

1 commit

  • This is another step towards improving the allocation of xattr
    blocks at inode allocation time. Here we take advantage of
    Christoph's recent work on ACLs to allocate a block for the
    xattrs early if we know that we will be adding ACLs to the
    inode later on. The advantage of that is that it is much
    more likely that we'll get a contiguous run of two blocks
    where the first is the inode and the second is the xattr block.

    We still have to fall back to the original system in case we
    don't get the requested two contiguous blocks, or in case the
    ACLs are too large to fit into the block.

    Future patches will move more of the ACL setting code further
    up the gfs2_inode_create() function. Also, I'd like to be
    able to do the same thing with the xattrs from LSMs in
    due course, too. That way we should be able to slowly reduce
    the number of independent transactions, at least in the
    most common cases.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

16 Jan, 2014

2 commits

  • This is a small cleanup to function gfs2_rgrp_go_lock so that it
    uses rgd instead of its more complicated twin.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     
  • Al Viro has tactfully pointed out that we are using the incorrect
    error code in some cases. This patch fixes that, and also removes
    the (unused) return value for glock dumping.

    > * gfs2_iget() - ENOBUFS instead of ENOMEM. ENOBUFS is
    > "No buffer space available (POSIX.1 (XSI STREAMS option))" and since
    > we don't support STREAMS it's probably fair game, but... what the hell?

    Signed-off-by: Steven Whitehouse
    Cc: Al Viro

    Steven Whitehouse
     

03 Jan, 2014

5 commits

  • Each rgrp header is represented as a single extent on disk, so we
    can calculate the position within the address space, since we are
    using address spaces mapped 1:1 to the disk. This means that it
    is possible to use the range based versions of filemap_fdatawrite/wait
    and for invalidating the page cache.

    Our eventual intent is to then be able to merge the address spaces
    used for rgrps into a single address space, rather than to have
    one for each glock, saving memory and reducing complexity.

    Since during umount, the rgrp structures are disposed of before
    the glocks, we need to store the extent information in the glock
    so that is is available for a final invalidation. This patch uses
    a field which is otherwise unused in rgrp glocks to do that, so
    that we do not have to expand the size of a glock.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • Since gfs2_inplace_reserve() is always called with a valid
    alloc parms structure, there is no need to test for this
    within the function itself - and in any case, after we've
    all ready dereferenced it anyway.

    Reported-by: Dan Carpenter
    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • With the preceding patch, we started accepting block reservations
    smaller than the ideal size, which requires a lot more parsing of the
    bitmaps. To reduce the amount of bitmap searching, this patch
    implements a scheme whereby each rgrp keeps track of the point
    at this multi-block reservations will fail.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     
  • This is just basically a resend of a patch I posted earlier.
    It didn't change from its original, except in diff offsets, etc:

    This patch fixes a bug in the GFS2 block allocation code. The problem
    starts if a process already has a multi-block reservation, but for
    some reason, another process disqualifies it from further allocations.
    For example, the other process might set on the GFS2_RDF_ERROR bit.
    The process holding the reservation jumps to label skip_rgrp, but
    that label comes after the code that removes the reservation from the
    tree. Therefore, the no longer usable reservation is not removed from
    the rgrp's reservations tree; it's lost. Eventually, the lost reservation
    causes the count of reserved blocks to get off, and eventually that
    causes a BUG_ON(rs->rs_rbm.rgd->rd_reserved < rs->rs_free) to trigger.
    This patch moves the call to after label skip_rgrp so that the
    disqualified reservation is properly removed from the tree, thus keeping
    the rgrp rd_reserved count sane.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     
  • Here is a second try at a patch I posted earlier, which also implements
    suggestions Steve made:

    Before this patch, GFS2 would keep searching through all the rgrps
    until it found one that had a chunk of free blocks big enough to
    satisfy the size hint, which is based on the file write size,
    regardless of whether the chunk was big enough to perform the write.
    However, when doing big writes there may not be a large enough
    chunk of free blocks in any rgrp, due to file system fragmentation.
    The largest chunk may be big enough to satisfy the write request,
    but it may not meet the ideal reservation size from the "size hint".
    The writes would slow to a crawl because every write would search
    every rgrp, then finally give up and default to a single-block write.
    In my case, performance would drop from 425MB/s to 18KB/s, or 24000
    times slower.

    This patch basically makes it so that if we can't find a contiguous
    chunk of blocks big enough to satisfy the sizehint, we'll use the
    largest chunk of blocks we found that will still contain the write.
    It does so by keeping track of the largest run of blocks within the
    rgrp.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     

16 Nov, 2013

1 commit


02 Oct, 2013

2 commits

  • When setting the starting point for block allocation, there were calls
    to both gfs2_rbm_to_block() and gfs2_rbm_from_block() in the common case
    of there being an active reservation. The gfs2_rbm_from_block() function
    can be quite slow, and since the two conversions were effectively a
    no-op, it makes sense to avoid them entirely in this case.

    There is no functional change here, but the code should be a bit more
    efficient after this patch.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • This patch adds a structure to contain allocation parameters with
    the intention of future expansion of this structure. The idea is
    that we should be able to add more information about the allocation
    in the future in order to allow the allocator to make a better job
    of placing the requests on-disk.

    There is no functional difference from applying this patch.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

27 Sep, 2013

1 commit

  • The reservation for an inode should be cleared when it is truncated so
    that we can start again at a different offset for future allocations.
    We could try and do better than that, by resetting the search based on
    where the truncation started from, but this is only a first step.

    In addition, there are three callers of gfs2_rs_delete() but only one
    of those should really be testing the value of i_writecount. While
    we get away with that in the other cases currently, I think it would
    be better if we made that test specific to the one case which
    requires it.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

18 Sep, 2013

2 commits

  • Since the previous patch eliminated bi in favor of bii, this follow-on
    patch needed to be adjusted accordingly. Here is the revised version.

    This patch adds a new function, gfs2_rbm_incr, which increments
    an rbm structure. This is more efficient than calling gfs2_rbm_to_block,
    incrementing, then calling gfs2_rbm_from_block.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     
  • This is a respin of the original patch. As Steve pointed out, the
    introduction of field bii makes it easy to eliminate bi itself.
    This revised patch does just that, replacing bi with bii.

    This patch adds a new field to the rbm structure, called bii,
    which is an index into the array of bitmaps for an rgrp.
    This replaces *bi which was a pointer to the bitmap.
    This is being done for further optimizations.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     

17 Sep, 2013

2 commits

  • When we used try locks for rgrps on block allocations, it was important
    to clear the flags field so that we used a blocking hold on the glock.
    Now that we're not doing try locks, clearing flags is unnecessary, and
    a waste of time. In fact, it's probably doing the wrong thing because
    it clears the GL_SKIP bit that was set for the lvb tracking purposes.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     
  • This patch introduces a new field in the bitmap structure called
    bi_blocks. Its purpose is to save us from constantly multiplying
    bi_len by the constant GFS2_NBBY. It also paves the way for more
    optimization in a future patch.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson