03 Jul, 2020

1 commit

  • Before this patch, function freeze_go_sync, called when promoting
    the freeze glock, was testing for the SDF_JOURNAL_LIVE superblock flag.
    That's only set for read-write mounts. Read-only mounts don't use a
    journal, so the bit is never set, so the freeze never happened.

    This patch removes the check for SDF_JOURNAL_LIVE for freeze requests
    but still checks it when deciding whether to flush a journal.

    Signed-off-by: Bob Peterson

    Bob Peterson
     

06 Jun, 2020

4 commits


03 Jun, 2020

1 commit

  • Before for this patch, function inode_go_sync ignored io errors
    during inode_go_sync, overwriting them with metadata write errors:

    error = filemap_fdatawait(mapping);
    mapping_set_error(mapping, error);
    }
    error = filemap_fdatawait(metamapping);
    ...
    return error;

    So any errors returned by the inode write would be forgotten if the
    metadata write succeeded. This patch still does both writes, but
    only sets error if it's still zero. That way, any errors will be
    reported by to the caller, do_xmote, which will take appropriate
    action and report the error.

    Signed-off-by: Bob Peterson
    Signed-off-by: Andreas Gruenbacher

    Bob Peterson
     

27 Feb, 2020

4 commits

  • Before this patch, function do_xmote would try to sync out the glock
    dirty data by calling the appropriate glops function XXX_go_sync()
    but it did not check for a good return code. If the sync was not
    possible due to an io error or whatever, do_xmote would continue on
    and call go_inval and release the glock to other cluster nodes.
    When those nodes go to replay the journal, they may already be holding
    glocks for the journal records that should have been synced, but were
    not due to the ignored error.

    This patch introduces proper error code checking to the go_sync
    family of glops functions.

    Signed-off-by: Bob Peterson
    Reviewed-by: Andreas Gruenbacher

    Bob Peterson
     
  • Before this patch, if gfs2_ail_empty_gl saw there was nothing on
    the ail list, it would return and not flush the log. The problem
    is that there could still be a revoke for the rgrp sitting on the
    sd_log_le_revoke list that's been recently taken off the ail list.
    But that revoke still needs to be written, and the rgrp_go_inval
    still needs to call log_flush_wait to ensure the revokes are all
    properly written to the journal before we relinquish control of
    the glock to another node. If we give the glock to another node
    before we have this knowledge, the node might crash and its journal
    replayed, in which case the missing revoke would allow the journal
    replay to replay the rgrp over top of the rgrp we already gave to
    another node, thus overwriting its changes and corrupting the
    file system.

    This patch makes gfs2_ail_empty_gl still call gfs2_log_flush rather
    than returning.

    Signed-off-by: Bob Peterson
    Reviewed-by: Andreas Gruenbacher

    Bob Peterson
     
  • Before this patch, the rgrp_go_inval and inode_go_inval functions each
    checked if there were any items left on the ail count (by way of a
    count), and if so, did a withdraw. But the withdraw code now uses
    glocks when changing the file system to read-only status. So we can
    not have glock functions withdrawing or a hang will likely result:
    The glocks can't be serviced by the work_func if the work_func is
    busy doing its own withdraw.

    This patch removes the checks from the go_inval functions and adds
    a centralized check in do_xmote to warn about the problem and not
    withdraw, but flag the error so it's eventually caught when the logd
    daemon eventually runs.

    Signed-off-by: Bob Peterson
    Reviewed-by: Andreas Gruenbacher

    Bob Peterson
     
  • When a node withdraws from a file system, it often leaves its journal
    in an incomplete state. This is especially true when the withdraw is
    caused by io errors writing to the journal. Before this patch, a
    withdraw would try to write a "shutdown" record to the journal, tell
    dlm it's done with the file system, and none of the other nodes
    know about the problem. Later, when the problem is fixed and the
    withdrawn node is rebooted, it would then discover that its own
    journal was incomplete, and replay it. However, replaying it at this
    point is almost guaranteed to introduce corruption because the other
    nodes are likely to have used affected resource groups that appeared
    in the journal since the time of the withdraw. Replaying the journal
    later will overwrite any changes made, and not through any fault of
    dlm, which was instructed during the withdraw to release those
    resources.

    This patch makes file system withdraws seen by the entire cluster.
    Withdrawing nodes dequeue their journal glock to allow recovery.

    The remaining nodes check all the journals to see if they are
    clean or in need of replay. They try to replay dirty journals, but
    only the journals of withdrawn nodes will be "not busy" and
    therefore available for replay.

    Until the journal replay is complete, no i/o related glocks may be
    given out, to ensure that the replay does not cause the
    aforementioned corruption: We cannot allow any journal replay to
    overwrite blocks associated with a glock once it is held.

    The "live" glock which is now used to signal when a withdraw
    occurs. When a withdraw occurs, the node signals its withdraw by
    dequeueing the "live" glock and trying to enqueue it in EX mode,
    thus forcing the other nodes to all see a demote request, by way
    of a "1CB" (one callback) try lock. The "live" glock is not
    granted in EX; the callback is only just used to indicate a
    withdraw has occurred.

    Note that all nodes in the cluster must wait for the recovering
    node to finish replaying the withdrawing node's journal before
    continuing. To this end, it checks that the journals are clean
    multiple times in a retry loop.

    Also note that the withdraw function may be called from a wide
    variety of situations, and therefore, we need to take extra
    precautions to make sure pointers are valid before using them in
    many circumstances.

    We also need to take care when glocks decide to withdraw, since
    the withdraw code now uses glocks.

    Also, before this patch, if a process encountered an error and
    decided to withdraw, if another process was already withdrawing,
    the second withdraw would be silently ignored, which set it free
    to unlock its glocks. That's correct behavior if the original
    withdrawer encounters further errors down the road. But if
    secondary waiters don't wait for the journal replay, unlocking
    glocks will allow other nodes to use them, despite the fact that
    the journal containing those blocks is being replayed. The
    replay needs to finish before our glocks are released to other
    nodes. IOW, secondary withdraws need to wait for the first
    withdraw to finish.

    For example, if an rgrp glock is unlocked by a process that didn't
    wait for the first withdraw, a journal replay could introduce file
    system corruption by replaying a rgrp block that has already been
    granted to a different cluster node.

    Signed-off-by: Bob Peterson

    Bob Peterson
     

21 Feb, 2020

1 commit

  • We need to allow some glocks to be enqueued, dequeued, promoted, and demoted
    when we're withdrawn. For example, to maintain metadata integrity, we should
    disallow the use of inode and rgrp glocks when withdrawn. Other glocks, like
    iopen or the transaction glocks may be safely used because none of their
    metadata goes through the journal. So in general, we should disallow all
    glocks with an address space, and allow all the others. One exception is:
    we need to allow our active journal to be demoted so others may recover it.

    Allowing glocks after withdraw gives us the ability to take appropriate
    action (in a following patch) to have our journal properly replayed by
    another node rather than just abandoning the current transactions and
    pretending nothing bad happened, leaving the other nodes free to modify
    the blocks we had in our journal, which may result in file system
    corruption.

    Signed-off-by: Bob Peterson

    Bob Peterson
     

10 Feb, 2020

2 commits

  • Before this patch, the rgrp code had a serious problem related to
    how it managed buffer_heads for resource groups. The problem caused
    file system corruption, especially in cases of journal replay.

    When an rgrp glock was demoted to transfer ownership to a
    different cluster node, do_xmote() first calls rgrp_go_sync and then
    rgrp_go_inval, as expected. When it calls rgrp_go_sync, that called
    gfs2_rgrp_brelse() that dropped the buffer_head reference count.
    In most cases, the reference count went to zero, which is right.
    However, there were other places where the buffers are handled
    differently.

    After rgrp_go_sync, do_xmote called rgrp_go_inval which called
    gfs2_rgrp_brelse a second time, then rgrp_go_inval's call to
    truncate_inode_pages_range would get rid of the pages in memory,
    but only if the reference count drops to 0.

    Unfortunately, gfs2_rgrp_brelse was setting bi->bi_bh = NULL.
    So when rgrp_go_sync called gfs2_rgrp_brelse, it lost the pointer
    to the buffer_heads in cases where the reference count was still 1.
    Therefore, when rgrp_go_inval called gfs2_rgrp_brelse a second time,
    it failed the check for "if (bi->bi_bh)" and thus failed to call
    brelse a second time. Because of that, the reference count on those
    buffers sometimes failed to drop from 1 to 0. And that caused
    function truncate_inode_pages_range to keep the pages in page cache
    rather than freeing them.

    The next time the rgrp glock was acquired, the metadata read of
    the rgrp buffers re-used the pages in memory, which were now
    wrong because they were likely modified by the other node who
    acquired the glock in EX (which is why we demoted the glock).
    This re-use of the page cache caused corruption because changes
    made by the other nodes were never seen, so the bitmaps were
    inaccurate.

    For some reason, the problem became most apparent when journal
    replay forced the replay of rgrps in memory, which caused newer
    rgrp data to be overwritten by the older in-core pages.

    A big part of the problem was that the rgrp buffer were released
    in multiple places: The go_unlock function would release them when
    the glock was released rather than when the glock is demoted,
    which is clearly wrong because our intent was to cache them until
    the glock is demoted from SH or EX.

    This patch attempts to clean up the mess and make one consistent
    and centralized mechanism for managing the rgrp buffer_heads by
    implementing several changes:

    1. It eliminates the call to gfs2_rgrp_brelse() from rgrp_go_sync.
    We don't want to release the buffers or zero the pointers when
    syncing for the reasons stated above. It only makes sense to
    release them when the glock is actually invalidated (go_inval).
    And when we do, then we set the bh pointers to NULL.
    2. The go_unlock function (which was only used for rgrps) is
    eliminated, as we've talked about doing many times before.
    The go_unlock function was called too early in the glock dq
    process, and should not happen until the glock is invalidated.
    3. It also eliminates the call to rgrp_brelse in gfs2_clear_rgrpd.
    That will now happen automatically when the rgrp glocks are
    demoted, and shouldn't happen any sooner or later than that.
    Instead, function gfs2_clear_rgrpd has been modified to demote
    the rgrp glocks, and therefore, free those pages, before the
    remaining glocks are culled by gfs2_gl_hash_clear. This
    prevents the gl_object from hanging around when the glocks are
    culled.

    Signed-off-by: Bob Peterson
    Reviewed-by: Andreas Gruenbacher

    Bob Peterson
     
  • Split gfs2_lm_withdraw into a function that prints an error message and a
    function that withdraws the filesystem.

    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Bob Peterson

    Andreas Gruenbacher
     

08 Jan, 2020

1 commit

  • Every caller of function gfs2_struct2blk specified sizeof(u64).

    This patch eliminates the unnecessary parameter and replaces the
    size calculation with a new superblock variable that is computed
    to be the maximum number of block pointers we can fit inside a
    log descriptor, as is done for pointers per dinode and indirect
    block.

    Signed-off-by: Bob Peterson
    Reviewed-by: Andrew Price
    Signed-off-by: Andreas Gruenbacher

    Bob Peterson
     

15 Nov, 2019

1 commit

  • Add function gfs2_withdrawn and replace all checks for the SDF_WITHDRAWN
    bit to call it. This does not change the logic or function of gfs2, and
    it facilitates later improvements to the withdraw sequence.

    Signed-off-by: Bob Peterson
    Signed-off-by: Andreas Gruenbacher

    Bob Peterson
     

30 Oct, 2019

1 commit


28 Jun, 2019

3 commits

  • This patch replaces a few leftover printk errors with calls to
    fs_info and similar, so that the file system having the error is
    properly logged.

    Signed-off-by: Bob Peterson
    Signed-off-by: Andreas Gruenbacher

    Bob Peterson
     
  • Before this patch, if a glock error was encountered, the glock with
    the problem was dumped. But sometimes you may have lots of file systems
    mounted, and that doesn't tell you which file system it was for.

    This patch adds a new boolean parameter fsid to the dump_glock family
    of functions. For non-error cases, such as dumping the glocks debugfs
    file, the fsid is not dumped in order to keep lock dumps and glocktop
    as clean as possible. For all error cases, such as GLOCK_BUG_ON, the
    file system id is now printed. This will make it easier to debug.

    Signed-off-by: Bob Peterson
    Signed-off-by: Andreas Gruenbacher

    Bob Peterson
     
  • Before this patch, the superblock flag indicating when a file system
    is withdrawn was called SDF_SHUTDOWN. This patch simply renames it to
    the more obvious SDF_WITHDRAWN.

    Signed-off-by: Bob Peterson
    Signed-off-by: Andreas Gruenbacher

    Bob Peterson
     

05 Jun, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this copyrighted material is made available to anyone wishing to use
    modify copy or redistribute it subject to the terms and conditions
    of the gnu general public license version 2

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-only

    has been chosen to replace the boilerplate/reference in 44 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Allison Randal
    Reviewed-by: Kate Stewart
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190531081038.653000175@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

08 May, 2019

1 commit

  • Use bios to read in the journal into the address space of the journal inode
    (jd_inode), sequentially and in large chunks. This is faster for locating the
    journal head that the previous binary search approach. When performing
    recovery, we keep the journal in the address space until recovery is done,
    which further speeds up things.

    Signed-off-by: Abhi Das
    Signed-off-by: Andreas Gruenbacher

    Abhi Das
     

15 Feb, 2019

1 commit


12 Dec, 2018

2 commits

  • This patch is based on an idea from Steve Whitehouse. The idea is
    to dump the number of pages for inodes in the glock dumps.
    The additional locking required me to drop const from quite a few
    places.

    Signed-off-by: Bob Peterson
    Signed-off-by: Andreas Gruenbacher

    Bob Peterson
     
  • Use bio(s) to read in the journal sequentially in large chunks and
    locate the head of the journal.

    This version addresses the issues Christoph pointed out w.r.t error handling
    and using deprecated API.

    Signed-off-by: Abhi Das
    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Bob Peterson
    Cc: Christoph Hellwig

    Abhi Das
     

06 Jun, 2018

1 commit

  • struct timespec is not y2038 safe. Transition vfs to use
    y2038 safe struct timespec64 instead.

    The change was made with the help of the following cocinelle
    script. This catches about 80% of the changes.
    All the header file and logic changes are included in the
    first 5 rules. The rest are trivial substitutions.
    I avoid changing any of the function signatures or any other
    filesystem specific data structures to keep the patch simple
    for review.

    The script can be a little shorter by combining different cases.
    But, this version was sufficient for my usecase.

    virtual patch

    @ depends on patch @
    identifier now;
    @@
    - struct timespec
    + struct timespec64
    current_time ( ... )
    {
    - struct timespec now = current_kernel_time();
    + struct timespec64 now = current_kernel_time64();
    ...
    - return timespec_trunc(
    + return timespec64_trunc(
    ... );
    }

    @ depends on patch @
    identifier xtime;
    @@
    struct \( iattr \| inode \| kstat \) {
    ...
    - struct timespec xtime;
    + struct timespec64 xtime;
    ...
    }

    @ depends on patch @
    identifier t;
    @@
    struct inode_operations {
    ...
    int (*update_time) (...,
    - struct timespec t,
    + struct timespec64 t,
    ...);
    ...
    }

    @ depends on patch @
    identifier t;
    identifier fn_update_time =~ "update_time$";
    @@
    fn_update_time (...,
    - struct timespec *t,
    + struct timespec64 *t,
    ...) { ... }

    @ depends on patch @
    identifier t;
    @@
    lease_get_mtime( ... ,
    - struct timespec *t
    + struct timespec64 *t
    ) { ... }

    @te depends on patch forall@
    identifier ts;
    local idexpression struct inode *inode_node;
    identifier i_xtime =~ "^i_[acm]time$";
    identifier ia_xtime =~ "^ia_[acm]time$";
    identifier fn_update_time =~ "update_time$";
    identifier fn;
    expression e, E3;
    local idexpression struct inode *node1;
    local idexpression struct inode *node2;
    local idexpression struct iattr *attr1;
    local idexpression struct iattr *attr2;
    local idexpression struct iattr attr;
    identifier i_xtime1 =~ "^i_[acm]time$";
    identifier i_xtime2 =~ "^i_[acm]time$";
    identifier ia_xtime1 =~ "^ia_[acm]time$";
    identifier ia_xtime2 =~ "^ia_[acm]time$";
    @@
    (
    (
    - struct timespec ts;
    + struct timespec64 ts;
    |
    - struct timespec ts = current_time(inode_node);
    + struct timespec64 ts = current_time(inode_node);
    )

    i_xtime, &ts)
    + timespec64_equal(&inode_node->i_xtime, &ts)
    |
    - timespec_equal(&ts, &inode_node->i_xtime)
    + timespec64_equal(&ts, &inode_node->i_xtime)
    |
    - timespec_compare(&inode_node->i_xtime, &ts)
    + timespec64_compare(&inode_node->i_xtime, &ts)
    |
    - timespec_compare(&ts, &inode_node->i_xtime)
    + timespec64_compare(&ts, &inode_node->i_xtime)
    |
    ts = current_time(e)
    |
    fn_update_time(..., &ts,...)
    |
    inode_node->i_xtime = ts
    |
    node1->i_xtime = ts
    |
    ts = inode_node->i_xtime
    |
    ia_xtime ...+> = ts
    |
    ts = attr1->ia_xtime
    |
    ts.tv_sec
    |
    ts.tv_nsec
    |
    btrfs_set_stack_timespec_sec(..., ts.tv_sec)
    |
    btrfs_set_stack_timespec_nsec(..., ts.tv_nsec)
    |
    - ts = timespec64_to_timespec(
    + ts =
    ...
    -)
    |
    - ts = ktime_to_timespec(
    + ts = ktime_to_timespec64(
    ...)
    |
    - ts = E3
    + ts = timespec_to_timespec64(E3)
    |
    - ktime_get_real_ts(&ts)
    + ktime_get_real_ts64(&ts)
    |
    fn(...,
    - ts
    + timespec64_to_timespec(ts)
    ,...)
    )
    ...+>
    (

    )
    |
    - timespec_equal(&node1->i_xtime1, &node2->i_xtime2)
    + timespec64_equal(&node1->i_xtime2, &node2->i_xtime2)
    |
    - timespec_equal(&node1->i_xtime1, &attr2->ia_xtime2)
    + timespec64_equal(&node1->i_xtime2, &attr2->ia_xtime2)
    |
    - timespec_compare(&node1->i_xtime1, &node2->i_xtime2)
    + timespec64_compare(&node1->i_xtime1, &node2->i_xtime2)
    |
    node1->i_xtime1 =
    - timespec_trunc(attr1->ia_xtime1,
    + timespec64_trunc(attr1->ia_xtime1,
    ...)
    |
    - attr1->ia_xtime1 = timespec_trunc(attr2->ia_xtime2,
    + attr1->ia_xtime1 = timespec64_trunc(attr2->ia_xtime2,
    ...)
    |
    - ktime_get_real_ts(&attr1->ia_xtime1)
    + ktime_get_real_ts64(&attr1->ia_xtime1)
    |
    - ktime_get_real_ts(&attr.ia_xtime1)
    + ktime_get_real_ts64(&attr.ia_xtime1)
    )

    @ depends on patch @
    struct inode *node;
    struct iattr *attr;
    identifier fn;
    identifier i_xtime =~ "^i_[acm]time$";
    identifier ia_xtime =~ "^ia_[acm]time$";
    expression e;
    @@
    (
    - fn(node->i_xtime);
    + fn(timespec64_to_timespec(node->i_xtime));
    |
    fn(...,
    - node->i_xtime);
    + timespec64_to_timespec(node->i_xtime));
    |
    - e = fn(attr->ia_xtime);
    + e = fn(timespec64_to_timespec(attr->ia_xtime));
    )

    @ depends on patch forall @
    struct inode *node;
    struct iattr *attr;
    identifier i_xtime =~ "^i_[acm]time$";
    identifier ia_xtime =~ "^ia_[acm]time$";
    identifier fn;
    @@
    {
    + struct timespec ts;
    i_xtime);
    fn (...,
    - &node->i_xtime,
    + &ts,
    ...);
    |
    + ts = timespec64_to_timespec(attr->ia_xtime);
    fn (...,
    - &attr->ia_xtime,
    + &ts,
    ...);
    )
    ...+>
    }

    @ depends on patch forall @
    struct inode *node;
    struct iattr *attr;
    struct kstat *stat;
    identifier ia_xtime =~ "^ia_[acm]time$";
    identifier i_xtime =~ "^i_[acm]time$";
    identifier xtime =~ "^[acm]time$";
    identifier fn, ret;
    @@
    {
    + struct timespec ts;
    i_xtime);
    ret = fn (...,
    - &node->i_xtime,
    + &ts,
    ...);
    |
    + ts = timespec64_to_timespec(node->i_xtime);
    ret = fn (...,
    - &node->i_xtime);
    + &ts);
    |
    + ts = timespec64_to_timespec(attr->ia_xtime);
    ret = fn (...,
    - &attr->ia_xtime,
    + &ts,
    ...);
    |
    + ts = timespec64_to_timespec(attr->ia_xtime);
    ret = fn (...,
    - &attr->ia_xtime);
    + &ts);
    |
    + ts = timespec64_to_timespec(stat->xtime);
    ret = fn (...,
    - &stat->xtime);
    + &ts);
    )
    ...+>
    }

    @ depends on patch @
    struct inode *node;
    struct inode *node2;
    identifier i_xtime1 =~ "^i_[acm]time$";
    identifier i_xtime2 =~ "^i_[acm]time$";
    identifier i_xtime3 =~ "^i_[acm]time$";
    struct iattr *attrp;
    struct iattr *attrp2;
    struct iattr attr ;
    identifier ia_xtime1 =~ "^ia_[acm]time$";
    identifier ia_xtime2 =~ "^ia_[acm]time$";
    struct kstat *stat;
    struct kstat stat1;
    struct timespec64 ts;
    identifier xtime =~ "^[acmb]time$";
    expression e;
    @@
    (
    ( node->i_xtime2 \| attrp->ia_xtime2 \| attr.ia_xtime2 \) = node->i_xtime1 ;
    |
    node->i_xtime2 = \( node2->i_xtime1 \| timespec64_trunc(...) \);
    |
    node->i_xtime2 = node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \);
    |
    node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \);
    |
    stat->xtime = node2->i_xtime1;
    |
    stat1.xtime = node2->i_xtime1;
    |
    ( node->i_xtime2 \| attrp->ia_xtime2 \) = attrp->ia_xtime1 ;
    |
    ( attrp->ia_xtime1 \| attr.ia_xtime1 \) = attrp2->ia_xtime2;
    |
    - e = node->i_xtime1;
    + e = timespec64_to_timespec( node->i_xtime1 );
    |
    - e = attrp->ia_xtime1;
    + e = timespec64_to_timespec( attrp->ia_xtime1 );
    |
    node->i_xtime1 = current_time(...);
    |
    node->i_xtime2 = node->i_xtime1 = node->i_xtime3 =
    - e;
    + timespec_to_timespec64(e);
    |
    node->i_xtime1 = node->i_xtime3 =
    - e;
    + timespec_to_timespec64(e);
    |
    - node->i_xtime1 = e;
    + node->i_xtime1 = timespec_to_timespec64(e);
    )

    Signed-off-by: Deepa Dinamani
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:

    Deepa Dinamani
     

23 Jan, 2018

2 commits

  • This patch just adds the capability for GFS2 to track which function
    called gfs2_log_flush. This should make it easier to diagnose
    problems based on the sequence of events found in the journals.

    Signed-off-by: Bob Peterson
    Reviewed-by: Andreas Gruenbacher

    Bob Peterson
     
  • This patch adds a new structure called gfs2_log_header_v2 which is used
    to store expanded fields into previously unused areas of the log headers
    (i.e., this change is backwards compatible). Some of these are used for
    debug purposes so we can backtrack when problems occur. Others are
    reserved for future expansion.

    This patch is based on a prototype from Steve Whitehouse.

    Signed-off-by: Bob Peterson
    Signed-off-by: Andreas Gruenbacher

    Bob Peterson
     

15 Sep, 2017

1 commit

  • Pull mount flag updates from Al Viro:
    "Another chunk of fmount preparations from dhowells; only trivial
    conflicts for that part. It separates MS_... bits (very grotty
    mount(2) ABI) from the struct super_block ->s_flags (kernel-internal,
    only a small subset of MS_... stuff).

    This does *not* convert the filesystems to new constants; only the
    infrastructure is done here. The next step in that series is where the
    conflicts would be; that's the conversion of filesystems. It's purely
    mechanical and it's better done after the merge, so if you could run
    something like

    list=$(for i in MS_RDONLY MS_NOSUID MS_NODEV MS_NOEXEC MS_SYNCHRONOUS MS_MANDLOCK MS_DIRSYNC MS_NOATIME MS_NODIRATIME MS_SILENT MS_POSIXACL MS_KERNMOUNT MS_I_VERSION MS_LAZYTIME; do git grep -l $i fs drivers/staging/lustre drivers/mtd ipc mm include/linux; done|sort|uniq|grep -v '^fs/namespace.c$')

    sed -i -e 's/\/SB_RDONLY/g' \
    -e 's/\/SB_NOSUID/g' \
    -e 's/\/SB_NODEV/g' \
    -e 's/\/SB_NOEXEC/g' \
    -e 's/\/SB_SYNCHRONOUS/g' \
    -e 's/\/SB_MANDLOCK/g' \
    -e 's/\/SB_DIRSYNC/g' \
    -e 's/\/SB_NOATIME/g' \
    -e 's/\/SB_NODIRATIME/g' \
    -e 's/\/SB_SILENT/g' \
    -e 's/\/SB_POSIXACL/g' \
    -e 's/\/SB_KERNMOUNT/g' \
    -e 's/\/SB_I_VERSION/g' \
    -e 's/\/SB_LAZYTIME/g' \
    $list

    and commit it with something along the lines of 'convert filesystems
    away from use of MS_... constants' as commit message, it would save a
    quite a bit of headache next cycle"

    * 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    VFS: Differentiate mount flags (MS_*) from internal superblock flags
    VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb)
    vfs: Add sb_rdonly(sb) to query the MS_RDONLY flag on s_flags

    Linus Torvalds
     

10 Aug, 2017

1 commit

  • Remove gfs2_set_nlink which prevents the link count of an inode from
    becoming non-zero once it has reached zero. The next commit reduces the
    amount of waiting on glocks when an inode is evicted from memory. With
    that, an inode can become reallocated before all the remote-unlink
    callbacks from a previous delete are processed, which causes the link
    count to change from zero to non-zero.

    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Bob Peterson

    Andreas Gruenbacher
     

21 Jul, 2017

1 commit

  • In inode_go_lock() function, the parameter order of list_add() is error.
    According to the define of list_add(), the first parameter is new entry
    and the second is the list head, so ip->i_trunc_list should be the
    first parameter and the sdp->sd_trunc_list should be second.

    Signed-off-by: Wang Xibo
    Signed-off-by: Xiao Likun
    Signed-off-by: Bob Peterson

    Wang Xibo
     

17 Jul, 2017

1 commit

  • Firstly by applying the following with coccinelle's spatch:

    @@ expression SB; @@
    -SB->s_flags & MS_RDONLY
    +sb_rdonly(SB)

    to effect the conversion to sb_rdonly(sb), then by applying:

    @@ expression A, SB; @@
    (
    -(!sb_rdonly(SB)) && A
    +!sb_rdonly(SB) && A
    |
    -A != (sb_rdonly(SB))
    +A != sb_rdonly(SB)
    |
    -A == (sb_rdonly(SB))
    +A == sb_rdonly(SB)
    |
    -!(sb_rdonly(SB))
    +!sb_rdonly(SB)
    |
    -A && (sb_rdonly(SB))
    +A && sb_rdonly(SB)
    |
    -A || (sb_rdonly(SB))
    +A || sb_rdonly(SB)
    |
    -(sb_rdonly(SB)) != A
    +sb_rdonly(SB) != A
    |
    -(sb_rdonly(SB)) == A
    +sb_rdonly(SB) == A
    |
    -(sb_rdonly(SB)) && A
    +sb_rdonly(SB) && A
    |
    -(sb_rdonly(SB)) || A
    +sb_rdonly(SB) || A
    )

    @@ expression A, B, SB; @@
    (
    -(sb_rdonly(SB)) ? 1 : 0
    +sb_rdonly(SB)
    |
    -(sb_rdonly(SB)) ? A : B
    +sb_rdonly(SB) ? A : B
    )

    to remove left over excess bracketage and finally by applying:

    @@ expression A, SB; @@
    (
    -(A & MS_RDONLY) != sb_rdonly(SB)
    +(bool)(A & MS_RDONLY) != sb_rdonly(SB)
    |
    -(A & MS_RDONLY) == sb_rdonly(SB)
    +(bool)(A & MS_RDONLY) == sb_rdonly(SB)
    )

    to make comparisons against the result of sb_rdonly() (which is a bool)
    work correctly.

    Signed-off-by: David Howells

    David Howells
     

05 Jul, 2017

2 commits

  • Put all remaining accesses to gl->gl_object under the
    gl->gl_lockref.lock spinlock to prevent races.

    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Bob Peterson

    Andreas Gruenbacher
     
  • So far, gfs2_evict_inode clears gl->gl_object and then flushes the glock
    work queue to make sure that inode glops which dereference gl->gl_object
    have finished running before the inode is destroyed. However, flushing
    the work queue may do more work than needed, and in particular, it may
    call into DLM, which we want to avoid here. Use a bit lock
    (GIF_GLOP_PENDING) to synchronize between the inode glops and
    gfs2_evict_inode instead to get rid of the flushing.

    In addition, flush the work queues of existing glocks before reusing
    them for new inodes to get those glocks into a known state: the glock
    state engine currently doesn't handle glock re-appropriation correctly.
    (We may be able to fix the glock state engine instead later.)

    Based on a patch by Steven Whitehouse .

    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Bob Peterson

    Andreas Gruenbacher
     

05 Apr, 2016

1 commit

  • Function inode_go_demote_ok had some code that was only executed
    if gl_holders was not empty. However, if gl_holders was not empty,
    the only caller, demote_ok(), returns before inode_go_demote_ok
    would ever be called. Therefore, it's dead code, so I removed it.

    Signed-off-by: Bob Peterson
    Acked-by: Steven Whitehouse

    Bob Peterson
     

25 Dec, 2015

1 commit

  • When gfs2 releases the glock of an inode, it must invalidate all
    information cached for that inode, including the page cache and acls.
    Use the new security_inode_invalidate_secctx hook to also invalidate
    security labels in that case. These items will be reread from disk
    when needed after reacquiring the glock.

    Signed-off-by: Andreas Gruenbacher
    Acked-by: Bob Peterson
    Acked-by: Steven Whitehouse
    Cc: cluster-devel@redhat.com
    [PM: fixed spelling errors and description line lengths]
    Signed-off-by: Paul Moore

    Andreas Gruenbacher
     

30 Oct, 2015

1 commit

  • Commit e66cf161 replaced the gl_spin spinlock in struct gfs2_glock with a
    gl_lockref lockref and defined gl_spin as gl_lockref.lock (the spinlock in
    gl_lockref). Remove that define to make the references to gl_lockref.lock more
    obvious.

    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Bob Peterson

    Andreas Gruenbacher
     

04 Sep, 2015

1 commit

  • What uniquely identifies a glock in the glock hash table is not
    gl_name, but gl_name and its superblock pointer. This patch makes
    the gl_name field correspond to a unique glock identifier. That will
    allow us to simplify hashing with a future patch, since the hash
    algorithm can then take the gl_name and hash its components in one
    operation.

    Signed-off-by: Bob Peterson
    Signed-off-by: Andreas Gruenbacher
    Acked-by: Steven Whitehouse

    Bob Peterson
     

19 Jun, 2015

2 commits

  • This patch allows the block allocation code to retain the buffers
    for the resource groups so they don't need to be re-read from buffer
    cache with every request. This is a performance improvement that's
    especially noticeable when resource groups are very large. For
    example, with 2GB resource groups and 4K blocks, there can be 33
    blocks for every resource group. This patch allows those 33 buffers
    to be kept around and not read in and thrown away with every
    operation. The buffers are released when the resource group is
    either synced or invalidated.

    Signed-off-by: Bob Peterson
    Reviewed-by: Steven Whitehouse
    Reviewed-by: Benjamin Marzinski

    Bob Peterson
     
  • The glocks used for resource groups often come and go hundreds of
    thousands of times per second. Adding them to the lru list just
    adds unnecessary contention for the lru_lock spin_lock, especially
    considering we're almost certainly going to re-use the glock and
    take it back off the lru microseconds later. We never want the
    glock shrinker to cull them anyway. This patch adds a new bit in
    the glops that determines which glock types get put onto the lru
    list and which ones don't.

    Signed-off-by: Bob Peterson
    Acked-by: Steven Whitehouse

    Bob Peterson
     

17 Nov, 2014

1 commit

  • The current gfs2 freezing code is considerably more complicated than it
    should be because it doesn't use the vfs freezing code on any node except
    the one that begins the freeze. This is because it needs to acquire a
    cluster glock before calling the vfs code to prevent a deadlock, and
    without the new freeze_super and thaw_super hooks, that was impossible. To
    deal with the issue, gfs2 had to do some hacky locking tricks to make sure
    that a frozen node couldn't be holding on a lock it needed to do the
    unfreeze ioctl.

    This patch makes use of the new hooks to simply the gfs2 locking code. Now,
    all the nodes in the cluster freeze and thaw in exactly the same way. Every
    node in the cluster caches the freeze glock in the shared state. The new
    freeze_super hook allows the freezing node to grab this freeze glock in
    the exclusive state without first calling the vfs freeze_super function.
    All the nodes in the cluster see this lock change, and call the vfs
    freeze_super function. The vfs locking code guarantees that the nodes can't
    get stuck holding the glocks necessary to unfreeze the system. To
    unfreeze, the freezing node uses the new thaw_super hook to drop the freeze
    glock. Again, all the nodes notice this, reacquire the glock in shared mode
    and call the vfs thaw_super function.

    Signed-off-by: Benjamin Marzinski
    Signed-off-by: Steven Whitehouse

    Benjamin Marzinski