05 Aug, 2020

1 commit

  • Pull uninitialized_var() macro removal from Kees Cook:
    "This is long overdue, and has hidden too many bugs over the years. The
    series has several "by hand" fixes, and then a trivial treewide
    replacement.

    - Clean up non-trivial uses of uninitialized_var()

    - Update documentation and checkpatch for uninitialized_var() removal

    - Treewide removal of uninitialized_var()"

    * tag 'uninit-macro-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
    compiler: Remove uninitialized_var() macro
    treewide: Remove uninitialized_var() usage
    checkpatch: Remove awareness of uninitialized_var() macro
    mm/debug_vm_pgtable: Remove uninitialized_var() usage
    f2fs: Eliminate usage of uninitialized_var() macro
    media: sur40: Remove uninitialized_var() usage
    KVM: PPC: Book3S PR: Remove uninitialized_var() usage
    clk: spear: Remove uninitialized_var() usage
    clk: st: Remove uninitialized_var() usage
    spi: davinci: Remove uninitialized_var() usage
    ide: Remove uninitialized_var() usage
    rtlwifi: rtl8192cu: Remove uninitialized_var() usage
    b43: Remove uninitialized_var() usage
    drbd: Remove uninitialized_var() usage
    x86/mm/numa: Remove uninitialized_var() usage
    docs: deprecated.rst: Add uninitialized_var()

    Linus Torvalds
     

17 Jul, 2020

1 commit

  • Using uninitialized_var() is dangerous as it papers over real bugs[1]
    (or can in the future), and suppresses unrelated compiler warnings
    (e.g. "unused variable"). If the compiler thinks it is uninitialized,
    either simply initialize the variable or make compiler changes.

    In preparation for removing[2] the[3] macro[4], remove all remaining
    needless uses with the following script:

    git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \
    xargs perl -pi -e \
    's/\buninitialized_var\(([^\)]+)\)/\1/g;
    s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;'

    drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid
    pathological white-space.

    No outstanding warnings were found building allmodconfig with GCC 9.3.0
    for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64,
    alpha, and m68k.

    [1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/
    [2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/
    [3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/
    [4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/

    Reviewed-by: Leon Romanovsky # drivers/infiniband and mlx4/mlx5
    Acked-by: Jason Gunthorpe # IB
    Acked-by: Kalle Valo # wireless drivers
    Reviewed-by: Chao Yu # erofs
    Signed-off-by: Kees Cook

    Kees Cook
     

08 Jul, 2020

1 commit

  • So far, gfs2 has taken the inode glocks inside the ->readpage and
    ->readahead address space operations. Since commit d4388340ae0b ("fs:
    convert mpage_readpages to mpage_readahead"), gfs2_readahead is passed
    the pages to read ahead locked. With that, the current holder of the
    inode glock may be trying to lock one of those pages while
    gfs2_readahead is trying to take the inode glock, resulting in a
    deadlock.

    Fix that by moving the lock taking to the higher-level ->read_iter file
    and ->fault vm operations. This also gets rid of an ugly lock inversion
    workaround in gfs2_readpage.

    The cache consistency model of filesystems like gfs2 is such that if
    data is found in the page cache, the data is up to date and can be used
    without taking any filesystem locks. If a page is not cached,
    filesystem locks must be taken before populating the page cache.

    To avoid taking the inode glock when the data is already cached,
    gfs2_file_read_iter first tries to read the data with the IOCB_NOIO flag
    set. If that fails, the inode glock is taken and the operation is
    retried with the IOCB_NOIO flag cleared.

    Signed-off-by: Andreas Gruenbacher

    Andreas Gruenbacher
     

03 Jul, 2020

5 commits

  • Before this patch, some gfs2 code locked the freeze glock with LM_FLAG_NOEXP
    (Do not freeze) flag, and some did not. We never want to freeze the freeze
    glock, so this patch makes it consistently use LM_FLAG_NOEXP always.

    Signed-off-by: Bob Peterson

    Bob Peterson
     
  • Before this patch, the freeze code in gfs2 specified GL_NOCACHE in
    several places. That's wrong because we always want to know the state
    of whether the file system is frozen.

    There was also a problem with freeze/thaw transitioning the glock from
    frozen (EX) to thawed (SH) because gfs2 will normally grant glocks in EX
    to processes that request it in SH mode, unless GL_EXACT is specified.
    Therefore, the freeze/thaw code, which tried to reacquire the glock in
    SH mode would get the glock in EX mode, and miss the transition from EX
    to SH. That made it think the thaw had completed normally, but since the
    glock was still cached in EX, other nodes could not freeze again.

    This patch removes the GL_NOCACHE flag to allow the freeze glock to be
    cached. It also adds the GL_EXACT flag so the glock is fully transitioned
    from EX to SH, thereby allowing future freeze operations.

    Signed-off-by: Bob Peterson

    Bob Peterson
     
  • Before this patch, only read-write mounts would grab the freeze
    glock in read-only mode, as part of gfs2_make_fs_rw. So the freeze
    glock was never initialized. That meant requests to freeze, which
    request the glock in EX, were granted without any state transition.
    That meant you could mount a gfs2 file system, which is currently
    frozen on a different cluster node, in read-only mode.

    This patch makes read-only mounts lock the freeze glock in SH mode,
    which will block for file systems that are frozen on another node.

    Signed-off-by: Bob Peterson

    Bob Peterson
     
  • Before this patch, function freeze_go_sync, called when promoting
    the freeze glock, was testing for the SDF_JOURNAL_LIVE superblock flag.
    That's only set for read-write mounts. Read-only mounts don't use a
    journal, so the bit is never set, so the freeze never happened.

    This patch removes the check for SDF_JOURNAL_LIVE for freeze requests
    but still checks it when deciding whether to flush a journal.

    Signed-off-by: Bob Peterson

    Bob Peterson
     
  • In several places, we used the GIF_ORDERED inode flag to determine
    if an inode was on the ordered writes list. However, since we always
    held the sd_ordered_lock spin_lock during the manipulation, we can
    just as easily check list_empty(&ip->i_ordered) instead.
    This allows us to keep more than one ordered writes list to make
    journal writing improvements.

    This patch eliminates GIF_ORDERED in favor of checking list_empty.

    Signed-off-by: Bob Peterson

    Bob Peterson
     

30 Jun, 2020

3 commits

  • In flush_delete_work, instead of flushing each individual pending
    delayed work item, cancel and re-queue them for immediate execution.
    The waiting isn't needed here because we're already waiting for all
    queued work items to complete in gfs2_flush_delete_work. This makes the
    code more efficient, but more importantly, it avoids sleeping during a
    rhashtable walk, inside rcu_read_lock().

    Signed-off-by: Andreas Gruenbacher

    Andreas Gruenbacher
     
  • Log flush operations (gfs2_log_flush()) can target a specific transaction.
    But if the function encounters errors (e.g. io errors) and withdraws,
    the transaction was only freed it if was queued to one of the ail lists.
    If the withdraw occurred before the transaction was queued to the ail1
    list, function ail_drain never freed it. The result was:

    BUG gfs2_trans: Objects remaining in gfs2_trans on __kmem_cache_shutdown()

    This patch makes log_flush() add the targeted transaction to the ail1
    list so that function ail_drain() will find and free it properly.

    Cc: stable@vger.kernel.org # v5.7+
    Signed-off-by: Bob Peterson
    Signed-off-by: Andreas Gruenbacher

    Bob Peterson
     
  • Callers expect gfs2_inode_lookup to return an inode pointer or ERR_PTR(error).
    Commit b66648ad6dcf caused it to return NULL instead of ERR_PTR(-ESTALE) in
    some cases. Fix that.

    Reported-by: Dan Carpenter
    Fixes: b66648ad6dcf ("gfs2: Move inode generation number check into gfs2_inode_lookup")
    Signed-off-by: Andreas Gruenbacher

    Andreas Gruenbacher
     

09 Jun, 2020

1 commit

  • Pull gfs2 updates from Andreas Gruenbacher:

    - An iopen glock locking scheme rework that speeds up deletes of inodes
    accessed from multiple nodes

    - Various bug fixes and debugging improvements

    - Convert gfs2-glocks.txt to ReST

    * tag 'gfs2-for-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
    gfs2: fix use-after-free on transaction ail lists
    gfs2: new slab for transactions
    gfs2: initialize transaction tr_ailX_lists earlier
    gfs2: Smarter iopen glock waiting
    gfs2: Wake up when setting GLF_DEMOTE
    gfs2: Check inode generation number in delete_work_func
    gfs2: Move inode generation number check into gfs2_inode_lookup
    gfs2: Minor gfs2_lookup_by_inum cleanup
    gfs2: Try harder to delete inodes locally
    gfs2: Give up the iopen glock on contention
    gfs2: Turn gl_delete into a delayed work
    gfs2: Keep track of deleted inode generations in LVBs
    gfs2: Allow ASPACE glocks to also have an lvb
    gfs2: instrumentation wrt log_flush stuck
    gfs2: introduce new gfs2_glock_assert_withdraw
    gfs2: print mapping->nrpages in glock dump for address space glocks
    gfs2: Only do glock put in gfs2_create_inode for free inodes
    gfs2: Allow lock_nolock mount to specify jid=X
    gfs2: Don't ignore inode write errors during inode_go_sync
    docs: filesystems: convert gfs2-glocks.txt to ReST

    Linus Torvalds
     

06 Jun, 2020

16 commits

  • Pull ext4 updates from Ted Ts'o:
    "A lot of bug fixes and cleanups for ext4, including:

    - Fix performance problems found in dioread_nolock now that it is the
    default, caused by transaction leaks.

    - Clean up fiemap handling in ext4

    - Clean up and refactor multiple block allocator (mballoc) code

    - Fix a problem with mballoc with a smaller file systems running out
    of blocks because they couldn't properly use blocks that had been
    reserved by inode preallocation.

    - Fixed a race in ext4_sync_parent() versus rename()

    - Simplify the error handling in the extent manipulation code

    - Make sure all metadata I/O errors are felected to
    ext4_ext_dirty()'s and ext4_make_inode_dirty()'s callers.

    - Avoid passing an error pointer to brelse in ext4_xattr_set()

    - Fix race which could result to freeing an inode on the dirty last
    in data=journal mode.

    - Fix refcount handling if ext4_iget() fails

    - Fix a crash in generic/019 caused by a corrupted extent node"

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (58 commits)
    ext4: avoid unnecessary transaction starts during writeback
    ext4: don't block for O_DIRECT if IOCB_NOWAIT is set
    ext4: remove the access_ok() check in ext4_ioctl_get_es_cache
    fs: remove the access_ok() check in ioctl_fiemap
    fs: handle FIEMAP_FLAG_SYNC in fiemap_prep
    fs: move fiemap range validation into the file systems instances
    iomap: fix the iomap_fiemap prototype
    fs: move the fiemap definitions out of fs.h
    fs: mark __generic_block_fiemap static
    ext4: remove the call to fiemap_check_flags in ext4_fiemap
    ext4: split _ext4_fiemap
    ext4: fix fiemap size checks for bitmap files
    ext4: fix EXT4_MAX_LOGICAL_BLOCK macro
    add comment for ext4_dir_entry_2 file_type member
    jbd2: avoid leaking transaction credits when unreserving handle
    ext4: drop ext4_journal_free_reserved()
    ext4: mballoc: use lock for checking free blocks while retrying
    ext4: mballoc: refactor ext4_mb_good_group()
    ext4: mballoc: introduce pcpu seqcnt for freeing PA to improve ENOSPC handling
    ext4: mballoc: refactor ext4_mb_discard_preallocations()
    ...

    Linus Torvalds
     
  • Andreas Gruenbacher
     
  • Before this patch, transactions could be merged into the system
    transaction by function gfs2_merge_trans(), but the transaction ail
    lists were never merged. Because the ail flushing mechanism can run
    separately, bd elements can be attached to the transaction's buffer
    list during the transaction (trans_add_meta, etc) but quickly moved
    to its ail lists. Later, in function gfs2_trans_end, the transaction
    can be freed (by gfs2_trans_end) while it still has bd elements
    queued to its ail lists, which can cause it to either lose track of
    the bd elements altogether (memory leak) or worse, reference the bd
    elements after the parent transaction has been freed.

    Although I've not seen any serious consequences, the problem becomes
    apparent with the previous patch's addition of:

    gfs2_assert_warn(sdp, list_empty(&tr->tr_ail1_list));

    to function gfs2_trans_free().

    This patch adds logic into gfs2_merge_trans() to move the merged
    transaction's ail lists to the sdp transaction. This prevents the
    use-after-free. To do this properly, we need to hold the ail lock,
    so we pass sdp into the function instead of the transaction itself.

    Signed-off-by: Bob Peterson
    Signed-off-by: Andreas Gruenbacher

    Bob Peterson
     
  • This patch adds a new slab for gfs2 transactions. That allows us to
    reduce kernel memory fragmentation, have better organization of data
    for analysis of vmcore dumps. A new centralized function is added to
    free the slab objects, and it exposes use-after-free by giving
    warnings if a transaction is freed while it still has bd elements
    attached to its buffers or ail lists. We make sure to initialize
    those transaction ail lists so we can check their integrity when freeing.

    At a later time, we should add a slab initialization function to
    make it more efficient, but for this initial patch I wanted to
    minimize the impact.

    Signed-off-by: Bob Peterson
    Signed-off-by: Andreas Gruenbacher

    Bob Peterson
     
  • Since transactions may be freed shortly after they're created, before
    a log_flush occurs, we need to initialize their ail1 and ail2 lists
    earlier. Before this patch, the ail1 list was initialized in gfs2_log_flush().
    This moves the initialization to the point when the transaction is first
    created.

    Signed-off-by: Bob Peterson
    Signed-off-by: Andreas Gruenbacher

    Bob Peterson
     
  • When trying to upgrade the iopen glock from a shared to an exclusive lock in
    gfs2_evict_inode, abort the wait if there is contention on the corresponding
    inode glock: in that case, the inode must still be in active use on another
    node, and we're not guaranteed to get the iopen glock anytime soon.

    To make this work even better, when we notice contention on the iopen glock and
    we can't evict the corresponsing inode and release the iopen glock immediately,
    poke the inode glock. The other node(s) trying to acquire the lock can then
    abort instead of timing out.

    Thanks to Heinz Mauelshagen for pointing out a locking bug in a previous
    version of this patch.

    Signed-off-by: Andreas Gruenbacher

    Andreas Gruenbacher
     
  • Wake up the sdp->sd_async_glock_wait wait queue when setting the GLF_DEMOTE
    flag.

    Signed-off-by: Andreas Gruenbacher

    Andreas Gruenbacher
     
  • In delete_work_func, if the iopen glock still has an inode attached,
    limit the inode lookup to that specific generation number: in the likely
    case that the inode was deleted on the node on which the inode's link
    count dropped to zero, we can skip verifying the on-disk block type and
    reading in the inode. The same applies if another node that had the
    inode open managed to delete the inode before us.

    Signed-off-by: Andreas Gruenbacher

    Andreas Gruenbacher
     
  • Move the inode generation number check from gfs2_lookup_by_inum into
    gfs2_inode_lookup: gfs2_inode_lookup may be able to decide that an inode with
    the given inode generation number cannot exist without having to verify the
    block type or reading the inode from disk.

    Signed-off-by: Andreas Gruenbacher

    Andreas Gruenbacher
     
  • Use a zero no_formal_ino instead of a NULL pointer to indicate that any inode
    generation number will qualify: a valid inode never has a zero no_formal_ino.

    Signed-off-by: Andreas Gruenbacher

    Andreas Gruenbacher
     
  • When an inode's link count drops to zero and the inode is cached on
    other nodes, the current behavior of gfs2 is to immediately give up and
    to rely on the other node(s) to delete the inode if there is iopen glock
    contention. This leads to resource group glock bouncing and the loss of
    caching. With the previous patches in place, we can fix that by not
    giving up immediately.

    When the inode is still open on other nodes, those nodes won't be able
    to evict the inode and give up the iopen glock. In that case, our lock
    conversion request will time out. The unlink system call will block for
    the duration of the iopen lock conversion request. We're also holding
    the inode glock in EX mode for an extended duration, so other nodes
    won't be able to make progress on the inode, either.

    This is worse than what we had before, but we can prevent other nodes
    from getting stuck by aborting our iopen locking request if there is
    contention on the inode glock. This will the the subject of a future
    patch.

    Signed-off-by: Andreas Gruenbacher

    Andreas Gruenbacher
     
  • When there's contention on the iopen glock, it means that the link count
    of the corresponding inode has dropped to zero on a remote node which is
    now trying to delete the inode. In that case, try to evict the inode so
    that the iopen glock will be released, which will allow the remote node
    to do its job.

    When the inode is still open locally, the inode's reference count won't
    drop to zero and so we'll keep holding the inode and its iopen glock.
    The remote node will time out its request to grab the iopen glock, and
    when the inode is finally closed locally, we'll try to delete it
    ourself.

    Signed-off-by: Andreas Gruenbacher

    Andreas Gruenbacher
     
  • This requires flushing delayed work items in gfs2_make_fs_ro (which is called
    before unmounting a filesystem).

    When inodes are deleted and then recreated, pending gl_delete work items would
    have no effect because the inode generations will have changed, so we can
    cancel any pending gl_delete works before reusing iopen glocks.

    Signed-off-by: Andreas Gruenbacher

    Andreas Gruenbacher
     
  • When deleting an inode, keep track of the generation of the deleted inode in
    the inode glock Lock Value Block (LVB). When trying to delete an inode
    remotely, check the last-known inode generation against the deleted inode
    generation to skip duplicate remote deletes. This avoids taking the resource
    group glock in order to verify the block type.

    Signed-off-by: Andreas Gruenbacher

    Andreas Gruenbacher
     
  • Signed-off-by: Bob Peterson
    Signed-off-by: Andreas Gruenbacher

    Bob Peterson
     
  • This adds checks for gfs2_log_flush being stuck, similarly to the check
    in gfs2_ail1_flush. To faciliate this and make the strings easy to grep
    we move the ail1 emptying to its own function, empty_ail1_list.

    Signed-off-by: Bob Peterson
    Signed-off-by: Andreas Gruenbacher

    Bob Peterson
     

05 Jun, 2020

2 commits


04 Jun, 2020

1 commit


03 Jun, 2020

5 commits

  • Before this patch, the error path of function gfs2_create_inode would
    always calls gfs2_glock_put for the inode glock. That's good for inodes
    that are free. But after they've been added to the vfs inodes, errors
    will cause the inode to be evicted, and the evict will do the glock
    put for us. If we do a glock put again, we can try to free the glock
    while there are still references to it, e.g. revokes pending for
    the transaction that created it.

    This patch adds a check: if (free_vfs_inode) before the put, thus
    solving the problem.

    Signed-off-by: Bob Peterson
    Signed-off-by: Andreas Gruenbacher

    Bob Peterson
     
  • The pgprot argument to __vmalloc is always PAGE_KERNEL now, so remove it.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Reviewed-by: Michael Kelley [hyperv]
    Acked-by: Gao Xiang [erofs]
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Wei Liu
    Cc: Christian Borntraeger
    Cc: Christophe Leroy
    Cc: Daniel Vetter
    Cc: David Airlie
    Cc: Greg Kroah-Hartman
    Cc: Haiyang Zhang
    Cc: Johannes Weiner
    Cc: "K. Y. Srinivasan"
    Cc: Laura Abbott
    Cc: Mark Rutland
    Cc: Minchan Kim
    Cc: Nitin Gupta
    Cc: Robin Murphy
    Cc: Sakari Ailus
    Cc: Stephen Hemminger
    Cc: Sumit Semwal
    Cc: Benjamin Herrenschmidt
    Cc: Catalin Marinas
    Cc: Heiko Carstens
    Cc: Paul Mackerras
    Cc: Vasily Gorbik
    Cc: Will Deacon
    Link: http://lkml.kernel.org/r/20200414131348.444715-22-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Implement the new readahead aop and convert all callers (block_dev,
    exfat, ext2, fat, gfs2, hpfs, isofs, jfs, nilfs2, ocfs2, omfs, qnx6,
    reiserfs & udf).

    The callers are all trivial except for GFS2 & OCFS2.

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: Junxiao Bi # ocfs2
    Reviewed-by: Joseph Qi # ocfs2
    Reviewed-by: Dave Chinner
    Reviewed-by: John Hubbard
    Reviewed-by: Christoph Hellwig
    Reviewed-by: William Kucharski
    Cc: Chao Yu
    Cc: Cong Wang
    Cc: Darrick J. Wong
    Cc: Eric Biggers
    Cc: Gao Xiang
    Cc: Jaegeuk Kim
    Cc: Michal Hocko
    Cc: Zi Yan
    Cc: Johannes Thumshirn
    Cc: Miklos Szeredi
    Link: http://lkml.kernel.org/r/20200414150233.24495-17-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • Before this patch, a simple typo accidentally added \n to the jid=
    string for lock_nolock mounts. This made it impossible to mount a
    gfs2 file system with a journal other than journal0. Thus:

    mount -tgfs2 -o hostdata="jid=1"

    Resulted in:
    mount: wrong fs type, bad option, bad superblock on

    In most cases this is not a problem. However, for debugging and
    testing purposes we sometimes want to test the integrity of other
    journals. This patch removes the unnecessary \n and thus allows
    lock_nolock users to specify an alternate journal.

    Signed-off-by: Bob Peterson
    Signed-off-by: Andreas Gruenbacher

    Bob Peterson
     
  • Before for this patch, function inode_go_sync ignored io errors
    during inode_go_sync, overwriting them with metadata write errors:

    error = filemap_fdatawait(mapping);
    mapping_set_error(mapping, error);
    }
    error = filemap_fdatawait(metamapping);
    ...
    return error;

    So any errors returned by the inode write would be forgotten if the
    metadata write succeeded. This patch still does both writes, but
    only sets error if it's still zero. That way, any errors will be
    reported by to the caller, do_xmote, which will take appropriate
    action and report the error.

    Signed-off-by: Bob Peterson
    Signed-off-by: Andreas Gruenbacher

    Bob Peterson
     

29 May, 2020

1 commit

  • Fix several issues in the previous gfs2_find_jhead fix:
    * When updating @blocks_submitted, @block refers to the first block block not
    submitted yet, not the last block submitted, so fix an off-by-one error.
    * We want to ensure that @blocks_submitted is far enough ahead of @blocks_read
    to guarantee that there is in-flight I/O. Otherwise, we'll eventually end up
    waiting for pages that haven't been submitted, yet.
    * It's much easier to compare the number of blocks added with the number of
    blocks submitted to limit the maximum bio size.
    * Even with bio chaining, we can keep adding blocks until we reach the maximum
    bio size, as long as we stop at a page boundary. This simplifies the logic.

    Signed-off-by: Andreas Gruenbacher
    Reviewed-by: Bob Peterson

    Andreas Gruenbacher
     

09 May, 2020

3 commits

  • This reverts commit df5db5f9ee112e76b5202fbc331f990a0fc316d6.

    This patch fixes a regression: patch df5db5f9ee112 allowed function
    run_queue() to bypass its call to do_xmote() if revokes were queued for
    the glock. That's wrong because its call to do_xmote() is what is
    responsible for calling the go_sync() glops functions to sync both
    the ail list and any revokes queued for it. By bypassing the call,
    gfs2 could get into a stand-off where the glock could not be demoted
    until its revokes are written back, but the revokes would not be
    written back because do_xmote() was never called.

    It "sort of" works, however, because there are other mechanisms like
    the log flush daemon (logd) that can sync the ail items and revokes,
    if it deems it necessary. The problem is: without file system pressure,
    it might never deem it necessary.

    Signed-off-by: Bob Peterson

    Bob Peterson
     
  • Before this patch, if the go_sync operation returned an error during
    the do_xmote process (such as unable to sync metadata to the journal)
    the code did goto out. That kept the glock locked, so it could not be
    given away, which correctly avoids file system corruption. However,
    it never set the withdraw bit or requeueing the glock work. So it would
    hang forever, unable to ever demote the glock.

    This patch changes to goto to a new label, skip_inval, so that errors
    from go_sync are treated the same way as errors from go_inval:
    The delayed withdraw bit is set and the work is requeued. That way,
    the logd should eventually figure out there's a problem and withdraw
    properly there.

    Signed-off-by: Bob Peterson
    Signed-off-by: Andreas Gruenbacher

    Bob Peterson
     
  • This patch rearranges gfs2_add_revoke so that the extra glock
    reference is added earlier on in the function to avoid races in which
    the glock is freed before the new reference is taken.

    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Bob Peterson

    Andreas Gruenbacher