15 Sep, 2017

1 commit

  • Pull mount flag updates from Al Viro:
    "Another chunk of fmount preparations from dhowells; only trivial
    conflicts for that part. It separates MS_... bits (very grotty
    mount(2) ABI) from the struct super_block ->s_flags (kernel-internal,
    only a small subset of MS_... stuff).

    This does *not* convert the filesystems to new constants; only the
    infrastructure is done here. The next step in that series is where the
    conflicts would be; that's the conversion of filesystems. It's purely
    mechanical and it's better done after the merge, so if you could run
    something like

    list=$(for i in MS_RDONLY MS_NOSUID MS_NODEV MS_NOEXEC MS_SYNCHRONOUS MS_MANDLOCK MS_DIRSYNC MS_NOATIME MS_NODIRATIME MS_SILENT MS_POSIXACL MS_KERNMOUNT MS_I_VERSION MS_LAZYTIME; do git grep -l $i fs drivers/staging/lustre drivers/mtd ipc mm include/linux; done|sort|uniq|grep -v '^fs/namespace.c$')

    sed -i -e 's/\/SB_RDONLY/g' \
    -e 's/\/SB_NOSUID/g' \
    -e 's/\/SB_NODEV/g' \
    -e 's/\/SB_NOEXEC/g' \
    -e 's/\/SB_SYNCHRONOUS/g' \
    -e 's/\/SB_MANDLOCK/g' \
    -e 's/\/SB_DIRSYNC/g' \
    -e 's/\/SB_NOATIME/g' \
    -e 's/\/SB_NODIRATIME/g' \
    -e 's/\/SB_SILENT/g' \
    -e 's/\/SB_POSIXACL/g' \
    -e 's/\/SB_KERNMOUNT/g' \
    -e 's/\/SB_I_VERSION/g' \
    -e 's/\/SB_LAZYTIME/g' \
    $list

    and commit it with something along the lines of 'convert filesystems
    away from use of MS_... constants' as commit message, it would save a
    quite a bit of headache next cycle"

    * 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    VFS: Differentiate mount flags (MS_*) from internal superblock flags
    VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb)
    vfs: Add sb_rdonly(sb) to query the MS_RDONLY flag on s_flags

    Linus Torvalds
     

25 Aug, 2017

1 commit

  • Before this patch, if GFS2 encountered IO errors while writing to
    the journal, it would not report the problem, so they would go
    unnoticed, sometimes for many hours. Sometimes this would only be
    noticed later, when recovery tried to do journal replay and failed
    due to invalid metadata at the blocks that resulted in IO errors.

    This patch makes GFS2's log daemon check for IO errors. If it
    encounters one, it withdraws from the file system and reports
    why in dmesg. A similar action is taken when IO errors occur when
    writing to the system statfs file.

    These errors are also reported back to any callers of fsync, since
    that requires the journal to be flushed. Therefore, any IO errors
    that would previously go unnoticed are now noticed and the file
    system is withdrawn as early as possible, thus preventing further
    file system damage.

    Also note that this reintroduces superblock variable sd_log_error,
    which Christoph removed with commit f729b66fca.

    Signed-off-by: Bob Peterson

    Bob Peterson
     

10 Aug, 2017

2 commits

  • When under memory pressure and an inode's link count has dropped to
    zero, defer deleting the inode to the delete workqueue. This avoids
    calling into DLM under memory pressure, which can deadlock.

    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Bob Peterson

    Andreas Gruenbacher
     
  • gfs2_evict_inode is called to free inodes under memory pressure. The
    function calls into DLM when an inode's last cluster-wide reference goes
    away (remote unlink) and to release the glock and associated DLM lock
    before finally destroying the inode. However, if DLM is blocked on
    memory to become available, calling into DLM again will deadlock.

    Avoid that by decoupling releasing glocks from destroying inodes in that
    case: with gfs2_glock_queue_put, glocks will be dequeued asynchronously
    in work queue context, when the associated inodes have likely already
    been destroyed.

    With this change, inodes can end up being unlinked, remote-unlink can be
    triggered, and then the inode can be reallocated before all
    remote-unlink callbacks are processed. To detect that, revalidate the
    link count in gfs2_evict_inode to make sure we're not deleting an
    allocated, referenced inode.

    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Bob Peterson

    Andreas Gruenbacher
     

09 Aug, 2017

3 commits


21 Jul, 2017

1 commit

  • This patch introduces a new helper function in glock.h that
    clears gl_object, with an added integrity check. An additional
    integrity check has been added to glock_set_object, plus comments.
    This is step 1 in a series to ensure gl_object integrity.

    Signed-off-by: Bob Peterson
    Reviewed-by: Andreas Gruenbacher

    Bob Peterson
     

17 Jul, 2017

1 commit

  • Firstly by applying the following with coccinelle's spatch:

    @@ expression SB; @@
    -SB->s_flags & MS_RDONLY
    +sb_rdonly(SB)

    to effect the conversion to sb_rdonly(sb), then by applying:

    @@ expression A, SB; @@
    (
    -(!sb_rdonly(SB)) && A
    +!sb_rdonly(SB) && A
    |
    -A != (sb_rdonly(SB))
    +A != sb_rdonly(SB)
    |
    -A == (sb_rdonly(SB))
    +A == sb_rdonly(SB)
    |
    -!(sb_rdonly(SB))
    +!sb_rdonly(SB)
    |
    -A && (sb_rdonly(SB))
    +A && sb_rdonly(SB)
    |
    -A || (sb_rdonly(SB))
    +A || sb_rdonly(SB)
    |
    -(sb_rdonly(SB)) != A
    +sb_rdonly(SB) != A
    |
    -(sb_rdonly(SB)) == A
    +sb_rdonly(SB) == A
    |
    -(sb_rdonly(SB)) && A
    +sb_rdonly(SB) && A
    |
    -(sb_rdonly(SB)) || A
    +sb_rdonly(SB) || A
    )

    @@ expression A, B, SB; @@
    (
    -(sb_rdonly(SB)) ? 1 : 0
    +sb_rdonly(SB)
    |
    -(sb_rdonly(SB)) ? A : B
    +sb_rdonly(SB) ? A : B
    )

    to remove left over excess bracketage and finally by applying:

    @@ expression A, SB; @@
    (
    -(A & MS_RDONLY) != sb_rdonly(SB)
    +(bool)(A & MS_RDONLY) != sb_rdonly(SB)
    |
    -(A & MS_RDONLY) == sb_rdonly(SB)
    +(bool)(A & MS_RDONLY) == sb_rdonly(SB)
    )

    to make comparisons against the result of sb_rdonly() (which is a bool)
    work correctly.

    Signed-off-by: David Howells

    David Howells
     

05 Jul, 2017

3 commits

  • On failure, keep the inode glock across the final iput of the new inode
    so that gfs2_evict_inode doesn't have to re-acquire the glock. That
    way, gfs2_evict_inode won't need to revalidate the block type.

    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Bob Peterson

    Andreas Gruenbacher
     
  • Put all remaining accesses to gl->gl_object under the
    gl->gl_lockref.lock spinlock to prevent races.

    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Bob Peterson

    Andreas Gruenbacher
     
  • So far, gfs2_evict_inode clears gl->gl_object and then flushes the glock
    work queue to make sure that inode glops which dereference gl->gl_object
    have finished running before the inode is destroyed. However, flushing
    the work queue may do more work than needed, and in particular, it may
    call into DLM, which we want to avoid here. Use a bit lock
    (GIF_GLOP_PENDING) to synchronize between the inode glops and
    gfs2_evict_inode instead to get rid of the flushing.

    In addition, flush the work queues of existing glocks before reusing
    them for new inodes to get those glocks into a known state: the glock
    state engine currently doesn't handle glock re-appropriation correctly.
    (We may be able to fix the glock state engine instead later.)

    Based on a patch by Steven Whitehouse .

    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Bob Peterson

    Andreas Gruenbacher
     

03 Apr, 2017

1 commit

  • Revert commit 86d067a797d4e8546a7c92b985f31e8cd3ec39ad: it turns out
    that waiting for iopen glock dequeues here isn't needed anymore because
    the bugs that commit was meant to fix have been fixed otherwise.

    In addition, we want to avoid waiting on glocks in gfs2_evict_inode in
    shrinker context because the shrinker may be invoked on behalf of DLM,
    in which case calling into DLM again would deadlock. This commit makes
    the described scenario less likely without completely avoiding it; it's
    still a step in the right direction, though.

    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Bob Peterson

    Andreas Gruenbacher
     

16 Mar, 2017

1 commit

  • When the GFS2 file system withdraws due to metadata corruption, it
    often has outstanding transactions in the journal and delayed work
    queued for its glocks. This patch adds some new checks for a
    withdrawn file system before proceeding with operations that would
    obviously cause a BUG() to be triggered. That allows GFS2 to be
    safely unmounted rather than cause the system to go down.

    Signed-off-by: Bob Peterson

    Bob Peterson
     

02 Mar, 2017

1 commit


03 Aug, 2016

1 commit

  • Replace 1 << value shift by more explicit BIT() macro

    Also fixes two bare unsigned definitions:

    WARNING: Prefer 'unsigned int' to bare use of 'unsigned'
    + unsigned hsize = BIT(ip->i_depth);

    Signed-off-by: Fabian Frederick
    Signed-off-by: Bob Peterson

    Fabian Frederick
     

27 Jun, 2016

1 commit

  • Make the code more readable by cleaning up the different ways of
    initializing lock holders and checking for initialized lock holders:
    mark lock holders as uninitialized by setting the holder's glock to NULL
    (gfs2_holder_mark_uninitialized) instead of zeroing out the entire
    object or using a separate flag. Recognize initialized holders by their
    non-NULL glock (gfs2_holder_initialized). Don't zero out holder objects
    which are immeditiately initialized via gfs2_holder_init or
    gfs2_glock_nq_init.

    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Bob Peterson

    Andreas Gruenbacher
     

11 Apr, 2016

1 commit


14 Jan, 2016

1 commit

  • This patch fixes an error condition in which an inode is partially
    created in gfs2_create_inode() but then some error is discovered,
    which causes it to fail and call iput() before the iopen glock is
    created or held. In that case, gfs2_delete_inode would try to
    unlock an iopen glock that doesn't yet exist. Therefore, we test
    its holder (which must exist) for the HIF_HOLDER bit before trying
    to dq it.

    Signed-off-by: Bob Peterson
    Acked-by: Steven Whitehouse

    Bob Peterson
     

19 Dec, 2015

2 commits

  • In function gfs2_delete_inode() we write and flush the mapping for
    a glock, among other things. We truncate the mapping for the inode,
    but we never truncate the mapping for the glock. This patch makes it
    also truncate the metamapping. This avoid cases where the glock is
    reused by another process who is trying to recreate an inode in its
    place using the same block.

    Signed-off-by: Bob Peterson
    Acked-by: Steven Whitehouse

    Bob Peterson
     
  • This patch changes every glock_dq for iopen glocks into a dq_wait.
    This makes sure that iopen glocks do not outlive the inode itself.
    In turn, that ensures that anyone trying to unlink the glock will
    be able to find the inode when it receives a remote iopen callback.

    Signed-off-by: Bob Peterson
    Acked-by: Steven Whitehouse

    Bob Peterson
     

15 Dec, 2015

4 commits

  • When gfs2 was unmounting filesystems or changing them to read-only it
    was clearing the SDF_JOURNAL_LIVE bit before the final log flush. This
    caused a race. If an inode glock got demoted in the gap between
    clearing the bit and the shutdown flush, it would be unable to reserve
    log space to clear out the active items list in inode_go_sync, causing an
    error in inode_go_inval because the glock was still dirty.

    To solve this, the SDF_JOURNAL_LIVE bit is now cleared inside the
    shutdown log flush. This means that, because of the locking on the log
    blocks, either inode_go_sync will be able to reserve space to clean the
    glock before the shutdown flush, or the shutdown flush will clean the
    glock itself, before inode_go_sync fails to reserve the space. Either
    way, the glock will be clean before inode_go_inval.

    Signed-off-by: Benjamin Marzinski
    Signed-off-by: Bob Peterson

    Benjamin Marzinski
     
  • gfs2 currently returns 31 bits of filename hash as a cookie that readdir
    uses for an offset into the directory. When there are a large number of
    directory entries, the likelihood of a collision goes up way too
    quickly. GFS2 will now return cookies that are guaranteed unique for a
    while, and then fail back to using 30 bits of filename hash.
    Specifically, the directory leaf blocks are divided up into chunks based
    on the minimum size of a gfs2 directory entry (48 bytes). Each entry's
    cookie is based off the chunk where it starts, in the linked list of
    leaf blocks that it hashes to (there are 131072 hash buckets). Directory
    entries will have unique names until they take reach chunk 8192.
    Assuming the largest filenames possible, and the least efficient spacing
    possible, this new method will still be able to return unique names when
    the previous method has statistically more than a 99% chance of a
    collision. The non-unique names it fails back to are guaranteed to not
    collide with the unique names.

    unique cookies will be in this format:
    - 1 bit "0" to make sure the the returned cookie is positive
    - 17 bits for the hash table index
    - 1 bit for the mode "0"
    - 13 bits for the offset

    non-unique cookies will be in this format:
    - 1 bit "0" to make sure the the returned cookie is positive
    - 17 bits for the hash table index
    - 1 bit for the mode "1"
    - 13 more bits of the name hash

    Another benefit of location based cookies, is that once a directory's
    exhash table is fully extended (so that multiple hash table indexs do
    not use the same leaf blocks), gfs2 can skip sorting the directory
    entries until it reaches the non-unique ones, and then it only needs to
    sort these. This provides a significant speed up for directory reads of
    very large directories.

    The only issue is that for these cookies to continue to point to the
    correct entry as files are added and removed from the directory, gfs2
    must keep the entries at the same offset in the leaf block when they are
    split (see my previous patch). This means that until all the nodes in a
    cluster are running with code that will split the directory leaf blocks
    this way, none of the nodes can use the new cookie code. To deal with
    this, gfs2 now has the mount option loccookie, which, if set, will make
    it return these new location based cookies. This option must not be set
    until all nodes in the cluster are at least running this version of the
    kernel code, and you have guaranteed that there are no outstanding
    cookies required by other software, such as NFS.

    gfs2 uses some of the extra space at the end of the gfs2_dirent
    structure to store the calculated readdir cookies. This keeps us from
    needing to allocate a seperate array to hold these values. gfs2
    recomputes the cookie stored in de_cookie for every readdir call. The
    time it takes to do so is small, and if gfs2 expected this value to be
    saved on disk, the new code wouldn't work correctly on filesystems
    created with an earlier version of gfs2.

    One issue with adding de_cookie to the union in the gfs2_dirent
    structure is that it caused the union to align itself to a 4 byte
    boundary, instead of its previous 2 byte boundary. This changed the
    offset of de_rahead. To solve that, I pulled de_rahead out of the union,
    since it does not need to be there.

    Signed-off-by: Benjamin Marzinski
    Signed-off-by: Bob Peterson

    Benjamin Marzinski
     
  • Before this patch, function update_statfs called gfs2_statfs_change_out
    to update the master statfs buffer without the sd_statfs_spin held.
    In theory, another process could call gfs2_statfs_sync, which takes
    the sd_statfs_spin lock and re-reads m_sc from the buffer. So there's
    a theoretical timing window in which one process could write the
    master statfs buffer, then another comes along and re-reads it, wiping
    out the changes.

    Signed-off-by: Bob Peterson

    Bob Peterson
     
  • Before this patch, multi-block reservation structures were allocated
    from a special slab. This patch folds the structure into the gfs2_inode
    structure. The disadvantage is that the gfs2_inode needs more memory,
    even when a file is opened read-only. The advantages are: (a) we don't
    need the special slab and the extra time it takes to allocate and
    deallocate from it. (b) we no longer need to worry that the structure
    exists for things like quota management. (c) This also allows us to
    remove the calls to get_write_access and put_write_access since we
    know the structure will exist.

    Signed-off-by: Bob Peterson

    Bob Peterson
     

24 Nov, 2015

1 commit

  • This patch basically reverts the majority of patch 5407e24.
    That patch eliminated the gfs2_qadata structure in favor of just
    using the reservations structure. The problem with doing that is that
    it increases the size of the reservations structure. That is not an
    issue until it comes time to fold the reservations structure into the
    inode in memory so we know it's always there. By separating out the
    quota structure again, we aren't punishing the non-quota users by
    making all the inodes bigger, requiring more slab space. This patch
    creates a new slab area to allocate the quota stuff so it's managed
    a little more sanely.

    Signed-off-by: Bob Peterson

    Bob Peterson
     

17 Nov, 2015

1 commit

  • When gfs2 allocates an inode and its extended attribute block next to
    each other at inode create time, the inode's directory entry indicates
    that in de_rahead. In that case, we can readahead the extended
    attribute block when we read in the inode.

    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Bob Peterson

    Andreas Gruenbacher
     

05 Sep, 2015

1 commit

  • Many file systems that implement the show_options hook fail to correctly
    escape their output which could lead to unescaped characters (e.g. new
    lines) leaking into /proc/mounts and /proc/[pid]/mountinfo files. This
    could lead to confusion, spoofed entries (resulting in things like
    systemd issuing false d-bus "mount" notifications), and who knows what
    else. This looks like it would only be the root user stepping on
    themselves, but it's possible weird things could happen in containers or
    in other situations with delegated mount privileges.

    Here's an example using overlay with setuid fusermount trusting the
    contents of /proc/mounts (via the /etc/mtab symlink). Imagine the use
    of "sudo" is something more sneaky:

    $ BASE="ovl"
    $ MNT="$BASE/mnt"
    $ LOW="$BASE/lower"
    $ UP="$BASE/upper"
    $ WORK="$BASE/work/ 0 0
    none /proc fuse.pwn user_id=1000"
    $ mkdir -p "$LOW" "$UP" "$WORK"
    $ sudo mount -t overlay -o "lowerdir=$LOW,upperdir=$UP,workdir=$WORK" none /mnt
    $ cat /proc/mounts
    none /root/ovl/mnt overlay rw,relatime,lowerdir=ovl/lower,upperdir=ovl/upper,workdir=ovl/work/ 0 0
    none /proc fuse.pwn user_id=1000 0 0
    $ fusermount -u /proc
    $ cat /proc/mounts
    cat: /proc/mounts: No such file or directory

    This fixes the problem by adding new seq_show_option and
    seq_show_option_n helpers, and updating the vulnerable show_option
    handlers to use them as needed. Some, like SELinux, need to be open
    coded due to unusual existing escape mechanisms.

    [akpm@linux-foundation.org: add lost chunk, per Kees]
    [keescook@chromium.org: seq_show_option should be using const parameters]
    Signed-off-by: Kees Cook
    Acked-by: Serge Hallyn
    Acked-by: Jan Kara
    Acked-by: Paul Moore
    Cc: J. R. Okajima
    Signed-off-by: Kees Cook
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     

02 Jun, 2015

1 commit

  • Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
    and the role of the separation is unclear. For cgroup support for
    writeback IOs, a bdi will be updated to host multiple wb's where each
    wb serves writeback IOs of a different cgroup on the bdi. To achieve
    that, a wb should carry all states necessary for servicing writeback
    IOs for a cgroup independently.

    This patch moves bandwidth related fields from backing_dev_info into
    bdi_writeback.

    * The moved fields are: bw_time_stamp, dirtied_stamp, written_stamp,
    write_bandwidth, avg_write_bandwidth, dirty_ratelimit,
    balanced_dirty_ratelimit, completions and dirty_exceeded.

    * writeback_chunk_size() and over_bground_thresh() now take @wb
    instead of @bdi.

    * bdi_writeout_fraction(bdi, ...) -> wb_writeout_fraction(wb, ...)
    bdi_dirty_limit(bdi, ...) -> wb_dirty_limit(wb, ...)
    bdi_position_ration(bdi, ...) -> wb_position_ratio(wb, ...)
    bdi_update_writebandwidth(bdi, ...) -> wb_update_write_bandwidth(wb, ...)
    [__]bdi_update_bandwidth(bdi, ...) -> [__]wb_update_bandwidth(wb, ...)
    bdi_{max|min}_pause(bdi, ...) -> wb_{max|min}_pause(wb, ...)
    bdi_dirty_limits(bdi, ...) -> wb_dirty_limits(wb, ...)

    * Init/exits of the relocated fields are moved to bdi_wb_init/exit()
    respectively. Note that explicit zeroing is dropped in the process
    as wb's are cleared in entirety anyway.

    * As there's still only one bdi_writeback per backing_dev_info, all
    uses of bdi->stat[] are mechanically replaced with bdi->wb.stat[]
    introducing no behavior changes.

    v2: Typo in description fixed as suggested by Jan.

    Signed-off-by: Tejun Heo
    Reviewed-by: Jan Kara
    Cc: Jens Axboe
    Cc: Wu Fengguang
    Cc: Jaegeuk Kim
    Cc: Steven Whitehouse
    Signed-off-by: Jens Axboe

    Tejun Heo
     

16 Apr, 2015

1 commit


21 Jan, 2015

1 commit

  • Now that we got rid of the bdi abuse on character devices we can always use
    sb->s_bdi to get at the backing_dev_info for a file, except for the block
    device special case. Export inode_to_bdi and replace uses of
    mapping->backing_dev_info with it to prepare for the removal of
    mapping->backing_dev_info.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Tejun Heo
    Reviewed-by: Jan Kara
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

17 Nov, 2014

1 commit

  • The current gfs2 freezing code is considerably more complicated than it
    should be because it doesn't use the vfs freezing code on any node except
    the one that begins the freeze. This is because it needs to acquire a
    cluster glock before calling the vfs code to prevent a deadlock, and
    without the new freeze_super and thaw_super hooks, that was impossible. To
    deal with the issue, gfs2 had to do some hacky locking tricks to make sure
    that a frozen node couldn't be holding on a lock it needed to do the
    unfreeze ioctl.

    This patch makes use of the new hooks to simply the gfs2 locking code. Now,
    all the nodes in the cluster freeze and thaw in exactly the same way. Every
    node in the cluster caches the freeze glock in the shared state. The new
    freeze_super hook allows the freezing node to grab this freeze glock in
    the exclusive state without first calling the vfs freeze_super function.
    All the nodes in the cluster see this lock change, and call the vfs
    freeze_super function. The vfs locking code guarantees that the nodes can't
    get stuck holding the glocks necessary to unfreeze the system. To
    unfreeze, the freezing node uses the new thaw_super hook to drop the freeze
    glock. Again, all the nodes notice this, reacquire the glock in shared mode
    and call the vfs thaw_super function.

    Signed-off-by: Benjamin Marzinski
    Signed-off-by: Steven Whitehouse

    Benjamin Marzinski
     

21 Aug, 2014

1 commit


16 Jul, 2014

1 commit

  • The current "wait_on_bit" interface requires an 'action'
    function to be provided which does the actual waiting.
    There are over 20 such functions, many of them identical.
    Most cases can be satisfied by one of just two functions, one
    which uses io_schedule() and one which just uses schedule().

    So:
    Rename wait_on_bit and wait_on_bit_lock to
    wait_on_bit_action and wait_on_bit_lock_action
    to make it explicit that they need an action function.

    Introduce new wait_on_bit{,_lock} and wait_on_bit{,_lock}_io
    which are *not* given an action function but implicitly use
    a standard one.
    The decision to error-out if a signal is pending is now made
    based on the 'mode' argument rather than being encoded in the action
    function.

    All instances of the old wait_on_bit and wait_on_bit_lock which
    can use the new version have been changed accordingly and their
    action functions have been discarded.
    wait_on_bit{_lock} does not return any specific error code in the
    event of a signal so the caller must check for non-zero and
    interpolate their own error code as appropriate.

    The wait_on_bit() call in __fscache_wait_on_invalidate() was
    ambiguous as it specified TASK_UNINTERRUPTIBLE but used
    fscache_wait_bit_interruptible as an action function.
    David Howells confirms this should be uniformly
    "uninterruptible"

    The main remaining user of wait_on_bit{,_lock}_action is NFS
    which needs to use a freezer-aware schedule() call.

    A comment in fs/gfs2/glock.c notes that having multiple 'action'
    functions is useful as they display differently in the 'wchan'
    field of 'ps'. (and /proc/$PID/wchan).
    As the new bit_wait{,_io} functions are tagged "__sched", they
    will not show up at all, but something higher in the stack. So
    the distinction will still be visible, only with different
    function names (gds2_glock_wait versus gfs2_glock_dq_wait in the
    gfs2/glock.c case).

    Since first version of this patch (against 3.15) two new action
    functions appeared, on in NFS and one in CIFS. CIFS also now
    uses an action function that makes the same freezer aware
    schedule call as NFS.

    Signed-off-by: NeilBrown
    Acked-by: David Howells (fscache, keys)
    Acked-by: Steven Whitehouse (gfs2)
    Acked-by: Peter Zijlstra
    Cc: Oleg Nesterov
    Cc: Steve French
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20140707051603.28027.72349.stgit@notabene.brown
    Signed-off-by: Ingo Molnar

    NeilBrown
     

14 May, 2014

1 commit

  • GFS2 has a transaction glock, which must be grabbed for every
    transaction, whose purpose is to deal with freezing the filesystem.
    Aside from this involving a large amount of locking, it is very easy to
    make the current fsfreeze code hang on unfreezing.

    This patch rewrites how gfs2 handles freezing the filesystem. The
    transaction glock is removed. In it's place is a freeze glock, which is
    cached (but not held) in a shared state by every node in the cluster
    when the filesystem is mounted. This lock only needs to be grabbed on
    freezing, and actions which need to be safe from freezing, like
    recovery.

    When a node wants to freeze the filesystem, it grabs this glock
    exclusively. When the freeze glock state changes on the nodes (either
    from shared to unlocked, or shared to exclusive), the filesystem does a
    special log flush. gfs2_log_flush() does all the work for flushing out
    the and shutting down the incore log, and then it tries to grab the
    freeze glock in a shared state again. Since the filesystem is stuck in
    gfs2_log_flush, no new transaction can start, and nothing can be written
    to disk. Unfreezing the filesytem simply involes dropping the freeze
    glock, allowing gfs2_log_flush() to grab and then release the shared
    lock, so it is cached for next time.

    However, in order for the unfreezing ioctl to occur, gfs2 needs to get a
    shared lock on the filesystem root directory inode to check permissions.
    If that glock has already been grabbed exclusively, fsfreeze will be
    unable to get the shared lock and unfreeze the filesystem.

    In order to allow the unfreeze, this patch makes gfs2 grab a shared lock
    on the filesystem root directory during the freeze, and hold it until it
    unfreezes the filesystem. The functions which need to grab a shared
    lock in order to allow the unfreeze ioctl to be issued now use the lock
    grabbed by the freeze code instead.

    The freeze and unfreeze code take care to make sure that this shared
    lock will not be dropped while another process is using it.

    Signed-off-by: Benjamin Marzinski
    Signed-off-by: Steven Whitehouse

    Benjamin Marzinski
     

05 Apr, 2014

2 commits

  • Pull ext4 updates from Ted Ts'o:
    "Major changes for 3.14 include support for the newly added ZERO_RANGE
    and COLLAPSE_RANGE fallocate operations, and scalability improvements
    in the jbd2 layer and in xattr handling when the extended attributes
    spill over into an external block.

    Other than that, the usual clean ups and minor bug fixes"

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (42 commits)
    ext4: fix premature freeing of partial clusters split across leaf blocks
    ext4: remove unneeded test of ret variable
    ext4: fix comment typo
    ext4: make ext4_block_zero_page_range static
    ext4: atomically set inode->i_flags in ext4_set_inode_flags()
    ext4: optimize Hurd tests when reading/writing inodes
    ext4: kill i_version support for Hurd-castrated file systems
    ext4: each filesystem creates and uses its own mb_cache
    fs/mbcache.c: doucple the locking of local from global data
    fs/mbcache.c: change block and index hash chain to hlist_bl_node
    ext4: Introduce FALLOC_FL_ZERO_RANGE flag for fallocate
    ext4: refactor ext4_fallocate code
    ext4: Update inode i_size after the preallocation
    ext4: fix partial cluster handling for bigalloc file systems
    ext4: delete path dealloc code in ext4_ext_handle_uninitialized_extents
    ext4: only call sync_filesystm() when remounting read-only
    fs: push sync_filesystem() down to the file system's remount_fs()
    jbd2: improve error messages for inconsistent journal heads
    jbd2: minimize region locked by j_list_lock in jbd2_journal_forget()
    jbd2: minimize region locked by j_list_lock in journal_get_create_access()
    ...

    Linus Torvalds
     
  • Pull GFS2 updates from Steven Whitehouse:
    "One of the main highlights this time, is not the patches themselves
    but instead the widening contributor base. It is good to see that
    interest is increasing in GFS2, and I'd like to thank all the
    contributors to this patch set.

    In addition to the usual set of bug fixes and clean ups, there are
    patches to improve inode creation performance when xattrs are required
    and some improvements to the transaction code which is intended to
    help improve scalability after further changes in due course.

    Journal extent mapping is also updated to make it more efficient and
    again, this is a foundation for future work in this area.

    The maximum number of ACLs has been increased to 300 (for a 4k block
    size) which means that even with a few additional xattrs from selinux,
    everything should fit within a single fs block.

    There is also a patch to bring GFS2's own copy of the writepages code
    up to the same level as the core VFS. Eventually we may be able to
    merge some of this code, since it is fairly similar.

    The other major change this time, is bringing consistency to the
    printing of messages via fs_, pr_ macros"

    * tag 'gfs2-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw: (29 commits)
    GFS2: Fix address space from page function
    GFS2: Fix uninitialized VFS inode in gfs2_create_inode
    GFS2: Fix return value in slot_get()
    GFS2: inline function gfs2_set_mode
    GFS2: Remove extraneous function gfs2_security_init
    GFS2: Increase the max number of ACLs
    GFS2: Re-add a call to log_flush_wait when flushing the journal
    GFS2: Ensure workqueue is scheduled after noexp request
    GFS2: check NULL return value in gfs2_ok_to_move
    GFS2: Convert gfs2_lm_withdraw to use fs_err
    GFS2: Use fs_ more often
    GFS2: Use pr_ more consistently
    GFS2: Move recovery variables to journal structure in memory
    GFS2: global conversion to pr_foo()
    GFS2: return -E2BIG if hit the maximum limits of ACLs
    GFS2: Clean up journal extent mapping
    GFS2: replace kmalloc - __vmalloc / memset 0
    GFS2: Remove extra "if" in gfs2_log_flush()
    fs: NULL dereference in posix_acl_to_xattr()
    GFS2: Move log buffer accounting to transaction
    ...

    Linus Torvalds
     

04 Apr, 2014

1 commit

  • Reclaim will be leaving shadow entries in the page cache radix tree upon
    evicting the real page. As those pages are found from the LRU, an
    iput() can lead to the inode being freed concurrently. At this point,
    reclaim must no longer install shadow pages because the inode freeing
    code needs to ensure the page tree is really empty.

    Add an address_space flag, AS_EXITING, that the inode freeing code sets
    under the tree lock before doing the final truncate. Reclaim will check
    for this flag before installing shadow pages.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: Andrea Arcangeli
    Cc: Bob Liu
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Cc: Jan Kara
    Cc: KOSAKI Motohiro
    Cc: Luigi Semenzato
    Cc: Mel Gorman
    Cc: Metin Doslu
    Cc: Michel Lespinasse
    Cc: Ozgun Erdogan
    Cc: Peter Zijlstra
    Cc: Roman Gushchin
    Cc: Ryan Mallon
    Cc: Tejun Heo
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

31 Mar, 2014

1 commit

  • When gfs2_create_inode() fails due to quota violation, the VFS
    inode is not completely uninitialized. This can cause a list
    corruption error.

    This patch correctly uninitializes the VFS inode when a quota
    violation occurs in the gfs2_create_inode codepath.

    Resolves: rhbz#1059808
    Signed-off-by: Abhi Das
    Signed-off-by: Steven Whitehouse

    Abhi Das
     

13 Mar, 2014

1 commit

  • Previously, the no-op "mount -o mount /dev/xxx" operation when the
    file system is already mounted read-write causes an implied,
    unconditional syncfs(). This seems pretty stupid, and it's certainly
    documented or guaraunteed to do this, nor is it particularly useful,
    except in the case where the file system was mounted rw and is getting
    remounted read-only.

    However, it's possible that there might be some file systems that are
    actually depending on this behavior. In most file systems, it's
    probably fine to only call sync_filesystem() when transitioning from
    read-write to read-only, and there are some file systems where this is
    not needed at all (for example, for a pseudo-filesystem or something
    like romfs).

    Signed-off-by: "Theodore Ts'o"
    Cc: linux-fsdevel@vger.kernel.org
    Cc: Christoph Hellwig
    Cc: Artem Bityutskiy
    Cc: Adrian Hunter
    Cc: Evgeniy Dushistov
    Cc: Jan Kara
    Cc: OGAWA Hirofumi
    Cc: Anders Larsen
    Cc: Phillip Lougher
    Cc: Kees Cook
    Cc: Mikulas Patocka
    Cc: Petr Vandrovec
    Cc: xfs@oss.sgi.com
    Cc: linux-btrfs@vger.kernel.org
    Cc: linux-cifs@vger.kernel.org
    Cc: samba-technical@lists.samba.org
    Cc: codalist@coda.cs.cmu.edu
    Cc: linux-ext4@vger.kernel.org
    Cc: linux-f2fs-devel@lists.sourceforge.net
    Cc: fuse-devel@lists.sourceforge.net
    Cc: cluster-devel@redhat.com
    Cc: linux-mtd@lists.infradead.org
    Cc: jfs-discussion@lists.sourceforge.net
    Cc: linux-nfs@vger.kernel.org
    Cc: linux-nilfs@vger.kernel.org
    Cc: linux-ntfs-dev@lists.sourceforge.net
    Cc: ocfs2-devel@oss.oracle.com
    Cc: reiserfs-devel@vger.kernel.org

    Theodore Ts'o