14 Jan, 2016

1 commit

  • This patch fixes an error condition in which an inode is partially
    created in gfs2_create_inode() but then some error is discovered,
    which causes it to fail and call iput() before the iopen glock is
    created or held. In that case, gfs2_delete_inode would try to
    unlock an iopen glock that doesn't yet exist. Therefore, we test
    its holder (which must exist) for the HIF_HOLDER bit before trying
    to dq it.

    Signed-off-by: Bob Peterson
    Acked-by: Steven Whitehouse

    Bob Peterson
     

19 Dec, 2015

2 commits

  • In function gfs2_delete_inode() we write and flush the mapping for
    a glock, among other things. We truncate the mapping for the inode,
    but we never truncate the mapping for the glock. This patch makes it
    also truncate the metamapping. This avoid cases where the glock is
    reused by another process who is trying to recreate an inode in its
    place using the same block.

    Signed-off-by: Bob Peterson
    Acked-by: Steven Whitehouse

    Bob Peterson
     
  • This patch changes every glock_dq for iopen glocks into a dq_wait.
    This makes sure that iopen glocks do not outlive the inode itself.
    In turn, that ensures that anyone trying to unlink the glock will
    be able to find the inode when it receives a remote iopen callback.

    Signed-off-by: Bob Peterson
    Acked-by: Steven Whitehouse

    Bob Peterson
     

15 Dec, 2015

4 commits

  • When gfs2 was unmounting filesystems or changing them to read-only it
    was clearing the SDF_JOURNAL_LIVE bit before the final log flush. This
    caused a race. If an inode glock got demoted in the gap between
    clearing the bit and the shutdown flush, it would be unable to reserve
    log space to clear out the active items list in inode_go_sync, causing an
    error in inode_go_inval because the glock was still dirty.

    To solve this, the SDF_JOURNAL_LIVE bit is now cleared inside the
    shutdown log flush. This means that, because of the locking on the log
    blocks, either inode_go_sync will be able to reserve space to clean the
    glock before the shutdown flush, or the shutdown flush will clean the
    glock itself, before inode_go_sync fails to reserve the space. Either
    way, the glock will be clean before inode_go_inval.

    Signed-off-by: Benjamin Marzinski
    Signed-off-by: Bob Peterson

    Benjamin Marzinski
     
  • gfs2 currently returns 31 bits of filename hash as a cookie that readdir
    uses for an offset into the directory. When there are a large number of
    directory entries, the likelihood of a collision goes up way too
    quickly. GFS2 will now return cookies that are guaranteed unique for a
    while, and then fail back to using 30 bits of filename hash.
    Specifically, the directory leaf blocks are divided up into chunks based
    on the minimum size of a gfs2 directory entry (48 bytes). Each entry's
    cookie is based off the chunk where it starts, in the linked list of
    leaf blocks that it hashes to (there are 131072 hash buckets). Directory
    entries will have unique names until they take reach chunk 8192.
    Assuming the largest filenames possible, and the least efficient spacing
    possible, this new method will still be able to return unique names when
    the previous method has statistically more than a 99% chance of a
    collision. The non-unique names it fails back to are guaranteed to not
    collide with the unique names.

    unique cookies will be in this format:
    - 1 bit "0" to make sure the the returned cookie is positive
    - 17 bits for the hash table index
    - 1 bit for the mode "0"
    - 13 bits for the offset

    non-unique cookies will be in this format:
    - 1 bit "0" to make sure the the returned cookie is positive
    - 17 bits for the hash table index
    - 1 bit for the mode "1"
    - 13 more bits of the name hash

    Another benefit of location based cookies, is that once a directory's
    exhash table is fully extended (so that multiple hash table indexs do
    not use the same leaf blocks), gfs2 can skip sorting the directory
    entries until it reaches the non-unique ones, and then it only needs to
    sort these. This provides a significant speed up for directory reads of
    very large directories.

    The only issue is that for these cookies to continue to point to the
    correct entry as files are added and removed from the directory, gfs2
    must keep the entries at the same offset in the leaf block when they are
    split (see my previous patch). This means that until all the nodes in a
    cluster are running with code that will split the directory leaf blocks
    this way, none of the nodes can use the new cookie code. To deal with
    this, gfs2 now has the mount option loccookie, which, if set, will make
    it return these new location based cookies. This option must not be set
    until all nodes in the cluster are at least running this version of the
    kernel code, and you have guaranteed that there are no outstanding
    cookies required by other software, such as NFS.

    gfs2 uses some of the extra space at the end of the gfs2_dirent
    structure to store the calculated readdir cookies. This keeps us from
    needing to allocate a seperate array to hold these values. gfs2
    recomputes the cookie stored in de_cookie for every readdir call. The
    time it takes to do so is small, and if gfs2 expected this value to be
    saved on disk, the new code wouldn't work correctly on filesystems
    created with an earlier version of gfs2.

    One issue with adding de_cookie to the union in the gfs2_dirent
    structure is that it caused the union to align itself to a 4 byte
    boundary, instead of its previous 2 byte boundary. This changed the
    offset of de_rahead. To solve that, I pulled de_rahead out of the union,
    since it does not need to be there.

    Signed-off-by: Benjamin Marzinski
    Signed-off-by: Bob Peterson

    Benjamin Marzinski
     
  • Before this patch, function update_statfs called gfs2_statfs_change_out
    to update the master statfs buffer without the sd_statfs_spin held.
    In theory, another process could call gfs2_statfs_sync, which takes
    the sd_statfs_spin lock and re-reads m_sc from the buffer. So there's
    a theoretical timing window in which one process could write the
    master statfs buffer, then another comes along and re-reads it, wiping
    out the changes.

    Signed-off-by: Bob Peterson

    Bob Peterson
     
  • Before this patch, multi-block reservation structures were allocated
    from a special slab. This patch folds the structure into the gfs2_inode
    structure. The disadvantage is that the gfs2_inode needs more memory,
    even when a file is opened read-only. The advantages are: (a) we don't
    need the special slab and the extra time it takes to allocate and
    deallocate from it. (b) we no longer need to worry that the structure
    exists for things like quota management. (c) This also allows us to
    remove the calls to get_write_access and put_write_access since we
    know the structure will exist.

    Signed-off-by: Bob Peterson

    Bob Peterson
     

24 Nov, 2015

1 commit

  • This patch basically reverts the majority of patch 5407e24.
    That patch eliminated the gfs2_qadata structure in favor of just
    using the reservations structure. The problem with doing that is that
    it increases the size of the reservations structure. That is not an
    issue until it comes time to fold the reservations structure into the
    inode in memory so we know it's always there. By separating out the
    quota structure again, we aren't punishing the non-quota users by
    making all the inodes bigger, requiring more slab space. This patch
    creates a new slab area to allocate the quota stuff so it's managed
    a little more sanely.

    Signed-off-by: Bob Peterson

    Bob Peterson
     

17 Nov, 2015

1 commit

  • When gfs2 allocates an inode and its extended attribute block next to
    each other at inode create time, the inode's directory entry indicates
    that in de_rahead. In that case, we can readahead the extended
    attribute block when we read in the inode.

    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Bob Peterson

    Andreas Gruenbacher
     

05 Sep, 2015

1 commit

  • Many file systems that implement the show_options hook fail to correctly
    escape their output which could lead to unescaped characters (e.g. new
    lines) leaking into /proc/mounts and /proc/[pid]/mountinfo files. This
    could lead to confusion, spoofed entries (resulting in things like
    systemd issuing false d-bus "mount" notifications), and who knows what
    else. This looks like it would only be the root user stepping on
    themselves, but it's possible weird things could happen in containers or
    in other situations with delegated mount privileges.

    Here's an example using overlay with setuid fusermount trusting the
    contents of /proc/mounts (via the /etc/mtab symlink). Imagine the use
    of "sudo" is something more sneaky:

    $ BASE="ovl"
    $ MNT="$BASE/mnt"
    $ LOW="$BASE/lower"
    $ UP="$BASE/upper"
    $ WORK="$BASE/work/ 0 0
    none /proc fuse.pwn user_id=1000"
    $ mkdir -p "$LOW" "$UP" "$WORK"
    $ sudo mount -t overlay -o "lowerdir=$LOW,upperdir=$UP,workdir=$WORK" none /mnt
    $ cat /proc/mounts
    none /root/ovl/mnt overlay rw,relatime,lowerdir=ovl/lower,upperdir=ovl/upper,workdir=ovl/work/ 0 0
    none /proc fuse.pwn user_id=1000 0 0
    $ fusermount -u /proc
    $ cat /proc/mounts
    cat: /proc/mounts: No such file or directory

    This fixes the problem by adding new seq_show_option and
    seq_show_option_n helpers, and updating the vulnerable show_option
    handlers to use them as needed. Some, like SELinux, need to be open
    coded due to unusual existing escape mechanisms.

    [akpm@linux-foundation.org: add lost chunk, per Kees]
    [keescook@chromium.org: seq_show_option should be using const parameters]
    Signed-off-by: Kees Cook
    Acked-by: Serge Hallyn
    Acked-by: Jan Kara
    Acked-by: Paul Moore
    Cc: J. R. Okajima
    Signed-off-by: Kees Cook
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     

02 Jun, 2015

1 commit

  • Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
    and the role of the separation is unclear. For cgroup support for
    writeback IOs, a bdi will be updated to host multiple wb's where each
    wb serves writeback IOs of a different cgroup on the bdi. To achieve
    that, a wb should carry all states necessary for servicing writeback
    IOs for a cgroup independently.

    This patch moves bandwidth related fields from backing_dev_info into
    bdi_writeback.

    * The moved fields are: bw_time_stamp, dirtied_stamp, written_stamp,
    write_bandwidth, avg_write_bandwidth, dirty_ratelimit,
    balanced_dirty_ratelimit, completions and dirty_exceeded.

    * writeback_chunk_size() and over_bground_thresh() now take @wb
    instead of @bdi.

    * bdi_writeout_fraction(bdi, ...) -> wb_writeout_fraction(wb, ...)
    bdi_dirty_limit(bdi, ...) -> wb_dirty_limit(wb, ...)
    bdi_position_ration(bdi, ...) -> wb_position_ratio(wb, ...)
    bdi_update_writebandwidth(bdi, ...) -> wb_update_write_bandwidth(wb, ...)
    [__]bdi_update_bandwidth(bdi, ...) -> [__]wb_update_bandwidth(wb, ...)
    bdi_{max|min}_pause(bdi, ...) -> wb_{max|min}_pause(wb, ...)
    bdi_dirty_limits(bdi, ...) -> wb_dirty_limits(wb, ...)

    * Init/exits of the relocated fields are moved to bdi_wb_init/exit()
    respectively. Note that explicit zeroing is dropped in the process
    as wb's are cleared in entirety anyway.

    * As there's still only one bdi_writeback per backing_dev_info, all
    uses of bdi->stat[] are mechanically replaced with bdi->wb.stat[]
    introducing no behavior changes.

    v2: Typo in description fixed as suggested by Jan.

    Signed-off-by: Tejun Heo
    Reviewed-by: Jan Kara
    Cc: Jens Axboe
    Cc: Wu Fengguang
    Cc: Jaegeuk Kim
    Cc: Steven Whitehouse
    Signed-off-by: Jens Axboe

    Tejun Heo
     

16 Apr, 2015

1 commit


21 Jan, 2015

1 commit

  • Now that we got rid of the bdi abuse on character devices we can always use
    sb->s_bdi to get at the backing_dev_info for a file, except for the block
    device special case. Export inode_to_bdi and replace uses of
    mapping->backing_dev_info with it to prepare for the removal of
    mapping->backing_dev_info.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Tejun Heo
    Reviewed-by: Jan Kara
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

17 Nov, 2014

1 commit

  • The current gfs2 freezing code is considerably more complicated than it
    should be because it doesn't use the vfs freezing code on any node except
    the one that begins the freeze. This is because it needs to acquire a
    cluster glock before calling the vfs code to prevent a deadlock, and
    without the new freeze_super and thaw_super hooks, that was impossible. To
    deal with the issue, gfs2 had to do some hacky locking tricks to make sure
    that a frozen node couldn't be holding on a lock it needed to do the
    unfreeze ioctl.

    This patch makes use of the new hooks to simply the gfs2 locking code. Now,
    all the nodes in the cluster freeze and thaw in exactly the same way. Every
    node in the cluster caches the freeze glock in the shared state. The new
    freeze_super hook allows the freezing node to grab this freeze glock in
    the exclusive state without first calling the vfs freeze_super function.
    All the nodes in the cluster see this lock change, and call the vfs
    freeze_super function. The vfs locking code guarantees that the nodes can't
    get stuck holding the glocks necessary to unfreeze the system. To
    unfreeze, the freezing node uses the new thaw_super hook to drop the freeze
    glock. Again, all the nodes notice this, reacquire the glock in shared mode
    and call the vfs thaw_super function.

    Signed-off-by: Benjamin Marzinski
    Signed-off-by: Steven Whitehouse

    Benjamin Marzinski
     

21 Aug, 2014

1 commit


16 Jul, 2014

1 commit

  • The current "wait_on_bit" interface requires an 'action'
    function to be provided which does the actual waiting.
    There are over 20 such functions, many of them identical.
    Most cases can be satisfied by one of just two functions, one
    which uses io_schedule() and one which just uses schedule().

    So:
    Rename wait_on_bit and wait_on_bit_lock to
    wait_on_bit_action and wait_on_bit_lock_action
    to make it explicit that they need an action function.

    Introduce new wait_on_bit{,_lock} and wait_on_bit{,_lock}_io
    which are *not* given an action function but implicitly use
    a standard one.
    The decision to error-out if a signal is pending is now made
    based on the 'mode' argument rather than being encoded in the action
    function.

    All instances of the old wait_on_bit and wait_on_bit_lock which
    can use the new version have been changed accordingly and their
    action functions have been discarded.
    wait_on_bit{_lock} does not return any specific error code in the
    event of a signal so the caller must check for non-zero and
    interpolate their own error code as appropriate.

    The wait_on_bit() call in __fscache_wait_on_invalidate() was
    ambiguous as it specified TASK_UNINTERRUPTIBLE but used
    fscache_wait_bit_interruptible as an action function.
    David Howells confirms this should be uniformly
    "uninterruptible"

    The main remaining user of wait_on_bit{,_lock}_action is NFS
    which needs to use a freezer-aware schedule() call.

    A comment in fs/gfs2/glock.c notes that having multiple 'action'
    functions is useful as they display differently in the 'wchan'
    field of 'ps'. (and /proc/$PID/wchan).
    As the new bit_wait{,_io} functions are tagged "__sched", they
    will not show up at all, but something higher in the stack. So
    the distinction will still be visible, only with different
    function names (gds2_glock_wait versus gfs2_glock_dq_wait in the
    gfs2/glock.c case).

    Since first version of this patch (against 3.15) two new action
    functions appeared, on in NFS and one in CIFS. CIFS also now
    uses an action function that makes the same freezer aware
    schedule call as NFS.

    Signed-off-by: NeilBrown
    Acked-by: David Howells (fscache, keys)
    Acked-by: Steven Whitehouse (gfs2)
    Acked-by: Peter Zijlstra
    Cc: Oleg Nesterov
    Cc: Steve French
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20140707051603.28027.72349.stgit@notabene.brown
    Signed-off-by: Ingo Molnar

    NeilBrown
     

14 May, 2014

1 commit

  • GFS2 has a transaction glock, which must be grabbed for every
    transaction, whose purpose is to deal with freezing the filesystem.
    Aside from this involving a large amount of locking, it is very easy to
    make the current fsfreeze code hang on unfreezing.

    This patch rewrites how gfs2 handles freezing the filesystem. The
    transaction glock is removed. In it's place is a freeze glock, which is
    cached (but not held) in a shared state by every node in the cluster
    when the filesystem is mounted. This lock only needs to be grabbed on
    freezing, and actions which need to be safe from freezing, like
    recovery.

    When a node wants to freeze the filesystem, it grabs this glock
    exclusively. When the freeze glock state changes on the nodes (either
    from shared to unlocked, or shared to exclusive), the filesystem does a
    special log flush. gfs2_log_flush() does all the work for flushing out
    the and shutting down the incore log, and then it tries to grab the
    freeze glock in a shared state again. Since the filesystem is stuck in
    gfs2_log_flush, no new transaction can start, and nothing can be written
    to disk. Unfreezing the filesytem simply involes dropping the freeze
    glock, allowing gfs2_log_flush() to grab and then release the shared
    lock, so it is cached for next time.

    However, in order for the unfreezing ioctl to occur, gfs2 needs to get a
    shared lock on the filesystem root directory inode to check permissions.
    If that glock has already been grabbed exclusively, fsfreeze will be
    unable to get the shared lock and unfreeze the filesystem.

    In order to allow the unfreeze, this patch makes gfs2 grab a shared lock
    on the filesystem root directory during the freeze, and hold it until it
    unfreezes the filesystem. The functions which need to grab a shared
    lock in order to allow the unfreeze ioctl to be issued now use the lock
    grabbed by the freeze code instead.

    The freeze and unfreeze code take care to make sure that this shared
    lock will not be dropped while another process is using it.

    Signed-off-by: Benjamin Marzinski
    Signed-off-by: Steven Whitehouse

    Benjamin Marzinski
     

05 Apr, 2014

2 commits

  • Pull ext4 updates from Ted Ts'o:
    "Major changes for 3.14 include support for the newly added ZERO_RANGE
    and COLLAPSE_RANGE fallocate operations, and scalability improvements
    in the jbd2 layer and in xattr handling when the extended attributes
    spill over into an external block.

    Other than that, the usual clean ups and minor bug fixes"

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (42 commits)
    ext4: fix premature freeing of partial clusters split across leaf blocks
    ext4: remove unneeded test of ret variable
    ext4: fix comment typo
    ext4: make ext4_block_zero_page_range static
    ext4: atomically set inode->i_flags in ext4_set_inode_flags()
    ext4: optimize Hurd tests when reading/writing inodes
    ext4: kill i_version support for Hurd-castrated file systems
    ext4: each filesystem creates and uses its own mb_cache
    fs/mbcache.c: doucple the locking of local from global data
    fs/mbcache.c: change block and index hash chain to hlist_bl_node
    ext4: Introduce FALLOC_FL_ZERO_RANGE flag for fallocate
    ext4: refactor ext4_fallocate code
    ext4: Update inode i_size after the preallocation
    ext4: fix partial cluster handling for bigalloc file systems
    ext4: delete path dealloc code in ext4_ext_handle_uninitialized_extents
    ext4: only call sync_filesystm() when remounting read-only
    fs: push sync_filesystem() down to the file system's remount_fs()
    jbd2: improve error messages for inconsistent journal heads
    jbd2: minimize region locked by j_list_lock in jbd2_journal_forget()
    jbd2: minimize region locked by j_list_lock in journal_get_create_access()
    ...

    Linus Torvalds
     
  • Pull GFS2 updates from Steven Whitehouse:
    "One of the main highlights this time, is not the patches themselves
    but instead the widening contributor base. It is good to see that
    interest is increasing in GFS2, and I'd like to thank all the
    contributors to this patch set.

    In addition to the usual set of bug fixes and clean ups, there are
    patches to improve inode creation performance when xattrs are required
    and some improvements to the transaction code which is intended to
    help improve scalability after further changes in due course.

    Journal extent mapping is also updated to make it more efficient and
    again, this is a foundation for future work in this area.

    The maximum number of ACLs has been increased to 300 (for a 4k block
    size) which means that even with a few additional xattrs from selinux,
    everything should fit within a single fs block.

    There is also a patch to bring GFS2's own copy of the writepages code
    up to the same level as the core VFS. Eventually we may be able to
    merge some of this code, since it is fairly similar.

    The other major change this time, is bringing consistency to the
    printing of messages via fs_, pr_ macros"

    * tag 'gfs2-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw: (29 commits)
    GFS2: Fix address space from page function
    GFS2: Fix uninitialized VFS inode in gfs2_create_inode
    GFS2: Fix return value in slot_get()
    GFS2: inline function gfs2_set_mode
    GFS2: Remove extraneous function gfs2_security_init
    GFS2: Increase the max number of ACLs
    GFS2: Re-add a call to log_flush_wait when flushing the journal
    GFS2: Ensure workqueue is scheduled after noexp request
    GFS2: check NULL return value in gfs2_ok_to_move
    GFS2: Convert gfs2_lm_withdraw to use fs_err
    GFS2: Use fs_ more often
    GFS2: Use pr_ more consistently
    GFS2: Move recovery variables to journal structure in memory
    GFS2: global conversion to pr_foo()
    GFS2: return -E2BIG if hit the maximum limits of ACLs
    GFS2: Clean up journal extent mapping
    GFS2: replace kmalloc - __vmalloc / memset 0
    GFS2: Remove extra "if" in gfs2_log_flush()
    fs: NULL dereference in posix_acl_to_xattr()
    GFS2: Move log buffer accounting to transaction
    ...

    Linus Torvalds
     

04 Apr, 2014

1 commit

  • Reclaim will be leaving shadow entries in the page cache radix tree upon
    evicting the real page. As those pages are found from the LRU, an
    iput() can lead to the inode being freed concurrently. At this point,
    reclaim must no longer install shadow pages because the inode freeing
    code needs to ensure the page tree is really empty.

    Add an address_space flag, AS_EXITING, that the inode freeing code sets
    under the tree lock before doing the final truncate. Reclaim will check
    for this flag before installing shadow pages.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: Andrea Arcangeli
    Cc: Bob Liu
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Cc: Jan Kara
    Cc: KOSAKI Motohiro
    Cc: Luigi Semenzato
    Cc: Mel Gorman
    Cc: Metin Doslu
    Cc: Michel Lespinasse
    Cc: Ozgun Erdogan
    Cc: Peter Zijlstra
    Cc: Roman Gushchin
    Cc: Ryan Mallon
    Cc: Tejun Heo
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

31 Mar, 2014

1 commit

  • When gfs2_create_inode() fails due to quota violation, the VFS
    inode is not completely uninitialized. This can cause a list
    corruption error.

    This patch correctly uninitializes the VFS inode when a quota
    violation occurs in the gfs2_create_inode codepath.

    Resolves: rhbz#1059808
    Signed-off-by: Abhi Das
    Signed-off-by: Steven Whitehouse

    Abhi Das
     

13 Mar, 2014

1 commit

  • Previously, the no-op "mount -o mount /dev/xxx" operation when the
    file system is already mounted read-write causes an implied,
    unconditional syncfs(). This seems pretty stupid, and it's certainly
    documented or guaraunteed to do this, nor is it particularly useful,
    except in the case where the file system was mounted rw and is getting
    remounted read-only.

    However, it's possible that there might be some file systems that are
    actually depending on this behavior. In most file systems, it's
    probably fine to only call sync_filesystem() when transitioning from
    read-write to read-only, and there are some file systems where this is
    not needed at all (for example, for a pseudo-filesystem or something
    like romfs).

    Signed-off-by: "Theodore Ts'o"
    Cc: linux-fsdevel@vger.kernel.org
    Cc: Christoph Hellwig
    Cc: Artem Bityutskiy
    Cc: Adrian Hunter
    Cc: Evgeniy Dushistov
    Cc: Jan Kara
    Cc: OGAWA Hirofumi
    Cc: Anders Larsen
    Cc: Phillip Lougher
    Cc: Kees Cook
    Cc: Mikulas Patocka
    Cc: Petr Vandrovec
    Cc: xfs@oss.sgi.com
    Cc: linux-btrfs@vger.kernel.org
    Cc: linux-cifs@vger.kernel.org
    Cc: samba-technical@lists.samba.org
    Cc: codalist@coda.cs.cmu.edu
    Cc: linux-ext4@vger.kernel.org
    Cc: linux-f2fs-devel@lists.sourceforge.net
    Cc: fuse-devel@lists.sourceforge.net
    Cc: cluster-devel@redhat.com
    Cc: linux-mtd@lists.infradead.org
    Cc: jfs-discussion@lists.sourceforge.net
    Cc: linux-nfs@vger.kernel.org
    Cc: linux-nilfs@vger.kernel.org
    Cc: linux-ntfs-dev@lists.sourceforge.net
    Cc: ocfs2-devel@oss.oracle.com
    Cc: reiserfs-devel@vger.kernel.org

    Theodore Ts'o
     

07 Mar, 2014

2 commits

  • Add pr_fmt, remove embedded "GFS2: " prefixes.
    This now consistently emits lower case "gfs2: " for each message.

    Other miscellanea around these changes:

    o Add missing newlines
    o Coalesce formats
    o Realign arguments

    Signed-off-by: Joe Perches
    Signed-off-by: Steven Whitehouse

    Joe Perches
     
  • -All printk(KERN_foo converted to pr_foo().
    -Messages updated to fit in 80 columns.
    -fs_macros converted as well.
    -fs_printk removed.

    Signed-off-by: Fabian Frederick
    Signed-off-by: Steven Whitehouse

    Fabian Frederick
     

03 Mar, 2014

1 commit

  • This patch fixes a long standing issue in mapping the journal
    extents. Most journals will consist of only a single extent,
    and although the cache took account of that by merging extents,
    it did not actually map large extents, but instead was doing a
    block by block mapping. Since the journal was only being mapped
    on mount, this was not normally noticeable.

    With the updated code, it is now possible to use the same extent
    mapping system during journal recovery (which will be added in a
    later patch). This will allow checking of the integrity of the
    journal before any reply of the journal content is attempted. For
    this reason the code is moving to bmap.c, since it will be used
    more widely in due course.

    An exercise left for the reader is to compare the new function
    gfs2_map_journal_extents() with gfs2_write_alloc_required()

    Additionally, should there be a failure, the error reporting is
    also updated to show more detail about what went wrong.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

15 Jan, 2014

1 commit

  • While investigating a rather strange bit of code in the quota
    clean up function, I spotted that the reason for its existence
    was that when remounting read only, we were not stopping the
    quotad thread, and thus it was possible for it to still have
    a reference to some of the quotas in that case.

    This patch moves the logd and quota thread start and stop into
    the make_fs_rw/ro functions, so that we now stop those threads
    when mounted read only.

    This means that quotad will always be stopped before we call
    the quota clean up function, and we can thus dispose of the
    (rather hackish) code that waits for it to give up its
    reference on the quotas.

    Signed-off-by: Steven Whitehouse
    Cc: Abhijith Das

    Steven Whitehouse
     

27 Sep, 2013

1 commit

  • The reservation for an inode should be cleared when it is truncated so
    that we can start again at a different offset for future allocations.
    We could try and do better than that, by resetting the search based on
    where the truncation started from, but this is only a first step.

    In addition, there are three callers of gfs2_rs_delete() but only one
    of those should really be testing the value of i_writecount. While
    we get away with that in the other cases currently, I think it would
    be better if we made that test specific to the one case which
    requires it.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

03 Jun, 2013

1 commit

  • This patch makes GFS2 immediately reclaim/delete all iopen glocks
    as soon as they're dequeued. This allows deleters to get an
    EXclusive lock on iopen so files are deleted properly instead of
    being set as unlinked.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     

08 Apr, 2013

1 commit


26 Feb, 2013

1 commit

  • Pull user namespace and namespace infrastructure changes from Eric W Biederman:
    "This set of changes starts with a few small enhnacements to the user
    namespace. reboot support, allowing more arbitrary mappings, and
    support for mounting devpts, ramfs, tmpfs, and mqueuefs as just the
    user namespace root.

    I do my best to document that if you care about limiting your
    unprivileged users that when you have the user namespace support
    enabled you will need to enable memory control groups.

    There is a minor bug fix to prevent overflowing the stack if someone
    creates way too many user namespaces.

    The bulk of the changes are a continuation of the kuid/kgid push down
    work through the filesystems. These changes make using uids and gids
    typesafe which ensures that these filesystems are safe to use when
    multiple user namespaces are in use. The filesystems converted for
    3.9 are ceph, 9p, afs, ocfs2, gfs2, ncpfs, nfs, nfsd, and cifs. The
    changes for these filesystems were a little more involved so I split
    the changes into smaller hopefully obviously correct changes.

    XFS is the only filesystem that remains. I was hoping I could get
    that in this release so that user namespace support would be enabled
    with an allyesconfig or an allmodconfig but it looks like the xfs
    changes need another couple of days before it they are ready."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (93 commits)
    cifs: Enable building with user namespaces enabled.
    cifs: Convert struct cifs_ses to use a kuid_t and a kgid_t
    cifs: Convert struct cifs_sb_info to use kuids and kgids
    cifs: Modify struct smb_vol to use kuids and kgids
    cifs: Convert struct cifsFileInfo to use a kuid
    cifs: Convert struct cifs_fattr to use kuid and kgids
    cifs: Convert struct tcon_link to use a kuid.
    cifs: Modify struct cifs_unix_set_info_args to hold a kuid_t and a kgid_t
    cifs: Convert from a kuid before printing current_fsuid
    cifs: Use kuids and kgids SID to uid/gid mapping
    cifs: Pass GLOBAL_ROOT_UID and GLOBAL_ROOT_GID to keyring_alloc
    cifs: Use BUILD_BUG_ON to validate uids and gids are the same size
    cifs: Override unmappable incoming uids and gids
    nfsd: Enable building with user namespaces enabled.
    nfsd: Properly compare and initialize kuids and kgids
    nfsd: Store ex_anon_uid and ex_anon_gid as kuids and kgids
    nfsd: Modify nfsd4_cb_sec to use kuids and kgids
    nfsd: Handle kuids and kgids in the nfs4acl to posix_acl conversion
    nfsd: Convert nfsxdr to use kuids and kgids
    nfsd: Convert nfs3xdr to use kuids and kgids
    ...

    Linus Torvalds
     

13 Feb, 2013

2 commits


29 Jan, 2013

3 commits

  • Instead of using a list of buffers to write ahead of the journal
    flush, this now uses a list of inodes and calls ->writepages
    via filemap_fdatawrite() in order to achieve the same thing. For
    most use cases this results in a shorter ordered write list,
    as well as much larger i/os being issued.

    The ordered write list is sorted by inode number before writing
    in order to retain the disk block ordering between inodes as
    per the previous code.

    The previous ordered write code used to conflict in its assumptions
    about how to write out the disk blocks with mpage_writepages()
    so that with this updated version we can also use mpage_writepages()
    for GFS2's ordered write, writepages implementation. So we will
    also send larger i/os from writeback too.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • The freeze code has not been looked at a lot recently. Upstream has
    moved on, and this is an attempt to catch us back up again. There
    is a vfs level interface for the freeze code which can be called
    from our (obsolete, but kept for backward compatibility purposes)
    sysfs freeze interface. This means freezing this way vs. doing it
    from the ioctl should now work in identical fashion.

    As a result of this, the freeze function is only called once
    and we can drop our own special purpose code for counting the
    number of freezes.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • There is little common content in gfs2_trans_add_bh() between the data
    and meta classes by the time that the functions which it calls are
    taken into account. The intent here is to split this into two
    separate functions. Stage one is to introduce gfs2_trans_add_data()
    and gfs2_trans_add_meta() and update the callers accordingly.

    Later patches will then pull in the content of gfs2_trans_add_bh()
    and its dependent functions in order to clean up the code in this
    area.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

07 Nov, 2012

1 commit

  • file_accessed() was being called by gfs2_mmap() with a shared glock. If it
    needed to update the atime, it was crashing because it dirtied the inode in
    gfs2_dirty_inode() without holding an exclusive lock. gfs2_dirty_inode()
    checked if the caller was already holding a glock, but it didn't make sure that
    the glock was in the exclusive state. Now, instead of calling file_accessed()
    while holding the shared lock in gfs2_mmap(), file_accessed() is called after
    grabbing and releasing the glock to update the inode. If file_accessed() needs
    to update the atime, it will grab an exclusive lock in gfs2_dirty_inode().

    gfs2_dirty_inode() now also checks to make sure that if the calling process has
    already locked the glock, it has an exclusive lock.

    Signed-off-by: Benjamin Marzinski
    Signed-off-by: Steven Whitehouse

    Benjamin Marzinski
     

03 Oct, 2012

1 commit

  • Pull workqueue changes from Tejun Heo:
    "This is workqueue updates for v3.7-rc1. A lot of activities this
    round including considerable API and behavior cleanups.

    * delayed_work combines a timer and a work item. The handling of the
    timer part has always been a bit clunky leading to confusing
    cancelation API with weird corner-case behaviors. delayed_work is
    updated to use new IRQ safe timer and cancelation now works as
    expected.

    * Another deficiency of delayed_work was lack of the counterpart of
    mod_timer() which led to cancel+queue combinations or open-coded
    timer+work usages. mod_delayed_work[_on]() are added.

    These two delayed_work changes make delayed_work provide interface
    and behave like timer which is executed with process context.

    * A work item could be executed concurrently on multiple CPUs, which
    is rather unintuitive and made flush_work() behavior confusing and
    half-broken under certain circumstances. This problem doesn't
    exist for non-reentrant workqueues. While non-reentrancy check
    isn't free, the overhead is incurred only when a work item bounces
    across different CPUs and even in simulated pathological scenario
    the overhead isn't too high.

    All workqueues are made non-reentrant. This removes the
    distinction between flush_[delayed_]work() and
    flush_[delayed_]_work_sync(). The former is now as strong as the
    latter and the specified work item is guaranteed to have finished
    execution of any previous queueing on return.

    * In addition to the various bug fixes, Lai redid and simplified CPU
    hotplug handling significantly.

    * Joonsoo introduced system_highpri_wq and used it during CPU
    hotplug.

    There are two merge commits - one to pull in IRQ safe timer from
    tip/timers/core and the other to pull in CPU hotplug fixes from
    wq/for-3.6-fixes as Lai's hotplug restructuring depended on them."

    Fixed a number of trivial conflicts, but the more interesting conflicts
    were silent ones where the deprecated interfaces had been used by new
    code in the merge window, and thus didn't cause any real data conflicts.

    Tejun pointed out a few of them, I fixed a couple more.

    * 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (46 commits)
    workqueue: remove spurious WARN_ON_ONCE(in_irq()) from try_to_grab_pending()
    workqueue: use cwq_set_max_active() helper for workqueue_set_max_active()
    workqueue: introduce cwq_set_max_active() helper for thaw_workqueues()
    workqueue: remove @delayed from cwq_dec_nr_in_flight()
    workqueue: fix possible stall on try_to_grab_pending() of a delayed work item
    workqueue: use hotcpu_notifier() for workqueue_cpu_down_callback()
    workqueue: use __cpuinit instead of __devinit for cpu callbacks
    workqueue: rename manager_mutex to assoc_mutex
    workqueue: WORKER_REBIND is no longer necessary for idle rebinding
    workqueue: WORKER_REBIND is no longer necessary for busy rebinding
    workqueue: reimplement idle worker rebinding
    workqueue: deprecate __cancel_delayed_work()
    workqueue: reimplement cancel_delayed_work() using try_to_grab_pending()
    workqueue: use mod_delayed_work() instead of __cancel + queue
    workqueue: use irqsafe timer for delayed_work
    workqueue: clean up delayed_work initializers and add missing one
    workqueue: make deferrable delayed_work initializer names consistent
    workqueue: cosmetic whitespace updates for macro definitions
    workqueue: deprecate system_nrt[_freezable]_wq
    workqueue: deprecate flush[_delayed]_work_sync()
    ...

    Linus Torvalds
     

24 Sep, 2012

3 commits

  • If a dirty GFS2 inode was being deleted but was in use by another node, its
    metadata was not getting written out before GFS2 checked for dirty buffers in
    gfs2_ail_flush(). GFS2 was relying on inode_go_sync() to write out the
    metadata when the other node tried to free the file, but it failed the error
    check before it got that far. This patch writes out the metadata before calling
    gfs2_ail_flush()

    Signed-off-by: Benjamin Marzinski
    Signed-off-by: Steven Whitehouse

    Benjamin Marzinski
     
  • The ->show_options() function for GFS2 was not correctly displaying
    the value when statfs slow in in use.

    Signed-off-by: Steven Whitehouse
    Reported-by: Milos Jakubicek

    Steven Whitehouse
     
  • This patch introduces a new structure, gfs2_rbm, which is a
    tuple of a resource group, a bitmap within the resource group
    and an offset within that bitmap. This is designed to make
    manipulating these sets of variables easier. There is also a
    new helper function which converts this representation back
    to a disk block address.

    In addition, the rbtree nodes which are used for the reservations
    were not being correctly initialised, which is now fixed. Also,
    the tracing was not passing through the inode where it should
    have been. That is mostly fixed aside from one corner case. This
    needs to be revisited since there can also be a NULL rgrp in
    some cases which results in the device being incorrect in the
    trace.

    This is intended to be the first step towards cleaning up some
    of the allocation code, and some further bug fixes.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse