12 Jul, 2011

1 commit

  • There is a potential race during filesystem mounting which has recently
    been reported. It occurs when the userland gfs_controld is able to
    process requests fast enough that it tries to use the sysfs interface
    before the lock module is properly initialised. This is a pretty
    unusual case as normally the lock module initialisation is very quick
    compared with gfs_controld.

    This patch adds an interruptible completion which is used to ensure that
    userland will wait for the initialisation of the lock module to
    complete.

    There are other potential solutions to this problem, but this is the
    quickest at this stage and has been tested both with and without
    mount.gfs2 present in the system.

    Signed-off-by: Steven Whitehouse
    Reported-by: David Booher

    Steven Whitehouse
     

10 May, 2011

1 commit


06 Oct, 2010

1 commit


29 Sep, 2010

1 commit

  • Recently a feature was added to GFS2 to allow journal id allocation
    via sysfs. This patch builds upon that so that a negative journal id
    will be treated as an error code to be passed back as the return code
    from mount. This allows termination of the mount process if there is
    a failure.

    Also, the process has been updated so that the kernel will wait
    for a journal id, even in the "spectator" case. This is required
    in order to avoid mounting a filesystem in case there is an error
    while joining the cluster. In the spectator case, 0 is written into
    the file to indicate that all is well, and that mount should continue.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

08 Aug, 2010

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (55 commits)
    workqueue: mark init_workqueues() as early_initcall()
    workqueue: explain for_each_*cwq_cpu() iterators
    fscache: fix build on !CONFIG_SYSCTL
    slow-work: kill it
    gfs2: use workqueue instead of slow-work
    drm: use workqueue instead of slow-work
    cifs: use workqueue instead of slow-work
    fscache: drop references to slow-work
    fscache: convert operation to use workqueue instead of slow-work
    fscache: convert object to use workqueue instead of slow-work
    workqueue: fix how cpu number is stored in work->data
    workqueue: fix mayday_mask handling on UP
    workqueue: fix build problem on !CONFIG_SMP
    workqueue: fix locking in retry path of maybe_create_worker()
    async: use workqueue for worker pool
    workqueue: remove WQ_SINGLE_CPU and use WQ_UNBOUND instead
    workqueue: implement unbound workqueue
    workqueue: prepare for WQ_UNBOUND implementation
    libata: take advantage of cmwq and remove concurrency limitations
    workqueue: fix worker management invocation without pending works
    ...

    Fixed up conflicts in fs/cifs/* as per Tejun. Other trivial conflicts in
    include/linux/workqueue.h, kernel/trace/Kconfig and kernel/workqueue.c

    Linus Torvalds
     

29 Jul, 2010

1 commit

  • This patch implements a wait for the journal id in the case that it has
    not been specified on the command line. This is to allow the future
    removal of the mount.gfs2 helper. The journal id would instead be
    directly communicated by gfs_controld to the file system. Here is a
    comparison of the two systems:

    Current:
    1. mount calls mount.gfs2
    2. mount.gfs2 connects to gfs_controld to retrieve the journal id
    3. mount.gfs2 adds the journal id to the mount command line and calls
    the mount system call
    4. gfs_controld receives the status of the mount request via a uevent

    Proposed:
    1. mount calls the mount system call (no mount.gfs2 helper)
    2. gfs_controld receives a uevent for a gfs2 fs which it doesn't know
    about already
    3. gfs_controld assigns a journal id to it via sysfs
    4. the mount system call then completes as normal (sending a uevent
    according to status)

    The advantage of the proposed system is that it is completely backward
    compatible with the current system both at the kernel and at the
    userland levels. The "first" parameter can also be set the same way,
    with the restriction that it must be set before the journal id is
    assigned.

    In addition, if mount becomes stuck waiting for a reply from
    gfs_controld which never arrives, then it is killable and will abort the
    mount gracefully.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

23 Jul, 2010

1 commit

  • Workqueue can now handle high concurrency. Convert gfs to use
    workqueue instead of slow-work.

    * Steven pointed out that recovery path might be run from allocation
    path and thus requires forward progress guarantee without memory
    allocation. Create and use gfs_recovery_wq with rescuer. Please
    note that forward progress wasn't guaranteed with slow-work.

    * Updated to use non-reentrant workqueue.

    Signed-off-by: Tejun Heo
    Acked-by: Steven Whitehouse

    Tejun Heo
     

21 May, 2010

1 commit

  • * git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw:
    GFS2: Fix typo
    GFS2: stuck in inode wait, no glocks stuck
    GFS2: Eliminate useless err variable
    GFS2: Fix writing to non-page aligned gfs2_quota structures
    GFS2: Add some useful messages
    GFS2: fix quota state reporting
    GFS2: Various gfs2_logd improvements
    GFS2: glock livelock
    GFS2: Clean up stuffed file copying
    GFS2: docs update
    GFS2: Remove space from slab cache name

    Linus Torvalds
     

14 May, 2010

1 commit


06 May, 2010

1 commit

  • The following patch adds a message to indicate when barriers have been
    disabled due to a block device which doesn't support them. You could
    already tell this via the mount options in /proc/mounts, but all the
    other filesystems also log a message at the same time.

    Also, the same mechanisms are used to indicate when the lock
    demote interface has been used (only ever used for debugging)
    which is a request from our support team.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

05 May, 2010

1 commit

  • This patch contains various tweaks to how log flushes and active item writeback
    work. gfs2_logd is now managed by a waitqueue, and gfs2_log_reseve now waits
    for gfs2_logd to do the log flushing. Multiple functions were rewritten to
    remove the need to call gfs2_log_lock(). Instead of using one test to see if
    gfs2_logd had work to do, there are now seperate tests to check if there
    are two many buffers in the incore log or if there are two many items on the
    active items list.

    This patch is a port of a patch Steve Whitehouse wrote about a year ago, with
    some minor changes. Since gfs2_ail1_start always submits all the active items,
    it no longer needs to keep track of the first ai submitted, so this has been
    removed. In gfs2_log_reserve(), the order of the calls to
    prepare_to_wait_exclusive() and wake_up() when firing off the logd thread has
    been switched. If it called wake_up first there was a small window for a race,
    where logd could run and return before gfs2_log_reserve was ready to get woken
    up. If gfs2_logd ran, but did not free up enough blocks, gfs2_log_reserve()
    would be left waiting for gfs2_logd to eventualy run because it timed out.
    Finally, gt_logd_secs, which controls how long to wait before gfs2_logd times
    out, and flushes the log, can now be set on mount with ar_commit.

    Signed-off-by: Benjamin Marzinski
    Signed-off-by: Steven Whitehouse

    Benjamin Marzinski
     

30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

08 Mar, 2010

2 commits

  • Constify struct sysfs_ops.

    This is part of the ops structure constification
    effort started by Arjan van de Ven et al.

    Benefits of this constification:

    * prevents modification of data that is shared
    (referenced) by many other structure instances
    at runtime

    * detects/prevents accidental (but not intentional)
    modification attempts on archs that enforce
    read-only kernel data at runtime

    * potentially better optimized code as the compiler
    can assume that the const data cannot be changed

    * the compiler/linker move const data into .rodata
    and therefore exclude them from false sharing

    Signed-off-by: Emese Revfy
    Acked-by: David Teigland
    Acked-by: Matt Domsch
    Acked-by: Maciej Sosnowski
    Acked-by: Hans J. Koch
    Acked-by: Pekka Enberg
    Acked-by: Jens Axboe
    Acked-by: Stephen Hemminger
    Signed-off-by: Greg Kroah-Hartman

    Emese Revfy
     
  • Constify struct kset_uevent_ops.

    This is part of the ops structure constification
    effort started by Arjan van de Ven et al.

    Benefits of this constification:

    * prevents modification of data that is shared
    (referenced) by many other structure instances
    at runtime

    * detects/prevents accidental (but not intentional)
    modification attempts on archs that enforce
    read-only kernel data at runtime

    * potentially better optimized code as the compiler
    can assume that the const data cannot be changed

    * the compiler/linker move const data into .rodata
    and therefore exclude them from false sharing

    Signed-off-by: Emese Revfy
    Signed-off-by: Greg Kroah-Hartman

    Emese Revfy
     

06 Mar, 2010

1 commit

  • * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6: (33 commits)
    quota: stop using QUOTA_OK / NO_QUOTA
    dquot: cleanup dquot initialize routine
    dquot: move dquot initialization responsibility into the filesystem
    dquot: cleanup dquot drop routine
    dquot: move dquot drop responsibility into the filesystem
    dquot: cleanup dquot transfer routine
    dquot: move dquot transfer responsibility into the filesystem
    dquot: cleanup inode allocation / freeing routines
    dquot: cleanup space allocation / freeing routines
    ext3: add writepage sanity checks
    ext3: Truncate allocated blocks if direct IO write fails to update i_size
    quota: Properly invalidate caches even for filesystems with blocksize < pagesize
    quota: generalize quota transfer interface
    quota: sb_quota state flags cleanup
    jbd: Delay discarding buffers in journal_unmap_buffer
    ext3: quota_write cross block boundary behaviour
    quota: drop permission checks from xfs_fs_set_xstate/xfs_fs_set_xquota
    quota: split out compat_sys_quotactl support from quota.c
    quota: split out netlink notification support from quota.c
    quota: remove invalid optimization from quota_sync_all
    ...

    Fixed trivial conflicts in fs/namei.c and fs/ufs/inode.c

    Linus Torvalds
     

05 Mar, 2010

1 commit

  • Currenly sync_quota_sb does a lot of sync and truncate action that only
    applies to "VFS" style quotas and is actively harmful for the sync
    performance in XFS. Move it into vfs_quota_sync and add a wait parameter
    to ->quota_sync to tell if we need it or not.

    My audit of the GFS2 code says it's also not needed given the way GFS2
    implements quotas, but I'd be happy if this can get a detailed review.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jan Kara

    Christoph Hellwig
     

01 Mar, 2010

1 commit

  • As a consequence of the previous patch, we can now remove the
    loop which used to be required due to the circular dependency
    between the inodes and glocks. Instead we can just invalidate
    the inodes, and then clear up any glocks which are left.

    Also we no longer need the rwsem since there is no longer any
    danger of the inode invalidation calling back into the glock
    code (and from there back into the inode code).

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

16 Dec, 2009

1 commit


03 Dec, 2009

2 commits


09 Sep, 2009

1 commit


17 Aug, 2009

2 commits

  • This adds a link from the per-gfs2 sb sysfs directory to
    the block device upon which the filesystem is mounted. The
    link is called "device", strangely enough :-)

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • With each uevent, we now always include the journal ID. We
    can't call it JID since that is already in use by some of
    the individual events relating to recovery, so we use
    JOURNALID instead. We don't send the JOURNALID for spectator
    mounts, since there isn't one.

    Also the ADD event now has both RDONLY and SPECTATOR information
    to match that of the ONLINE event.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

14 Aug, 2009

1 commit


26 May, 2009

2 commits

  • Since we can cat /proc/mounts there is no need to have this
    subdirectory in the gfs2 sysfs files. In fact this does not
    reflect the full range of possible mount argumenmts, where
    as /proc/mounts does.

    There was only one userland user of this set of sysfs files
    and it will function perfectly well without these files
    being present (in fact that subcommand of gfs2_tool is
    obsolete anyway).

    The tune/* subdirectory is also considered mostly obsolete,
    but there are a few uses of this until mount arguments can
    be added for the last few functions for which there are no
    equivalents currently. However the tune/* directory is still
    in my sights and new code should avoid using it. Only the gfs2_quota
    and gfs2_tool programs are know to use tune/* at the moment.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • The lockstruct sub directory contained two entries, both of
    which are duplicated elsewhere in the gfs2 sysfs files as
    well as being available via /proc/mounts. There is no userland program
    using either of them, so this patch removes them.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

19 May, 2009

1 commit

  • This patch fixes a race condition where we can receive recovery
    requests part way through processing a umount. This was causing
    problems since the recovery thread had already gone away.

    Looking in more detail at the recovery code, it was really trying
    to implement a slight variation on a work queue, and that happens to
    align nicely with the recently introduced slow-work subsystem. As a
    result I've updated the code to use slow-work, rather than its own home
    grown variety of work queue.

    When using the wait_on_bit() function, I noticed that the wait function
    that was supplied as an argument was appearing in the WCHAN field, so
    I've updated the function names in order to produce more meaningful
    output.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

13 May, 2009

2 commits

  • These two tunables are pointless and would never need to be
    changed anyway. There is also a race between them and umount
    as the deamons which they refer to might have gone away. The
    easiest way to fix the race is to remove the interface.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • It has always been possible to adjust the gfs2 log commit
    interval, but only from the sysfs interface. This adds a
    mount option, commit=, which will be familar to ext3
    users.

    The sysfs interface continues to be available as well, although
    this might be removed in the future.

    Also this patch cleans up some duplicated structures in the GFS2
    sysfs code.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

24 Mar, 2009

4 commits

  • This adds a sysfs file called demote_rq to GFS2's
    per filesystem directory. Its possible to use this
    file to demote arbitrary glocks in exactly the same
    way as if a request had come in from a remote node.

    This is intended for testing issues relating to caching
    of data under glocks. Despite that, the interface is
    generic enough to send requests to any type of glock,
    but be careful as its not always safe to send an
    arbitrary message to an arbitrary glock. For that reason
    and to prevent DoS, this interface is restricted to root
    only.

    The messages look like this:

    :

    Example:

    echo -n "2:13324 EX" >/sys/fs/gfs2/unity:myfs/demote_rq

    Which means "please demote inode glock (type 2) number 13324 so that
    I can get an EX (exclusive) lock". The lock modes are those which
    would normally be sent by a remote node in its callback so if you
    want to unlock a glock, you use EX, to demote to shared, use SH or PR
    (depending on whether you like GFS2 or DLM lock modes better!).

    If the glock doesn't exist, you'll get -ENOENT returned. If the
    arguments don't make sense, you'll get -EINVAL returned.

    The plan is that this interface will be used in combination with
    the blktrace patch which I recently posted for comments although
    it is, of course, still useful in its own right.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • Since we have a UUID, we ought to expose it to the user via sysfs
    and uevents. We already have the fs name in both of these places
    (a combination of the lock proto and lock table name) so if we add
    the UUID as well, we have a full set.

    For older filesystems (i.e. those created before mkfs.gfs2 was writing
    UUIDs by default) the sysfs file will appear zero length, and no UUID
    env var will be added to the uevents.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • This is the big patch that I've been working on for some time
    now. There are many reasons for wanting to make this change
    such as:
    o Reducing overhead by eliminating duplicated fields between structures
    o Simplifcation of the code (reduces the code size by a fair bit)
    o The locking interface is now the DLM interface itself as proposed
    some time ago.
    o Fewer lookups of glocks when processing replies from the DLM
    o Fewer memory allocations/deallocations for each glock
    o Scope to do further optimisations in the future (but this patch is
    more than big enough for now!)

    Please note that (a) this patch relates to the lock_dlm module and
    not the DLM itself, that is still a separate module; and (b) that
    we retain the ability to build GFS2 as a standalone single node
    filesystem with out requiring the DLM.

    This patch needs a lot of testing, hence my keeping it I restarted
    my -git tree after the last merge window. That way, this has the maximum
    exposure before its merged. This is (modulo a few minor bug fixes) the
    same patch that I've been posting on and off the the last three months
    and its passed a number of different tests so far.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • Deallocation of gfs2_quota_data objects now happens on-demand through a
    shrinker instead of routinely deallocating through the quotad daemon.

    Signed-off-by: Abhijith Das
    Signed-off-by: Steven Whitehouse

    Abhijith Das
     

05 Jan, 2009

4 commits

  • Remove code that used to have something to do with initrd
    but has been unused for a long time.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • We ought to inform the user of the locktable and lockproto for each
    uevent we generate.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • This patch removes the two daemons, gfs2_scand and gfs2_glockd
    and replaces them with a shrinker which is called from the VM.

    The net result is that GFS2 responds better when there is memory
    pressure, since it shrinks the glock cache at the same rate
    as the VFS shrinks the dcache and icache. There are no longer
    any time based criteria for shrinking glocks, they are kept
    until such time as the VM asks for more memory and then we
    demote just as many glocks as required.

    There are potential future changes to this code, including the
    possibility of sorting the glocks which are to be written back
    into inode number order, to get a better I/O ordering. It would
    be very useful to have an elevator based workqueue implementation
    for this, as that would automatically deal with the read I/O cases
    at the same time.

    This patch is my answer to Andrew Morton's remark, made during
    the initial review of GFS2, asking why GFS2 needs so many kernel
    threads, the answer being that it doesn't :-) This patch is a
    net loss of about 200 lines of code.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • This patch is a clean up of gfs2_quotad prior to giving it an
    extra job to do in addition to the current portfolio of updating
    the quota and statfs information from time to time.

    As a result it has been moved into quota.c allowing one of the
    functions it calls to be made static. Also the clean up allows
    the two existing functions to have separate timeouts and also
    to coexist with its future role of dealing with the "truncate in
    progress" inode flag.

    The (pointless) setting of gfs2_quotad_secs is removed since we
    arrange to only wake up quotad when one of the two timers expires.

    In addition the struct gfs2_quota_data is moved into a slab cache,
    mainly for easier debugging. It should also be possible to use
    a shrinker in the future, rather than the current scheme of scanning
    the quota data entries from time to time.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

18 Sep, 2008

1 commit

  • Until now, we've used the same scheme as GFS1 for atime. This has failed
    since atime is a per vfsmnt flag, not a per fs flag and as such the
    "noatime" flag was not getting passed down to the filesystems. This
    patch removes all the "special casing" around atime updates and we
    simply use the VFS's atime code.

    The net result is that GFS2 will now support all the same atime related
    mount options of any other filesystem on a per-vfsmnt basis. We do lose
    the "lazy atime" updates, but we gain "relatime". We could add lazy
    atime to the VFS at a later date, if there is a requirement for that
    variant still - I suspect relatime will be enough.

    Also we lose about 100 lines of code after this patch has been applied,
    and I have a suspicion that it will speed things up a bit, even when
    atime is "on". So it seems like a nice clean up as well.

    From a user perspective, everything stays the same except the loss of
    the per-fs atime quantum tweekable (ought to be per-vfsmnt at the very
    least, and to be honest I don't think anybody ever used it) and that a
    number of options which were ignored before now work correctly.

    Please let me know if you've got any comments. I'm pushing this out
    early so that you can all see what my plans are.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

10 Jul, 2008

1 commit


27 Jun, 2008

1 commit

  • There are several reasons why this is undesirable:

    1. It never happens during normal operation anyway
    2. If it does happen it causes performance to be very, very poor
    3. It isn't likely to solve the original problem (memory shortage
    on remote DLM node) it was supposed to solve
    4. It uses a bunch of arbitrary constants which are unlikely to be
    correct for any particular situation and for which the tuning seems
    to be a black art.
    5. In an N node cluster, only 1/N of the dropped locked will actually
    contribute to solving the problem on average.

    So all in all we are better off without it. This also makes merging
    the lock_dlm module into GFS2 a bit easier.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse