03 Apr, 2009

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
    Remove two unneeded exports and make two symbols static in fs/mpage.c
    Cleanup after commit 585d3bc06f4ca57f975a5a1f698f65a45ea66225
    Trim includes of fdtable.h
    Don't crap into descriptor table in binfmt_som
    Trim includes in binfmt_elf
    Don't mess with descriptor table in load_elf_binary()
    Get rid of indirect include of fs_struct.h
    New helper - current_umask()
    check_unsafe_exec() doesn't care about signal handlers sharing
    New locking/refcounting for fs_struct
    Take fs_struct handling to new file (fs/fs_struct.c)
    Get rid of bumping fs_struct refcount in pivot_root(2)
    Kill unsharing fs_struct in __set_personality()

    Linus Torvalds
     

01 Apr, 2009

2 commits

  • Change the page_mkwrite prototype to take a struct vm_fault, and return
    VM_FAULT_xxx flags. There should be no functional change.

    This makes it possible to return much more detailed error information to
    the VM (and also can provide more information eg. virtual_address to the
    driver, which might be important in some special cases).

    This is required for a subsequent fix. And will also make it easier to
    merge page_mkwrite() with fault() in future.

    Signed-off-by: Nick Piggin
    Cc: Chris Mason
    Cc: Trond Myklebust
    Cc: Miklos Szeredi
    Cc: Steven Whitehouse
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Artem Bityutskiy
    Cc: Felix Blyakher
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • current->fs->umask is what most of fs_struct users are doing.
    Put that into a helper function.

    Signed-off-by: Al Viro

    Al Viro
     

28 Mar, 2009

2 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (37 commits)
    fs: avoid I_NEW inodes
    Merge code for single and multiple-instance mounts
    Remove get_init_pts_sb()
    Move common mknod_ptmx() calls into caller
    Parse mount options just once and copy them to super block
    Unroll essentials of do_remount_sb() into devpts
    vfs: simple_set_mnt() should return void
    fs: move bdev code out of buffer.c
    constify dentry_operations: rest
    constify dentry_operations: configfs
    constify dentry_operations: sysfs
    constify dentry_operations: JFS
    constify dentry_operations: OCFS2
    constify dentry_operations: GFS2
    constify dentry_operations: FAT
    constify dentry_operations: FUSE
    constify dentry_operations: procfs
    constify dentry_operations: ecryptfs
    constify dentry_operations: CIFS
    constify dentry_operations: AFS
    ...

    Linus Torvalds
     
  • Signed-off-by: Al Viro

    Al Viro
     

24 Mar, 2009

20 commits

  • This removes some old code that was causing issues during
    filesystem freeze.

    Reported-by: Andrew Price
    Tested-by: Andrew Price
    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • The logic requires that we mark the glock dirty in page_mkwrite
    otherwise we might not flush correctly in the case that no
    allocation was required in the process of dirying the page.
    Also we need to set the shared write flag early for the same
    reason.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • This cleans up a number of bits of code mostly based in glops.c.
    A couple of simple functions have been merged into the callers
    to make it more obvious what is going on, the mysterious raising
    of i_writecount around the truncate_inode_pages() call has been
    removed. The meta_go_* operations have been renamed rgrp_go_*
    since that is the only lock type that they are used with.

    The unused argument of gfs2_read_sb has been removed. Also
    a bug has been fixed where a check for the rindex inode was
    in the wrong callback. More comments are added, and the
    debugging code is improved too.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • After calling out to the dlm, GFS2 sets the new state of a glock to
    gl_target in gdlm_ast(). However, gl_target is not always the lock
    state that was requested. If a conversion from shared to exclusive
    fails, finish_xmote() will call do_xmote() with LM_ST_UNLOCKED, instead
    of gl->gl_target, so that it can reacquire the lock in exlusive the next
    time around. In this case, setting the lock to gl_target in gdlm_ast()
    will make GFS2 think that it has the glock in exclusive mode, when
    really, it doesn't have the glock locked at all. This patch adds a new
    field to the gfs2_glock structure, gl_req, to track the mode that was
    requested.

    Signed-off-by: Benjamin Marzinski
    Signed-off-by: Steven Whitehouse

    Benjamin Marzinski
     
  • I introduced "is_partially_uptodate" aops for GFS2.

    A page can have multiple buffers and even if a page is not uptodate, some buffers
    can be uptodate on pagesize != blocksize environment.
    This aops checks that all buffers which correspond to a part of a file
    that we want to read are uptodate. If so, we do not have to issue actual
    read IO to HDD even if a page is not uptodate because the portion we
    want to read are uptodate.
    "block_is_partially_uptodate" function is already used by ext2/3/4.
    With the following patch random read/write mixed workloads or random read after
    random write workloads can be optimized and we can get performance improvement.

    I did a performance test using the sysbench.

    #sysbench --num-threads=16 --max-requests=200000 --test=fileio --file-num=1
    --file-block-size=8K --file-total-size=2G --file-test-mode=rndrw --file-fsync-freq=0
    --file-rw-ratio=1 run

    -2.6.29-rc6
    Test execution summary:
    total time: 202.6389s
    total number of events: 200000
    total time taken by event execution: 2580.0480
    per-request statistics:
    min: 0.0000s
    avg: 0.0129s
    max: 49.5852s
    approx. 95 percentile: 0.0462s

    -2.6.29-rc6-patched
    Test execution summary:
    total time: 177.8639s
    total number of events: 200000
    total time taken by event execution: 2419.0199
    per-request statistics:
    min: 0.0000s
    avg: 0.0121s
    max: 52.4306s
    approx. 95 percentile: 0.0444s

    arch: ia64
    pagesize: 16k
    blocksize: 4k

    Signed-off-by: Hisashi Hifumi
    Signed-off-by: Steven Whitehouse

    Hisashi Hifumi
     
  • Impact: Make symbol static.

    Fix this sparse warning:
    fs/gfs2/rgrp.c:188:5: warning: symbol 'gfs2_bitfit' was not declared. Should it be static?

    Signed-off-by: Hannes Eder
    Signed-off-by: Steven Whitehouse

    Hannes Eder
     
  • Fix this sparse warnings:
    fs/gfs2/rgrp.c:156:23: warning: constant 0xffffffffffffffff is so big it is unsigned long long
    fs/gfs2/rgrp.c:157:23: warning: constant 0xaaaaaaaaaaaaaaaa is so big it is unsigned long long
    fs/gfs2/rgrp.c:158:23: warning: constant 0x5555555555555555 is so big it is long long
    fs/gfs2/rgrp.c:194:20: warning: constant 0x5555555555555555 is so big it is long long
    fs/gfs2/rgrp.c:204:44: warning: constant 0x5555555555555555 is so big it is long long

    Signed-off-by: Hannes Eder
    Signed-off-by: Steven Whitehouse

    Hannes Eder
     
  • This adds support for "quota" and "noquota" mount options in addition to the
    existing "quota=on/off/account" so that we are compatible with the names by
    which these options are more generally known.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • An alignment issue with the existing bitfit algorithm was reported
    on IA64. This patch attempts to fix that, and also to tidy up the
    code a bit. There is now more documentation about how this works
    and it has survived a number of different tests.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • This adds a sysfs file called demote_rq to GFS2's
    per filesystem directory. Its possible to use this
    file to demote arbitrary glocks in exactly the same
    way as if a request had come in from a remote node.

    This is intended for testing issues relating to caching
    of data under glocks. Despite that, the interface is
    generic enough to send requests to any type of glock,
    but be careful as its not always safe to send an
    arbitrary message to an arbitrary glock. For that reason
    and to prevent DoS, this interface is restricted to root
    only.

    The messages look like this:

    :

    Example:

    echo -n "2:13324 EX" >/sys/fs/gfs2/unity:myfs/demote_rq

    Which means "please demote inode glock (type 2) number 13324 so that
    I can get an EX (exclusive) lock". The lock modes are those which
    would normally be sent by a remote node in its callback so if you
    want to unlock a glock, you use EX, to demote to shared, use SH or PR
    (depending on whether you like GFS2 or DLM lock modes better!).

    If the glock doesn't exist, you'll get -ENOENT returned. If the
    arguments don't make sense, you'll get -EINVAL returned.

    The plan is that this interface will be used in combination with
    the blktrace patch which I recently posted for comments although
    it is, of course, still useful in its own right.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • Since we have a UUID, we ought to expose it to the user via sysfs
    and uevents. We already have the fs name in both of these places
    (a combination of the lock proto and lock table name) so if we add
    the UUID as well, we have a full set.

    For older filesystems (i.e. those created before mkfs.gfs2 was writing
    UUIDs by default) the sysfs file will appear zero length, and no UUID
    env var will be added to the uevents.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • This patch allows GFS2 to generate discard requests for blocks which are
    no longer useful to the filesystem (i.e. those which have been freed as
    the result of an unlink operation). The requests are generated at the
    time which those blocks become available for reuse in the filesystem.

    In order to use this new feature, you have to specify the "discard"
    mount option. The code coalesces adjacent blocks into a single extent
    when generating the discard requests, thus generating the minimum
    number.

    If an error occurs when the request has been sent to the block device,
    then it will print a message and turn off the requests for that
    filesystem. If the problem is temporary, then you can use remount to
    turn the option back on again. There is also a nodiscard mount option
    so that you can use remount to turn discard requests off, if required.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • This patch fixes a deadlock when the journal is flushed and there
    are dirty inodes other than the one which caused the journal flush.
    Originally the journal flushing code was trying to obtain the
    transaction glock while running the flush code for an inode glock.
    We no longer require the transaction glock at this point in time
    since we know that any attempt to get the transaction glock from
    another node will result in a journal flush. So if we are flushing
    the journal, we can be sure that the transaction lock is still
    cached from when the transaction was started.

    By inlining a version of gfs2_trans_begin() (minus the bit which
    gets the transaction glock) we can avoid the deadlock problems
    caused if there is a demote request queued up on the transaction
    glock.

    In addition I've also moved the umount rwsem so that it covers
    the glock workqueue, since it all demotions are done by this
    workqueue now. That fixes a bug on umount which I came across
    while fixing the original problem.

    Reported-by: David Teigland
    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • We were keeping hold of an extra ref to the root inode in one
    of the error paths, that resulted in a hang.

    Reported-by: Nate Straz
    Signed-off-by: Steven Whitehouse
    Tested-by: Robert Peterson

    Steven Whitehouse
     
  • The time stamp field is unused in the glock now that we are
    using a shrinker, so that we can remove it and save sizeof(unsigned long)
    bytes in each glock.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • This is the big patch that I've been working on for some time
    now. There are many reasons for wanting to make this change
    such as:
    o Reducing overhead by eliminating duplicated fields between structures
    o Simplifcation of the code (reduces the code size by a fair bit)
    o The locking interface is now the DLM interface itself as proposed
    some time ago.
    o Fewer lookups of glocks when processing replies from the DLM
    o Fewer memory allocations/deallocations for each glock
    o Scope to do further optimisations in the future (but this patch is
    more than big enough for now!)

    Please note that (a) this patch relates to the lock_dlm module and
    not the DLM itself, that is still a separate module; and (b) that
    we retain the ability to build GFS2 as a standalone single node
    filesystem with out requiring the DLM.

    This patch needs a lot of testing, hence my keeping it I restarted
    my -git tree after the last merge window. That way, this has the maximum
    exposure before its merged. This is (modulo a few minor bug fixes) the
    same patch that I've been posting on and off the the last three months
    and its passed a number of different tests so far.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • We only really need a single spin lock for the quota data, so
    lets just use the lru lock for now.

    Signed-off-by: Steven Whitehouse
    Cc: Abhijith Das

    Steven Whitehouse
     
  • Deallocation of gfs2_quota_data objects now happens on-demand through a
    shrinker instead of routinely deallocating through the quotad daemon.

    Signed-off-by: Abhijith Das
    Signed-off-by: Steven Whitehouse

    Abhijith Das
     
  • The quota code uses lvbs and this is currently not implemented in
    lock_nolock, thereby causing panics when quota is enabled with
    lock_nolock. This patch adds the relevant bits.

    Signed-off-by: Abhijith Das
    Signed-off-by: Steven Whitehouse

    Abhijith Das
     
  • The following patch fixes an issue relating to remount and argument
    parsing. After this fix is applied, remount becomes atomic in that
    it either succeeds changing the mount to the new state, or it fails
    and leaves it in the old state. Previously it was possible for the
    parsing of options to fail part way though and for the fs to be left
    in a state where some of the new arguments had been applied, but some
    had not.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

10 Jan, 2009

1 commit

  • Currently, ext3 in mainline Linux doesn't have the freeze feature which
    suspends write requests. So, we cannot take a backup which keeps the
    filesystem's consistency with the storage device's features (snapshot and
    replication) while it is mounted.

    In many case, a commercial filesystem (e.g. VxFS) has the freeze feature
    and it would be used to get the consistent backup.

    If Linux's standard filesystem ext3 has the freeze feature, we can do it
    without a commercial filesystem.

    So I have implemented the ioctls of the freeze feature.
    I think we can take the consistent backup with the following steps.
    1. Freeze the filesystem with the freeze ioctl.
    2. Separate the replication volume or create the snapshot
    with the storage device's feature.
    3. Unfreeze the filesystem with the unfreeze ioctl.
    4. Take the backup from the separated replication volume
    or the snapshot.

    This patch:

    VFS:
    Changed the type of write_super_lockfs and unlockfs from "void"
    to "int" so that they can return an error.
    Rename write_super_lockfs and unlockfs of the super block operation
    freeze_fs and unfreeze_fs to avoid a confusion.

    ext3, ext4, xfs, gfs2, jfs:
    Changed the type of write_super_lockfs and unlockfs from "void"
    to "int" so that write_super_lockfs returns an error if needed,
    and unlockfs always returns 0.

    reiserfs:
    Changed the type of write_super_lockfs and unlockfs from "void"
    to "int" so that they always return 0 (success) to keep a current behavior.

    Signed-off-by: Takashi Sato
    Signed-off-by: Masayuki Hamaguchi
    Cc:
    Cc:
    Cc: Christoph Hellwig
    Cc: Dave Kleikamp
    Cc: Dave Chinner
    Cc: Alasdair G Kergon
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Takashi Sato
     

07 Jan, 2009

3 commits


05 Jan, 2009

11 commits

  • SPIN_LOCK_UNLOCKED is deprecated. The following makes the change suggested
    in Documentation/spinlocks.txt

    The semantic patch that makes this change is as follows:
    (http://www.emn.fr/x-info/coccinelle/)

    //
    @@
    declarer name DEFINE_SPINLOCK;
    identifier xxx_lock;
    @@

    - spinlock_t xxx_lock = SPIN_LOCK_UNLOCKED;
    + DEFINE_SPINLOCK(xxx_lock);
    //

    Signed-off-by: Julia Lawall
    Signed-off-by: Steven Whitehouse

    Julia Lawall
     
  • This should solve the issue with the previous attempt at fixing this.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • This reverts commit 78802499912f1ba31ce83a94c55b5a980f250a43.

    The original patch is causing problems in relation to order of
    operations at umount in relation to jdata files. I need to fix
    this a different way.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • This patch removes some unused code, and make the calculation
    of the number of blocks required conditional in order to reduce
    the number of times this (potentially expensive) calculation
    is done.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • In order to distinguish between two differing uevent messages
    and to avoid using the (racy) method of reading status from
    sysfs in future, this adds some status information to our
    uevent messages.

    Btw, before anybody says "sysfs isn't racy", I'm aware of that,
    but the way that GFS2 was using it (send an ambiugous uevent and
    then expect the receiver to read sysfs to find out the status
    of the reported operation) was.

    The additional benefit of using the new interface is that it
    should be possible for a node to recover multiple journals
    at the same time, since there is no longer any confusion as
    to which journal the status belongs to.

    At some future stage, when all the userland programs have been
    converted, I intend to remove the old interface.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • There was a use-after-free with the GFS2 super block during
    umount. This patch moves almost all of the umount code from
    ->put_super into ->kill_sb, the only bit that cannot be moved
    being the glock hash clearing which has to remain as ->put_super
    due to umount ordering requirements. As a result its now obvious
    that the kfree is the final operation, whereas before it was
    hidden in ->put_super.

    Also gfs2_jindex_free is then only referenced from a single file
    so thats moved and marked static too.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • Remove code that used to have something to do with initrd
    but has been unused for a long time.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • The functions which are being moved can all be marked
    static in their new locations, since they only have
    a single caller each. Their new locations are more
    logical than before and some of the functions are
    small enough that the compiler might well inline them.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • gfs2_lock_fs_check_clean() should not be calling gfs2_jindex_hold()
    since it doesn't work like rindex hold, despite the comment. That
    allows gfs2_jindex_hold() to be moved into ops_fstype.c where it
    can be made static.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • We ought to inform the user of the locktable and lockproto for each
    uevent we generate.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • This patch removes the two daemons, gfs2_scand and gfs2_glockd
    and replaces them with a shrinker which is called from the VM.

    The net result is that GFS2 responds better when there is memory
    pressure, since it shrinks the glock cache at the same rate
    as the VFS shrinks the dcache and icache. There are no longer
    any time based criteria for shrinking glocks, they are kept
    until such time as the VM asks for more memory and then we
    demote just as many glocks as required.

    There are potential future changes to this code, including the
    possibility of sorting the glocks which are to be written back
    into inode number order, to get a better I/O ordering. It would
    be very useful to have an elevator based workqueue implementation
    for this, as that would automatically deal with the read I/O cases
    at the same time.

    This patch is my answer to Andrew Morton's remark, made during
    the initial review of GFS2, asking why GFS2 needs so many kernel
    threads, the answer being that it doesn't :-) This patch is a
    net loss of about 200 lines of code.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse