08 May, 2007

1 commit

  • I have never seen a use of SLAB_DEBUG_INITIAL. It is only supported by
    SLAB.

    I think its purpose was to have a callback after an object has been freed
    to verify that the state is the constructor state again? The callback is
    performed before each freeing of an object.

    I would think that it is much easier to check the object state manually
    before the free. That also places the check near the code object
    manipulation of the object.

    Also the SLAB_DEBUG_INITIAL callback is only performed if the kernel was
    compiled with SLAB debugging on. If there would be code in a constructor
    handling SLAB_DEBUG_INITIAL then it would have to be conditional on
    SLAB_DEBUG otherwise it would just be dead code. But there is no such code
    in the kernel. I think SLUB_DEBUG_INITIAL is too problematic to make real
    use of, difficult to understand and there are easier ways to accomplish the
    same effect (i.e. add debug code before kfree).

    There is a related flag SLAB_CTOR_VERIFY that is frequently checked to be
    clear in fs inode caches. Remove the pointless checks (they would even be
    pointless without removeal of SLAB_DEBUG_INITIAL) from the fs constructors.

    This is the last slab flag that SLUB did not support. Remove the check for
    unimplemented flags from SLUB.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

13 Feb, 2007

1 commit


12 Feb, 2007

1 commit

  • Fix insecure default behaviour reported by Tigran Aivazian: if an ext2 or
    ext3 or ext4 filesystem is tuned to mount with "acl", but mounted by a
    kernel built without ACL support, then umask was ignored when creating
    inodes - though root or user has umask 022, touch creates files as 0666,
    and mkdir creates directories as 0777.

    This appears to have worked right until 2.6.11, when a fix to the default
    mode on symlinks (always 0777) assumed VFS applies umask: which it does,
    unless the mount is marked for ACLs; but ext[234] set MS_POSIXACL in
    s_flags according to s_mount_opt set according to def_mount_opts.

    We could revert to the 2.6.10 ext[234]_init_acl (adding an S_ISLNK test);
    but other filesystems only set MS_POSIXACL when ACLs are configured. We
    could fix this at another level; but it seems most robust to avoid setting
    the s_mount_opt flag in the first place (at the expense of more ifdefs).

    Likewise don't set the XATTR_USER flag when built without XATTR support.

    Signed-off-by: Hugh Dickins
    Cc: Tigran Aivazian
    Cc:
    Cc: Andreas Gruenbacher
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

09 Dec, 2006

1 commit

  • This facility provides three entry points:

    ilog2() Log base 2 of unsigned long
    ilog2_u32() Log base 2 of u32
    ilog2_u64() Log base 2 of u64

    These facilities can either be used inside functions on dynamic data:

    int do_something(long q)
    {
    ...;
    y = ilog2(x)
    ...;
    }

    Or can be used to statically initialise global variables with constant values:

    unsigned n = ilog2(27);

    When performing static initialisation, the compiler will report "error:
    initializer element is not constant" if asked to take a log of zero or of
    something not reducible to a constant. They treat negative numbers as
    unsigned.

    When not dealing with a constant, they fall back to using fls() which permits
    them to use arch-specific log calculation instructions - such as BSR on
    x86/x86_64 or SCAN on FRV - if available.

    [akpm@osdl.org: MMC fix]
    Signed-off-by: David Howells
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Herbert Xu
    Cc: David Howells
    Cc: Wojtek Kaniewski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     

08 Dec, 2006

3 commits

  • Update ext2_statfs to return an FSID that is a 64 bit XOR of the 128 bit
    filesystem UUID as suggested by Andreas Dilger. See the following Bugzilla
    entry for details:

    http://bugzilla.kernel.org/show_bug.cgi?id=136

    Cc: Andreas Dilger
    Cc: Stephen Tweedie
    Signed-off-by: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pekka Enberg
     
  • Replace all uses of kmem_cache_t with struct kmem_cache.

    The patch was generated using the following script:

    #!/bin/sh
    #
    # Replace one string by another in all the kernel sources.
    #

    set -e

    for file in `find * -name "*.c" -o -name "*.h"|xargs grep -l $1`; do
    quilt add $file
    sed -e "1,\$s/$1/$2/g" $file >/tmp/$$
    mv /tmp/$$ $file
    quilt refresh
    done

    The script was run like this

    sh replace kmem_cache_t "struct kmem_cache"

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • SLAB_KERNEL is an alias of GFP_KERNEL.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

12 Oct, 2006

1 commit

  • Current error behaviour for ext2 and ext3 filesystems does not fully
    correspond to the documentation and should be fixed.

    According to man 8 mount, ext2 and ext3 file systems allow to set one of 3
    different on-errors behaviours:

    ---- start of quote man 8 mount ----

    errors=continue / errors=remount-ro / errors=panic

    Define the behaviour when an error is encountered. (Either ignore
    errors and just mark the file system erroneous and continue, or remount
    the file system read-only, or panic and halt the system.) The default is
    set in the filesystem superblock, and can be changed using tune2fs(8).

    ---- end of quote ----

    However EXT3_ERRORS_CONTINUE is not read from the superblock, and thus
    ERRORS_CONT is not saved on the sbi->s_mount_opt. It leads to the incorrect
    handle of errors on ext3.

    Then we've checked corresponding code in ext2 and discovered that it is buggy
    as well:

    - EXT2_ERRORS_CONTINUE is not read from the superblock (the same);

    - parse_option() does not clean the alternative values and thus something
    like (ERRORS_CONT|ERRORS_RO) can be set;

    - if options are omitted, parse_option() does not set any of these options.

    Therefore it is possible to set any combination of these options on the ext2:

    - none of them may be set: EXT2_ERRORS_CONTINUE on superblock / empty mount
    options;

    - any of them may be set using mount options;

    - 2 any options may be set: by using EXT2_ERRORS_RO/EXT2_ERRORS_PANIC on the
    superblock and other value in mount options;

    - and finally all three options may be set by adding third option in remount.

    Currently ext2 uses these values only in ext2_error() and it is not leading to
    any noticeable troubles. However somebody may be discouraged when he will try
    to workaround EXT2_ERRORS_PANIC on the superblock by using errors=continue in
    mount options.

    This patch:

    EXT2_ERRORS_CONTINUE should be read from the superblock as default value for
    error behaviour. parse_option() should clean the alternative options and
    should not change default value taken from the superblock.

    Signed-off-by: Vasily Averin
    Acked-by: Kirill Korotaev
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vasily Averin
     

27 Sep, 2006

3 commits


19 Sep, 2006

1 commit

  • Fix a performance degradation introduced in 2.6.17. (30% degradation
    running dbench with 16 threads)

    Commit 21730eed11de42f22afcbd43f450a1872a0b5ea1, which claims to make
    EXT2_DEBUG work again, moves the taking of the kernel lock out of
    debug-only code in ext2_count_free_inodes and ext2_count_free_blocks and
    into ext2_statfs.

    The same problem was fixed in ext3 by removing the lock completely (commit
    5b11687924e40790deb0d5f959247ade82196665)

    Signed-off-by: Dave Kleikamp
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Kleikamp
     

17 Sep, 2006

1 commit


28 Aug, 2006

1 commit


04 Jul, 2006

1 commit

  • The quota code plays interesting games with the lock ordering; to quote Jan:

    | i_mutex of inode containing quota file is acquired after all other
    | quota locks. i_mutex of all other inodes is acquired before quota
    | locks. Quota code makes sure (by resetting inode operations and
    | setting special flag on inode) that noone tries to enter quota code
    | while holding i_mutex on a quota file...

    The good news is that all of this special case i_mutex grabbing happens in the
    (per filesystem) low level quota write function. For this special case we
    need a new I_MUTEX_* nesting level, since this just entirely outside any of
    the regular VFS locking rules for i_mutex. I trust Jan on his blue eyes that
    this is not ever going to deadlock; and based on that the patch below is what
    it takes to inform lockdep of these very interesting new locking rules.

    The new locking rule for the I_MUTEX_QUOTA nesting level is that this is the
    deepest possible level of nesting for i_mutex, and that this only should be
    used in quota write (and possibly read) function of filesystems. This makes
    the lock ordering of the I_MUTEX_* levels:

    I_MUTEX_PARENT -> I_MUTEX_CHILD -> I_MUTEX_NORMAL -> I_MUTEX_QUOTA

    Has no effect on non-lockdep kernels.

    Signed-off-by: Arjan van de Ven
    Acked-by: Ingo Molnar
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arjan van de Ven
     

01 Jul, 2006

1 commit


26 Jun, 2006

2 commits

  • This patch makes EXT2_DEBUG work again. Due to lack of proper include
    file, EXT2_DEBUG was undefined in bitmap.c and ext2_count_free() is left
    out. Moved to balloc.c and removed bitmap.c entirely.

    Second, debug versions of ext2_count_free_{inodes/blocks} reacquires
    superblock lock. Moved lock into callers.

    Signed-off-by: Val Henson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Valerie Henson
     
  • The variable i is guaranteed to be the same as db_count given the previous
    for loop. So get rid of it since it's dead code.

    Signed-off-by: "Theodore Ts'o"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Theodore Ts'o
     

23 Jun, 2006

3 commits

  • The percpu counter data type are changed in this set of patches to support
    more users like ext3 who need more than 32 bit to store the free blocks
    total in the filesystem.

    - Generic perpcu counters data type changes. The size of the global counter
    and local counter were explictly specified using s64 and s32. The global
    counter is changed from long to s64, while the local counter is changed from
    long to s32, so we could avoid doing 64 bit update in most cases.

    - Users of the percpu counters are updated to make use of the new
    percpu_counter_init() routine now taking an additional parameter to allow
    users to pass the initial value of the global counter.

    Signed-off-by: Mingming Cao
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mingming Cao
     
  • Give the statfs superblock operation a dentry pointer rather than a superblock
    pointer.

    This complements the get_sb() patch. That reduced the significance of
    sb->s_root, allowing NFS to place a fake root there. However, NFS does
    require a dentry to use as a target for the statfs operation. This permits
    the root in the vfsmount to be used instead.

    linux/mount.h has been added where necessary to make allyesconfig build
    successfully.

    Interest has also been expressed for use with the FUSE and XFS filesystems.

    Signed-off-by: David Howells
    Acked-by: Al Viro
    Cc: Nathan Scott
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     
  • Extend the get_sb() filesystem operation to take an extra argument that
    permits the VFS to pass in the target vfsmount that defines the mountpoint.

    The filesystem is then required to manually set the superblock and root dentry
    pointers. For most filesystems, this should be done with simple_set_mnt()
    which will set the superblock pointer and then set the root dentry to the
    superblock's s_root (as per the old default behaviour).

    The get_sb() op now returns an integer as there's now no need to return the
    superblock pointer.

    This patch permits a superblock to be implicitly shared amongst several mount
    points, such as can be done with NFS to avoid potential inode aliasing. In
    such a case, simple_set_mnt() would not be called, and instead the mnt_root
    and mnt_sb would be set directly.

    The patch also makes the following changes:

    (*) the get_sb_*() convenience functions in the core kernel now take a vfsmount
    pointer argument and return an integer, so most filesystems have to change
    very little.

    (*) If one of the convenience function is not used, then get_sb() should
    normally call simple_set_mnt() to instantiate the vfsmount. This will
    always return 0, and so can be tail-called from get_sb().

    (*) generic_shutdown_super() now calls shrink_dcache_sb() to clean up the
    dcache upon superblock destruction rather than shrink_dcache_anon().

    This is required because the superblock may now have multiple trees that
    aren't actually bound to s_root, but that still need to be cleaned up. The
    currently called functions assume that the whole tree is rooted at s_root,
    and that anonymous dentries are not the roots of trees which results in
    dentries being left unculled.

    However, with the way NFS superblock sharing are currently set to be
    implemented, these assumptions are violated: the root of the filesystem is
    simply a dummy dentry and inode (the real inode for '/' may well be
    inaccessible), and all the vfsmounts are rooted on anonymous[*] dentries
    with child trees.

    [*] Anonymous until discovered from another tree.

    (*) The documentation has been adjusted, including the additional bit of
    changing ext2_* into foo_* in the documentation.

    [akpm@osdl.org: convert ipath_fs, do other stuff]
    Signed-off-by: David Howells
    Acked-by: Al Viro
    Cc: Nathan Scott
    Cc: Roland Dreier
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     

26 Mar, 2006

1 commit

  • If I mount ext2 "rw", I want it to say "rw", not "rw,nogrpid".

    I caught this writing an automated regression test script for the busybox
    mount command. The symptom is
    /dev/loop0 on /images/ext2.dir type ext2 (rw,nogrpid)
    instead of:
    /dev/loop0 on /images/ext2.dir type ext2 (rw)

    The behavior was introduced by git commit
    8fc2751beb0941966d3a97b26544e8585e428c08.

    Signed-off-by: Rob Landley
    Cc: Mark Bellon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rob Landley
     

24 Mar, 2006

3 commits

  • Rewrap the overly long source code lines resulting from the previous
    patch's addition of the slab cache flag SLAB_MEM_SPREAD. This patch
    contains only formatting changes, and no function change.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • Mark file system inode and similar slab caches subject to SLAB_MEM_SPREAD
    memory spreading.

    If a slab cache is marked SLAB_MEM_SPREAD, then anytime that a task that's
    in a cpuset with the 'memory_spread_slab' option enabled goes to allocate
    from such a slab cache, the allocations are spread evenly over all the
    memory nodes (task->mems_allowed) allowed to that task, instead of favoring
    allocation on the node local to the current cpu.

    The following inode and similar caches are marked SLAB_MEM_SPREAD:

    file cache
    ==== =====
    fs/adfs/super.c adfs_inode_cache
    fs/affs/super.c affs_inode_cache
    fs/befs/linuxvfs.c befs_inode_cache
    fs/bfs/inode.c bfs_inode_cache
    fs/block_dev.c bdev_cache
    fs/cifs/cifsfs.c cifs_inode_cache
    fs/coda/inode.c coda_inode_cache
    fs/dquot.c dquot
    fs/efs/super.c efs_inode_cache
    fs/ext2/super.c ext2_inode_cache
    fs/ext2/xattr.c (fs/mbcache.c) ext2_xattr
    fs/ext3/super.c ext3_inode_cache
    fs/ext3/xattr.c (fs/mbcache.c) ext3_xattr
    fs/fat/cache.c fat_cache
    fs/fat/inode.c fat_inode_cache
    fs/freevxfs/vxfs_super.c vxfs_inode
    fs/hpfs/super.c hpfs_inode_cache
    fs/isofs/inode.c isofs_inode_cache
    fs/jffs/inode-v23.c jffs_fm
    fs/jffs2/super.c jffs2_i
    fs/jfs/super.c jfs_ip
    fs/minix/inode.c minix_inode_cache
    fs/ncpfs/inode.c ncp_inode_cache
    fs/nfs/direct.c nfs_direct_cache
    fs/nfs/inode.c nfs_inode_cache
    fs/ntfs/super.c ntfs_big_inode_cache_name
    fs/ntfs/super.c ntfs_inode_cache
    fs/ocfs2/dlm/dlmfs.c dlmfs_inode_cache
    fs/ocfs2/super.c ocfs2_inode_cache
    fs/proc/inode.c proc_inode_cache
    fs/qnx4/inode.c qnx4_inode_cache
    fs/reiserfs/super.c reiser_inode_cache
    fs/romfs/inode.c romfs_inode_cache
    fs/smbfs/inode.c smb_inode_cache
    fs/sysv/inode.c sysv_inode_cache
    fs/udf/super.c udf_inode_cache
    fs/ufs/super.c ufs_inode_cache
    net/socket.c sock_inode_cache
    net/sunrpc/rpc_pipe.c rpc_inode_cache

    The choice of which slab caches to so mark was quite simple. I marked
    those already marked SLAB_RECLAIM_ACCOUNT, except for fs/xfs, dentry_cache,
    inode_cache, and buffer_head, which were marked in a previous patch. Even
    though SLAB_RECLAIM_ACCOUNT is for a different purpose, it marks the same
    potentially large file system i/o related slab caches as we need for memory
    spreading.

    Given that the rule now becomes "wherever you would have used a
    SLAB_RECLAIM_ACCOUNT slab cache flag before (usually the inode cache), use
    the SLAB_MEM_SPREAD flag too", this should be easy enough to maintain.
    Future file system writers will just copy one of the existing file system
    slab cache setups and tend to get it right without thinking.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • Add a proper prototype for ext2_get_parent().

    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     

04 Feb, 2006

1 commit


10 Jan, 2006

1 commit


14 Nov, 2005

1 commit


09 Nov, 2005

1 commit


08 Sep, 2005

1 commit

  • If /etc/mtab is a regular file all of the mount options (of a file system)
    are written to /etc/mtab by the mount command. The quota tools look there
    for the quota strings for their operation. If, however, /etc/mtab is a
    symlink to /proc/mounts (a "good thing" in some environments) the tools
    don't write anything - they assume the kernel will take care of things.

    While the quota options are sent down to the kernel via the mount system
    call and the file system codes handle them properly unfortunately there is
    no code to echo the quota strings into /proc/mounts and the quota tools
    fail in the symlink case.

    The attached patchs modify the EXT[2|3] and JFS codes to add the necessary
    hooks. The show_options function of each file system in these patches
    currently deal with only those things that seemed related to quotas;
    especially in the EXT3 case more can be done (later?).

    Jan Kara also noted the difficulty in moving these changes above the FS
    codes responding similarly to myself to Andrew's comment about possible
    VFS migration. Issue summary:

    - FS codes have to process the entire string of options anyway.

    - Only FS codes that use quotas must have a show_options function (for
    quotas to work properly) however quotas are only used in a small number
    of FS.

    - Since most of the quota using FS support other options these FS codes
    should have the a show_options function to show those options - and the
    quota echoing becomes virtually negligible.

    Based on feedback I have modified my patches from the original:

    JFS a missing patch has been restored to the posting
    EXT[2|3] and JFS always use the show_options function
    - Each FS has at least one FS specific option displayed
    - QUOTA output is under a CONFIG_QUOTA ifdef
    - a follow-on patch will add a multitude of options for each FS
    EXT[2|3] and JFS "quota" is treated as "usrquota"
    EXT3 journalled data check for journalled quota removed
    EXT[2|3] mount when quota specified but not compiled in

    - no changes from my original patch. I tested the patch and the codes
    warn but

    - still mount. With all due respection I believe the comments
    otherwise were a

    - misread of the patch. Please reread/test and comment. XFS patch
    removed - the XFS team already made the necessary changes EXT3 mixing
    old and new quotas are handled differently (not purely exclusive)

    - if old and new quotas for the same type are used together the old
    type is silently depricated for compatability (e.g. usrquota and
    usrjquota)

    - mixing of old and new quotas is an error (e.g. usrjquota and
    grpquota)

    Signed-off-by: Mark Bellon
    Acked-by: Dave Kleikamp
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mark Bellon
     

13 Jul, 2005

1 commit


24 Jun, 2005

1 commit


17 Apr, 2005

2 commits

  • Whilst trying to stress test a Promise SX8 card, we stumbled across
    some nasty filesystem corruption in ext2. Our tests involved
    creating an ext2 partition, mounting, running several concurrent
    fsx's over it, umounting, and fsck'ing, all scripted[1]. The fsck
    would always return with errors.

    This regression was traced back to a change between 2.6.9 and
    2.6.10, which moves the functionality of ext2_put_inode into
    ext2_clear_inode. The attached patch reverses this change, and
    eliminated the source of corruption.

    Mingming Cao said:

    I think his patch for ext2 is correct. The corruption on ext3 is not the same
    issue he saw on ext2. I believe that's the race between discard reservation
    and reservation in-use that we already fixed it in 2.6.12- rc1.

    For the problem related to ext2, at the time when we design reservation for
    ext3, we decide we only need to discard the reservation at the last file
    close, so we have ext3_discard_reservation on iput_final- >ext3_clear_inode.

    The ext2 handle discard preallocation differently at that time, it discard the
    preallocation at each iput(), not in input_final(), so we think it's
    unnecessary to thrash it so frequently, and the right thing to do, as we did
    for ext3 reservation, discard preallocation on last iput(). So we moved the
    ext2_discard_preallocation from ext2_put_inode(0 to ext2_clear_inode.

    Since ext2 preallocation is doing pre-allocation on disk, so it is possible
    that at the unmount time, someone is still hold the reference of the inode, so
    the preallocation for a file is not discard yet, so we still mark those blocks
    allocated on disk, while they are not actually in the inode's block map, so
    fsck will catch/fix that error later.

    This is not a issue for ext3, as ext3 reservation(pre-allocation) is done in
    memory.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bernard Blackham
     
  • Initial git repository build. I'm not bothering with the full history,
    even though we have it. We can create a separate "historical" git
    archive of that later if we want to, and in the meantime it's about
    3.2GB when imported into git - space that would just make the early
    git days unnecessarily complicated, when we don't have a lot of good
    infrastructure for it.

    Let it rip!

    Linus Torvalds