09 Jan, 2009

5 commits

  • * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (57 commits)
    jbd2: Fix oops in jbd2_journal_init_inode() on corrupted fs
    ext4: Remove "extents" mount option
    block: Add Kconfig help which notes that ext4 needs CONFIG_LBD
    ext4: Make printk's consistently prefixed with "EXT4-fs: "
    ext4: Add sanity checks for the superblock before mounting the filesystem
    ext4: Add mount option to set kjournald's I/O priority
    jbd2: Submit writes to the journal using WRITE_SYNC
    jbd2: Add pid and journal device name to the "kjournald2 starting" message
    ext4: Add markers for better debuggability
    ext4: Remove code to create the journal inode
    ext4: provide function to release metadata pages under memory pressure
    ext3: provide function to release metadata pages under memory pressure
    add releasepage hooks to block devices which can be used by file systems
    ext4: Fix s_dirty_blocks_counter if block allocation failed with nodelalloc
    ext4: Init the complete page while building buddy cache
    ext4: Don't allow new groups to be added during block allocation
    ext4: mark the blocks/inode bitmap beyond end of group as used
    ext4: Use new buffer_head flag to check uninit group bitmaps initialization
    ext4: Fix the race between read_inode_bitmap() and ext4_new_inode()
    ext4: code cleanup
    ...

    Linus Torvalds
     
  • Use the new generic implementation.

    Signed-off-by: Wu Fengguang
    Cc: Theodore Ts'o
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • At the moment there are few restrictions on which flags may be set on
    which inodes. Specifically DIRSYNC may only be set on directories and
    IMMUTABLE and APPEND may not be set on links. Tighten that to disallow
    TOPDIR being set on non-directories and only NODUMP and NOATIME to be set
    on non-regular file, non-directories.

    Introduces a flags masking function which masks flags based on mode and
    use it during inode creation and when flags are set via the ioctl to
    facilitate future consistency.

    Signed-off-by: Duane Griffin
    Acked-by: Andreas Dilger
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Duane Griffin
     
  • At present INDEX is the only flag that new ext3 inodes do NOT inherit from
    their parent. In addition prevent the flags DIRTY, ECOMPR, IMAGIC and
    TOPDIR from being inherited. List inheritable flags explicitly to prevent
    future flags from accidentally being inherited.

    This fixes the TOPDIR flag inheritance bug reported at
    http://bugzilla.kernel.org/show_bug.cgi?id=9866.

    Signed-off-by: Duane Griffin
    Acked-by: Andreas Dilger
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Duane Griffin
     
  • As spotted by kmemtrace, struct ext3_sb_info is 17152 bytes on 64-bit
    which makes it a very bad fit for SLAB allocators. The culprit of the
    wasted memory is ->s_blockgroup_lock which can be as big as 16 KB when
    NR_CPUS >= 32.

    To fix that, allocate ->s_blockgroup_lock, which fits nicely in a order 2
    page in the worst case, separately. This shinks down struct ext3_sb_info
    enough to fit a 1 KB slab cache so now we allocate 16 KB + 1 KB instead of
    32 KB saving 15 KB of memory.

    Acked-by: Andreas Dilger
    Signed-off-by: Pekka Enberg
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pekka Enberg
     

06 Jan, 2009

3 commits


05 Jan, 2009

1 commit

  • With the write_begin/write_end aops, page_symlink was broken because it
    could no longer pass a GFP_NOFS type mask into the point where the
    allocations happened. They are done in write_begin, which would always
    assume that the filesystem can be entered from reclaim. This bug could
    cause filesystem deadlocks.

    The funny thing with having a gfp_t mask there is that it doesn't really
    allow the caller to arbitrarily tinker with the context in which it can be
    called. It couldn't ever be GFP_ATOMIC, for example, because it needs to
    take the page lock. The only thing any callers care about is __GFP_FS
    anyway, so turn that into a single flag.

    Add a new flag for write_begin, AOP_FLAG_NOFS. Filesystems can now act on
    this flag in their write_begin function. Change __grab_cache_page to
    accept a nofs argument as well, to honour that flag (while we're there,
    change the name to grab_cache_page_write_begin which is more instructive
    and does away with random leading underscores).

    This is really a more flexible way to go in the end anyway -- if a
    filesystem happens to want any extra allocations aside from the pagecache
    ones in ints write_begin function, it may now use GFP_KERNEL (rather than
    GFP_NOFS) for common case allocations (eg. ocfs2_alloc_write_ctxt, for a
    random example).

    [kosaki.motohiro@jp.fujitsu.com: fix ubifs]
    [kosaki.motohiro@jp.fujitsu.com: fix fuse]
    Signed-off-by: Nick Piggin
    Reviewed-by: KOSAKI Motohiro
    Cc: [2.6.28.x]
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    [ Cleaned up the calling convention: just pass in the AOP flags
    untouched to the grab_cache_page_write_begin() function. That
    just simplifies everybody, and may even allow future expansion of the
    logic. - Linus ]
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

01 Jan, 2009

2 commits


07 Dec, 2008

1 commit

  • This fixes a gcc warning but it doesn't appear able to result in a
    failure, since the primary way the loop is exited is the first
    conditional in the for loop, and at least for a consistent filesystem,
    the signed/unsigned should in practice never be exposed.

    Signed-off-by: Roel Kluin
    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     

14 Nov, 2008

2 commits

  • Conflicts:
    security/keys/internal.h
    security/keys/process_keys.c
    security/keys/request_key.c

    Fixed conflicts above by using the non 'tsk' versions.

    Signed-off-by: James Morris

    James Morris
     
  • Wrap access to task credentials so that they can be separated more easily from
    the task_struct during the introduction of COW creds.

    Change most current->(|e|s|fs)[ug]id to current_(|e|s|fs)[ug]id().

    Change some task->e?[ug]id to task_e?[ug]id(). In some places it makes more
    sense to use RCU directly rather than a convenient wrapper; these will be
    addressed by later patches.

    Signed-off-by: David Howells
    Reviewed-by: James Morris
    Acked-by: Serge Hallyn
    Cc: Stephen Tweedie
    Cc: Andrew Morton
    Cc: adilger@sun.com
    Cc: linux-ext4@vger.kernel.org
    Signed-off-by: James Morris

    David Howells
     

13 Nov, 2008

1 commit


07 Nov, 2008

1 commit

  • In ext3_sync_fs, we only wait for a commit to finish if we started it, but
    there may be one already in progress which will not be synced.

    In the case of a data=ordered umount with pending long symlinks which are
    delayed due to a long list of other I/O on the backing block device, this
    causes the buffer associated with the long symlinks to not be moved to the
    inode dirty list in the second phase of fsync_super. Then, before they
    can be dirtied again, kjournald exits, seeing the UMOUNT flag and the
    dirty pages are never written to the backing block device, causing long
    symlink corruption and exposing new or previously freed block data to
    userspace.

    This can be reproduced with a script created
    by Eric Sandeen :

    #!/bin/bash

    umount /mnt/test2
    mount /dev/sdb4 /mnt/test2
    rm -f /mnt/test2/*
    dd if=/dev/zero of=/mnt/test2/bigfile bs=1M count=512
    touch
    /mnt/test2/thisisveryveryveryveryveryveryveryveryveryveryveryveryveryveryveryverylongfilename
    ln -s
    /mnt/test2/thisisveryveryveryveryveryveryveryveryveryveryveryveryveryveryveryverylongfilename
    /mnt/test2/link
    umount /mnt/test2
    mount /dev/sdb4 /mnt/test2
    ls /mnt/test2/
    umount /mnt/test2

    To ensure all commits are synced, we flush all journal commits now when
    sync_fs'ing ext3.

    Signed-off-by: Arthur Jones
    Cc: Eric Sandeen
    Cc: Theodore Ts'o
    Cc:
    Cc: [2.6.everything]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arthur Jones
     

29 Oct, 2008

1 commit

  • The original ext3 hash algorithms assumed that variables of type char
    were signed, as God and K&R intended. Unfortunately, this assumption
    is not true on some architectures. Userspace support for marking
    filesystems with non-native signed/unsigned chars was added two years
    ago, but the kernel-side support was never added (until now).

    Signed-off-by: "Theodore Ts'o"
    Cc: akpm@linux-foundation.org
    Cc: linux-kernel@vger.kernel.org

    Theodore Ts'o
     

28 Oct, 2008

1 commit

  • Vegard Nossum reported a bug which accesses freed memory (found via
    kmemcheck). When journal has been aborted, ext3_put_super() calls
    ext3_abort() after freeing the journal_t object, and then ext3_abort()
    accesses it. This patch fix it.

    Signed-off-by: Hidehiro Kawai
    Acked-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Hidehiro Kawai
     

26 Oct, 2008

1 commit

  • Fix a regression caused by commit 6a897cf4, "ext3: fix ext3_dx_readdir
    hash collision handling", where deleting files in a large directory
    (requiring more than one getdents system call), results in some
    filenames being returned twice. This was caused by a failure to
    update info->curr_hash and info->curr_minor_hash, so that if the
    directory had gotten modified since the last getdents() system call
    (as would be the case if the user is running "rm -r" or "git clean"),
    a directory entry would get returned twice to the userspace.

    This patch fixes the bug reported by Markus Trippelsdorf at:
    http://bugzilla.kernel.org/show_bug.cgi?id=11844

    Signed-off-by: "Theodore Ts'o"
    Tested-by: Markus Trippelsdorf

    Theodore Ts'o
     

24 Oct, 2008

3 commits

  • This one was due to a merge error: we added a use of nd.path in commit
    2d7c820e56ce83b23daee9eb5343730fb309418e ("ext3: add checks for errors
    from jbd"), and concurrently we got rid of 'nd' and used a naked 'path'
    in commit 8264613def2e5c4f12bc3167713090fd172e6055 ("[PATCH] switch
    quota_on-related stuff to kern_path()").

    That all merged cleanly, but it didn't actually _work_. This should fix
    it.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/viro/bdev: (66 commits)
    [PATCH] kill the rest of struct file propagation in block ioctls
    [PATCH] get rid of struct file use in blkdev_ioctl() BLKBSZSET
    [PATCH] get rid of blkdev_locked_ioctl()
    [PATCH] get rid of blkdev_driver_ioctl()
    [PATCH] sanitize blkdev_get() and friends
    [PATCH] remember mode of reiserfs journal
    [PATCH] propagate mode through swsusp_close()
    [PATCH] propagate mode through open_bdev_excl/close_bdev_excl
    [PATCH] pass fmode_t to blkdev_put()
    [PATCH] kill the unused bsize on the send side of /dev/loop
    [PATCH] trim file propagation in block/compat_ioctl.c
    [PATCH] end of methods switch: remove the old ones
    [PATCH] switch sr
    [PATCH] switch sd
    [PATCH] switch ide-scsi
    [PATCH] switch tape_block
    [PATCH] switch dcssblk
    [PATCH] switch dasd
    [PATCH] switch mtd_blkdevs
    [PATCH] switch mmc
    ...

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (46 commits)
    [PATCH] fs: add a sanity check in d_free
    [PATCH] i_version: remount support
    [patch] vfs: make security_inode_setattr() calling consistent
    [patch 1/3] FS_MBCACHE: don't needlessly make it built-in
    [PATCH] move executable checking into ->permission()
    [PATCH] fs/dcache.c: update comment of d_validate()
    [RFC PATCH] touch_mnt_namespace when the mount flags change
    [PATCH] reiserfs: add missing llseek method
    [PATCH] fix ->llseek for more directories
    [PATCH vfs-2.6 6/6] vfs: add LOOKUP_RENAME_TARGET intent
    [PATCH vfs-2.6 5/6] vfs: remove LOOKUP_PARENT from non LOOKUP_PARENT lookup
    [PATCH vfs-2.6 4/6] vfs: remove unnecessary fsnotify_d_instantiate()
    [PATCH vfs-2.6 3/6] vfs: add __d_instantiate() helper
    [PATCH vfs-2.6 2/6] vfs: add d_ancestor()
    [PATCH vfs-2.6 1/6] vfs: replace parent == dentry->d_parent by IS_ROOT()
    [PATCH] get rid of on-stack dentry in udf
    [PATCH 2/2] anondev: switch to IDA
    [PATCH 1/2] anondev: init IDR statically
    [JFFS2] Use d_splice_alias() not d_add() in jffs2_lookup()
    [PATCH] Optimise NFS readdir hack slightly.
    ...

    Linus Torvalds
     

23 Oct, 2008

4 commits


21 Oct, 2008

2 commits


20 Oct, 2008

6 commits

  • A very large directory with many read failures (either due to storage
    problems, or due to invalid size & blocks from corruption) will generate a
    printk storm as the filesystem continues to try to read all the blocks.
    This flood of messages can tie up the box until it is complete - which may
    be a very long time, especially for very large corrupted values.

    This is fixed by only reporting the corruption once each time we try to
    read the directory.

    Signed-off-by: Eric Sandeen
    Signed-off-by: "Theodore Ts'o"
    Cc: Eugene Teo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Sandeen
     
  • For blocksize < pagesize we need to remove blocks that got allocated in
    block_write_begin() if we fail with ENOSPC for later blocks.
    block_write_begin() internally does this if it allocated page locally.
    This makes sure we don't have blocks outside inode.i_size during ENOSPC.

    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: "Theodore Ts'o"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     
  • This fixes a bug where readdir() would return a directory entry twice
    if there was a hash collision in an hash tree indexed directory.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Eugene Dashevsky
    Signed-off-by: Mike Snitzer
    Signed-off-by: "Theodore Ts'o"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eugene Dashevsky
     
  • If the journal doesn't abort when it gets an IO error in file data blocks,
    the file data corruption will spread silently. Because most of
    applications and commands do buffered writes without fsync(), they don't
    notice the IO error. It's scary for mission critical systems. On the
    other hand, if the journal aborts whenever it gets an IO error in file
    data blocks, the system will easily become inoperable. So this patch
    introduces a filesystem option to determine whether it aborts the journal
    or just call printk() when it gets an IO error in file data.

    If you mount a ext3 fs with data_err=abort option, it aborts on file data
    write error. If you mount it with data_err=ignore, it doesn't abort, just
    call printk(). data_err=ignore is the default.

    Signed-off-by: Hidehiro Kawai
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hidehiro Kawai
     
  • We could run into ENOSPC error on ext3, even when there is free blocks on
    the filesystem.

    The problem is triggered in the case the goal block group has 0 free
    blocks , and the rest block groups are skipped due to the check of
    "free_blocks < windowsz/2". Current code could fall back to non
    reservation allocation to prevent early ENOSPC after examing all the block
    groups with reservation on , but this code was bypassed if the reservation
    window is turned off already, which is true in this case.

    This patch fixed two issues:
    1) We don't need to turn off block reservation if the goal block group has
    0 free blocks left and continue search for the rest of block groups.

    Current code the intention is to turn off the block reservation if the
    goal allocation group has a few (some) free blocks left (not enough for
    make the desired reservation window),to try to allocation in the goal
    block group, to get better locality. But if the goal blocks have 0 free
    blocks, it should leave the block reservation on, and continues search for
    the next block groups,rather than turn off block reservation completely.

    2) we don't need to check the window size if the block reservation is off.

    The problem was originally found and fixed in ext4.

    Signed-off-by: Mingming Cao
    Cc: Theodore Ts'o
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mingming Cao
     
  • When trying to resize a ext3 fs and you run out of reserved gdt blocks,
    you get an error that doesn't actually tell you what went wrong, it just
    says that the gdb it picked is not correct, which is the case since you
    don't have any reserved gdt blocks left. This patch adds a check to make
    sure you have reserved gdt blocks to use, and if not prints out a more
    relevant error.

    Signed-off-by: Josef Bacik
    Cc:
    Cc: Andreas Dilger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Josef Bacik
     

14 Oct, 2008

1 commit

  • This is a much better version of a previous patch to make the parser
    tables constant. Rather than changing the typedef, we put the "const" in
    all the various places where its required, allowing the __initconst
    exception for nfsroot which was the cause of the previous trouble.

    This was posted for review some time ago and I believe its been in -mm
    since then.

    Signed-off-by: Steven Whitehouse
    Cc: Alexander Viro
    Signed-off-by: Linus Torvalds

    Steven Whitehouse
     

04 Oct, 2008

1 commit

  • Any block based fs (this patch includes ext3) just has to declare its own
    fiemap() function and then call this generic function with its own
    get_block_t. This works well for block based filesystems that will map
    multiple contiguous blocks at one time, but will work for filesystems that
    only map one block at a time, you will just end up with an "extent" for each
    block. One gotcha is this will not play nicely where there is hole+data
    after the EOF. This function will assume its hit the end of the data as soon
    as it hits a hole after the EOF, so if there is any data past that it will
    not pick that up. AFAIK no block based fs does this anyway, but its in the
    comments of the function anyway just in case.

    Signed-off-by: Josef Bacik
    Signed-off-by: Mark Fasheh
    Signed-off-by: "Theodore Ts'o"
    Cc: linux-fsdevel@vger.kernel.org

    Josef Bacik
     

01 Aug, 2008

1 commit

  • * new helper: vfs_quota_on_path(); equivalent of vfs_quota_on() sans the
    pathname resolution.
    * callers of vfs_quota_on() that do their own pathname resolution and
    checks based on it are switched to vfs_quota_on_path(); that way we
    avoid the races.
    * reiserfs leaked dentry/vfsmount references on several failure exits.

    Signed-off-by: Al Viro

    Al Viro
     

29 Jul, 2008

1 commit

  • When we read some part of a file through pagecache, if there is a
    pagecache of corresponding index but this page is not uptodate, read IO
    is issued and this page will be uptodate.

    I think this is good for pagesize == blocksize environment but there is
    room for improvement on pagesize != blocksize environment. Because in
    this case a page can have multiple buffers and even if a page is not
    uptodate, some buffers can be uptodate.

    So I suggest that when all buffers which correspond to a part of a file
    that we want to read are uptodate, use this pagecache and copy data from
    this pagecache to user buffer even if a page is not uptodate. This can
    reduce read IO and improve system throughput.

    I wrote a benchmark program and got result number with this program.

    This benchmark do:

    1: mount and open a test file.

    2: create a 512MB file.

    3: close a file and umount.

    4: mount and again open a test file.

    5: pwrite randomly 300000 times on a test file. offset is aligned
    by IO size(1024bytes).

    6: measure time of preading randomly 100000 times on a test file.

    The result was:
    2.6.26
    330 sec

    2.6.26-patched
    226 sec

    Arch:i386
    Filesystem:ext3
    Blocksize:1024 bytes
    Memory: 1GB

    On ext3/4, a file is written through buffer/block. So random read/write
    mixed workloads or random read after random write workloads are optimized
    with this patch under pagesize != blocksize environment. This test result
    showed this.

    The benchmark program is as follows:

    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #define LEN 1024
    #define LOOP 1024*512 /* 512MB */

    main(void)
    {
    unsigned long i, offset, filesize;
    int fd;
    char buf[LEN];
    time_t t1, t2;

    if (mount("/dev/sda1", "/root/test1/", "ext3", 0, 0) < 0) {
    perror("cannot mount\n");
    exit(1);
    }
    memset(buf, 0, LEN);
    fd = open("/root/test1/testfile", O_CREAT|O_RDWR|O_TRUNC);
    if (fd < 0) {
    perror("cannot open file\n");
    exit(1);
    }
    for (i = 0; i < LOOP; i++)
    write(fd, buf, LEN);
    close(fd);
    if (umount("/root/test1/") < 0) {
    perror("cannot umount\n");
    exit(1);
    }
    if (mount("/dev/sda1", "/root/test1/", "ext3", 0, 0) < 0) {
    perror("cannot mount\n");
    exit(1);
    }
    fd = open("/root/test1/testfile", O_RDWR);
    if (fd < 0) {
    perror("cannot open file\n");
    exit(1);
    }

    filesize = LEN * LOOP;
    for (i = 0; i < 300000; i++){
    offset = (random() % filesize) & (~(LEN - 1));
    pwrite(fd, buf, LEN, offset);
    }
    printf("start test\n");
    time(&t1);
    for (i = 0; i < 100000; i++){
    offset = (random() % filesize) & (~(LEN - 1));
    pread(fd, buf, LEN, offset);
    }
    time(&t2);
    printf("%ld sec\n", t2-t1);
    close(fd);
    if (umount("/root/test1/") < 0) {
    perror("cannot umount\n");
    exit(1);
    }
    }

    Signed-off-by: Hisashi Hifumi
    Cc: Nick Piggin
    Cc: Christoph Hellwig
    Cc: Jan Kara
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hisashi Hifumi
     

27 Jul, 2008

2 commits

  • * kill nameidata * argument; map the 3 bits in ->flags anybody cares
    about to new MAY_... ones and pass with the mask.
    * kill redundant gfs2_iop_permission()
    * sanitize ecryptfs_permission()
    * fix remaining places where ->permission() instances might barf on new
    MAY_... found in mask.

    The obvious next target in that direction is permission(9)

    folded fix for nfs_permission() breakage from Miklos Szeredi

    Signed-off-by: Al Viro

    Al Viro
     
  • Kmem cache passed to constructor is only needed for constructors that are
    themselves multiplexeres. Nobody uses this "feature", nor does anybody uses
    passed kmem cache in non-trivial way, so pass only pointer to object.

    Non-trivial places are:
    arch/powerpc/mm/init_64.c
    arch/powerpc/mm/hugetlbpage.c

    This is flag day, yes.

    Signed-off-by: Alexey Dobriyan
    Acked-by: Pekka Enberg
    Acked-by: Christoph Lameter
    Cc: Jon Tollefson
    Cc: Nick Piggin
    Cc: Matt Mackall
    [akpm@linux-foundation.org: fix arch/powerpc/mm/hugetlbpage.c]
    [akpm@linux-foundation.org: fix mm/slab.c]
    [akpm@linux-foundation.org: fix ubifs]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan