24 Sep, 2009

1 commit

  • * 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (21 commits)
    HWPOISON: Enable error_remove_page on btrfs
    HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs
    HWPOISON: Add madvise() based injector for hardware poisoned pages v4
    HWPOISON: Enable error_remove_page for NFS
    HWPOISON: Enable .remove_error_page for migration aware file systems
    HWPOISON: The high level memory error handler in the VM v7
    HWPOISON: Add PR_MCE_KILL prctl to control early kill behaviour per process
    HWPOISON: shmem: call set_page_dirty() with locked page
    HWPOISON: Define a new error_remove_page address space op for async truncation
    HWPOISON: Add invalidate_inode_page
    HWPOISON: Refactor truncate to allow direct truncating of page v2
    HWPOISON: check and isolate corrupted free pages v2
    HWPOISON: Handle hardware poisoned pages in try_to_unmap
    HWPOISON: Use bitmask/action code for try_to_unmap behaviour
    HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2
    HWPOISON: Add poison check to page fault handling
    HWPOISON: Add basic support for poisoned pages in fault handler v3
    HWPOISON: Add new SIGBUS error codes for hardware poison signals
    HWPOISON: Add support for poison swap entries v2
    HWPOISON: Export some rmap vma locking to outside world
    ...

    Linus Torvalds
     

23 Sep, 2009

1 commit

  • Unlike on most other architectures ino_t is an unsigned int on s390. So
    add an explicit cast to avoid this compile warning:

    fs/ext2/namei.c: In function 'ext2_lookup':
    fs/ext2/namei.c:73: warning: format '%lu' expects type 'long unsigned int', but argument 4 has type 'ino_t'

    Signed-off-by: Heiko Carstens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heiko Carstens
     

22 Sep, 2009

1 commit


16 Sep, 2009

1 commit

  • Enable removing of corrupted pages through truncation
    for a bunch of file systems: ext*, xfs, gfs2, ocfs2, ntfs
    These should cover most server needs.

    I chose the set of migration aware file systems for this
    for now, assuming they have been especially audited.
    But in general it should be safe for all file systems
    on the data area that support read/write and truncate.

    Caveat: the hardware error handler does not take i_mutex
    for now before calling the truncate function. Is that ok?

    Cc: tytso@mit.edu
    Cc: hch@infradead.org
    Cc: mfasheh@suse.com
    Cc: aia21@cantab.net
    Cc: hugh.dickins@tiscali.co.uk
    Cc: swhiteho@redhat.com
    Signed-off-by: Andi Kleen

    Andi Kleen
     

14 Sep, 2009

1 commit


09 Sep, 2009

1 commit


06 Sep, 2009

1 commit

  • In ext2_rename(), dir_page is acquired through ext2_dotdot(). It is
    then released through ext2_set_link() but only if old_dir != new_dir.
    Failing that, the pkmap reference count is never decremented and the
    page remains pinned forever. Repeat that a couple times with highmem
    pages and all pkmap slots get exhausted, and every further kmap() calls
    end up stalling on the pkmap_map_wait queue at which point the whole
    system comes to a halt.

    Signed-off-by: Nicolas Pitre
    Acked-by: Theodore Ts'o
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Nicolas Pitre
     

13 Jul, 2009

1 commit

  • * Remove smp_lock.h from files which don't need it (including some headers!)
    * Add smp_lock.h to files which do need it
    * Make smp_lock.h include conditional in hardirq.h
    It's needed only for one kernel_locked() usage which is under CONFIG_PREEMPT

    This will make hardirq.h inclusion cheaper for every PREEMPT=n config
    (which includes allmodconfig/allyesconfig, BTW)

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

01 Jul, 2009

1 commit

  • ext2_iget() returns -ESTALE if invoked on a deleted inode, in order to
    report errors to NFS properly. However, in ext[234]_lookup(), this
    -ESTALE can be propagated to userspace if the filesystem is corrupted such
    that a directory entry references a deleted inode. This leads to a
    misleading error message - "Stale NFS file handle" - and confusion on the
    part of the admin.

    The bug can be easily reproduced by creating a new filesystem, making a
    link to an unused inode using debugfs, then mounting and attempting to ls
    -l said link.

    This patch thus changes ext2_lookup to return -EIO if it receives -ESTALE
    from ext2_iget(), as ext2 does for other filesystem metadata corruption;
    and also invokes the appropriate ext*_error functions when this case is
    detected.

    Signed-off-by: Bryan Donlan
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bryan Donlan
     

24 Jun, 2009

2 commits


19 Jun, 2009

1 commit

  • One of our users is complaining that his backup tool is upset on ext2
    (while it's happy on ext3, xfs, ...) because of the mtime change.

    The problem is:

    mkdir foo
    mkdir bar
    mkdir foo/a

    Now under ext2:
    mv foo/a foo/b

    changes mtime of 'foo/a' (foo/b after the move). That does not really
    make sense and it does not happen under any other filesystem I've seen.

    More complicated is:
    mv foo/a bar/a

    This changes mtime of foo/a (bar/a after the move) and it makes some
    sense since we had to update parent directory pointer of foo/a. But
    again, no other filesystem does this. So after some thoughts I'd vote
    for consistency and change ext2 to behave the same as other filesystems.

    Do not update mtime of a moved directory. Specs don't say anything
    about it (neither that it should, nor that it should not be updated) and
    other common filesystems (ext3, ext4, xfs, reiserfs, fat, ...) don't do
    it. So let's become more consistent.

    Spotted by ronny.pretzsch@dfs.de, initial fix by Jörn Engel.

    Reported-by:
    Cc:
    Cc: Jörn Engel
    Signed-off-by: Jan Kara
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

13 Jun, 2009

1 commit


12 Jun, 2009

5 commits

  • Add a ->sync_fs method for data integrity syncs, and reimplement
    ->write_super ontop of it.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • kill ext2_sync_file() (along with ext2/fsync.c), get rid of
    ext2_update_inode() - it's an alias of ext2_write_inode().

    Signed-off-by: Al Viro

    Al Viro
     
  • [xfs, btrfs, capifs, shmem don't need BKL, exempt]

    Signed-off-by: Alessio Igor Bogani
    Signed-off-by: Al Viro

    Alessio Igor Bogani
     
  • Move BKL into ->put_super from the only caller. A couple of
    filesystems had trivial enough ->put_super (only kfree and NULLing of
    s_fs_info + stuff in there) to not get any locking: coda, cramfs, efs,
    hugetlbfs, omfs, qnx4, shmem, all others got the full treatment. Most
    of them probably don't need it, but I'd rather sort that out individually.
    Preferably after all the other BKL pushdowns in that area.

    [AV: original used to move lock_super() down as well; these changes are
    removed since we don't do lock_super() at all in generic_shutdown_super()
    now]
    [AV: fuse, btrfs and xfs are known to need no damn BKL, exempt]

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • We just did a full fs writeout using sync_filesystem before, and if
    that's not enough for the filesystem it can perform it's own writeout
    in ->put_super, which many filesystems already do.

    Move a call to foofs_write_super into every foofs_put_super for now to
    guarantee identical behaviour until it's cleaned up by the individual
    filesystem maintainers.

    Exceptions:

    - affs already has identical copy & pasted code at the beginning of
    affs_put_super so no need to do it twice.
    - xfs does the right thing without it and I have changes pending for
    the xfs tree touching this are so I don't really need conflicts
    here..

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

18 May, 2009

1 commit


27 Apr, 2009

1 commit


14 Apr, 2009

1 commit

  • If two writers allocating blocks to file race with each other (e.g.
    because writepages races with ordinary write or two writepages race with
    each other), ext2_getblock() can be called on the same inode in parallel.
    Before we are going to allocate new blocks, we have to recheck the block
    chain we have obtained so far without holding truncate_mutex. Otherwise
    we could overwrite the indirect block pointer set by the other writer
    leading to data loss.

    The below test program by Ying is able to reproduce the data loss with ext2
    on in BRD in a few minutes if the machine is under memory pressure:

    long kMemSize = 50 << 20;
    int kPageSize = 4096;

    int main(int argc, char **argv) {
    int status;
    int count = 0;
    int i;
    char *fname = "/mnt/test.mmap";
    char *mem;
    unlink(fname);
    int fd = open(fname, O_CREAT | O_EXCL | O_RDWR, 0600);
    status = ftruncate(fd, kMemSize);
    mem = mmap(0, kMemSize, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
    // Fill the memory with 1s.
    memset(mem, 1, kMemSize);
    sleep(2);
    for (i = 0; i < kMemSize; i++) {
    int byte_good = mem[i] != 0;
    if (!byte_good && ((i % kPageSize) == 0)) {
    //printf("%d ", i / kPageSize);
    count++;
    }
    }
    munmap(mem, kMemSize);
    close(fd);
    unlink(fname);

    if (count > 0) {
    printf("Running %d bad page\n", count);
    return 1;
    }
    return 0;
    }

    Cc: Ying Han
    Cc: Nick Piggin
    Signed-off-by: Jan Kara
    Cc: Mingming Cao
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

01 Apr, 2009

1 commit


26 Mar, 2009

2 commits


12 Feb, 2009

1 commit

  • For a reason that I was unable to understand in three months of debugging,
    mount ext2 -o remount stopped working properly when remounting from
    regular operation to xip, or the other way around. According to a git
    bisect search, the problem was introduced with the VM_MIXEDMAP/PTE_SPECIAL
    rework in the vm:

    commit 70688e4dd1647f0ceb502bbd5964fa344c5eb411
    Author: Nick Piggin
    Date: Mon Apr 28 02:13:02 2008 -0700

    xip: support non-struct page backed memory

    In the failing scenario, the filesystem is mounted read only via root=
    kernel parameter on s390x. During remount (in rc.sysinit), the inodes of
    the bash binary and its libraries are busy and cannot be invalidated (the
    bash which is running rc.sysinit resides on subject filesystem).
    Afterwards, another bash process (running ifup-eth) recurses into a
    subshell, runs dup_mm (via fork). Some of the mappings in this bash
    process were created from inodes that could not be invalidated during
    remount.

    Both parent and child process crash some time later due to inconsistencies
    in their address spaces. The issue seems to be timing sensitive, various
    attempts to recreate it have failed.

    This patch refuses to change the xip flag during remount in case some
    inodes cannot be invalidated. This patch keeps users from running into
    that issue.

    [akpm@linux-foundation.org: cleanup]
    Signed-off-by: Carsten Otte
    Cc: Nick Piggin
    Cc: Jared Hulbert
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Carsten Otte
     

16 Jan, 2009

1 commit

  • We used to just write changed page for IS_DIRSYNC inodes. But we also
    have to update the directory inode itself just for the case that we've
    allocated a new block and changed i_size.

    [akpm@linux-foundation.org: still sync the data page]
    Signed-off-by: Jan Kara
    Tested-by: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

09 Jan, 2009

4 commits

  • At the moment there are few restrictions on which flags may be set on
    which inodes. Specifically DIRSYNC may only be set on directories and
    IMMUTABLE and APPEND may not be set on links. Tighten that to disallow
    TOPDIR being set on non-directories and only NODUMP and NOATIME to be set
    on non-regular file, non-directories.

    Introduces a flags masking function which masks flags based on mode and
    use it during inode creation and when flags are set via the ioctl to
    facilitate future consistency.

    Signed-off-by: Duane Griffin
    Acked-by: Andreas Dilger
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Duane Griffin
     
  • At present BTREE/INDEX is the only flag that new ext2 inodes do NOT
    inherit from their parent. In addition prevent the flags DIRTY, ECOMPR,
    INDEX, IMAGIC and TOPDIR from being inherited. List inheritable flags
    explicitly to prevent future flags from accidentally being inherited.

    This fixes the TOPDIR flag inheritance bug reported at
    http://bugzilla.kernel.org/show_bug.cgi?id=9866.

    Signed-off-by: Duane Griffin
    Acked-by: Andreas Dilger
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Duane Griffin
     
  • As spotted by kmemtrace, struct ext2_sb_info is 17024 bytes on 64-bit
    which makes it a very bad fit for SLAB allocators. The culprit of the
    wasted memory is ->s_blockgroup_lock which can be as big as 16 KB when
    NR_CPUS >= 32.

    To fix that, allocate ->s_blockgroup_lock, which fits nicely in a order 2
    page in the worst case, separately. This shinks down struct ext2_sb_info
    enough to fit a 1 KB slab cache so now we allocate 16 KB + 1 KB instead of
    32 KB saving 15 KB of memory.

    Acked-by: Andreas Dilger
    Signed-off-by: Pekka Enberg
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pekka J Enberg
     
  • There is no argument named @chain in ext2_splice_branch, remove references
    to it.

    Signed-off-by: Qinghuang Feng
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Qinghuang Feng
     

01 Jan, 2009

2 commits

  • * make ext2_new_inode() put the inode into icache in locked state
    * do not unlock until the inode is fully set up; otherwise nfsd
    might pick it in half-baked state.
    * make sure that ext2_new_inode() does *not* lead to two inodes with the
    same inumber hashed at the same time; otherwise a bogus fhandle coming
    from nfsd might race with inode creation:

    nfsd: iget_locked() creates inode
    nfsd: try to read from disk, block on that.
    ext2_new_inode(): allocate inode with that inumber
    ext2_new_inode(): insert it into icache, set it up and dirty
    ext2_write_inode(): get the relevant part of inode table in cache,
    set the entry for our inode (and start writing to disk)
    nfsd: get CPU again, look into inode table, see nice and sane on-disk
    inode, set the in-core inode from it

    oops - we have two in-core inodes with the same inumber live in icache,
    both used for IO. Welcome to fs corruption...

    Signed-off-by: Al Viro

    Al Viro
     
  • Ensure fast symlink targets are NUL-terminated, even if corrupted
    on-disk.

    Cc: Andrew Morton
    Signed-off-by: Duane Griffin
    Signed-off-by: Al Viro

    Duane Griffin
     

14 Nov, 2008

1 commit

  • Wrap access to task credentials so that they can be separated more easily from
    the task_struct during the introduction of COW creds.

    Change most current->(|e|s|fs)[ug]id to current_(|e|s|fs)[ug]id().

    Change some task->e?[ug]id to task_e?[ug]id(). In some places it makes more
    sense to use RCU directly rather than a convenient wrapper; these will be
    addressed by later patches.

    Signed-off-by: David Howells
    Reviewed-by: James Morris
    Acked-by: Serge Hallyn
    Cc: linux-ext4@vger.kernel.org
    Signed-off-by: James Morris

    David Howells
     

24 Oct, 2008

1 commit

  • * git://git.kernel.org/pub/scm/linux/kernel/git/viro/bdev: (66 commits)
    [PATCH] kill the rest of struct file propagation in block ioctls
    [PATCH] get rid of struct file use in blkdev_ioctl() BLKBSZSET
    [PATCH] get rid of blkdev_locked_ioctl()
    [PATCH] get rid of blkdev_driver_ioctl()
    [PATCH] sanitize blkdev_get() and friends
    [PATCH] remember mode of reiserfs journal
    [PATCH] propagate mode through swsusp_close()
    [PATCH] propagate mode through open_bdev_excl/close_bdev_excl
    [PATCH] pass fmode_t to blkdev_put()
    [PATCH] kill the unused bsize on the send side of /dev/loop
    [PATCH] trim file propagation in block/compat_ioctl.c
    [PATCH] end of methods switch: remove the old ones
    [PATCH] switch sr
    [PATCH] switch sd
    [PATCH] switch ide-scsi
    [PATCH] switch tape_block
    [PATCH] switch dcssblk
    [PATCH] switch dasd
    [PATCH] switch mtd_blkdevs
    [PATCH] switch mmc
    ...

    Linus Torvalds
     

23 Oct, 2008

2 commits


21 Oct, 2008

2 commits


17 Oct, 2008

2 commits

  • A very large directory with many read failures (either due to storage
    problems, or due to invalid size & blocks from corruption) will generate a
    printk storm as the filesystem continues to try to read all the blocks.
    This flood of messages can tie up the box until it is complete - which may
    be a very long time, especially for very large corrupted values.

    This is fixed by only reporting the corruption once each time we try to
    read the directory.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Eric Sandeen
    Signed-off-by: "Theodore Ts'o"
    Cc: Eugene Teo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Sandeen
     
  • We could run into ENOSPC error on ext2, even when there is free blocks on
    the filesystem.

    The problem is triggered in the case the goal block group has 0 free
    blocks , and the rest block groups are skipped due to the check of
    "free_blocks < windowsz/2". Current code could fall back to non
    reservation allocation to prevent early ENOSPC after examing all the block
    groups with reservation on , but this code was bypassed if the reservation
    window is turned off already, which is true in this case.

    This patch fixed two issues:
    1) We don't need to turn off block reservation if the goal block group has
    0 free blocks left and continue search for the rest of block groups.

    Current code the intention is to turn off the block reservation if the
    goal allocation group has a few (some) free blocks left (not enough for
    make the desired reservation window),to try to allocation in the goal
    block group, to get better locality. But if the goal blocks have 0 free
    blocks, it should leave the block reservation on, and continues search for
    the next block groups,rather than turn off block reservation completely.

    2) we don't need to check the window size if the block reservation is off.

    The problem was originally found and fixed in ext4.

    Signed-off-by: Mingming Cao
    Cc: Theodore Ts'o
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mingming Cao