13 Oct, 2009

1 commit

  • This avoids updating the superblock write time when we are mounting
    the root file system read/only but we need to replay the journal; at
    that point, for people who are east of GMT and who make their clock
    tick in localtime for Windows bug-for-bug compatibility, and this will
    cause e2fsck to complain and force a full file system check.

    Signed-off-by: "Theodore Ts'o"
    Signed-off-by: Jan Kara

    Theodore Ts'o
     

24 Sep, 2009

1 commit

  • * 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (21 commits)
    HWPOISON: Enable error_remove_page on btrfs
    HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs
    HWPOISON: Add madvise() based injector for hardware poisoned pages v4
    HWPOISON: Enable error_remove_page for NFS
    HWPOISON: Enable .remove_error_page for migration aware file systems
    HWPOISON: The high level memory error handler in the VM v7
    HWPOISON: Add PR_MCE_KILL prctl to control early kill behaviour per process
    HWPOISON: shmem: call set_page_dirty() with locked page
    HWPOISON: Define a new error_remove_page address space op for async truncation
    HWPOISON: Add invalidate_inode_page
    HWPOISON: Refactor truncate to allow direct truncating of page v2
    HWPOISON: check and isolate corrupted free pages v2
    HWPOISON: Handle hardware poisoned pages in try_to_unmap
    HWPOISON: Use bitmask/action code for try_to_unmap behaviour
    HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2
    HWPOISON: Add poison check to page fault handling
    HWPOISON: Add basic support for poisoned pages in fault handler v3
    HWPOISON: Add new SIGBUS error codes for hardware poison signals
    HWPOISON: Add support for poison swap entries v2
    HWPOISON: Export some rmap vma locking to outside world
    ...

    Linus Torvalds
     

22 Sep, 2009

2 commits


16 Sep, 2009

4 commits

  • In case we fsync() a file and inode is not dirty, we don't force a transaction
    to disk and hence don't flush disk caches. Thus file data could be just in disk
    caches and not on persistent storage. Fix the problem by flushing disk caches
    if we didn't force a transaction commit.

    Signed-off-by: Jan Kara

    Jan Kara
     
  • I've been struggling with this off and on while I've been testing the
    data=guarded work. The symptom is corrupted orphan lists and inodes
    with the wrong i_size stored on disk. I was convinced the
    data=guarded code was just missing a call to ext3_mark_inode_dirty, but
    tracing showed the i_disksize I was sending to ext3_mark_inode_dirty
    wasn't actually making it to the drive.

    ext3_mark_inode_dirty can be called without locks held (atime updates
    and a few others), so the data=guarded code uses locks while updating
    the in-memory inode, and then calls ext3_mark_inode_dirty
    without any locks held.

    But, ext3_mark_inode_dirty has no internal locking to make sure that
    only one CPU is updating the buffer head at a time. Generally this
    works out ok because everyone that changes the inode then calls
    ext3_mark_inode_dirty themselves. Even though it races, eventually
    someone updates the buffer heads and things move on.

    But there is still a risk of the wrong values getting in, and the
    data=guarded code seems to hit the race very often.

    Since everyone that changes the inode also logs it, it should be
    possible to fix this with some memory barriers. I'll leave that as an
    exercise to the reader and lock the buffer head instead.

    It it probably a good idea to have a different patch series for lockless
    bit flipping on the ext3 i_state field. ext3_do_update_inode &= clears
    EXT3_STATE_NEW without any locks held.

    Signed-off-by: Chris Mason
    Signed-off-by: Jan Kara

    Chris Mason
     
  • During truncate we are sometimes forced to start a new transaction as the
    amount of blocks to be journaled is both quite large and hard to predict. So
    far we restarted a transaction while holding truncate_mutex and that violates
    lock ordering because truncate_mutex ranks below transaction start (and it
    can lead to a real deadlock with ext3_get_blocks() allocating new blocks
    from ext3_writepage()).

    Luckily, the problem is easy to fix: We just drop the truncate_mutex before
    restarting the transaction and acquire it afterwards. We are safe to do this as
    by the time ext3_truncate() is called, all the page cache for the truncated
    part of the file is dropped and so writepage() cannot come and allocate new
    blocks in the part of the file we are truncating. The rest of writers is
    stopped by us holding i_mutex.

    Signed-off-by: Jan Kara

    Jan Kara
     
  • Enable removing of corrupted pages through truncation
    for a bunch of file systems: ext*, xfs, gfs2, ocfs2, ntfs
    These should cover most server needs.

    I chose the set of migration aware file systems for this
    for now, assuming they have been especially audited.
    But in general it should be safe for all file systems
    on the data area that support read/write and truncate.

    Caveat: the hardware error handler does not take i_mutex
    for now before calling the truncate function. Is that ok?

    Cc: tytso@mit.edu
    Cc: hch@infradead.org
    Cc: mfasheh@suse.com
    Cc: aia21@cantab.net
    Cc: hugh.dickins@tiscali.co.uk
    Cc: swhiteho@redhat.com
    Signed-off-by: Andi Kleen

    Andi Kleen
     

14 Sep, 2009

1 commit


09 Sep, 2009

1 commit


24 Aug, 2009

2 commits

  • This patch makes the error message about changing journaling mode on remount
    more descriptive. Some people are going to hit this error now due to commit
    bbae8bcc49bc4d002221dab52c79a50a82e7cd1f if they configure a kernel to default
    to data=writeback mode. The problem happens if they have data=ordered set for
    the root filesystem in /etc/fstab but not in the kernel command line (and they
    don't use initrd). Their filesystem then gets mounted as data=writeback by
    kernel but then their boot fails because init scripts won't be able to remount
    the filesystem rw. Better error message will hopefully make it easier for them
    to find the error in their setup and bother us less with error reports :).

    Signed-off-by: Jan Kara

    Jan Kara
     
  • The old description for this configuration option was perhaps not
    completely balanced in terms of describing the tradeoffs of using a
    default of data=writeback vs. data=ordered. Despite the fact that old
    description very strongly recomended disabling this feature, all of
    the major distributions have elected to preserve the existing 'legacy'
    default, which is a strong hint that it perhaps wasn't telling the
    whole story.

    This revised description has been vetted by a number of ext3
    developers as being better at informing the user about the tradeoffs
    of enabling or disabling this configuration feature.

    Cc: linux-ext4@vger.kernel.org
    Signed-off-by: "Theodore Ts'o"
    Signed-off-by: Jan Kara

    Theodore Ts'o
     

16 Jul, 2009

2 commits

  • Get rid of extenddisksize parameter of ext3_get_blocks_handle(). This seems to
    be a relict from some old days and setting disksize in this function does not
    make much sence. Currently it was set only by ext3_getblk(). Since the
    parameter has some effect only if create == 1, it is easy to check that the
    three callers which end up calling ext3_getblk() with create == 1 (ext3_append,
    ext3_quota_write, ext3_mkdir) do the right thing and set disksize themselves.

    Signed-off-by: Jan Kara

    Jan Kara
     
  • Contents of long symlinks is written via standard write methods. So when the
    write fails, we add inode to orphan list. But symlinks don't have .truncate
    method defined so nobody properly removes them from the orphan list (both on
    disk and in memory).

    Fix this by calling ext3_truncate() directly instead of calling vmtruncate()
    (which is saner anyway since we don't need anything vmtruncate() does except
    from calling .truncate in these paths). We also add inode to orphan list only
    if ext3_can_truncate() is true (currently, it can be false for symlinks when
    there are no blocks allocated) - otherwise orphan list processing will complain
    and ext3_truncate() will not remove inode from on-disk orphan list.

    Signed-off-by: Jan Kara

    Jan Kara
     

24 Jun, 2009

2 commits


20 Jun, 2009

1 commit


19 Jun, 2009

3 commits

  • Follow-up to "block: enable by default support for large devices
    and files on 32-bit archs".

    Rename CONFIG_LBD to CONFIG_LBDAF to:
    - allow update of existing [def]configs for "default y" change
    - reflect that it is used also for large files support nowadays

    Signed-off-by: Bartlomiej Zolnierkiewicz
    Signed-off-by: Jens Axboe

    Bartlomiej Zolnierkiewicz
     
  • As Ted pointed out, it can happen that ext3_truncate() returns without
    removing inode from orphan list. This way we could in some rare cases
    (like when we get ENOMEM from an allocation in ext3_truncate called
    because of failed ext3_write_begin) leave the inode on orphan list and
    that triggers assertion failure on umount.

    So make ext3_truncate() always remove inode from in-memory orphan list.

    Cc: Theodore Ts'o
    Signed-off-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Chain verification in ext3_get_blocks() has been hosed since it called
    verify_chain(chain, NULL) which always returns success. As a result
    readers could in theory race with truncate. On the other hand the race
    probably cannot happen with the current locking scheme, since by the
    time ext3_truncate() is called all the pages are already removed and
    hence get_block() shouldn't be called on such pages...

    Signed-off-by: Jan Kara
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

17 Jun, 2009

1 commit

  • If a filesystem supports POSIX ACL's, the VFS layer expects the filesystem
    to do POSIX ACL checks on any files not owned by the caller, and it does
    this for every single pathname component that it looks up.

    That obviously can be pretty expensive if the filesystem isn't careful
    about it, especially with locking. That's doubly sad, since the common
    case tends to be that there are no ACL's associated with the files in
    question.

    ext3 already caches the ACL data so that it doesn't have to look it up
    over and over again, but it does so by taking the inode->i_lock spinlock
    on every lookup. Which is a noticeable overhead even if it's a private
    lock, especially on CPU's where the serialization is expensive (eg Intel
    Netburst aka 'P4').

    For the special case of not actually having any ACL's, all that locking is
    unnecessary. Even if somebody else were to be changing the ACL's on
    another CPU, we simply don't care - if we've seen a NULL ACL, we might as
    well use it.

    So just load the ACL speculatively without any locking, and if it was
    NULL, just use it. If it's non-NULL (either because we had a cached
    entry, or because the cache hasn't been filled in at all), it means that
    we'll need to get the lock and re-load it properly.

    This is noticeable even on Nehalem, which does locking quite well (much
    better than P4). From lmbench:

    Processor, Processes - times in microseconds - smaller is better
    --------------------------------------------------------------------
    Host OS Mhz null null open slct fork exec sh
    call I/O stat clos TCP proc proc proc
    --------- ------------- ---- ---- ---- ---- ---- ---- ---- ---- ----
    - before:
    nehalem.l Linux 2.6.30- 3193 0.04 0.09 0.95 1.45 2.18 69.1 273. 1141
    nehalem.l Linux 2.6.30- 3193 0.04 0.09 0.95 1.48 2.28 69.9 253. 1140
    nehalem.l Linux 2.6.30- 3193 0.04 0.10 0.95 1.42 2.19 68.6 284. 1141
    - after:
    nehalem.l Linux 2.6.30- 3193 0.04 0.09 0.92 1.44 2.12 68.3 282. 1094
    nehalem.l Linux 2.6.30- 3193 0.04 0.09 0.92 1.39 2.20 67.0 308. 1123
    nehalem.l Linux 2.6.30- 3193 0.04 0.09 0.92 1.39 2.36 67.4 293. 1148

    where you can see what appears to be a roughly 3% improvement in stat
    and open/close latencies from just the removal of the locking overhead.

    Of course, this only matters for files you don't own (the owner never
    needs to do the ACL checks), but that's the common case for libraries,
    header files, and executables. As well as for the base components of any
    absolute pathname, even if you are the owner of the final file.

    [ At some point we probably want to move this ACL caching logic entirely
    into the VFS layer (and only call down to the filesystem when
    uncached), but in the meantime this improves ext3 a bit.

    A similar fix to btrfs makes a much bigger difference (15x improvement
    in lmbench) due to broken caching. ]

    Signed-off-by: Linus Torvalds
    Signed-off-by: "Theodore Ts'o"
    Acked-by: Jan Kara
    Cc: Al Viro
    Signed-off-by: Al Viro

    Linus Torvalds
     

12 Jun, 2009

5 commits

  • [xfs, btrfs, capifs, shmem don't need BKL, exempt]

    Signed-off-by: Alessio Igor Bogani
    Signed-off-by: Al Viro

    Alessio Igor Bogani
     
  • Note that since we can't run into contention between remount_fs and write_super
    (due to exclusion on s_umount), we have to care only about filesystems that
    touch lock_super() on their own. Out of those ext3, ext4, hpfs, sysv and ufs
    do need it; fat doesn't since its ->remount_fs() only accesses assign-once
    data (basically, it's "we have no atime on directories and only have atime on
    files for vfat; force nodiratime and possibly noatime into *flags").

    [folded a build fix from hch]

    Signed-off-by: Al Viro

    Al Viro
     
  • Move BKL into ->put_super from the only caller. A couple of
    filesystems had trivial enough ->put_super (only kfree and NULLing of
    s_fs_info + stuff in there) to not get any locking: coda, cramfs, efs,
    hugetlbfs, omfs, qnx4, shmem, all others got the full treatment. Most
    of them probably don't need it, but I'd rather sort that out individually.
    Preferably after all the other BKL pushdowns in that area.

    [AV: original used to move lock_super() down as well; these changes are
    removed since we don't do lock_super() at all in generic_shutdown_super()
    now]
    [AV: fuse, btrfs and xfs are known to need no damn BKL, exempt]

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • * 'for-2.6.31' of git://git.kernel.dk/linux-2.6-block: (153 commits)
    block: add request clone interface (v2)
    floppy: fix hibernation
    ramdisk: remove long-deprecated "ramdisk=" boot-time parameter
    fs/bio.c: add missing __user annotation
    block: prevent possible io_context->refcount overflow
    Add serial number support for virtio_blk, V4a
    block: Add missing bounce_pfn stacking and fix comments
    Revert "block: Fix bounce limit setting in DM"
    cciss: decode unit attention in SCSI error handling code
    cciss: Remove no longer needed sendcmd reject processing code
    cciss: change SCSI error handling routines to work with interrupts enabled.
    cciss: separate error processing and command retrying code in sendcmd_withirq_core()
    cciss: factor out fix target status processing code from sendcmd functions
    cciss: simplify interface of sendcmd() and sendcmd_withirq()
    cciss: factor out core of sendcmd_withirq() for use by SCSI error handling code
    cciss: Use schedule_timeout_uninterruptible in SCSI error handling code
    block: needs to set the residual length of a bidi request
    Revert "block: implement blkdev_readpages"
    block: Fix bounce limit setting in DM
    Removed reference to non-existing file Documentation/PCI/PCI-DMA-mapping.txt
    ...

    Manually fix conflicts with tracing updates in:
    block/blk-sysfs.c
    drivers/ide/ide-atapi.c
    drivers/ide/ide-cd.c
    drivers/ide/ide-floppy.c
    drivers/ide/ide-tape.c
    include/trace/events/block.h
    kernel/trace/blktrace.c

    Linus Torvalds
     

23 May, 2009

1 commit

  • Until now we have had a 1:1 mapping between storage device physical
    block size and the logical block sized used when addressing the device.
    With SATA 4KB drives coming out that will no longer be the case. The
    sector size will be 4KB but the logical block size will remain
    512-bytes. Hence we need to distinguish between the physical block size
    and the logical ditto.

    This patch renames hardsect_size to logical_block_size.

    Signed-off-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     

18 May, 2009

1 commit


09 Apr, 2009

1 commit


07 Apr, 2009

1 commit

  • This makes the defautl ext3 data ordering mode (when no explicit
    ordering is set) configurable, so as to allow people to default to
    'data=writeback' and get the resulting latency improvements.

    This is a non-issue if a filesystem has been explicitly set to some
    ordering (with 'tune2fs').

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

04 Apr, 2009

1 commit


03 Apr, 2009

7 commits

  • In data=writeback mode, start an asynchronous flush when renaming a
    file on top of an already-existing file. This lowers the probability
    of data loss in the case of applications that attempt to replace a
    file via using rename().

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     
  • In data=writeback mode, start an asynchronous flush when closing a
    file which had been previously truncated down to zero. This lowers
    the probability of data loss in the case of applications that attempt
    to replace a file using truncate.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
    Remove two unneeded exports and make two symbols static in fs/mpage.c
    Cleanup after commit 585d3bc06f4ca57f975a5a1f698f65a45ea66225
    Trim includes of fdtable.h
    Don't crap into descriptor table in binfmt_som
    Trim includes in binfmt_elf
    Don't mess with descriptor table in load_elf_binary()
    Get rid of indirect include of fs_struct.h
    New helper - current_umask()
    check_unsafe_exec() doesn't care about signal handlers sharing
    New locking/refcounting for fs_struct
    Take fs_struct handling to new file (fs/fs_struct.c)
    Get rid of bumping fs_struct refcount in pivot_root(2)
    Kill unsharing fs_struct in __set_personality()

    Linus Torvalds
     
  • Sometimes block_write_begin() can map buffers in a page but later we
    fail to copy data into those buffers (because the source page has been
    paged out in the mean time). We then end up with !uptodate mapped
    buffers. To add a bit more to the confusion, block_write_end() does
    not commit any data (and thus does not any mark buffers as uptodate) if
    we didn't succeed with copying all the data.

    Commit f4fc66a894546bdc88a775d0e83ad20a65210bcb (ext3: convert to new
    aops) missed these cases and thus we were inserting non-uptodate
    buffers to transaction's list which confuses JBD code and it reports IO
    errors, aborts a transaction and generally makes users afraid about
    their data ;-P.

    This patch fixes the problem by reorganizing ext3_..._write_end() code
    to first call block_write_end() to mark buffers with valid data
    uptodate and after that we file only uptodate buffers to transaction's
    lists.

    We also fix a problem where we could leave blocks allocated beyond i_size
    (i_disksize in fact) because of failed write. We now add inode to orphan
    list when write fails (to be safe in case we crash) and then truncate blocks
    beyond i_size in a separate transaction.

    Signed-off-by: Jan Kara
    Reviewed-by: Aneesh Kumar K.V
    Cc: Nick Piggin
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • ext3_iget() returns -ESTALE if invoked on a deleted inode, in order to
    report errors to NFS properly. However, in ext[234]_lookup(), this
    -ESTALE can be propagated to userspace if the filesystem is corrupted such
    that a directory entry references a deleted inode. This leads to a
    misleading error message - "Stale NFS file handle" - and confusion on the
    part of the admin.

    The bug can be easily reproduced by creating a new filesystem, making a
    link to an unused inode using debugfs, then mounting and attempting to ls
    -l said link.

    This patch thus changes ext3_lookup to return -EIO if it receives -ESTALE
    from ext3_iget(), as ext3 does for other filesystem metadata corruption;
    and also invokes the appropriate ext*_error functions when this case is
    detected.

    Signed-off-by: Bryan Donlan
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bryan Donlan
     
  • Use unsigned instead of int for the parameter which carries a blocksize.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Wei Yongjun
    Cc: Theodore Ts'o
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yongjun
     
  • Reformat ext3/ioctl.c to make it look more like ext4/ioctl.c and remove
    the BKL around ext3_ioctl().

    Signed-off-by: Cyrus Massoumi
    Cc:
    Acked-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrus Massoumi
     

01 Apr, 2009

1 commit


28 Mar, 2009

1 commit

  • * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-quota-2.6: (27 commits)
    ext2: Zero our b_size in ext2_quota_read()
    trivial: fix typos/grammar errors in fs/Kconfig
    quota: Coding style fixes
    quota: Remove superfluous inlines
    quota: Remove uppercase aliases for quota functions.
    nfsd: Use lowercase names of quota functions
    jfs: Use lowercase names of quota functions
    udf: Use lowercase names of quota functions
    ufs: Use lowercase names of quota functions
    reiserfs: Use lowercase names of quota functions
    ext4: Use lowercase names of quota functions
    ext3: Use lowercase names of quota functions
    ext2: Use lowercase names of quota functions
    ramfs: Remove quota call
    vfs: Use lowercase names of quota functions
    quota: Remove dqbuf_t and other cleanups
    quota: Remove NODQUOT macro
    quota: Make global quota locks cacheline aligned
    quota: Move quota files into separate directory
    ext4: quota reservation for delayed allocation
    ...

    Linus Torvalds