11 Feb, 2008

1 commit

  • Commit 8811930dc74a503415b35c4a79d14fb0b408a361 ("splice: missing user
    pointer access verification") added the proper access_ok() calls to
    copy_from_user_mmap_sem() which ensures we can copy the struct iovecs
    from userspace to the kernel.

    But we also must check whether we can access the actual memory region
    pointed to by the struct iovec to fix the access checks properly.

    Signed-off-by: Bastian Blank
    Acked-by: Oliver Pinter
    Cc: Jens Axboe
    Cc: Andrew Morton
    Signed-off-by: Pekka Enberg
    Signed-off-by: Linus Torvalds

    Bastian Blank
     

10 Feb, 2008

9 commits

  • This flag is simply a generic "this is a crash/burn test filesystem"
    marker. If it is set, then filesystem code which is "in development"
    will be allowed to mount the filesystem. Filesystem code which is not
    considered ready for prime-time will check for this flag, and if it is
    not set, it will refuse to touch the filesystem.

    As we start rolling ext4 out to distro's like Fedora, et. al, this makes
    it less likely that a user might accidentally start using ext4 on a
    production filesystem; a bad thing, since that will essentially make it
    be unfsckable until e2fsprogs catches up.

    Signed-off-by: Theodore Tso
    Signed-off-by: Mingming Cao

    Theodore Tso
     
  • Multiblock allocator calls BUG_ON in many case if the free and used
    blocks count obtained looking at the bitmap is different from what
    the allocator internally accounted for. Use ext4_error in such case
    and don't panic the system.

    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Mingming Cao
    Signed-off-by: "Theodore Ts'o"

    Aneesh Kumar K.V
     
  • struct ext4_allocation_context is rather large, and this bloats
    the stack of many functions which use it. Allocating it from
    a named slab cache will alleviate this.

    For example, with this change (on top of the noinline patch sent earlier):

    -ext4_mb_new_blocks 200
    +ext4_mb_new_blocks 40

    -ext4_mb_free_blocks 344
    +ext4_mb_free_blocks 168

    -ext4_mb_release_inode_pa 216
    +ext4_mb_release_inode_pa 40

    -ext4_mb_release_group_pa 192
    +ext4_mb_release_group_pa 24

    Most of these stack-allocated structs are actually used only for
    mballoc history; and in those cases often a smaller struct would do.
    So changing that may be another way around it, at least for those
    functions, if preferred. For now, in those cases where the ac
    is only for history, an allocation failure simply skips the history
    recording, and does not cause any other failures.

    Signed-off-by: Eric Sandeen
    Signed-off-by: Mingming Cao
    Signed-off-by: "Theodore Ts'o"

    Eric Sandeen
     
  • In JBD2 jbd2_journal_write_commit_record(), clear the buffer_ordered
    flag for the bh after barried IO has succeed. This prevents later, if
    the same buffer head were submitted to the underlying device, which has
    been reconfigured to not support barrier request, the JBD2 commit code
    could treat it as a normal IO (without barrier).

    This is a port from JBD/ext3 fix from Neil Brown.

    More details from Neil:

    Some devices - notably dm and md - can change their behaviour in
    response to BIO_RW_BARRIER requests. They might start out accepting
    such requests but on reconfiguration, they find out that they cannot
    any more. JBD2 deal with this by always testing if BIO_RW_BARRIER
    requests fail with EOPNOTSUPP, and retrying the write
    requests without the barrier (probably after waiting for any pending
    writes to complete).

    However there is a bug in the handling this in JBD2 for ext4 .

    When ext4/JBD2 to submit a BIO_RW_BARRIER request,
    it sets the buffer_ordered flag on the buffer head.
    If the request completes successfully, the flag STAYS SET.

    Other code might then write the same buffer_head after the device has
    been reconfigured to not accept barriers. This write will then fail,
    but the "other code" is not ready to handle EOPNOTSUPP errors and the
    error will be treated as fatal.

    Cc: Neil Brown
    Signed-off-by: Dave Kleikamp
    Signed-off-by: Mingming Cao
    Signed-off-by: "Theodore Ts'o"

    Dave Kleikamp
     
  • We cannot start transaction in ext4_direct_IO() and just let it last
    during the whole write because dio_get_page() acquires mmap_sem which
    ranks above transaction start (e.g. because we have dependency chain
    mmap_sem->PageLock->journal_start, or because we update atime while
    holding mmap_sem) and thus deadlocks could happen. We solve the problem
    by starting a transaction separately for each ext4_get_block() call.

    We *could* have a problem that we allocate a block and before its data
    are written out the machine crashes and thus we expose stale data. But
    that does not happen because for hole-filling generic code falls back to
    buffered writes and for file extension, we add inode to orphan list and
    thus in case of crash, journal replay will truncate inode back to the
    original size.

    Signed-off-by: Jan Kara
    Signed-off-by: Mingming Cao
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • In order to prevent a circular locking dependency when an unlink
    operation is racing with an ext4 migration, we delay taking i_data_sem
    until just before switch the inode format, and use i_mutex to prevent
    writes and truncates during the first part of the migration operation.

    Acked-by: Jan Kara
    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Mingming Cao
    Signed-off-by: "Theodore Ts'o"

    Aneesh Kumar K.V
     
  • Repoted by Adrian Bunk :

    The Coverity checker spotted the following NULL dereference:

    static int ext4_mb_mark_diskspace_used
    {
    ...
    if (!bitmap_bh)
    goto out_err;
    ...
    out_err:
    sb->s_dirt = 1;
    put_bh(bitmap_bh);
    ...

    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Mingming Cao

    Aneesh Kumar K.V
     
  • - remove non-standard in/out markers
    - use tabs for formatting

    Signed-off-by: Christoph Hellwig
    Cc: "Randy.Dunlap"
    Cc: Erez Zadok
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • /home/bunk/linux/kernel-2.6/git/linux-2.6/fs/hostfs/hostfs_kern.c: In function 'hostfs_show_options':
    /home/bunk/linux/kernel-2.6/git/linux-2.6/fs/hostfs/hostfs_kern.c:328: error: dereferencing pointer to incomplete type

    We need to include mount.h to get vfsmount.

    Signed-off-by: Jiri Kosina
    Reported-by: Adrian Bunk
    Cc: Jeff Dike
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Kosina
     

09 Feb, 2008

24 commits

  • Revert commit c6caeb7c4544608e8ae62731334661fc396c7f85 ("proc: fix the
    threaded /proc/self"), since Eric says "The patch really is wrong.
    There is at least one corner case in procps that cares."

    Cc: Eric W. Biederman
    Cc: Ingo Molnar
    Cc: "Guillaume Chazarain"
    Cc: "Pavel Emelyanov"
    Cc: "Rafael J. Wysocki"
    Cc: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • * 'for-linus' of git://git.kernel.dk/linux-2.6-block:
    Enhanced partition statistics: documentation update
    Enhanced partition statistics: remove old partition statistics
    Enhanced partition statistics: procfs
    Enhanced partition statistics: sysfs
    Enhanced partition statistics: aoe fix
    Enhanced partition statistics: update partition statitics
    Enhanced partition statistics: core statistics
    block: fixup rq_init() a bit

    Manually fixed conflict in drivers/block/aoe/aoecmd.c due to statistics
    support.

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm:
    dlm: add __init and __exit marks to init and exit functions
    dlm: eliminate astparam type casting
    dlm: proper types for asts and basts
    dlm: dlm/user.c input validation fixes
    dlm: fix dlm_dir_lookup() handling of too long names
    dlm: fix overflows when copying from ->m_extra to lvb
    dlm: make find_rsb() fail gracefully when namelen is too large
    dlm: receive_rcom_lock_args() overflow check
    dlm: verify that places expecting rcom_lock have packet long enough
    dlm: validate data in dlm_recover_directory()
    dlm: missing length check in check_config()
    dlm: use proper type for ->ls_recover_buf
    dlm: do not byteswap rcom_config
    dlm: do not byteswap rcom_lock
    dlm: dlm_process_incoming_buffer() fixes
    dlm: use proper C for dlm/requestqueue stuff (and fix alignment bug)

    Linus Torvalds
     
  • vmsplice_to_user() must always check the user pointer and length
    with access_ok() before copying. Likewise, for the slow path of
    copy_from_user_mmap_sem() we need to check that we may read from
    the user region.

    Signed-off-by: Jens Axboe
    Cc: Wojciech Purczynski
    Signed-off-by: Greg Kroah-Hartman
    Signed-off-by: Linus Torvalds

    Jens Axboe
     
  • Turn off quotas before filesystem is remounted read only. Otherwise quota
    will try to write to read-only filesystem which does no good... We could
    also just refuse to remount ro when quota is enabled but turning quota off
    is consistent with what we do on umount.

    Signed-off-by: Jan Kara
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Some devices - notably dm and md - can change their behaviour in response
    to BIO_RW_BARRIER requests. They might start out accepting such requests
    but on reconfiguration, they find out that they cannot any more.

    ext3 (and other filesystems) deal with this by always testing if
    BIO_RW_BARRIER requests fail with EOPNOTSUPP, and retrying the write
    requests without the barrier (probably after waiting for any pending writes
    to complete).

    However there is a bug in the handling for this for ext3.

    When ext3 (jbd actually) decides to submit a BIO_RW_BARRIER request, it
    sets the buffer_ordered flag on the buffer head. If the request completes
    successfully, the flag STAYS SET.

    Other code might then write the same buffer_head after the device has been
    reconfigured to not accept barriers. This write will then fail, but the
    "other code" is not ready to handle EOPNOTSUPP errors and the error will be
    treated as fatal.

    This can be seen without having to reconfigure a device at exactly the
    wrong time by putting:

    if (buffer_ordered(bh))
    printk("OH DEAR, and ordered buffer\n");

    in the while loop in "commit phase 5" of journal_commit_transaction.

    If it ever prints the "OH DEAR ..." message (as it does sometimes for
    me), then that request could (in different circumstances) have failed
    with EOPNOTSUPP, but that isn't tested for.

    My proposed fix is to clear the buffer_ordered flag after it has been
    used, as in the following patch.

    Signed-off-by: Neil Brown
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Neil Brown
     
  • do_mount() uses a whopping 616 bytes of stack on x86_64 in 2.6.24-mm1,
    largely thanks to gcc inlining the various helper functions.

    noinlining these can slim it down a lot; on my box this patch gets it down
    to 168, which is mostly the struct nameidata nd; left on the stack.

    These functions are called only as do_mount() helpers; none of them should
    be in any path that would see a performance benefit from inlining...

    Signed-off-by: Eric Sandeen
    Cc: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Sandeen
     
  • There is an outdated comment in serial_core.c also fixed.

    Signed-off-by: Denis Cheng
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Denis Cheng
     
  • There are two possible races in handling of private_list in buffer cache.

    1) When fsync_buffers_list() processes a private_list, it clears
    b_assoc_mapping and moves buffer to its private list. Now
    drop_buffers() comes, sees a buffer is on list so it calls
    __remove_assoc_queue() which complains about b_assoc_mapping being
    cleared (as it cannot propagate possible IO error). This race has been
    actually observed in the wild.

    2) When fsync_buffers_list() processes a private_list,
    mark_buffer_dirty_inode() can be called on bh which is already on the
    private list of fsync_buffers_list(). As buffer is on some list (note
    that the check is performed without private_lock), it is not readded to
    the mapping's private_list and after fsync_buffers_list() finishes, we
    have a dirty buffer which should be on private_list but it isn't. This
    race has not been reported, probably because most (but not all) callers
    of mark_buffer_dirty_inode() hold i_mutex and thus are serialized with
    fsync().

    Fix these issues by not clearing b_assoc_map when fsync_buffers_list()
    moves buffer to a dedicated list and by reinserting buffer in private_list
    when it is found dirty after we have submitted buffer for IO. We also
    change the tests whether a buffer is on a private list from
    !list_empty(&bh->b_assoc_buffers) to bh->b_assoc_map so that they are
    single word reads and hence lockless checks are safe.

    Signed-off-by: Jan Kara
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Following the deprecation schedule the a.out ELF interpreter support
    is removed now with this patch. a.out ELF interpreters were an transition
    feature for moving a.out systems to ELF, but they're unlikely to be still
    needed. Pure a.out systems will still work of course. This allows to
    simplify the hairy ELF loader.

    Signed-off-by: Andi Kleen
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • Add a .show_options super operation to udf.

    Signed-off-by: Miklos Szeredi
    Acked-by: Cyrill Gorcunov
    Acked-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • Add a .show_options super operation to reiserfs.

    Use generic_show_options() and save the complete option string in
    reiserfs_fill_super() and reiserfs_remount().

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • Add a .show_options super operation to ncpfs.

    Small fix: add FS_BINARY_MOUNTDATA to the filesystem type flags, since
    it can take binary data, as well as text (similarly to NFS).

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • Add a .show_options super operation to isofs.

    Use generic_show_options() and save the complete option string in
    isofs_fill_super().

    Signed-off-by: Miklos Szeredi
    Acked-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • Add a .show_options super operation to hugetlbfs.

    Use generic_show_options() and save the complete option string in
    hugetlbfs_fill_super().

    Signed-off-by: Miklos Szeredi
    Cc: Adam Litke
    Cc: Badari Pulavarty
    Cc: Ken Chen
    Cc: William Lee Irwin III
    Cc: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • Add a .show_options super operation to hpfs.

    Use generic_show_options() and save the complete option string in
    hpfs_fill_super() and hpfs_remount_fs().

    Also add a small fix: hpfs_remount_fs() should return -EINVAL on
    error, instead of 1, which is not an error value.

    Signed-off-by: Miklos Szeredi
    Cc: Mikulas Patocka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • Add the "host path" option to /proc/mounts for UML hostfs filesystems.

    The mount source (mnt_devname) should really be used for this, but not
    easy to change now in a backward compatible way.

    Signed-off-by: Miklos Szeredi
    Cc: Jeff Dike
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • Add blksize= option to /proc/mounts for fuseblk filesystems.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • Add flush option to /proc/mounts for msdos and vfat filesystems.

    Signed-off-by: Miklos Szeredi
    Acked-by: OGAWA Hirofumi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • Add noreservation option to /proc/mounts for ext2 filesystems.

    Signed-off-by: Miklos Szeredi
    Acked-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • Add a .show_options super operation to devpts.

    Small cleanup: when parsing the "mode" option, mask with S_IALLUGO
    instead of ~S_IFMT.

    Signed-off-by: Miklos Szeredi
    Acked-by: H. Peter Anvin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • Add a .show_options super operation to befs.

    Use generic_show_options() and save the complete option string in
    befs_fill_super().

    Signed-off-by: Miklos Szeredi
    Cc: Sergey S. Kostyliov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • Add a .show_options super operation to autofs.

    Use generic_show_options() and save the complete option string in
    autofs_fill_super().

    Signed-off-by: Miklos Szeredi
    Acked-by: H. Peter Anvin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • Add uid= and gid= options to /proc/mounts for autofs4 filesystems.

    Signed-off-by: Miklos Szeredi
    Acked-by: Ian Kent
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     

06 Feb, 2008

1 commit

  • The ext3 root inode was treated specially with respect
    to in-inode extended attributes, for reasons detailed
    in the removed comment below. The first mkfs-created
    inodes would not get extra_i_size or the EXT3_STATE_XATTR
    flag set in ext3_read_inode, which disallowed reading or
    setting in-inode EAs on the root.

    However, in ext4, ext4_mark_inode_dirty calls
    ext4_expand_extra_isize for all inodes; once this is done
    EAs may be placed in the root ext4 inode body.

    But for reasons above, it won't be found after a reboot.

    testcase:

    setfattr -n user.name -v value mntpt/
    setfattr -n user.name2 -v value2 mntpt/
    umount mntpt/; remount mntpt/
    getfattr -d mntpt/

    name2/value2 has gone missing; debugfs shows it in the
    inode body, but it is not found there by getattr.

    The following fixes it up; newer mkfs appears to properly
    zero the inodes, so this workaround isn't needed for ext4.

    Signed-off-by: Eric Sandeen
    Signed-off-by: Theodore Ts'o

    Eric Sandeen
     

05 Feb, 2008

4 commits

  • For fast symbolic links, the file content is stored in the i_block[]
    array, which is not compatible with the new file extents format.
    e2fsck reports error on such files because EXTENTS_FL is set.
    Don't set the EXTENTS_FL flag when creating fast symlinks.

    In the case of file migration, skip fast symbolic links.

    Signed-off-by: Valerie Clement
    Signed-off-by: Mingming Cao
    Signed-off-by: "Theodore Ts'o"

    Valerie Clement
     
  • JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT needs to be checked with
    JBD2_HAS_INCOMPAT_FEATURE

    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Mingming Cao
    Signed-off-by: "Theodore Ts'o"

    Aneesh Kumar K.V
     
  • With journal checksum patch we added asynchronous commits of journal
    commit headers, and accidentally dropped taking a reference on the
    buffer head.

    (Before the change, sync_dirty_buffer did the get_bh(). The associative
    put_bh is done by journal_wait_on_commit_record().)

    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Mingming Cao
    Signed-off-by: "Theodore Ts'o"

    Aneesh Kumar K.V
     
  • The buffer head pointer passed to journal_wait_on_commit_record() could
    be NULL if the previous journal_submit_commit_record() failed or journal
    has already aborted.

    Looking at the jbd2 debug messages, before the oops happened, the jbd2
    is aborted due to trying to access the next log block beyond the end
    of device. This might be caused by using a corrupted image.

    We need to check the error returns from journal_submit_commit_record()
    and avoid calling journal_wait_on_commit_record() in the failure case.

    This addresses Kernel Bugzilla #9849

    Signed-off-by: Mingming Cao
    Signed-off-by: "Theodore Ts'o"

    Mingming Cao
     

01 Feb, 2008

1 commit