03 Dec, 2011

1 commit


02 Dec, 2011

3 commits

  • * 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2: (31 commits)
    ocfs2: avoid unaligned access to dqc_bitmap
    ocfs2: Use filemap_write_and_wait() instead of write_inode_now()
    ocfs2: honor O_(D)SYNC flag in fallocate
    ocfs2: Add a missing journal credit in ocfs2_link_credits() -v2
    ocfs2: send correct UUID to cleancache initialization
    ocfs2: Commit transactions in error cases -v2
    ocfs2: make direntry invalid when deleting it
    fs/ocfs2/dlm/dlmlock.c: free kmem_cache_zalloc'd data using kmem_cache_free
    ocfs2: Avoid livelock in ocfs2_readpage()
    ocfs2: serialize unaligned aio
    ocfs2: Implement llseek()
    ocfs2: Fix ocfs2_page_mkwrite()
    ocfs2: Add comment about orphan scanning
    ocfs2: Clean up messages in the fs
    ocfs2/cluster: Cluster up now includes network connections too
    ocfs2/cluster: Add new function o2net_fill_node_map()
    ocfs2/cluster: Fix output in file elapsed_time_in_ms
    ocfs2/dlm: dlmlock_remote() needs to account for remastery
    ocfs2/dlm: Take inflight reference count for remotely mastered resources too
    ocfs2/dlm: Cleanup dlm_wait_for_node_death() and dlm_wait_for_node_recovery()
    ...

    Linus Torvalds
     
  • The dqc_bitmap field of struct ocfs2_local_disk_chunk is 32-bit aligned,
    but not 64-bit aligned. The dqc_bitmap is accessed by ocfs2_set_bit(),
    ocfs2_clear_bit(), ocfs2_test_bit(), or ocfs2_find_next_zero_bit(). These
    are wrapper macros for ext2_*_bit() which need to take an unsigned long
    aligned address (though some architectures are able to handle unaligned
    address correctly)

    So some 64bit architectures may not be able to access the dqc_bitmap
    correctly.

    This avoids such unaligned access by using another wrapper functions for
    ext2_*_bit(). The code is taken from fs/ext4/mballoc.c which also need to
    handle unaligned bitmap access.

    Signed-off-by: Akinobu Mita
    Acked-by: Joel Becker
    Cc: Mark Fasheh
    Signed-off-by: Andrew Morton
    Signed-off-by: Joel Becker

    Akinobu Mita
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
    Btrfs: fix meta data raid-repair merge problem
    Btrfs: skip allocation attempt from empty cluster
    Btrfs: skip block groups without enough space for a cluster
    Btrfs: start search for new cluster at the beginning
    Btrfs: reset cluster's max_size when creating bitmap
    Btrfs: initialize new bitmaps' list
    Btrfs: fix oops when calling statfs on readonly device
    Btrfs: Don't error on resizing FS to same size
    Btrfs: fix deadlock on metadata reservation when evicting a inode
    Fix URL of btrfs-progs git repository in docs
    btrfs scrub: handle -ENOMEM from init_ipath()

    Linus Torvalds
     

01 Dec, 2011

10 commits

  • Commit 4a54c8c16 introduced raid-repair, killing the individual
    readpage_io_failed_hook entries from inode.c and disk-io.c. Commit
    4bb31e92 introduced new readahead code, adding a readpage_io_failed_hook to
    disk-io.c.

    The raid-repair commit had logic to disable raid-repair, if
    readpage_io_failed_hook is set. Thus, the readahead commit effectively
    disabled raid-repair for meta data.

    This commit changes the logic to always attempt raid-repair when needed and
    call the readpage_io_failed_hook in case raid-repair fails. This is much
    more straight forward and should have been like that from the beginning.

    Signed-off-by: Jan Schmidt
    Reported-by: Stefan Behrens
    Signed-off-by: Chris Mason

    Jan Schmidt
     
  • If we don't have a cluster, don't bother trying to allocate from it,
    jumping right away to the attempt to allocate a new cluster.

    Signed-off-by: Alexandre Oliva
    Signed-off-by: Chris Mason

    Alexandre Oliva
     
  • We test whether a block group has enough free space to hold the
    requested block, but when we're doing clustered allocation, we can
    save some cycles by testing whether it has enough room for the cluster
    upfront, otherwise we end up attempting to set up a cluster and
    failing. Only in the NO_EMPTY_SIZE loop do we attempt an unclustered
    allocation, and by then we'll have zeroed the cluster size, so this
    patch won't stop us from using the block group as a last resort.

    Signed-off-by: Alexandre Oliva
    Signed-off-by: Chris Mason

    Alexandre Oliva
     
  • Instead of starting at zero (offset is always zero), request a cluster
    starting at search_start, that denotes the beginning of the current
    block group.

    Signed-off-by: Alexandre Oliva
    Signed-off-by: Chris Mason

    Alexandre Oliva
     
  • The field that indicates the size of the largest contiguous chunk of
    free space in the cluster is not initialized when setting up bitmaps,
    it's only increased when we find a larger contiguous chunk. We end up
    retaining a larger value than appropriate for highly-fragmented
    clusters, which may cause pointless searches for large contiguous
    groups, and even cause clusters that do not meet the density
    requirements to be set up.

    Signed-off-by: Alexandre Oliva
    Signed-off-by: Chris Mason

    Alexandre Oliva
     
  • We're failing to create clusters with bitmaps because
    setup_cluster_no_bitmap checks that the list is empty before inserting
    the bitmap entry in the list for setup_cluster_bitmap, but the list
    field is only initialized when it is restored from the on-disk free
    space cache, or when it is written out to disk.

    Besides a potential race condition due to the multiple use of the list
    field, filesystem performance severely degrades over time: as we use
    up all non-bitmap free extents, the try-to-set-up-cluster dance is
    done at every metadata block allocation. For every block group, we
    fail to set up a cluster, and after failing on them all up to twice,
    we fall back to the much slower unclustered allocation.

    To make matters worse, before the unclustered allocation, we try to
    create new block groups until we reach the 1% threshold, which
    introduces additional bitmaps and thus block groups that we'll iterate
    over at each metadata block request.

    Alexandre Oliva
     
  • To reproduce this bug:

    # dd if=/dev/zero of=img bs=1M count=256
    # mkfs.btrfs img
    # losetup -r /dev/loop1 img
    # mount /dev/loop1 /mnt
    OOPS!!

    It triggered BUG_ON(!nr_devices) in btrfs_calc_avail_data_space().

    To fix this, instead of checking write-only devices, we check all open
    deivces:

    # df -h /dev/loop1
    Filesystem Size Used Avail Use% Mounted on
    /dev/loop1 250M 28K 238M 1% /mnt

    Signed-off-by: Li Zefan

    Li Zefan
     
  • It seems overly harsh to fail a resize of a btrfs file system to the
    same size when a shrink or grow would succeed. User app GParted trips
    over this error. Allow it by bypassing the shrink or grow operation.

    Signed-off-by: Mike Fleetwood

    Mike Fleetwood
     
  • When I ran the xfstests, I found the test tasks was blocked on meta-data
    reservation.

    By debugging, I found the reason of this bug:
    start transaction
    |
    v
    reserve meta-data space
    |
    v
    flush delay allocation -> iput inode -> evict inode
    ^ |
    | v
    wait for delay allocation flush

    Miao Xie
     
  • init_ipath() can return an ERR_PTR(-ENOMEM).

    Signed-off-by: Dan Carpenter

    Dan Carpenter
     

30 Nov, 2011

3 commits

  • With Dmitry fsstress updates I've seen very reproducible crashes in
    xfs_attr_shortform_remove because xfs_attr_shortform_bytesfit claims that
    the attributes would not fit inline into the inode after removing an
    attribute. It turns out that we were operating on an inode with lots
    of delalloc extents, and thus an if_bytes values for the data fork that
    is larger than biggest possible on-disk storage for it which utterly
    confuses the code near the end of xfs_attr_shortform_bytesfit.

    Fix this by always allowing the current attribute fork, like we already
    do for the attr1 format, given that delalloc conversion will take care
    for moving either the data or attribute area out of line if it doesn't
    fit at that point - or making the point moot by merging extents at this
    point.

    Also document the function better, and clean up some loose bits.

    Reviewed-by: Dave Chinner
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Ben Myers

    Christoph Hellwig
     
  • If we are doing synchronous inode reclaim we block the VM from making
    progress in memory reclaim. So if we encouter a flush locked inode
    promote it in the delwri list and wake up xfsbufd to write it out now.
    Without this we can get hangs of up to 30 seconds during workloads hitting
    synchronous inode reclaim.

    The scheme is copied from what we do for dquot reclaims.

    Reported-by: Simon Kirby
    Signed-off-by: Christoph Hellwig
    Tested-by: Simon Kirby
    Signed-off-by: Ben Myers

    Christoph Hellwig
     
  • * 'dev' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
    ext4: fix racy use-after-free in ext4_end_io_dio()

    Linus Torvalds
     

29 Nov, 2011

2 commits


25 Nov, 2011

1 commit

  • ext4_end_io_dio() queues io_end->work and then clears iocb->private;
    however, io_end->work calls aio_complete() which frees the iocb
    object. If that slab object gets reallocated, then ext4_end_io_dio()
    can end up clearing someone else's iocb->private, this use-after-free
    can cause a leak of a struct ext4_io_end_t structure.

    Detected and tested with slab poisoning.

    [ Note: Can also reproduce using 12 fio's against 12 file systems with the
    following configuration file:

    [global]
    direct=1
    ioengine=libaio
    iodepth=1
    bs=4k
    ba=4k
    size=128m

    [create]
    filename=${TESTDIR}
    rw=write

    -- tytso ]

    Google-Bug-Id: 5354697
    Signed-off-by: Tejun Heo
    Signed-off-by: "Theodore Ts'o"
    Reported-by: Kent Overstreet
    Tested-by: Kent Overstreet
    Cc: stable@kernel.org

    Tejun Heo
     

24 Nov, 2011

4 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs:
    eCryptfs: Extend array bounds for all filename chars
    eCryptfs: Flush file in vma close
    eCryptfs: Prevent file create race condition

    Linus Torvalds
     
  • From mhalcrow's original commit message:

    Characters with ASCII values greater than the size of
    filename_rev_map[] are valid filename characters.
    ecryptfs_decode_from_filename() will access kernel memory beyond
    that array, and ecryptfs_parse_tag_70_packet() will then decrypt
    those characters. The attacker, using the FNEK of the crafted file,
    can then re-encrypt the characters to reveal the kernel memory past
    the end of the filename_rev_map[] array. I expect low security
    impact since this array is statically allocated in the text area,
    and the amount of memory past the array that is accessible is
    limited by the largest possible ASCII filename character.

    This patch solves the issue reported by mhalcrow but with an
    implementation suggested by Linus to simply extend the length of
    filename_rev_map[] to 256. Characters greater than 0x7A are mapped to
    0x00, which is how invalid characters less than 0x7A were previously
    being handled.

    Signed-off-by: Tyler Hicks
    Reported-by: Michael Halcrow
    Cc: stable@kernel.org

    Tyler Hicks
     
  • Dirty pages weren't being written back when an mmap'ed eCryptfs file was
    closed before the mapping was unmapped. Since f_ops->flush() is not
    called by the munmap() path, the lower file was simply being released.
    This patch flushes the eCryptfs file in the vm_ops->close() path.

    https://launchpad.net/bugs/870326

    Signed-off-by: Tyler Hicks
    Cc: stable@kernel.org [2.6.39+]

    Tyler Hicks
     
  • The file creation path prematurely called d_instantiate() and
    unlock_new_inode() before the eCryptfs inode info was fully
    allocated and initialized and before the eCryptfs metadata was written
    to the lower file.

    This could result in race conditions in subsequent file and inode
    operations leading to unexpected error conditions or a null pointer
    dereference while attempting to use the unallocated memory.

    https://launchpad.net/bugs/813146

    Signed-off-by: Tyler Hicks
    Cc: stable@kernel.org

    Tyler Hicks
     

23 Nov, 2011

4 commits


22 Nov, 2011

4 commits


21 Nov, 2011

1 commit

  • To prevent an NFS server from being used to create a directory loop in an NFS
    superblock on the client, the following patch was committed:

    commit 1836750115f20b774e55c032a3893e8c5bdf41ed
    Author: Al Viro
    Date: Tue Jul 12 21:42:24 2011 -0400
    Subject: fix loop checks in d_materialise_unique()

    This causes ELOOP to be reported to anyone trying to access the dentry that
    would otherwise cause the kernel to complete the loop.

    However, no indication is given to the caller as to why an operation that ought
    to work doesn't. The fault is with the kernel, which doesn't want to try and
    solve the problem as it gets horrendously messy if there's another mountpoint
    somewhere in the trees being spliced that can't be moved[*].

    [*] The real problem is that we don't handle the excision of a subtree that
    gets moved _out_ of what we can see. This can happen on the server where a
    directory is merely moved between two other dirs on the same filesystem, but
    where destination dir is not accessible by the client.

    So, given the choice to return ELOOP rather than trying to reconfigure the
    dentry tree, we should give the caller some indication of why they aren't being
    allowed to make what should be a legitimate request and log a message.

    Signed-off-by: David Howells
    Acked-by: Sachin Prabhu
    Signed-off-by: Al Viro

    David Howells
     

20 Nov, 2011

7 commits

  • We've been hitting BUG()'s in btrfs_cont_expand and btrfs_fallocate and anywhere
    else that calls btrfs_get_extent while running xfstests 13 in a loop. This is
    because fiemap is calling btrfs_get_extent with non-sectorsize aligned offsets,
    which will end up adding mappings that are not sectorsize aligned, which will
    cause problems in some cases for subsequent calls to btrfs_get_extent for
    similar areas that are sectorsize aligned. With this patch I ran xfstests 13 in
    a loop for a couple of hours and didn't hit the problem that I could previously
    hit in at most 20 minutes. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • When doing the io_ctl helpers to clean up the free space cache stuff I stopped
    using our normal prepare_pages stuff, which means I of course forgot to do
    things like set the pages extent mapped, which will cause us all sorts of
    wonderful propblems. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • We've been hitting panics when running xfstest 13 in a loop for long periods of
    time. And actually this problem has always existed so we've been hitting these
    things randomly for a while. Basically what happens is we get a thread coming
    into the allocator and reading the space cache off of disk and adding the
    entries to the free space cache as we go. Then we get another thread that comes
    in and tries to allocate from that block group. Since block_group->cached !=
    BTRFS_CACHE_NO it goes ahead and tries to do the allocation. We do this because
    if we're doing the old slow way of caching we don't want to hold people up and
    wait for everything to finish. The problem with this is we could end up
    discarding the space cache at some arbitrary point in the future, which means we
    could very well end up allocating space that is either bad, or when the real
    caching happens it could end up thinking the space isn't in use when it really
    is and cause all sorts of other problems.

    The solution is to add a new flag to indicate we are loading the free space
    cache from disk, and always try to cache the block group if cache->cached !=
    BTRFS_CACHE_FINISHED. That way if we are loading the space cache anybody else
    who tries to allocate from the block group will have to wait until it's finished
    to make sure it completes successfully. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • For the user it is confusing to find something like:
    [10197.627710] new size for /dev/mapper/vg0-usr_share is 3221225472
    in kernel log, because it doesn't point directly to btrfs.

    This patch prefixes those messages with "btrfs:" like other btrfs
    related printks.

    Signed-off-by: Arnd Hannemann
    Signed-off-by: Chris Mason

    Arnd Hannemann
     
  • Round inode bytes and delalloc bytes up to real blocksize before
    converting to sector size. Otherwise eg. files smaller than 512
    are reported with zero blocks due to incorrect rounding.

    Signed-off-by: David Sterba
    Signed-off-by: Chris Mason

    David Sterba
     
  • setup_cluster_no_bitmap() searches all the extents and bitmaps starting
    from offset. Therefore if it returns -ENOSPC, all the bitmaps starting
    from offset are in the bitmaps list, so it's sufficient to search from
    this list in setup_cluser_bitmap().

    Signed-off-by: Li Zefan
    Signed-off-by: Chris Mason

    Li Zefan
     
  • Suppose there are two bitmaps [0, 256], [256, 512] and one extent
    [100, 120] in the free space cache, and we want to setup a cluster
    with offset=100, bytes=50.

    In this case, there will be only one bitmap [256, 512] in the temporary
    bitmaps list, and then setup_cluster_bitmap() won't search bitmap [0, 256].

    The cause is, the list is constructed in setup_cluster_no_bitmap(),
    and only bitmaps with bitmap_entry->offset >= offset will be added
    into the list, and the very bitmap that convers offset has
    bitmap_entry->offset
    Signed-off-by: Chris Mason

    Li Zefan