10 Dec, 2011

1 commit

  • * git://git.samba.org/sfrench/cifs-2.6:
    cifs: check for NULL last_entry before calling cifs_save_resume_key
    cifs: attempt to freeze while looping on a receive attempt
    cifs: Fix sparse warning when calling cifs_strtoUCS
    CIFS: Add descriptions to the brlock cache functions

    Linus Torvalds
     

09 Dec, 2011

7 commits

  • Since commit a25cac5198d4 ("proc: Consider NO_HZ when printing idle and
    iowait times") we are reporting idle/io_wait time also while a CPU is
    tickless. We rely on get_{idle,iowait}_time functions to retrieve
    proper data.

    These functions, however, use usecs_to_cputime to translate micro
    seconds time to cputime64_t. This is just an alias to usecs_to_jiffies
    which reduces the data type from u64 to unsigned int and also checks
    whether the given parameter overflows jiffies_to_usecs(MAX_JIFFY_OFFSET)
    and returns MAX_JIFFY_OFFSET in that case.

    When we overflow depends on CONFIG_HZ but especially for CONFIG_HZ_300
    it is quite low (1431649781) so we are getting MAX_JIFFY_OFFSET for
    >3000s! until we overflow unsigned int. Just for reference
    CONFIG_HZ_100 has an overflow window around 20s, CONFIG_HZ_250 ~8s and
    CONFIG_HZ_1000 ~2s.

    This results in a bug when people saw [h]top going mad reporting 100%
    CPU usage even though there was basically no CPU load. The reason was
    simply that /proc/stat stopped reporting idle/io_wait changes (and
    reported MAX_JIFFY_OFFSET) and so the only change happening was for user
    system time.

    Let's use nsecs_to_jiffies64 instead which doesn't reduce the precision
    to 32b type and it is much more appropriate for cumulative time values
    (unlike usecs_to_jiffies which intended for timeout calculations).

    Signed-off-by: Michal Hocko
    Tested-by: Artem S. Tashkinov
    Cc: Dave Jones
    Cc: Arnd Bergmann
    Cc: Alexey Dobriyan
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Fix the error message "directives may not be used inside a macro argument"
    which appears when the kernel is compiled for the cris architecture.

    Signed-off-by: Claudio Scordino
    Cc: Andrea Arcangeli
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Claudio Scordino
     
  • Prior to commit eaf35b1, cifs_save_resume_key had some NULL pointer
    checks at the top. It turns out that at least one of those NULL
    pointer checks is needed after all.

    When the LastNameOffset in a FIND reply appears to be beyond the end of
    the buffer, CIFSFindFirst and CIFSFindNext will set srch_inf.last_entry
    to NULL. Since eaf35b1, the code will now oops in this situation.

    Fix this by having the callers check for a NULL last entry pointer
    before calling cifs_save_resume_key. No change is needed for the
    call site in cifs_readdir as it's not reachable with a NULL
    current_entry pointer.

    This should fix:

    https://bugzilla.redhat.com/show_bug.cgi?id=750247

    Cc: stable@vger.kernel.org
    Cc: Christoph Hellwig
    Reported-by: Adam G. Metzler
    Signed-off-by: Jeff Layton
    Signed-off-by: Steve French

    Jeff Layton
     
  • In the recent overhaul of the demultiplex thread receive path, I
    neglected to ensure that we attempt to freeze on each pass through the
    receive loop.

    Reported-and-Tested-by: Woody Suwalski
    Reported-and-Tested-by: Adam Williamson
    Signed-off-by: Jeff Layton
    Signed-off-by: Steve French

    Jeff Layton
     
  • Fix sparse endian check warning while calling cifs_strtoUCS

    CHECK fs/cifs/smbencrypt.c
    fs/cifs/smbencrypt.c:216:37: warning: incorrect type in argument 1
    (different base types)
    fs/cifs/smbencrypt.c:216:37: expected restricted __le16 [usertype] *
    fs/cifs/smbencrypt.c:216:37: got unsigned short *

    Signed-off-by: Steve French
    Acked-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com

    Steve French
     
  • Signed-off-by: Pavel Shilovsky
    Signed-off-by: Steve French

    Pavel Shilovsky
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
    Btrfs: drop spin lock when memory alloc fails
    Btrfs: check if the to-be-added device is writable
    Btrfs: try cluster but don't advance in search list
    Btrfs: try to allocate from cluster even at LOOP_NO_EMPTY_SIZE

    Linus Torvalds
     

08 Dec, 2011

6 commits


07 Dec, 2011

3 commits

  • __d_path() API is asking for trouble and in case of apparmor d_namespace_path()
    getting just that. The root cause is that when __d_path() misses the root
    it had been told to look for, it stores the location of the most remote ancestor
    in *root. Without grabbing references. Sure, at the moment of call it had
    been pinned down by what we have in *path. And if we raced with umount -l, we
    could have very well stopped at vfsmount/dentry that got freed as soon as
    prepend_path() dropped vfsmount_lock.

    It is safe to compare these pointers with pre-existing (and known to be still
    alive) vfsmount and dentry, as long as all we are asking is "is it the same
    address?". Dereferencing is not safe and apparmor ended up stepping into
    that. d_namespace_path() really wants to examine the place where we stopped,
    even if it's not connected to our namespace. As the result, it looked
    at ->d_sb->s_magic of a dentry that might've been already freed by that point.
    All other callers had been careful enough to avoid that, but it's really
    a bad interface - it invites that kind of trouble.

    The fix is fairly straightforward, even though it's bigger than I'd like:
    * prepend_path() root argument becomes const.
    * __d_path() is never called with NULL/NULL root. It was a kludge
    to start with. Instead, we have an explicit function - d_absolute_root().
    Same as __d_path(), except that it doesn't get root passed and stops where
    it stops. apparmor and tomoyo are using it.
    * __d_path() returns NULL on path outside of root. The main
    caller is show_mountinfo() and that's precisely what we pass root for - to
    skip those outside chroot jail. Those who don't want that can (and do)
    use d_path().
    * __d_path() root argument becomes const. Everyone agrees, I hope.
    * apparmor does *NOT* try to use __d_path() or any of its variants
    when it sees that path->mnt is an internal vfsmount. In that case it's
    definitely not mounted anywhere and dentry_path() is exactly what we want
    there. Handling of sysctl()-triggered weirdness is moved to that place.
    * if apparmor is asked to do pathname relative to chroot jail
    and __d_path() tells it we it's not in that jail, the sucker just calls
    d_absolute_path() instead. That's the other remaining caller of __d_path(),
    BTW.
    * seq_path_root() does _NOT_ return -ENAMETOOLONG (it's stupid anyway -
    the normal seq_file logics will take care of growing the buffer and redoing
    the call of ->show() just fine). However, if it gets path not reachable
    from root, it returns SEQ_SKIP. The only caller adjusted (i.e. stopped
    ignoring the return value as it used to do).

    Reviewed-by: John Johansen
    ACKed-by: John Johansen
    Signed-off-by: Al Viro
    Cc: stable@vger.kernel.org

    Al Viro
     
  • Apply the scheme used in log_regrant_write_log_space to wake up any other
    threads waiting for log space before the newly added one to
    log_regrant_write_log_space as well, and factor the code into readable
    helpers. For each of the queues we have add two helpers:

    - one to try to wake up all waiting threads. This helper will also be
    usable by xfs_log_move_tail once we remove the current opportunistic
    wakeups in it.
    - one to sleep on t_wait until enough log space is available, loosely
    modelled after Linux waitqueues.

    And use them to reimplement the guts of log_regrant_write_log_space and
    log_regrant_write_log_space. These two function now use one and the same
    algorithm for waiting on log space instead of subtly different ones before,
    with an option to completely unify them in the near future.

    Also move the filesystem shutdown handling to the common caller given
    that we had to touch it anyway.

    Based on hard debugging and an earlier patch from
    Chandra Seetharaman .

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Chandra Seetharaman
    Tested-by: Chandra Seetharaman
    Signed-off-by: Ben Myers

    Christoph Hellwig
     
  • The i_ino field in the VFS inode is of type unsigned long and thus can't
    hold the full 64-bit inode number on 32-bit kernels. We have the full
    inode number in the XFS inode, so use that one for nfs exports. Note
    that I've also switched the 32-bit file handles types to it, just to make
    the code more consistent and copy & paste errors less likely to happen.

    Reported-by: Guoquan Yang
    Reported-by: Hank Peng
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Ben Myers

    Christoph Hellwig
     

03 Dec, 2011

2 commits

  • When testing the new xfstests --large-fs option that does very large
    file preallocations, this assert was tripped deep in
    xfs_alloc_vextent():

    XFS: Assertion failed: args->minlen maxlen, file: fs/xfs/xfs_alloc.c, line: 2239

    The allocation was trying to allocate a zero length extent because
    the lower 32 bits of the allocation length was zero. The remaining
    length of the allocation to be done was an exact multiple of 2^32 -
    the first case I saw was at 496TB remaining to be allocated.

    This turns out to be an overflow when converting the allocation
    length (a 64 bit quantity) into the extent length to allocate (a 32
    bit quantity), and it requires the length to be allocated an exact
    multiple of 2^32 blocks to trip the assert.

    Fix it by limiting the extent lenth to allocate to MAXEXTLEN.

    Signed-off-by: Dave Chinner
    Signed-off-by: Ben Myers
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     
  • * 'for-linus' of git://oss.sgi.com/xfs/xfs:
    xfs: fix attr2 vs large data fork assert
    xfs: force buffer writeback before blocking on the ilock in inode reclaim
    xfs: validate acl count

    Linus Torvalds
     

02 Dec, 2011

3 commits

  • * 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2: (31 commits)
    ocfs2: avoid unaligned access to dqc_bitmap
    ocfs2: Use filemap_write_and_wait() instead of write_inode_now()
    ocfs2: honor O_(D)SYNC flag in fallocate
    ocfs2: Add a missing journal credit in ocfs2_link_credits() -v2
    ocfs2: send correct UUID to cleancache initialization
    ocfs2: Commit transactions in error cases -v2
    ocfs2: make direntry invalid when deleting it
    fs/ocfs2/dlm/dlmlock.c: free kmem_cache_zalloc'd data using kmem_cache_free
    ocfs2: Avoid livelock in ocfs2_readpage()
    ocfs2: serialize unaligned aio
    ocfs2: Implement llseek()
    ocfs2: Fix ocfs2_page_mkwrite()
    ocfs2: Add comment about orphan scanning
    ocfs2: Clean up messages in the fs
    ocfs2/cluster: Cluster up now includes network connections too
    ocfs2/cluster: Add new function o2net_fill_node_map()
    ocfs2/cluster: Fix output in file elapsed_time_in_ms
    ocfs2/dlm: dlmlock_remote() needs to account for remastery
    ocfs2/dlm: Take inflight reference count for remotely mastered resources too
    ocfs2/dlm: Cleanup dlm_wait_for_node_death() and dlm_wait_for_node_recovery()
    ...

    Linus Torvalds
     
  • The dqc_bitmap field of struct ocfs2_local_disk_chunk is 32-bit aligned,
    but not 64-bit aligned. The dqc_bitmap is accessed by ocfs2_set_bit(),
    ocfs2_clear_bit(), ocfs2_test_bit(), or ocfs2_find_next_zero_bit(). These
    are wrapper macros for ext2_*_bit() which need to take an unsigned long
    aligned address (though some architectures are able to handle unaligned
    address correctly)

    So some 64bit architectures may not be able to access the dqc_bitmap
    correctly.

    This avoids such unaligned access by using another wrapper functions for
    ext2_*_bit(). The code is taken from fs/ext4/mballoc.c which also need to
    handle unaligned bitmap access.

    Signed-off-by: Akinobu Mita
    Acked-by: Joel Becker
    Cc: Mark Fasheh
    Signed-off-by: Andrew Morton
    Signed-off-by: Joel Becker

    Akinobu Mita
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
    Btrfs: fix meta data raid-repair merge problem
    Btrfs: skip allocation attempt from empty cluster
    Btrfs: skip block groups without enough space for a cluster
    Btrfs: start search for new cluster at the beginning
    Btrfs: reset cluster's max_size when creating bitmap
    Btrfs: initialize new bitmaps' list
    Btrfs: fix oops when calling statfs on readonly device
    Btrfs: Don't error on resizing FS to same size
    Btrfs: fix deadlock on metadata reservation when evicting a inode
    Fix URL of btrfs-progs git repository in docs
    btrfs scrub: handle -ENOMEM from init_ipath()

    Linus Torvalds
     

01 Dec, 2011

10 commits

  • Commit 4a54c8c16 introduced raid-repair, killing the individual
    readpage_io_failed_hook entries from inode.c and disk-io.c. Commit
    4bb31e92 introduced new readahead code, adding a readpage_io_failed_hook to
    disk-io.c.

    The raid-repair commit had logic to disable raid-repair, if
    readpage_io_failed_hook is set. Thus, the readahead commit effectively
    disabled raid-repair for meta data.

    This commit changes the logic to always attempt raid-repair when needed and
    call the readpage_io_failed_hook in case raid-repair fails. This is much
    more straight forward and should have been like that from the beginning.

    Signed-off-by: Jan Schmidt
    Reported-by: Stefan Behrens
    Signed-off-by: Chris Mason

    Jan Schmidt
     
  • If we don't have a cluster, don't bother trying to allocate from it,
    jumping right away to the attempt to allocate a new cluster.

    Signed-off-by: Alexandre Oliva
    Signed-off-by: Chris Mason

    Alexandre Oliva
     
  • We test whether a block group has enough free space to hold the
    requested block, but when we're doing clustered allocation, we can
    save some cycles by testing whether it has enough room for the cluster
    upfront, otherwise we end up attempting to set up a cluster and
    failing. Only in the NO_EMPTY_SIZE loop do we attempt an unclustered
    allocation, and by then we'll have zeroed the cluster size, so this
    patch won't stop us from using the block group as a last resort.

    Signed-off-by: Alexandre Oliva
    Signed-off-by: Chris Mason

    Alexandre Oliva
     
  • Instead of starting at zero (offset is always zero), request a cluster
    starting at search_start, that denotes the beginning of the current
    block group.

    Signed-off-by: Alexandre Oliva
    Signed-off-by: Chris Mason

    Alexandre Oliva
     
  • The field that indicates the size of the largest contiguous chunk of
    free space in the cluster is not initialized when setting up bitmaps,
    it's only increased when we find a larger contiguous chunk. We end up
    retaining a larger value than appropriate for highly-fragmented
    clusters, which may cause pointless searches for large contiguous
    groups, and even cause clusters that do not meet the density
    requirements to be set up.

    Signed-off-by: Alexandre Oliva
    Signed-off-by: Chris Mason

    Alexandre Oliva
     
  • We're failing to create clusters with bitmaps because
    setup_cluster_no_bitmap checks that the list is empty before inserting
    the bitmap entry in the list for setup_cluster_bitmap, but the list
    field is only initialized when it is restored from the on-disk free
    space cache, or when it is written out to disk.

    Besides a potential race condition due to the multiple use of the list
    field, filesystem performance severely degrades over time: as we use
    up all non-bitmap free extents, the try-to-set-up-cluster dance is
    done at every metadata block allocation. For every block group, we
    fail to set up a cluster, and after failing on them all up to twice,
    we fall back to the much slower unclustered allocation.

    To make matters worse, before the unclustered allocation, we try to
    create new block groups until we reach the 1% threshold, which
    introduces additional bitmaps and thus block groups that we'll iterate
    over at each metadata block request.

    Alexandre Oliva
     
  • To reproduce this bug:

    # dd if=/dev/zero of=img bs=1M count=256
    # mkfs.btrfs img
    # losetup -r /dev/loop1 img
    # mount /dev/loop1 /mnt
    OOPS!!

    It triggered BUG_ON(!nr_devices) in btrfs_calc_avail_data_space().

    To fix this, instead of checking write-only devices, we check all open
    deivces:

    # df -h /dev/loop1
    Filesystem Size Used Avail Use% Mounted on
    /dev/loop1 250M 28K 238M 1% /mnt

    Signed-off-by: Li Zefan

    Li Zefan
     
  • It seems overly harsh to fail a resize of a btrfs file system to the
    same size when a shrink or grow would succeed. User app GParted trips
    over this error. Allow it by bypassing the shrink or grow operation.

    Signed-off-by: Mike Fleetwood

    Mike Fleetwood
     
  • When I ran the xfstests, I found the test tasks was blocked on meta-data
    reservation.

    By debugging, I found the reason of this bug:
    start transaction
    |
    v
    reserve meta-data space
    |
    v
    flush delay allocation -> iput inode -> evict inode
    ^ |
    | v
    wait for delay allocation flush

    Miao Xie
     
  • init_ipath() can return an ERR_PTR(-ENOMEM).

    Signed-off-by: Dan Carpenter

    Dan Carpenter
     

30 Nov, 2011

3 commits

  • With Dmitry fsstress updates I've seen very reproducible crashes in
    xfs_attr_shortform_remove because xfs_attr_shortform_bytesfit claims that
    the attributes would not fit inline into the inode after removing an
    attribute. It turns out that we were operating on an inode with lots
    of delalloc extents, and thus an if_bytes values for the data fork that
    is larger than biggest possible on-disk storage for it which utterly
    confuses the code near the end of xfs_attr_shortform_bytesfit.

    Fix this by always allowing the current attribute fork, like we already
    do for the attr1 format, given that delalloc conversion will take care
    for moving either the data or attribute area out of line if it doesn't
    fit at that point - or making the point moot by merging extents at this
    point.

    Also document the function better, and clean up some loose bits.

    Reviewed-by: Dave Chinner
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Ben Myers

    Christoph Hellwig
     
  • If we are doing synchronous inode reclaim we block the VM from making
    progress in memory reclaim. So if we encouter a flush locked inode
    promote it in the delwri list and wake up xfsbufd to write it out now.
    Without this we can get hangs of up to 30 seconds during workloads hitting
    synchronous inode reclaim.

    The scheme is copied from what we do for dquot reclaims.

    Reported-by: Simon Kirby
    Signed-off-by: Christoph Hellwig
    Tested-by: Simon Kirby
    Signed-off-by: Ben Myers

    Christoph Hellwig
     
  • * 'dev' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
    ext4: fix racy use-after-free in ext4_end_io_dio()

    Linus Torvalds
     

29 Nov, 2011

2 commits


25 Nov, 2011

1 commit

  • ext4_end_io_dio() queues io_end->work and then clears iocb->private;
    however, io_end->work calls aio_complete() which frees the iocb
    object. If that slab object gets reallocated, then ext4_end_io_dio()
    can end up clearing someone else's iocb->private, this use-after-free
    can cause a leak of a struct ext4_io_end_t structure.

    Detected and tested with slab poisoning.

    [ Note: Can also reproduce using 12 fio's against 12 file systems with the
    following configuration file:

    [global]
    direct=1
    ioengine=libaio
    iodepth=1
    bs=4k
    ba=4k
    size=128m

    [create]
    filename=${TESTDIR}
    rw=write

    -- tytso ]

    Google-Bug-Id: 5354697
    Signed-off-by: Tejun Heo
    Signed-off-by: "Theodore Ts'o"
    Reported-by: Kent Overstreet
    Tested-by: Kent Overstreet
    Cc: stable@kernel.org

    Tejun Heo
     

24 Nov, 2011

2 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs:
    eCryptfs: Extend array bounds for all filename chars
    eCryptfs: Flush file in vma close
    eCryptfs: Prevent file create race condition

    Linus Torvalds
     
  • From mhalcrow's original commit message:

    Characters with ASCII values greater than the size of
    filename_rev_map[] are valid filename characters.
    ecryptfs_decode_from_filename() will access kernel memory beyond
    that array, and ecryptfs_parse_tag_70_packet() will then decrypt
    those characters. The attacker, using the FNEK of the crafted file,
    can then re-encrypt the characters to reveal the kernel memory past
    the end of the filename_rev_map[] array. I expect low security
    impact since this array is statically allocated in the text area,
    and the amount of memory past the array that is accessible is
    limited by the largest possible ASCII filename character.

    This patch solves the issue reported by mhalcrow but with an
    implementation suggested by Linus to simply extend the length of
    filename_rev_map[] to 256. Characters greater than 0x7A are mapped to
    0x00, which is how invalid characters less than 0x7A were previously
    being handled.

    Signed-off-by: Tyler Hicks
    Reported-by: Michael Halcrow
    Cc: stable@kernel.org

    Tyler Hicks