21 Apr, 2014

1 commit

  • Pull ext4 fixes from Ted Ts'o:
    "These are regression and bug fixes for ext4.

    We had a number of new features in ext4 during this merge window
    (ZERO_RANGE and COLLAPSE_RANGE fallocate modes, renameat, etc.) so
    there were many more regression and bug fixes this time around. It
    didn't help that xfstests hadn't been fully updated to fully stress
    test COLLAPSE_RANGE until after -rc1"

    * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (31 commits)
    ext4: disable COLLAPSE_RANGE for bigalloc
    ext4: fix COLLAPSE_RANGE failure with 1KB block size
    ext4: use EINVAL if not a regular file in ext4_collapse_range()
    ext4: enforce we are operating on a regular file in ext4_zero_range()
    ext4: fix extent merging in ext4_ext_shift_path_extents()
    ext4: discard preallocations after removing space
    ext4: no need to truncate pagecache twice in collapse range
    ext4: fix removing status extents in ext4_collapse_range()
    ext4: use filemap_write_and_wait_range() correctly in collapse range
    ext4: use truncate_pagecache() in collapse range
    ext4: remove temporary shim used to merge COLLAPSE_RANGE and ZERO_RANGE
    ext4: fix ext4_count_free_clusters() with EXT4FS_DEBUG and bigalloc enabled
    ext4: always check ext4_ext_find_extent result
    ext4: fix error handling in ext4_ext_shift_extents
    ext4: silence sparse check warning for function ext4_trim_extent
    ext4: COLLAPSE_RANGE only works on extent-based files
    ext4: fix byte order problems introduced by the COLLAPSE_RANGE patches
    ext4: use i_size_read in ext4_unaligned_aio()
    fs: disallow all fallocate operation on active swapfile
    fs: move falloc collapse range check into the filesystem methods
    ...

    Linus Torvalds
     

20 Apr, 2014

3 commits

  • Once COLLAPSE RANGE is be disable for ext4 with bigalloc feature till finding
    root-cause of problem. It will be enable with fixing that regression of
    xfstest(generic 075 and 091) again.

    Signed-off-by: Namjae Jeon
    Signed-off-by: Ashish Sangwan
    Reviewed-by: Lukas Czerner
    Signed-off-by: "Theodore Ts'o"

    Namjae Jeon
     
  • When formatting with 1KB or 2KB(not aligned with PAGE SIZE) block
    size, xfstests generic/075 and 091 are failing. The offset supplied to
    function truncate_pagecache_range is block size aligned. In this
    function start offset is re-aligned to PAGE_SIZE by rounding_up to the
    next page boundary. Due to this rounding up, old data remains in the
    page cache when blocksize is less than page size and start offset is
    not aligned with page size. In case of collapse range, we need to
    align start offset to page size boundary by doing a round down
    operation instead of round up.

    Signed-off-by: Namjae Jeon
    Signed-off-by: Ashish Sangwan
    Signed-off-by: "Theodore Ts'o"

    Namjae Jeon
     
  • A va_list needs to be copied in case it needs to be used twice.

    Thanks to Hugh for debugging this issue, leading to various panics.

    Tested:

    lpq84:~# echo "|/foobar12345 %h %h %h %h %h %h %h %h %h %h %h %h %h %h %h %h %h %h %h %h" >/proc/sys/kernel/core_pattern

    'produce_core' is simply : main() { *(int *)0 = 1;}

    lpq84:~# ./produce_core
    Segmentation fault (core dumped)
    lpq84:~# dmesg | tail -1
    [ 614.352947] Core dump to |/foobar12345 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 (null) pipe failed

    Notice the last argument was replaced by a NULL (we were lucky enough to
    not crash, but do not try this on your production machine !)

    After fix :

    lpq83:~# echo "|/foobar12345 %h %h %h %h %h %h %h %h %h %h %h %h %h %h %h %h %h %h %h %h" >/proc/sys/kernel/core_pattern
    lpq83:~# ./produce_core
    Segmentation fault
    lpq83:~# dmesg | tail -1
    [ 740.800441] Core dump to |/foobar12345 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 pipe failed

    Fixes: 5fe9d8ca21cc ("coredump: cn_vprintf() has no reason to call vsnprintf() twice")
    Signed-off-by: Eric Dumazet
    Diagnosed-by: Hugh Dickins
    Acked-by: Oleg Nesterov
    Cc: Neil Horman
    Cc: Andrew Morton
    Cc: stable@vger.kernel.org # 3.11+
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     

19 Apr, 2014

2 commits

  • Pull cifs fixes from Steve French:
    "A set of 5 small cifs fixes"

    * 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
    cif: fix dead code
    cifs: fix error handling cifs_user_readv
    fs: cifs: remove unused variable.
    Return correct error on query of xattr on file with empty xattrs
    cifs: Wait for writebacks to complete before attempting write.

    Linus Torvalds
     
  • Pull driver core fixes from Greg KH:
    "Here are some driver core fixes for 3.15-rc2. Also in here are some
    documentation updates, as well as an API removal that had to wait for
    after -rc1 due to the cleanups coming into you from multiple developer
    trees (this one and the PPC tree.)

    All have been in linux next successfully"

    * tag 'driver-core-3.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
    drivers/base/dd.c incorrect pr_debug() parameters
    Documentation: Update stable address in Chinese and Japanese translations
    topology: Fix compilation warning when not in SMP
    Chinese: add translation of io_ordering.txt
    stable_kernel_rules: spelling/word usage
    sysfs, driver-core: remove unused {sysfs|device}_schedule_callback_owner()
    kernfs: protect lazy kernfs_iattrs allocation with mutex
    fs: Don't return 0 from get_anon_bdev

    Linus Torvalds
     

18 Apr, 2014

8 commits

  • Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     
  • Signed-off-by: Jon Ernst
    Signed-off-by: "Theodore Ts'o"

    jon ernst
     
  • There is a bug in ext4_ext_shift_path_extents() where if we actually
    manage to merge a extent we would skip shifting the next extent. This
    will result in in one extent in the extent tree not being properly
    shifted.

    This is causing failure in various xfstests tests using fsx or fsstress
    with collapse range support. It will also cause file system corruption
    which looks something like:

    e2fsck 1.42.9 (4-Feb-2014)
    Pass 1: Checking inodes, blocks, and sizes
    Inode 20 has out of order extents
    (invalid logical block 3, physical block 492938, len 2)
    Clear? yes
    ...

    when running e2fsck.

    It's also very easily reproducible just by running fsx without any
    parameters. I can usually hit the problem within a minute.

    Fix it by increasing ex_start only if we're not merging the extent.

    Signed-off-by: Lukas Czerner
    Signed-off-by: "Theodore Ts'o"
    Reviewed-by: Namjae Jeon

    Lukas Czerner
     
  • Currently in ext4_collapse_range() and ext4_punch_hole() we're
    discarding preallocation twice. Once before we attempt to do any changes
    and second time after we're done with the changes.

    While the second call to ext4_discard_preallocations() in
    ext4_punch_hole() case is not needed, we need to discard preallocation
    right after ext4_ext_remove_space() in collapse range case because in
    the case we had to restart a transaction in the middle of removing space
    we might have new preallocations created.

    Remove unneeded ext4_discard_preallocations() ext4_punch_hole() and move
    it to the better place in ext4_collapse_range()

    Signed-off-by: Lukas Czerner
    Signed-off-by: "Theodore Ts'o"

    Lukas Czerner
     
  • We're already calling truncate_pagecache() before we attempt to do any
    actual job so there is not need to truncate pagecache once more using
    truncate_setsize() after we're finished.

    Remove truncate_setsize() and replace it just with i_size_write() note
    that we're holding appropriate locks.

    Signed-off-by: Lukas Czerner
    Signed-off-by: "Theodore Ts'o"

    Lukas Czerner
     
  • Currently in ext4_collapse_range() when calling ext4_es_remove_extent() to
    remove status extents we're passing (EXT_MAX_BLOCKS - punch_start - 1)
    in order to remove all extents from start of the collapse range to the
    end of the file. However this is wrong because we might miss the
    possible extent covering the last block of the file.

    Fix it by removing the -1.

    Signed-off-by: Lukas Czerner
    Signed-off-by: "Theodore Ts'o"
    Reviewed-by: Namjae Jeon

    Lukas Czerner
     
  • Currently we're passing -1 as lend argumnet for
    filemap_write_and_wait_range() which is wrong since lend is signed type
    so it would cause some confusion and we might not write_and_wait for the
    entire range we're expecting to write.

    Fix it by using LLONG_MAX instead.

    Signed-off-by: Lukas Czerner
    Signed-off-by: "Theodore Ts'o"

    Lukas Czerner
     
  • We should be using truncate_pagecache() instead of
    truncate_pagecache_range() in the collapse range because we're
    truncating page cache from offset to the end of file.
    truncate_pagecache() also get rid of the private COWed pages from the
    range because we're going to shift the end of the file.

    Signed-off-by: Lukas Czerner
    Signed-off-by: "Theodore Ts'o"

    Lukas Czerner
     

17 Apr, 2014

14 commits

  • This issue was found by Coverity (CID 1202536)

    This proposes a fix for a statement that creates dead code.
    The "rc < 0" statement is within code that is run
    with "rc > 0".

    It seems like "err < 0" was meant to be used here.
    This way, the error code is returned by the function.

    Signed-off-by: Michael Opdenacker
    Acked-by: Al Viro
    Signed-off-by: Steve French

    Michael Opdenacker
     
  • Coverity says:

    *** CID 1202537: Dereference after null check (FORWARD_NULL)
    /fs/cifs/file.c: 2873 in cifs_user_readv()
    2867 cur_len = min_t(const size_t, len - total_read, cifs_sb->rsize);
    2868 npages = DIV_ROUND_UP(cur_len, PAGE_SIZE);
    2869
    2870 /* allocate a readdata struct */
    2871 rdata = cifs_readdata_alloc(npages,
    2872 cifs_uncached_readv_complete);
    >>> CID 1202537: Dereference after null check (FORWARD_NULL)
    >>> Comparing "rdata" to null implies that "rdata" might be null.
    2873 if (!rdata) {
    2874 rc = -ENOMEM;
    2875 goto error;
    2876 }
    2877
    2878 rc = cifs_read_allocate_pages(rdata, npages);

    ...when we "goto error", rc will be non-zero, and then we end up trying
    to do a kref_put on the rdata (which is NULL). Fix this by replacing
    the "goto error" with a "break".

    Reported-by:
    Signed-off-by: Jeff Layton
    Signed-off-by: Steve French

    Jeff Layton
     
  • xfstests generic/004 reproduces an ilock deadlock using the tmpfile
    interface when selinux is enabled. This occurs because
    xfs_create_tmpfile() takes the ilock and then calls d_tmpfile(). The
    latter eventually calls into xfs_xattr_get() which attempts to get the
    lock again. E.g.:

    xfs_io D ffffffff81c134c0 4096 3561 3560 0x00000080
    ffff8801176a1a68 0000000000000046 ffff8800b401b540 ffff8801176a1fd8
    00000000001d5800 00000000001d5800 ffff8800b401b540 ffff8800b401b540
    ffff8800b73a6bd0 fffffffeffffffff ffff8800b73a6bd8 ffff8800b5ddb480
    Call Trace:
    [] schedule+0x29/0x70
    [] rwsem_down_read_failed+0xc5/0x120
    [] ? xfs_ilock_attr_map_shared+0x1f/0x50 [xfs]
    [] call_rwsem_down_read_failed+0x14/0x30
    [] ? down_read_nested+0x89/0xa0
    [] ? xfs_ilock+0x122/0x250 [xfs]
    [] xfs_ilock+0x122/0x250 [xfs]
    [] xfs_ilock_attr_map_shared+0x1f/0x50 [xfs]
    [] xfs_attr_get+0x90/0xe0 [xfs]
    [] xfs_xattr_get+0x37/0x50 [xfs]
    [] generic_getxattr+0x4f/0x70
    [] inode_doinit_with_dentry+0x1ae/0x650
    [] selinux_d_instantiate+0x1c/0x20
    [] security_d_instantiate+0x1b/0x30
    [] d_instantiate+0x50/0x70
    [] d_tmpfile+0xb5/0xc0
    [] xfs_create_tmpfile+0x362/0x410 [xfs]
    [] xfs_vn_tmpfile+0x18/0x20 [xfs]
    [] path_openat+0x228/0x6a0
    [] ? sched_clock+0x9/0x10
    [] ? kvm_clock_read+0x27/0x40
    [] ? __alloc_fd+0xaf/0x1f0
    [] do_filp_open+0x3a/0x90
    [] ? _raw_spin_unlock+0x27/0x40
    [] ? __alloc_fd+0xaf/0x1f0
    [] do_sys_open+0x12e/0x210
    [] SyS_open+0x1e/0x20
    [] system_call_fastpath+0x16/0x1b

    xfs_vn_tmpfile() also fails to initialize security on the newly created
    inode.

    Pull the d_tmpfile() call up into xfs_vn_tmpfile() after the transaction
    has been committed and the inode unlocked. Also, initialize security on
    the inode based on the parent directory provided via the tmpfile call.

    Signed-off-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Brian Foster
     
  • When testing exhaustion of dm snapshots, the following appeared
    with CONFIG_DEBUG_OBJECTS_FREE enabled:

    ODEBUG: free active (active state 0) object type: work_struct hint: xfs_buf_iodone_work+0x0/0x1d0 [xfs]

    indicating that we'd freed a buffer which still had a pending reference,
    down this path:

    [ 190.867975] [] debug_check_no_obj_freed+0x22b/0x270
    [ 190.880820] [] kmem_cache_free+0xd0/0x370
    [ 190.892615] [] xfs_buf_free+0xe4/0x210 [xfs]
    [ 190.905629] [] xfs_buf_rele+0xe7/0x270 [xfs]
    [ 190.911770] [] xfs_trans_read_buf_map+0x7b6/0xac0 [xfs]

    At issue is the fact that if IO fails in xfs_buf_iorequest,
    we'll queue completion unconditionally, and then call
    xfs_buf_rele; but if IO failed, there are no IOs remaining,
    and xfs_buf_rele will free the bp while work is still queued.

    Fix this by not scheduling completion if the buffer has
    an error on it; run it immediately. The rest is only comment
    changes.

    Thanks to dchinner for spotting the root cause.

    Signed-off-by: Eric Sandeen
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Eric Sandeen
     
  • We negate the error value being returned from a generic function
    incorrectly. The code path that it is running in returned negative
    errors, so there is no need to negate it to get the correct error
    signs here.

    This was uncovered by generic/019.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • And interesting situation can occur if a log IO error occurs during
    the unmount of a filesystem. The cases reported have the same
    signature - the update of the superblock counters fails due to a log
    write IO error:

    XFS (dm-16): xfs_do_force_shutdown(0x2) called from line 1170 of file fs/xfs/xfs_log.c. Return address = 0xffffffffa08a44a1
    XFS (dm-16): Log I/O Error Detected. Shutting down filesystem
    XFS (dm-16): Unable to update superblock counters. Freespace may not be correct on next mount.
    XFS (dm-16): xfs_log_force: error 5 returned.
    XFS (¿-¿¿¿): Please umount the filesystem and rectify the problem(s)

    It can be seen that the last line of output contains a corrupt
    device name - this is because the log and xfs_mount structures have
    already been freed by the time this message is printed. A kernel
    oops closely follows.

    The issue is that the shutdown is occurring in a separate IO
    completion thread to the unmount. Once the shutdown processing has
    started and all the iclogs are marked with XLOG_STATE_IOERROR, the
    log shutdown code wakes anyone waiting on a log force so they can
    process the shutdown error. This wakes up the unmount code that
    is doing a synchronous transaction to update the superblock
    counters.

    The unmount path now sees all the iclogs are marked with
    XLOG_STATE_IOERROR and so never waits on them again, knowing that if
    it does, there will not be a wakeup trigger for it and we will hang
    the unmount if we do. Hence the unmount runs through all the
    remaining code and frees all the filesystem structures while the
    xlog_iodone() is still processing the shutdown. When the log
    shutdown processing completes, xfs_do_force_shutdown() emits the
    "Please umount the filesystem and rectify the problem(s)" message,
    and xlog_iodone() then aborts all the objects attached to the iclog.
    An iclog that has already been freed....

    The real issue here is that there is no serialisation point between
    the log IO and the unmount. We have serialisations points for log
    writes, log forces, reservations, etc, but we don't actually have
    any code that wakes for log IO to fully complete. We do that for all
    other types of object, so why not iclogbufs?

    Well, it turns out that we can easily do this. We've got xfs_buf
    handles, and that's what everyone else uses for IO serialisation.
    i.e. bp->b_sema. So, lets hold iclogbufs locked over IO, and only
    release the lock in xlog_iodone() when we are finished with the
    buffer. That way before we tear down the iclog, we can lock and
    unlock the buffer to ensure IO completion has finished completely
    before we tear it down.

    Signed-off-by: Dave Chinner
    Tested-by: Mike Snitzer
    Tested-by: Bob Mastors
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • FSX has been detecting data corruption after to collapse range
    calls. The key observation is that the offset of the last extent in
    the file was not being shifted, and hence when the file size was
    adjusted it was truncating away data because the extents handled
    been correctly shifted.

    Tracing indicated that before the collapse, the extent list looked
    like:

    ....
    ino 0x5788 state idx 6 offset 26 block 195904 count 10 flag 0
    ino 0x5788 state idx 7 offset 39 block 195917 count 35 flag 0
    ino 0x5788 state idx 8 offset 86 block 195964 count 32 flag 0

    and after the shift of 2 blocks:

    ino 0x5788 state idx 6 offset 24 block 195904 count 10 flag 0
    ino 0x5788 state idx 7 offset 37 block 195917 count 35 flag 0
    ino 0x5788 state idx 8 offset 86 block 195964 count 32 flag 0

    Note that the last extent did not change offset. After the changing
    of the file size:

    ino 0x5788 state idx 6 offset 24 block 195904 count 10 flag 0
    ino 0x5788 state idx 7 offset 37 block 195917 count 35 flag 0
    ino 0x5788 state idx 8 offset 86 block 195964 count 30 flag 0

    You can see that the last extent had it's length truncated,
    indicating that we've lost data.

    The reason for this is that the xfs_bmap_shift_extents() loop uses
    XFS_IFORK_NEXTENTS() to determine how many extents are in the inode.
    This, unfortunately, doesn't take into account delayed allocation
    extents - it's a count of physically allocated extents - and hence
    when the file being collapsed has a delalloc extent like this one
    does prior to the range being collapsed:

    ....
    ino 0x5788 state idx 4 offset 11 block 4503599627239429 count 1 flag 0
    ....

    it gets the count wrong and terminates the shift loop early.

    Fix it by using the in-memory extent array size that includes
    delayed allocation extents to determine the number of extents on the
    inode.

    Signed-off-by: Dave Chinner
    Tested-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • Al Viro tracked down the problem that has caused generic/263 to fail
    on XFS since the test was introduced. If is caused by
    xfs_get_blocks() mapping a single extent that spans EOF without
    marking it as buffer-new() so that the direct IO code does not zero
    the tail of the block at the new EOF. This is a long standing bug
    that has been around for many, many years.

    Because xfs_get_blocks() starts the map before EOF, it can't set
    buffer_new(), because that causes he direct IO code to also zero
    unaligned sectors at the head of the IO. This would overwrite valid
    data with zeros, and hence we cannot validly return a single extent
    that spans EOF to direct IO.

    Fix this by detecting a mapping that spans EOF and truncate it down
    to EOF. This results in the the direct IO code doing the right thing
    for unaligned data blocks before EOF, and then returning to get
    another mapping for the region beyond EOF which XFS treats correctly
    by setting buffer_new() on it. This makes direct Io behave correctly
    w.r.t. tail block zeroing beyond EOF, and fsx is happy about that.

    Again, thanks to Al Viro for finding what I couldn't.

    [ dchinner: Fix for __divdi3 build error:

    Reported-by: Paul Gortmaker
    Tested-by: Paul Gortmaker
    Signed-off-by: Mark Tinguely
    Reviewed-by: Eric Sandeen
    ]

    Signed-off-by: Dave Chinner
    Tested-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • All device_schedule_callback_owner() users are converted to use
    device_remove_file_self(). Remove now unused
    {sysfs|device}_schedule_callback_owner().

    Signed-off-by: Tejun Heo
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     
  • kernfs_iattrs is allocated lazily when operations which require it
    take place; unfortunately, the lazy allocation and returning weren't
    properly synchronized and when there are multiple concurrent
    operations, it might end up returning kernfs_iattrs which hasn't
    finished initialization yet or different copies to different callers.

    Fix it by synchronizing with a mutex. This can be smarter with memory
    barriers but let's go there if it actually turns out to be necessary.

    Signed-off-by: Tejun Heo
    Link: http://lkml.kernel.org/g/533ABA32.9080602@oracle.com
    Reported-by: Sasha Levin
    Cc: stable@vger.kernel.org # 3.14
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     
  • Commit 9e30cc9595303b27b48 removed an internal mount. This
    has the side-effect that rootfs now has FSID 0. Many
    userspace utilities assume that st_dev in struct stat
    is never 0, so this change breaks a number of tools in
    early userspace.

    Since we don't know how many userspace programs are affected,
    make sure that FSID is at least 1.

    References: http://article.gmane.org/gmane.linux.kernel/1666905
    References: http://permalink.gmane.org/gmane.linux.utilities.util-linux-ng/8557
    Cc: 3.14
    Signed-off-by: Thomas Bächler
    Acked-by: Tejun Heo
    Acked-by: H. Peter Anvin
    Tested-by: Alexandre Demers
    Signed-off-by: Greg Kroah-Hartman

    Thomas Bächler
     
  • In SMB2_set_compression(), the "res_key" variable is only initialized to NULL
    and later kfreed. It is therefore useless and should be removed.

    Found with the following semantic patch:

    @@
    identifier foo;
    identifier f;
    type T;
    @@
    * f(...) {
    ...
    * T *foo = NULL;
    ... when forall
    when != foo
    * kfree(foo);
    ...
    }

    Signed-off-by: Cyril Roelandt
    Signed-off-by: Steve French

    Cyril Roelandt
     
  • xfstest 020 detected a problem with cifs xattr handling. When a file
    had an empty xattr list, we returned success (with an empty xattr value)
    on query of particular xattrs rather than returning ENODATA.
    This patch fixes it so that query of an xattr returns ENODATA when the
    xattr list is empty for the file.

    Signed-off-by: Steve French
    Reviewed-by: Jeff Layton

    Steve French
     
  • Problem reported in Red Hat bz 1040329 for strict writes where we cache
    only when we hold oplock and write direct to the server when we don't.

    When we receive an oplock break, we first change the oplock value for
    the inode in cifsInodeInfo->oplock to indicate that we no longer hold
    the oplock before we enqueue a task to flush changes to the backing
    device. Once we have completed flushing the changes, we return the
    oplock to the server.

    There are 2 ways here where we can have data corruption
    1) While we flush changes to the backing device as part of the oplock
    break, we can have processes write to the file. These writes check for
    the oplock, find none and attempt to write directly to the server.
    These direct writes made while we are flushing from cache could be
    overwritten by data being flushed from the cache causing data
    corruption.
    2) While a thread runs in cifs_strict_writev, the machine could receive
    and process an oplock break after the thread has checked the oplock and
    found that it allows us to cache and before we have made changes to the
    cache. In that case, we end up with a dirty page in cache when we
    shouldn't have any. This will be flushed later and will overwrite all
    subsequent writes to the part of the file represented by this page.

    Before making any writes to the server, we need to confirm that we are
    not in the process of flushing data to the server and if we are, we
    should wait until the process is complete before we attempt the write.
    We should also wait for existing writes to complete before we process
    an oplock break request which changes oplock values.

    We add a version specific downgrade_oplock() operation to allow for
    differences in the oplock values set for the different smb versions.

    Cc: stable@vger.kernel.org
    Signed-off-by: Sachin Prabhu
    Reviewed-by: Jeff Layton
    Reviewed-by: Pavel Shilovsky
    Signed-off-by: Steve French

    Sachin Prabhu
     

15 Apr, 2014

1 commit

  • With bigalloc enabled we must use EXT4_CLUSTERS_PER_GROUP() instead of
    EXT4_BLOCKS_PER_GROUP() otherwise we will go beyond the allocated buffer.

    $ mount -t ext4 /dev/vde /vde
    [ 70.573993] EXT4-fs DEBUG (fs/ext4/mballoc.c, 2346): ext4_mb_alloc_groupinfo:
    [ 70.575174] allocated s_groupinfo array for 1 meta_bg's
    [ 70.576172] EXT4-fs DEBUG (fs/ext4/super.c, 2092): ext4_check_descriptors:
    [ 70.576972] Checking group descriptorsBUG: unable to handle kernel paging request at ffff88006ab56000
    [ 72.463686] IP: [] __bitmap_weight+0x2a/0x7f
    [ 72.464168] PGD 295e067 PUD 2961067 PMD 7fa8e067 PTE 800000006ab56060
    [ 72.464738] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
    [ 72.465139] Modules linked in:
    [ 72.465402] CPU: 1 PID: 3560 Comm: mount Tainted: G W 3.14.0-rc2-00069-ge57bce1 #60
    [ 72.466079] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    [ 72.466505] task: ffff88007ce6c8a0 ti: ffff88006b7f0000 task.ti: ffff88006b7f0000
    [ 72.466505] RIP: 0010:[] [] __bitmap_weight+0x2a/0x7f
    [ 72.466505] RSP: 0018:ffff88006b7f1c00 EFLAGS: 00010206
    [ 72.466505] RAX: 0000000000000000 RBX: 000000000000050a RCX: 0000000000000040
    [ 72.466505] RDX: 0000000000000000 RSI: 0000000000080000 RDI: 0000000000000000
    [ 72.466505] RBP: ffff88006b7f1c28 R08: 0000000000000002 R09: 0000000000000000
    [ 72.466505] R10: 000000000000babe R11: 0000000000000400 R12: 0000000000080000
    [ 72.466505] R13: 0000000000000200 R14: 0000000000002000 R15: ffff88006ab55000
    [ 72.466505] FS: 00007f43ba1fa840(0000) GS:ffff88007f800000(0000) knlGS:0000000000000000
    [ 72.466505] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    [ 72.466505] CR2: ffff88006ab56000 CR3: 000000006b7e6000 CR4: 00000000000006e0
    [ 72.466505] Stack:
    [ 72.466505] ffff88006ab65000 0000000000000000 0000000000000000 0000000000010000
    [ 72.466505] ffff88006ab6f400 ffff88006b7f1c58 ffffffff81396bb8 0000000000010000
    [ 72.466505] 0000000000000000 ffff88007b869a90 ffff88006a48a000 ffff88006b7f1c70
    [ 72.466505] Call Trace:
    [ 72.466505] [] memweight+0x5f/0x8a
    [ 72.466505] [] ext4_count_free+0x13/0x21
    [ 72.466505] [] ext4_count_free_clusters+0xdb/0x171
    [ 72.466505] [] ext4_fill_super+0x117c/0x28ef
    [ 72.466505] [] ? vsnprintf+0x1c7/0x3f7
    [ 72.466505] [] mount_bdev+0x145/0x19c
    [ 72.466505] [] ? ext4_calculate_overhead+0x2a1/0x2a1
    [ 72.466505] [] ext4_mount+0x15/0x17
    [ 72.466505] [] mount_fs+0x67/0x150
    [ 72.466505] [] vfs_kern_mount+0x64/0xde
    [ 72.466505] [] do_mount+0x6fe/0x7f5
    [ 72.466505] [] ? strndup_user+0x3a/0xd9
    [ 72.466505] [] SyS_mount+0x85/0xbe
    [ 72.466505] [] tracesys+0xdd/0xe2
    [ 72.466505] Code: c3 89 f0 b9 40 00 00 00 55 99 48 89 e5 41 57 f7 f9 41 56 49 89 ff 41 55 45 31 ed 41 54 41 89 f4 53 31 db 41 89 c6 45 39 ee 7e 10 8b 3c ef 49 ff c5 e8 bf ff ff ff 01 c3 eb eb 31 c0 45 85 f6
    [ 72.466505] RIP [] __bitmap_weight+0x2a/0x7f
    [ 72.466505] RSP
    [ 72.466505] CR2: ffff88006ab56000
    [ 72.466505] ---[ end trace 7d051a08ae138573 ]---
    Killed

    Signed-off-by: "Theodore Ts'o"

    Azat Khuzhin
     

14 Apr, 2014

7 commits

  • When we are zeroing space andit is covered by a delalloc range, we
    need to punch the delalloc range out before we truncate the page
    cache. Failing to do so leaves and inconsistency between the page
    cache and the extent tree, which we later trip over when doing
    direct IO over the same range.

    Signed-off-by: Dave Chinner
    Tested-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • Similar to the write_begin problem, xfs-vm_write_end will truncate
    back to the old EOF, potentially removing page cache from over the
    top of delalloc blocks with valid data in them. Fix this by
    truncating back to just the start of the failed write.

    Signed-off-by: Dave Chinner
    Tested-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • If we fail a write beyond EOF and have to handle it in
    xfs_vm_write_begin(), we truncate the inode back to the current inode
    size. This doesn't take into account the fact that we may have
    already made successful writes to the same page (in the case of block
    size < page size) and hence we can truncate the page cache away from
    blocks with valid data in them. If these blocks are delayed
    allocation blocks, we now have a mismatch between the page cache and
    the extent tree, and this will trigger - at minimum - a delayed
    block count mismatch assert when the inode is evicted from the cache.
    We can also trip over it when block mapping for direct IO - this is
    the most common symptom seen from fsx and fsstress when run from
    xfstests.

    Fix it by only truncating away the exact range we are updating state
    for in this write_begin call.

    Signed-off-by: Dave Chinner
    Tested-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • When a write fails, if we don't clear the delalloc flags from the
    buffers over the failed range, they can persist beyond EOF and cause
    problems. writeback will see the pages in the page cache, see they
    are dirty and continually retry the write, assuming that the page
    beyond EOF is just racing with a truncate. The page will eventually
    be released due to some other operation (e.g. direct IO), and it
    will not pass through invalidation because it is dirty. Hence it
    will be released with buffer_delay set on it, and trigger warnings
    in xfs_vm_releasepage() and assert fail in xfs_file_aio_write_direct
    because invalidation failed and we didn't write the corect amount.

    This causes failures on block size < page size filesystems in fsx
    and fsstress workloads run by xfstests.

    Fix it by completely trashing any state on the buffer that could be
    used to imply that it contains valid data when the delalloc range
    over the buffer is punched out during the failed write handling.

    Signed-off-by: Dave Chinner
    Tested-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • On 32 bit, size_t is "unsigned int", not "unsigned long", causing the
    following warning when comparing with PAGE_SIZE, which is always "unsigned
    long":

    fs/cifs/file.c: In function ‘cifs_readdata_to_iov’:
    fs/cifs/file.c:2757: warning: comparison of distinct pointer types lacks a cast

    Introduced by commit 7f25bba819a3 ("cifs_iovec_read: keep iov_iter
    between the calls of cifs_readdata_to_iov()"), which changed the
    signedness of "remaining" and the code from min_t() to min().

    Signed-off-by: Geert Uytterhoeven
    Signed-off-by: Linus Torvalds

    Geert Uytterhoeven
     
  • Where are some places where logic guaranties us that extent we are
    searching exits, but this may not be true due to on-disk data
    corruption. If such corruption happens we must prevent possible
    null pointer dereferences.

    Signed-off-by: Dmitry Monakhov
    Signed-off-by: "Theodore Ts'o"

    Dmitry Monakhov
     
  • Fix error handling by adding some. :-)

    Signed-off-by: Dmitry Monakhov
    Signed-off-by: "Theodore Ts'o"

    Dmitry Monakhov
     

13 Apr, 2014

4 commits

  • This fixes the following sparse warning:

    CHECK fs/ext4/mballoc.c
    fs/ext4/mballoc.c:5019:9: warning: context imbalance in
    'ext4_trim_extent' - unexpected unlock

    Signed-off-by: "Jon Ernst"
    Signed-off-by: "Theodore Ts'o"

    jon ernst
     
  • Unfortunately, we weren't checking to make sure of this the inode was
    extent-based before attempt operate on it. Hilarity ensues.

    Signed-off-by: "Theodore Ts'o"
    Cc: Namjae Jeon

    Theodore Ts'o
     
  • Pull yet more networking updates from David Miller:

    1) Various fixes to the new Redpine Signals wireless driver, from
    Fariya Fatima.

    2) L2TP PPP connect code takes PMTU from the wrong socket, fix from
    Dmitry Petukhov.

    3) UFO and TSO packets differ in whether they include the protocol
    header in gso_size, account for that in skb_gso_transport_seglen().
    From Florian Westphal.

    4) If VLAN untagging fails, we double free the SKB in the bridging
    output path. From Toshiaki Makita.

    5) Several call sites of sk->sk_data_ready() were referencing an SKB
    just added to the socket receive queue in order to calculate the
    second argument via skb->len. This is dangerous because the moment
    the skb is added to the receive queue it can be consumed in another
    context and freed up.

    It turns out also that none of the sk->sk_data_ready()
    implementations even care about this second argument.

    So just kill it off and thus fix all these use-after-free bugs as a
    side effect.

    6) Fix inverted test in tcp_v6_send_response(), from Lorenzo Colitti.

    7) pktgen needs to do locking properly for LLTX devices, from Daniel
    Borkmann.

    8) xen-netfront driver initializes TX array entries in RX loop :-) From
    Vincenzo Maffione.

    9) After refactoring, some tunnel drivers allow a tunnel to be
    configured on top itself. Fix from Nicolas Dichtel.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (46 commits)
    vti: don't allow to add the same tunnel twice
    gre: don't allow to add the same tunnel twice
    drivers: net: xen-netfront: fix array initialization bug
    pktgen: be friendly to LLTX devices
    r8152: check RTL8152_UNPLUG
    net: sun4i-emac: add promiscuous support
    net/apne: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO
    net: ipv6: Fix oif in TCP SYN+ACK route lookup.
    drivers: net: cpsw: enable interrupts after napi enable and clearing previous interrupts
    drivers: net: cpsw: discard all packets received when interface is down
    net: Fix use after free by removing length arg from sk_data_ready callbacks.
    Drivers: net: hyperv: Address UDP checksum issues
    Drivers: net: hyperv: Negotiate suitable ndis version for offload support
    Drivers: net: hyperv: Allocate memory for all possible per-pecket information
    bridge: Fix double free and memory leak around br_allowed_ingress
    bonding: Remove debug_fs files when module init fails
    i40evf: program RSS LUT correctly
    i40evf: remove open-coded skb_cow_head
    ixgb: remove open-coded skb_cow_head
    igbvf: remove open-coded skb_cow_head
    ...

    Linus Torvalds
     
  • The vfs merge caused a latent bug to show up:

    In file included from fs/ceph/super.h:4:0,
    from fs/ceph/ioctl.c:3:
    include/linux/ceph/ceph_debug.h:4:0: warning: "pr_fmt" redefined [enabled by default]
    #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
    ^
    In file included from include/linux/kernel.h:13:0,
    from include/linux/uio.h:12,
    from include/linux/socket.h:7,
    from include/uapi/linux/in.h:22,
    from include/linux/in.h:23,
    from fs/ceph/ioctl.c:1:
    include/linux/printk.h:214:0: note: this is the location of the previous definition
    #define pr_fmt(fmt) fmt
    ^

    where the reason is that is included much too late
    for the "pr_fmt()" define.

    The include of needs to be the first include in the
    file, but fs/ceph/ioctl.c had for some reason missed that, and it wasn't
    noticeable until some unrelated header file changes brought in an
    indirect earlier include of .

    Signed-off-by: Linus Torvalds

    Linus Torvalds