21 Jun, 2012

2 commits

  • do_exit() and exec_mmap() call sync_mm_rss() before mm_release() does
    put_user(clear_child_tid) which can update task->rss_stat and thus make
    mm->rss_stat inconsistent. This triggers the "BUG:" printk in check_mm().

    Let's fix this bug in the safest way, and optimize/cleanup this later.

    Reported-by: Markus Trippelsdorf
    Signed-off-by: Konstantin Khlebnikov
    Cc: Oleg Nesterov
    Cc: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • A gc-inode is a pseudo inode used to buffer the blocks to be moved by
    garbage collection.

    Block caches of gc-inodes must be cleared every time a garbage collection
    function (nilfs_clean_segments) completes. Otherwise, stale blocks
    buffered in the caches may be wrongly reused in successive calls of the GC
    function.

    For user files, this is not a problem because their gc-inodes are
    distinguished by a checkpoint number as well as an inode number. They
    never buffer different blocks if either an inode number, a checkpoint
    number, or a block offset differs.

    However, gc-inodes of sufile, cpfile and DAT file can store different data
    for the same block offset. Thus, the nilfs_clean_segments function can
    move incorrect block for these meta-data files if an old block is cached.
    I found this is really causing meta-data corruption in nilfs.

    This fixes the issue by ensuring cache clear of gc-inodes and resolves
    reported GC problems including checkpoint file corruption, b-tree
    corruption, and the following warning during GC.

    nilfs_palloc_freev: entry number 307234 already freed.
    ...

    Signed-off-by: Ryusuke Konishi
    Tested-by: Ryusuke Konishi
    Cc: [2.6.37+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ryusuke Konishi
     

17 Jun, 2012

1 commit


16 Jun, 2012

5 commits

  • Pull NFS client bugfixes from Trond Myklebust:
    "Highlights include:

    - Fix a couple of mount regressions due to the recent cleanups.
    - Fix an Oops in the open recovery code
    - Fix an rpc_pipefs upcall hang that results from some of the net
    namespace work from 3.4.x (stable kernel candidate).
    - Fix a couple of write and o_direct regressions that were found at
    last weeks Bakeathon testing event in Ann Arbor."

    * tag 'nfs-for-3.5-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
    NFS: add an endian notation for sparse
    NFSv4.1: integer overflow in decode_cb_sequence_args()
    rpc_pipefs: allow rpc_purge_list to take a NULL waitq pointer
    NFSv4 do not send an empty SETATTR compound
    NFSv2: EOF incorrectly set on short read
    NFS: Use the NFS_DEFAULT_VERSION for v2 and v3 mounts
    NFS: fix directio refcount bug on commit
    NFSv4: Fix unnecessary delegation returns in nfs4_do_open
    NFSv4.1: Convert another trivial printk into a dprintk
    NFS4: Fix open bug when pnfs module blacklisted
    NFS: Remove incorrect BUG_ON in nfs_found_client
    NFS: Map minor mismatch error to protocol not support error.
    NFS: Fix a commit bug
    NFS4: Set parsed mount data version to 4
    NFSv4.1: Ensure we clear session state flags after a session creation
    NFSv4.1: Convert a trivial printk into a dprintk
    NFSv4: Fix up decode_attr_mdsthreshold
    NFSv4: Fix an Oops in the open recovery code
    NFSv4.1: Fix a request leak on the back channel

    Linus Torvalds
     
  • Pull two nfsd bugfixes from J. Bruce Fields.

    * 'for-3.5' of git://linux-nfs.org/~bfields/linux:
    nfsd4: BUG_ON(!is_spin_locked()) no good on UP kernels
    NFS: hard-code init_net for NFS callback transports

    Linus Torvalds
     
  • Avoid warning in 32 bit machines

    Signed-off-by: Chris Mason

    Chris Mason
     
  • gcc was giving an uninit variable warning here. Strictly
    speaking we don't need to init it, but this will make things
    much less error prone.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • Pull btrfs update from Chris Mason:
    "The dates look like I had to rebase this morning because there was a
    compiler warning for a printk arg that I had missed earlier.

    These are all fixes, including one to prevent using stale pointers for
    device names, and lots of fixes around transaction abort cleanups
    (Josef, Liu Bo).

    Jan Schmidt also sent in a number of fixes for the new reference
    number tracking code.

    Liu Bo beat me to updating the MAINTAINERS file. Since he thought to
    also fix the git url, I kept his commit."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (24 commits)
    Btrfs: update MAINTAINERS info for BTRFS FILE SYSTEM
    Btrfs: destroy the items of the delayed inodes in error handling routine
    Btrfs: make sure that we've made everything in pinned tree clean
    Btrfs: avoid memory leak of extent state in error handling routine
    Btrfs: do not resize a seeding device
    Btrfs: fix missing inherited flag in rename
    Btrfs: fix incompat flags setting
    Btrfs: fix defrag regression
    Btrfs: call filemap_fdatawrite twice for compression
    Btrfs: keep inode pinned when compressing writes
    Btrfs: implement ->show_devname
    Btrfs: use rcu to protect device->name
    Btrfs: unlock everything properly in the error case for nocow
    Btrfs: fix btrfs_destroy_marked_extents
    Btrfs: abort the transaction if the commit fails
    Btrfs: wake up transaction waiters when aborting a transaction
    Btrfs: fix locking in btrfs_destroy_delayed_refs
    Btrfs: pass locked_page into extent_clear_unlock_delalloc if theres an error
    Btrfs: fix race in tree mod log addition
    Btrfs: add btrfs_next_old_leaf
    ...

    Linus Torvalds
     

15 Jun, 2012

25 commits

  • the items of the delayed inodes were forgotten to be freed, this patch
    fixes it.

    Signed-off-by: Miao Xie
    Signed-off-by: Chris Mason

    Miao Xie
     
  • Since we have two trees for recording pinned extents, we need to go through
    both of them to make sure that we've done everything clean.

    Signed-off-by: Liu Bo
    Signed-off-by: Chris Mason

    Liu Bo
     
  • We've forgotten to clear extent states in pinned tree, which will results in
    space counter mismatch and memory leak:

    WARNING: at fs/btrfs/extent-tree.c:7537 btrfs_free_block_groups+0x1f3/0x2e0 [btrfs]()
    ...
    space_info 2 has 8380416 free, is not full
    space_info total=12582912, used=4096, pinned=4096, reserved=0, may_use=0, readonly=4194304
    btrfs state leak: start 29364224 end 29376511 state 1 in tree ffff880075f20090 refs 1
    ...

    Signed-off-by: Liu Bo
    Signed-off-by: Chris Mason

    Liu Bo
     
  • Seeding devices are not supposed to change any more.

    Signed-off-by: Liu Bo
    Signed-off-by: Chris Mason

    Liu Bo
     
  • When we move a file into a directory with compression flag, we need to
    inherite BTRFS_INODE_COMPRESS and clear BTRFS_INODE_NOCOMPRESS as well.
    But if we move a file into a directory without compression flag, we need
    to clear both of them.

    It is the way how our setflags deals with compression flag, so keep
    the same behaviour here.

    Signed-off-by: Liu Bo
    Signed-off-by: Chris Mason

    Liu Bo
     
  • Chris Mason
     
  • It's a bug, but it happens to work, as BTRFS_COMPRESS_LZO == 2, which
    has only one bit set.

    Signed-off-by: Li Zefan

    Li Zefan
     
  • If a file has 3 small extents:

    | ext1 | ext2 | ext3 |

    Running "btrfs fi defrag" will only defrag the last two extents, if those
    extent mappings hasn't been read into memory from disk.

    This bug was introduced by commit 17ce6ef8d731af5edac8c39e806db4c7e1f6956f
    ("Btrfs: add a check to decide if we should defrag the range")

    The cause is, that commit looked into previous and next extents using
    lookup_extent_mapping() only.

    While at it, remove the code that checks the previous extent, since
    it's sufficient to check the next extent.

    Signed-off-by: Li Zefan

    Li Zefan
     
  • I removed this in an earlier commit and I was wrong. Because compression
    can return from filemap_fdatawrite() without having actually set any of it's
    pages as writeback() it can make filemap_fdatawait() do essentially nothing,
    and then we won't find any ordered extents because they may not have been
    created yet. So not only does this make fsync() completely useless, but it
    will also screw up if you truncate on a non-page aligned offset since we
    zero out the end and then wait on ordered extents and then call drop caches.
    We can drop the cache before the io completes and then we try to unpin the
    extent we just wrote we won't find it and everything goes sideways. So fix
    this by putting it back and put a giant comment there to keep me from trying
    to remove it in the future. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • A user reported lots of problems using compression on the new code and it
    turns out part of the problem was that igrab() was failing when we added a
    new ordered extent. This is because when writing out an inode under
    compression we immediately return without actually doing anything to the
    pages, and then in another thread at some point down the line actually do
    the ordered dance. The problem is between the point that we start writeback
    and we actually add the ordered extent we could be trying to reclaim the
    inode, which makes igrab() return NULL. So we need to do an igrab() when we
    create the async extent and then drop it when we are done with it. This
    makes sure we stay pinned in memory until the ordered extent can get a
    reference on it and we are good to go. With this patch we no longer panic
    in btrfs_finish_ordered_io(). Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • Because btrfs can remove the device that was mounted we need to have a
    ->show_devname so that in this case we can print out some other device in
    the file system to /proc/mount. So if there are multiple devices in a btrfs
    file system we will just print the device with the lowest devid that we can
    find. This will make everything consistent and deal with device removal
    properly. The drawback is if you mount with a device that is higher than
    the lowest devicd it won't show up as the mounted device in /proc/mounts,
    but this is a small price to pay. This was inspired by Miao Xie's patch.
    Thanks,

    Reviewed-by: Miao Xie
    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • Al pointed out that we can just toss out the old name on a device and add a
    new one arbitrarily, so anybody who uses device->name in printk could
    possibly use free'd memory. Instead of adding locking around all of this he
    suggested doing it with RCU, so I've introduced a struct rcu_string that
    does just that and have gone through and protected all accesses to
    device->name that aren't under the uuid_mutex with rcu_read_lock(). This
    protects us and I will use it for dealing with removing the device that we
    used to mount the file system in a later patch. Thanks,

    Reviewed-by: David Sterba
    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • I was getting hung on umount when a transaction was aborted because a range
    of one of the free space inodes was still locked. This is because the nocow
    stuff doesn't unlock anything on error. This fixed the problem and I
    verified that is what was happening. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • So we're forcing the eb's to have their ref count set to 1 so invalidatepage
    works but this breaks lots of things, for example root nodes, and is just
    plain wrong, we don't need to just evict all of this stuff. Also drop the
    invalidatepage altogether and add a page_cache_release(). With this patch
    we no longer hang when trying to access the root nodes after an aborted
    transaction and we no longer leak memory. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • If a transaction commit fails we don't abort it so we don't set an error on
    the file system. This patch fixes that by actually calling the abort stuff
    and then adding a check for a fs error in the transaction start stuff to
    make sure it is caught properly. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • I was getting lots of hung tasks and a NULL pointer dereference because we
    are not cleaning up the transaction properly when it aborts. First we need
    to reset the running_transaction to NULL so we don't get a bad dereference
    for any start_transaction callers after this. Also we cannot rely on
    waitqueue_active() since it's just a list_empty(), so just call wake_up()
    directly since that will do the barrier for us and such. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • The transaction abort stuff was throwing warnings from the list debugging
    code because we do a list_del_init outside of the delayed_refs spin lock.
    The delayed refs locking makes baby Jesus cry so it's not hard to get wrong,
    but we need to take the ref head mutex to make sure it's not being processed
    currently, and so if it is we need to drop the spin lock and then take and
    drop the mutex and do the search again. If we can take the mutex then we
    can safely remove the head from the list and carry on. Now when the
    transaction aborts I don't get the list debugging warnings. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • While doing my enospc work I got a transaction abortion that resulted in a
    panic when we tried to unlock_page() an already unlocked page. This is
    because we aren't calling extent_clear_unlock_delalloc with the locked page
    so it was unlocking all the pages in the range. This is wrong since
    __extent_writepage expects to have the page locked still unless we return
    *page_started as 1. This should keep us from panicing. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • Most frequent symptom was a BUG triggering in expire_client, with the
    server locking up shortly thereafter.

    Introduced by 508dc6e110c6dbdc0bbe84298ccfe22de7538486 "nfsd41:
    free_session/free_client must be called under the client_lock".

    Cc: stable@kernel.org
    Cc: Benny Halevy
    Signed-off-by: J. Bruce Fields

    J. Bruce Fields
     
  • In case of destroying mount namespace on child reaper exit, nsproxy is zeroed
    to the point already. So, dereferencing of it is invalid.
    This patch hard-code "init_net" for all network namespace references for NFS
    callback services. This will be fixed with proper NFS callback
    containerization.

    Signed-off-by: Stanislav Kinsbursky
    Signed-off-by: J. Bruce Fields

    Stanislav Kinsbursky
     
  • When adding to the tree modification log, we grab two locks at different
    stages. We must not drop the outer lock until we're done with section
    protected by the inner lock. This moves the unlock call for the outer lock
    to the appropriate position.

    Signed-off-by: Jan Schmidt

    Jan Schmidt
     
  • To make sense of the tree mod log, the backref walker not only needs
    btrfs_search_old_slot, but it also called btrfs_next_leaf, which in turn was
    calling btrfs_search_slot. This obviously didn't give the correct result.

    This commit adds btrfs_next_old_leaf, a drop-in replacement for
    btrfs_next_leaf with a time_seq parameter. If it is zero, it behaves exactly
    like btrfs_next_leaf. If it is non-zero, it will use btrfs_search_old_slot
    with this time_seq parameter.

    Signed-off-by: Jan Schmidt

    Jan Schmidt
     
  • In __tree_mod_log_oldest_root() we must return the found operation even if
    it's not a ROOT_REPLACE operation. Otherwise, the caller assumes that there
    are no operations to be rewinded and returns immediately.

    The code in the caller is modified to improve readability.

    Signed-off-by: Jan Schmidt

    Jan Schmidt
     
  • get_old_root could race with root node updates because we weren't locking
    the node early enough. Use btrfs_read_lock_root_node to grab the root locked
    in the very beginning and release the lock as soon as possible (just like
    btrfs_search_slot does).

    Signed-off-by: Jan Schmidt

    Jan Schmidt
     
  • When resolving indirect refs, we used to call btrfs_next_leaf in case we
    didn't find an exact match. While we should find exact matches most of the
    time, in case we don't, we must continue searching. Treating those matches
    differently depending on the level we're searching doesn't make sense.

    Even worse, we might end up searching for a key larger than the largest, in
    which case there is no next_leaf and subsequent jobs would fail. This commit
    drops the bogous lines.

    Signed-off-by: Jan Schmidt

    Jan Schmidt
     

12 Jun, 2012

6 commits


10 Jun, 2012

1 commit

  • Older versions of nfs utils don't always pass a "vers=" mount option for
    NFS. This chould lead to attempts at using NFS v0 due to a zeroed out
    nfs_parsed_mount_data struct. I solve this by setting the default NFS
    version to NFS_DEFAULT_VERSION in the v2 and v3 cases (v4 has already been
    taken care of by a similar patch).

    Reported-by: Joerg Roedel
    Signed-off-by: Bryan Schumaker
    Signed-off-by: Trond Myklebust

    Bryan Schumaker