13 Oct, 2013

4 commits

  • Olga reported that file descriptors opened with O_PATH do not work with
    fstatfs(), found during further development of ksh93's thread support.

    There is no reason to not allow O_PATH file descriptors here (fstatfs is
    very much a path operation), so use "fdget_raw()". See commit
    55815f70147d ("vfs: make O_PATH file descriptors usable for 'fstat()'")
    for a very similar issue reported for fstat() by the same team.

    Reported-and-tested-by: ольга крыжановская
    Acked-by: Al Viro
    Cc: stable@kernel.org # O_PATH introduced in 3.0+
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Pull ext4 bugfixes from Ted Ts'o:
    "A bug fix and performance regression fix for ext4"

    * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
    ext4: fix memory leak in xattr
    ext4: fix performance regression in writeback of random writes

    Linus Torvalds
     
  • Pull btrfs fixes from Chris Mason:
    "We've got more bug fixes in my for-linus branch:

    One of these fixes another corner of the compression oops from last
    time. Miao nailed down some problems with concurrent snapshot
    deletion and drive balancing.

    I kept out one of his patches for more testing, but these are all
    stable"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
    Btrfs: fix oops caused by the space balance and dead roots
    Btrfs: insert orphan roots into fs radix tree
    Btrfs: limit delalloc pages outside of find_delalloc_range
    Btrfs: use right root when checking for hash collision

    Linus Torvalds
     
  • If we take the 2nd retry path in ext4_expand_extra_isize_ea, we
    potentionally return from the function without having freed these
    allocations. If we don't do the return, we over-write the previous
    allocation pointers, so we leak either way.

    Spotted with Coverity.

    [ Fixed by tytso to set is and bs to NULL after freeing these
    pointers, in case in the retry loop we later end up triggering an
    error causing a jump to cleanup, at which point we could have a double
    free bug. -- Ted ]

    Signed-off-by: Dave Jones
    Signed-off-by: "Theodore Ts'o"
    Reviewed-by: Eric Sandeen
    Cc: stable@vger.kernel.org

    Dave Jones
     

11 Oct, 2013

4 commits

  • When doing space balance and subvolume destroy at the same time, we met
    the following oops:

    kernel BUG at fs/btrfs/relocation.c:2247!
    RIP: 0010: [] prepare_to_merge+0x154/0x1f0 [btrfs]
    Call Trace:
    [] relocate_block_group+0x466/0x4e6 [btrfs]
    [] btrfs_relocate_block_group+0x143/0x275 [btrfs]
    [] btrfs_relocate_chunk.isra.27+0x5c/0x5a2 [btrfs]
    [] ? btrfs_item_key_to_cpu+0x15/0x31 [btrfs]
    [] ? btrfs_get_token_64+0x7e/0xcd [btrfs]
    [] ? btrfs_tree_read_unlock_blocking+0xb2/0xb7 [btrfs]
    [] btrfs_balance+0x9c7/0xb6f [btrfs]
    [] btrfs_ioctl_balance+0x234/0x2ac [btrfs]
    [] btrfs_ioctl+0xd87/0x1ef9 [btrfs]
    [] ? path_openat+0x234/0x4db
    [] ? __do_page_fault+0x31d/0x391
    [] ? vma_link+0x74/0x94
    [] vfs_ioctl+0x1d/0x39
    [] do_vfs_ioctl+0x32d/0x3e2
    [] SyS_ioctl+0x57/0x83
    [] ? do_page_fault+0xe/0x10
    [] system_call_fastpath+0x16/0x1b

    It is because we returned the error number if the reference of the root was 0
    when doing space relocation. It was not right here, because though the root
    was dead(refs == 0), but the space it held still need be relocated, or we
    could not remove the block group. So in this case, we should return the root
    no matter it is dead or not.

    Signed-off-by: Miao Xie
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Miao Xie
     
  • Now we don't drop all the deleted snapshots/subvolumes before the space
    balance. It means we have to relocate the space which is held by the dead
    snapshots/subvolumes. So we must into them into fs radix tree, or we would
    forget to commit the change of them when doing transaction commit, and it
    would corrupt the metadata.

    Signed-off-by: Miao Xie
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Miao Xie
     
  • Liu fixed part of this problem and unfortunately I steered him in slightly the
    wrong direction and so didn't completely fix the problem. The problem is we
    limit the size of the delalloc range we are looking for to max bytes and then we
    try to lock that range. If we fail to lock the pages in that range we will
    shrink the max bytes to a single page and re loop. However if our first page is
    inside of the delalloc range then we will end up limiting the end of the range
    to a period before our first page. This is illustrated below

    [0 -------- delalloc range --------- 256mb]
    [page]

    So find_delalloc_range will return with delalloc_start as 0 and end as 128mb,
    and then we will notice that delalloc_start < *start and adjust it up, but not
    adjust delalloc_end up, so things go sideways. To fix this we need to not limit
    the max bytes in find_delalloc_range, but in find_lock_delalloc_range and that
    way we don't end up with this confusion. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • btrfs_rename was using the root of the old dir instead of the root of the new
    dir when checking for a hash collision, so if you tried to move a file into a
    subvol it would freak out because it would see the file you are trying to move
    in its current root. This fixes the bug where this would fail

    btrfs subvol create test1
    btrfs subvol create test2
    mv test1 test2.

    Thanks to Chris Murphy for catching this,

    Cc: stable@vger.kernel.org
    Reported-by: Chris Murphy
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     

06 Oct, 2013

1 commit

  • Pull btrfs fixes from Chris Mason:
    "This is a small collection of fixes, including a regression fix from
    Liu Bo that solves rare crashes with compression on.

    I've merged my for-linus up to 3.12-rc3 because the top commit is only
    meant for 3.12. The rest of the fixes are also available in my master
    branch on top of my last 3.11 based pull"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
    btrfs: Fix crash due to not allocating integrity data for a bioset
    Btrfs: fix a use-after-free bug in btrfs_dev_replace_finishing
    Btrfs: eliminate races in worker stopping code
    Btrfs: fix crash of compressed writes
    Btrfs: fix transid verify errors when recovering log tree

    Linus Torvalds
     

05 Oct, 2013

13 commits

  • When btrfs creates a bioset, we must also allocate the integrity data pool.
    Otherwise btrfs will crash when it tries to submit a bio to a checksumming
    disk:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
    IP: [] mempool_alloc+0x4a/0x150
    PGD 2305e4067 PUD 23063d067 PMD 0
    Oops: 0000 [#1] PREEMPT SMP
    Modules linked in: btrfs scsi_debug xfs ext4 jbd2 ext3 jbd mbcache
    sch_fq_codel eeprom lpc_ich mfd_core nfsd exportfs auth_rpcgss af_packet
    raid6_pq xor zlib_deflate libcrc32c [last unloaded: scsi_debug]
    CPU: 1 PID: 4486 Comm: mount Not tainted 3.12.0-rc1-mcsum #2
    Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    task: ffff8802451c9720 ti: ffff880230698000 task.ti: ffff880230698000
    RIP: 0010:[] [] mempool_alloc+0x4a/0x150
    RSP: 0018:ffff880230699688 EFLAGS: 00010286
    RAX: 0000000000000001 RBX: 0000000000000000 RCX: 00000000005f8445
    RDX: 0000000000000001 RSI: 0000000000000010 RDI: 0000000000000000
    RBP: ffff8802306996f8 R08: 0000000000011200 R09: 0000000000000008
    R10: 0000000000000020 R11: ffff88009d6e8000 R12: 0000000000011210
    R13: 0000000000000030 R14: ffff8802306996b8 R15: ffff8802451c9720
    FS: 00007f25b8a16800(0000) GS:ffff88024fc80000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 0000000000000018 CR3: 0000000230576000 CR4: 00000000000007e0
    Stack:
    ffff8802451c9720 0000000000000002 ffffffff81a97100 0000000000281250
    ffffffff81a96480 ffff88024fc99150 ffff880228d18200 0000000000000000
    0000000000000000 0000000000000040 ffff880230e8c2e8 ffff8802459dc900
    Call Trace:
    [] bio_integrity_alloc+0x48/0x1b0
    [] bio_integrity_prep+0xac/0x360
    [] ? mempool_alloc+0x58/0x150
    [] ? alloc_extent_state+0x31/0x110 [btrfs]
    [] blk_queue_bio+0x1c9/0x460
    [] generic_make_request+0xca/0x100
    [] submit_bio+0x79/0x160
    [] btrfs_map_bio+0x48e/0x5b0 [btrfs]
    [] btree_submit_bio_hook+0xda/0x110 [btrfs]
    [] submit_one_bio+0x6a/0xa0 [btrfs]
    [] read_extent_buffer_pages+0x250/0x310 [btrfs]
    [] ? __radix_tree_preload+0x66/0xf0
    [] ? radix_tree_insert+0x95/0x260
    [] btree_read_extent_buffer_pages.constprop.128+0xb6/0x120
    [btrfs]
    [] read_tree_block+0x3a/0x60 [btrfs]
    [] open_ctree+0x139d/0x2030 [btrfs]
    [] btrfs_mount+0x53a/0x7d0 [btrfs]
    [] ? pcpu_alloc+0x8eb/0x9f0
    [] ? __kmalloc_track_caller+0x35/0x1e0
    [] mount_fs+0x20/0xd0
    [] vfs_kern_mount+0x76/0x120
    [] do_mount+0x200/0xa40
    [] ? strndup_user+0x5b/0x80
    [] SyS_mount+0x90/0xe0
    [] system_call_fastpath+0x1a/0x1f
    Code: 4c 8d 75 a8 4c 89 6d e8 45 89 e0 4c 8d 6f 30 48 89 5d d8 41 83 e0 af 48
    89 fb 49 83 c6 18 4c 89 7d f8 65 4c 8b 3c 25 c0 b8 00 00 8b 73 18 44 89 c7
    44 89 45 98 ff 53 20 48 85 c0 48 89 c2 74
    RIP [] mempool_alloc+0x4a/0x150
    RSP
    CR2: 0000000000000018
    ---[ end trace 7a96042017ed21e2 ]---

    Signed-off-by: Darrick J. Wong
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Darrick J. Wong
     
  • Chris Mason
     
  • Pull CIFS fixes from Steve French:
    "Small set of cifs fixes. Most important is Jeff's fix that works
    around disconnection problems which can be caused by simultaneous use
    of user space tools (starting a long running smbclient backup then
    doing a cifs kernel mount) or multiple cifs mounts through a NAT, and
    Jim's fix to deal with reexport of cifs share.

    I expect to send two more cifs fixes next week (being tested now) -
    fixes to address an SMB2 unmount hang when server dies and a fix for
    cifs symlink handling of Windows "NFS" symlinks"

    * 'for-linus' of git://git.samba.org/sfrench/cifs-2.6:
    [CIFS] update cifs.ko version
    [CIFS] Remove ext2 flags that have been moved to fs.h
    [CIFS] Provide sane values for nlink
    cifs: stop trying to use virtual circuits
    CIFS: FS-Cache: Uncache unread pages in cifs_readpages() before freeing them

    Linus Torvalds
     
  • Pull xfs bugfixes from Ben Myers:
    "There are lockdep annotations for project quotas, a fix for dirent
    dtype support on v4 filesystems, a fix for a memory leak in recovery,
    and a fix for the build error that resulted from it. D'oh"

    * tag 'xfs-for-linus-v3.12-rc4' of git://oss.sgi.com/xfs/xfs:
    xfs: Use kmem_free() instead of free()
    xfs: fix memory leak in xlog_recover_add_to_trans
    xfs: dirent dtype presence is dependent on directory magic numbers
    xfs: lockdep needs to know about 3 dquot-deep nesting

    Linus Torvalds
     
  • free_device rcu callback, scheduled from btrfs_rm_dev_replace_srcdev,
    can be processed before btrfs_scratch_superblock is called, which would
    result in a use-after-free on btrfs_device contents. Fix this by
    zeroing the superblock before the rcu callback is registered.

    Cc: Stefan Behrens
    Signed-off-by: Ilya Dryomov
    Signed-off-by: Josef Bacik

    Ilya Dryomov
     
  • The current implementation of worker threads in Btrfs has races in
    worker stopping code, which cause all kinds of panics and lockups when
    running btrfs/011 xfstest in a loop. The problem is that
    btrfs_stop_workers is unsynchronized with respect to check_idle_worker,
    check_busy_worker and __btrfs_start_workers.

    E.g., check_idle_worker race flow:

    btrfs_stop_workers(): check_idle_worker(aworker):
    - grabs the lock
    - splices the idle list into the
    working list
    - removes the first worker from the
    working list
    - releases the lock to wait for
    its kthread's completion
    - grabs the lock
    - if aworker is on the working list,
    moves aworker from the working list
    to the idle list
    - releases the lock
    - grabs the lock
    - puts the worker
    - removes the second worker from the
    working list
    ......
    btrfs_stop_workers returns, aworker is on the idle list
    FS is umounted, memory is freed
    ......
    aworker is waken up, fireworks ensue

    With this applied, I wasn't able to trigger the problem in 48 hours,
    whereas previously I could reliably reproduce at least one of these
    races within an hour.

    Reported-by: David Sterba
    Signed-off-by: Ilya Dryomov
    Signed-off-by: Josef Bacik

    Ilya Dryomov
     
  • The crash[1] is found by xfstests/generic/208 with "-o compress",
    it's not reproduced everytime, but it does panic.

    The bug is quite interesting, it's actually introduced by a recent commit
    (573aecafca1cf7a974231b759197a1aebcf39c2a,
    Btrfs: actually limit the size of delalloc range).

    Btrfs implements delay allocation, so during writeback, we
    (1) get a page A and lock it
    (2) search the state tree for delalloc bytes and lock all pages within the range
    (3) process the delalloc range, including find disk space and create
    ordered extent and so on.
    (4) submit the page A.

    It runs well in normal cases, but if we're in a racy case, eg.
    buffered compressed writes and aio-dio writes,
    sometimes we may fail to lock all pages in the 'delalloc' range,
    in which case, we need to fall back to search the state tree again with
    a smaller range limit(max_bytes = PAGE_CACHE_SIZE - offset).

    The mentioned commit has a side effect, that is, in the fallback case,
    we can find delalloc bytes before the index of the page we already have locked,
    so we're in the case of (delalloc_end 0).

    This ends with not locking delalloc pages but making ->writepage still
    process them, and the crash happens.

    This fixes it by just thinking that we find nothing and returning to caller
    as the caller knows how to deal with it properly.

    [1]:
    ------------[ cut here ]------------
    kernel BUG at mm/page-writeback.c:2170!
    [...]
    CPU: 2 PID: 11755 Comm: btrfs-delalloc- Tainted: G O 3.11.0+ #8
    [...]
    RIP: 0010:[] [] clear_page_dirty_for_io+0x1e/0x83
    [...]
    [ 4934.248731] Stack:
    [ 4934.248731] ffff8801477e5dc8 ffffea00049b9f00 ffff8801869f9ce8 ffffffffa02b841a
    [ 4934.248731] 0000000000000000 0000000000000000 0000000000000fff 0000000000000620
    [ 4934.248731] ffff88018db59c78 ffffea0005da8d40 ffffffffa02ff860 00000001810016c0
    [ 4934.248731] Call Trace:
    [ 4934.248731] [] extent_range_clear_dirty_for_io+0xcf/0xf5 [btrfs]
    [ 4934.248731] [] compress_file_range+0x1dc/0x4cb [btrfs]
    [ 4934.248731] [] ? detach_if_pending+0x22/0x4b
    [ 4934.248731] [] async_cow_start+0x35/0x53 [btrfs]
    [ 4934.248731] [] worker_loop+0x14b/0x48c [btrfs]
    [ 4934.248731] [] ? btrfs_queue_worker+0x25c/0x25c [btrfs]
    [ 4934.248731] [] kthread+0x8d/0x95
    [ 4934.248731] [] ? kthread_freezable_should_stop+0x43/0x43
    [ 4934.248731] [] ret_from_fork+0x7c/0xb0
    [ 4934.248731] [] ? kthread_freezable_should_stop+0x43/0x43
    [ 4934.248731] Code: ff 85 c0 0f 94 c0 0f b6 c0 59 5b 5d c3 0f 1f 44 00 00 55 48 89 e5 41 54 53 48 89 fb e8 2c de 00 00 49 89 c4 48 8b 03 a8 01 75 02 0b 4d 85 e4 74 52 49 8b 84 24 80 00 00 00 f6 40 20 01 75 44
    [ 4934.248731] RIP [] clear_page_dirty_for_io+0x1e/0x83
    [ 4934.248731] RSP
    [ 4934.280307] ---[ end trace 36f06d3f8750236a ]---

    Signed-off-by: Liu Bo
    Signed-off-by: Josef Bacik

    Liu Bo
     
  • If we crash with a log, remount and recover that log, and then crash before we
    can commit another transaction we will get transid verify errors on the next
    mount. This is because we were not zero'ing out the log when we committed the
    transaction after recovery. This is ok as long as we commit another transaction
    at some point in the future, but if you abort or something else goes wrong you
    can end up in this weird state because the recovery stuff says that the tree log
    should have a generation+1 of the super generation, which won't be the case of
    the transaction that was started for recovery. Fix this by removing the check
    and _always_ zero out the log portion of the super when we commit a transaction.
    This fixes the transid verify issues I was seeing with my force errors tests.
    Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • This fixes a build failure caused by calling the free() function which
    does not exist in the Linux kernel.

    Signed-off-by: Thierry Reding
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    (cherry picked from commit aaaae98022efa4f3c31042f1fdf9e7a0c5f04663)

    Thierry Reding
     
  • Free the memory in error path of xlog_recover_add_to_trans().
    Normally this memory is freed in recovery pass2, but is leaked
    in the error path.

    Signed-off-by: Mark Tinguely
    Reviewed-by: Eric Sandeen
    Signed-off-by: Ben Myers

    (cherry picked from commit 519ccb81ac1c8e3e4eed294acf93be00b43dcad6)

    tinguely@sgi.com
     
  • The determination of whether a directory entry contains a dtype
    field originally was dependent on the filesystem having CRCs
    enabled. This meant that the format for dtype beign enabled could be
    determined by checking the directory block magic number rather than
    doing a feature bit check. This was useful in that it meant that we
    didn't need to pass a struct xfs_mount around to functions that
    were already supplied with a directory block header.

    Unfortunately, the introduction of dtype fields into the v4
    structure via a feature bit meant this "use the directory block
    magic number" method of discriminating the dirent entry sizes is
    broken. Hence we need to convert the places that use magic number
    checks to use feature bit checks so that they work correctly and not
    by chance.

    The current code works on v4 filesystems only because the dirent
    size roundup covers the extra byte needed by the dtype field in the
    places where this problem occurs.

    Signed-off-by: Dave Chinner
    Reviewed-by: Ben Myers
    Signed-off-by: Ben Myers

    (cherry picked from commit 367993e7c6428cb7617ab7653d61dca54e2fdede)

    Dave Chinner
     
  • Michael Semon reported that xfs/299 generated this lockdep warning:

    =============================================
    [ INFO: possible recursive locking detected ]
    3.12.0-rc2+ #2 Not tainted
    ---------------------------------------------
    touch/21072 is trying to acquire lock:
    (&xfs_dquot_other_class){+.+...}, at: [] xfs_trans_dqlockedjoin+0x57/0x64

    but task is already holding lock:
    (&xfs_dquot_other_class){+.+...}, at: [] xfs_trans_dqlockedjoin+0x57/0x64

    other info that might help us debug this:
    Possible unsafe locking scenario:

    CPU0
    ----
    lock(&xfs_dquot_other_class);
    lock(&xfs_dquot_other_class);

    *** DEADLOCK ***

    May be due to missing lock nesting notation

    7 locks held by touch/21072:
    #0: (sb_writers#10){++++.+}, at: [] mnt_want_write+0x1e/0x3e
    #1: (&type->i_mutex_dir_key#4){+.+.+.}, at: [] do_last+0x245/0xe40
    #2: (sb_internal#2){++++.+}, at: [] xfs_trans_alloc+0x1f/0x35
    #3: (&(&ip->i_lock)->mr_lock/1){+.+...}, at: [] xfs_ilock+0x100/0x1f1
    #4: (&(&ip->i_lock)->mr_lock){++++-.}, at: [] xfs_ilock_nowait+0x105/0x22f
    #5: (&dqp->q_qlock){+.+...}, at: [] xfs_trans_dqlockedjoin+0x57/0x64
    #6: (&xfs_dquot_other_class){+.+...}, at: [] xfs_trans_dqlockedjoin+0x57/0x64

    The lockdep annotation for dquot lock nesting only understands
    locking for user and "other" dquots, not user, group and quota
    dquots. Fix the annotations to match the locking heirarchy we now
    have.

    Reported-by: Michael L. Semon
    Signed-off-by: Dave Chinner
    Reviewed-by: Ben Myers
    Signed-off-by: Ben Myers

    (cherry picked from commit f112a049712a5c07de25d511c3c6587a2b1a015e)

    Dave Chinner
     
  • Pull fuse bugfixes from Miklos Szeredi:
    "This contains two more fixes by Maxim for writeback/truncate races and
    fixes for RCU walk in fuse_dentry_revalidate()"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
    fuse: no RCU mode in fuse_access()
    fuse: readdirplus: fix RCU walk
    fuse: don't check_submounts_and_drop() in RCU walk
    fuse: fix fallocate vs. ftruncate race
    fuse: wait for writeback in fuse_file_fallocate()

    Linus Torvalds
     

03 Oct, 2013

1 commit


02 Oct, 2013

2 commits


01 Oct, 2013

7 commits

  • fuse_access() is never called in RCU walk, only on the final component of
    access(2) and chdir(2)...

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • Doing dput(parent) is not valid in RCU walk mode. In RCU mode it would
    probably be okay to update the parent flags, but it's actually not
    necessary most of the time...

    So only set the FUSE_I_ADVISE_RDPLUS flag on the parent when the entry was
    recently initialized by READDIRPLUS.

    This is achieved by setting FUSE_I_INIT_RDPLUS on entries added by
    READDIRPLUS and only dropping out of RCU mode if this flag is set.
    FUSE_I_INIT_RDPLUS is cleared once the FUSE_I_ADVISE_RDPLUS flag is set in
    the parent.

    Reported-by: Al Viro
    Signed-off-by: Miklos Szeredi
    Cc: stable@vger.kernel.org

    Miklos Szeredi
     
  • If revalidate finds an invalid dentry in RCU walk mode, let the VFS deal
    with it instead of calling check_submounts_and_drop() which is not prepared
    for being called from RCU walk.

    Signed-off-by: Miklos Szeredi
    Cc: stable@vger.kernel.org

    Miklos Szeredi
     
  • Pull NFS client bugfixes from Trond Myklebust:
    - Stable fix for Oopses in the pNFS files layout driver
    - Fix a regression when doing a non-exclusive file create on NFSv4.x
    - NFSv4.1 security negotiation fixes when looking up the root
    filesystem
    - Fix a memory ordering issue in the pNFS files layout driver

    * tag 'nfs-for-3.12-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
    NFS: Give "flavor" an initial value to fix a compile warning
    NFSv4.1: try SECINFO_NO_NAME flavs until one works
    NFSv4.1: Ensure memory ordering between nfs4_ds_connect and nfs4_fl_prepare_ds
    NFSv4.1: nfs4_fl_prepare_ds - fix bugs when the connect attempt fails
    NFSv4: Honour the 'opened' parameter in the atomic_open() filesystem method

    Linus Torvalds
     
  • Merge misc fixes from Andrew Morton.

    * emailed patches from Andrew Morton : (22 commits)
    pidns: fix free_pid() to handle the first fork failure
    ipc,msg: prevent race with rmid in msgsnd,msgrcv
    ipc/sem.c: update sem_otime for all operations
    mm/hwpoison: fix the lack of one reference count against poisoned page
    mm/hwpoison: fix false report on 2nd attempt at page recovery
    mm/hwpoison: fix test for a transparent huge page
    mm/hwpoison: fix traversal of hugetlbfs pages to avoid printk flood
    block: change config option name for cmdline partition parsing
    mm/mlock.c: prevent walking off the end of a pagetable in no-pmd configuration
    mm: avoid reinserting isolated balloon pages into LRU lists
    arch/parisc/mm/fault.c: fix uninitialized variable usage
    include/asm-generic/vtime.h: avoid zero-length file
    nilfs2: fix issue with race condition of competition between segments for dirty blocks
    Documentation/kernel-parameters.txt: replace kernelcore with Movable
    mm/bounce.c: fix a regression where MS_SNAP_STABLE (stable pages snapshotting) was ignored
    kernel/kmod.c: check for NULL in call_usermodehelper_exec()
    ipc/sem.c: synchronize the proc interface
    ipc/sem.c: optimize sem_lock()
    ipc/sem.c: fix race in sem_lock()
    mm/compaction.c: periodically schedule when freeing pages
    ...

    Linus Torvalds
     
  • Many NILFS2 users were reported about strange file system corruption
    (for example):

    NILFS: bad btree node (blocknr=185027): level = 0, flags = 0x0, nchildren = 768
    NILFS error (device sda4): nilfs_bmap_last_key: broken bmap (inode number=11540)

    But such error messages are consequence of file system's issue that takes
    place more earlier. Fortunately, Jerome Poulin
    and Anton Eliasson were reported about another
    issue not so recently. These reports describe the issue with segctor
    thread's crash:

    BUG: unable to handle kernel paging request at 0000000000004c83
    IP: nilfs_end_page_io+0x12/0xd0 [nilfs2]

    Call Trace:
    nilfs_segctor_do_construct+0xf25/0x1b20 [nilfs2]
    nilfs_segctor_construct+0x17b/0x290 [nilfs2]
    nilfs_segctor_thread+0x122/0x3b0 [nilfs2]
    kthread+0xc0/0xd0
    ret_from_fork+0x7c/0xb0

    These two issues have one reason. This reason can raise third issue
    too. Third issue results in hanging of segctor thread with eating of
    100% CPU.

    REPRODUCING PATH:

    One of the possible way or the issue reproducing was described by
    Jermoe me Poulin :

    1. init S to get to single user mode.
    2. sysrq+E to make sure only my shell is running
    3. start network-manager to get my wifi connection up
    4. login as root and launch "screen"
    5. cd /boot/log/nilfs which is a ext3 mount point and can log when NILFS dies.
    6. lscp | xz -9e > lscp.txt.xz
    7. mount my snapshot using mount -o cp=3360839,ro /dev/vgUbuntu/root /mnt/nilfs
    8. start a screen to dump /proc/kmsg to text file since rsyslog is killed
    9. start a screen and launch strace -f -o find-cat.log -t find
    /mnt/nilfs -type f -exec cat {} > /dev/null \;
    10. start a screen and launch strace -f -o apt-get.log -t apt-get update
    11. launch the last command again as it did not crash the first time
    12. apt-get crashes
    13. ps aux > ps-aux-crashed.log
    13. sysrq+W
    14. sysrq+E wait for everything to terminate
    15. sysrq+SUSB

    Simplified way of the issue reproducing is starting kernel compilation
    task and "apt-get update" in parallel.

    REPRODUCIBILITY:

    The issue is reproduced not stable [60% - 80%]. It is very important to
    have proper environment for the issue reproducing. The critical
    conditions for successful reproducing:

    (1) It should have big modified file by mmap() way.

    (2) This file should have the count of dirty blocks are greater that
    several segments in size (for example, two or three) from time to time
    during processing.

    (3) It should be intensive background activity of files modification
    in another thread.

    INVESTIGATION:

    First of all, it is possible to see that the reason of crash is not valid
    page address:

    NILFS [nilfs_segctor_complete_write]:2100 bh->b_count 0, bh->b_blocknr 13895680, bh->b_size 13897727, bh->b_page 0000000000001a82
    NILFS [nilfs_segctor_complete_write]:2101 segbuf->sb_segnum 6783

    Moreover, value of b_page (0x1a82) is 6786. This value looks like segment
    number. And b_blocknr with b_size values look like block numbers. So,
    buffer_head's pointer points on not proper address value.

    Detailed investigation of the issue is discovered such picture:

    [-----------------------------SEGMENT 6783-------------------------------]
    NILFS [nilfs_segctor_do_construct]:2310 nilfs_segctor_begin_construction
    NILFS [nilfs_segctor_do_construct]:2321 nilfs_segctor_collect
    NILFS [nilfs_segctor_do_construct]:2336 nilfs_segctor_assign
    NILFS [nilfs_segctor_do_construct]:2367 nilfs_segctor_update_segusage
    NILFS [nilfs_segctor_do_construct]:2371 nilfs_segctor_prepare_write
    NILFS [nilfs_segctor_do_construct]:2376 nilfs_add_checksums_on_logs
    NILFS [nilfs_segctor_do_construct]:2381 nilfs_segctor_write
    NILFS [nilfs_segbuf_submit_bio]:464 bio->bi_sector 111149024, segbuf->sb_segnum 6783

    [-----------------------------SEGMENT 6784-------------------------------]
    NILFS [nilfs_segctor_do_construct]:2310 nilfs_segctor_begin_construction
    NILFS [nilfs_segctor_do_construct]:2321 nilfs_segctor_collect
    NILFS [nilfs_lookup_dirty_data_buffers]:782 bh->b_count 1, bh->b_page ffffea000709b000, page->index 0, i_ino 1033103, i_size 25165824
    NILFS [nilfs_lookup_dirty_data_buffers]:783 bh->b_assoc_buffers.next ffff8802174a6798, bh->b_assoc_buffers.prev ffff880221cffee8
    NILFS [nilfs_segctor_do_construct]:2336 nilfs_segctor_assign
    NILFS [nilfs_segctor_do_construct]:2367 nilfs_segctor_update_segusage
    NILFS [nilfs_segctor_do_construct]:2371 nilfs_segctor_prepare_write
    NILFS [nilfs_segctor_do_construct]:2376 nilfs_add_checksums_on_logs
    NILFS [nilfs_segctor_do_construct]:2381 nilfs_segctor_write
    NILFS [nilfs_segbuf_submit_bh]:575 bh->b_count 1, bh->b_page ffffea000709b000, page->index 0, i_ino 1033103, i_size 25165824
    NILFS [nilfs_segbuf_submit_bh]:576 segbuf->sb_segnum 6784
    NILFS [nilfs_segbuf_submit_bh]:577 bh->b_assoc_buffers.next ffff880218a0d5f8, bh->b_assoc_buffers.prev ffff880218bcdf50
    NILFS [nilfs_segbuf_submit_bio]:464 bio->bi_sector 111150080, segbuf->sb_segnum 6784, segbuf->sb_nbio 0
    [----------] ditto
    NILFS [nilfs_segbuf_submit_bio]:464 bio->bi_sector 111164416, segbuf->sb_segnum 6784, segbuf->sb_nbio 15

    [-----------------------------SEGMENT 6785-------------------------------]
    NILFS [nilfs_segctor_do_construct]:2310 nilfs_segctor_begin_construction
    NILFS [nilfs_segctor_do_construct]:2321 nilfs_segctor_collect
    NILFS [nilfs_lookup_dirty_data_buffers]:782 bh->b_count 2, bh->b_page ffffea000709b000, page->index 0, i_ino 1033103, i_size 25165824
    NILFS [nilfs_lookup_dirty_data_buffers]:783 bh->b_assoc_buffers.next ffff880219277e80, bh->b_assoc_buffers.prev ffff880221cffc88
    NILFS [nilfs_segctor_do_construct]:2367 nilfs_segctor_update_segusage
    NILFS [nilfs_segctor_do_construct]:2371 nilfs_segctor_prepare_write
    NILFS [nilfs_segctor_do_construct]:2376 nilfs_add_checksums_on_logs
    NILFS [nilfs_segctor_do_construct]:2381 nilfs_segctor_write
    NILFS [nilfs_segbuf_submit_bh]:575 bh->b_count 2, bh->b_page ffffea000709b000, page->index 0, i_ino 1033103, i_size 25165824
    NILFS [nilfs_segbuf_submit_bh]:576 segbuf->sb_segnum 6785
    NILFS [nilfs_segbuf_submit_bh]:577 bh->b_assoc_buffers.next ffff880218a0d5f8, bh->b_assoc_buffers.prev ffff880222cc7ee8
    NILFS [nilfs_segbuf_submit_bio]:464 bio->bi_sector 111165440, segbuf->sb_segnum 6785, segbuf->sb_nbio 0
    [----------] ditto
    NILFS [nilfs_segbuf_submit_bio]:464 bio->bi_sector 111177728, segbuf->sb_segnum 6785, segbuf->sb_nbio 12

    NILFS [nilfs_segctor_do_construct]:2399 nilfs_segctor_wait
    NILFS [nilfs_segbuf_wait]:676 segbuf->sb_segnum 6783
    NILFS [nilfs_segbuf_wait]:676 segbuf->sb_segnum 6784
    NILFS [nilfs_segbuf_wait]:676 segbuf->sb_segnum 6785

    NILFS [nilfs_segctor_complete_write]:2100 bh->b_count 0, bh->b_blocknr 13895680, bh->b_size 13897727, bh->b_page 0000000000001a82

    BUG: unable to handle kernel paging request at 0000000000001a82
    IP: [] nilfs_end_page_io+0x12/0xd0 [nilfs2]

    Usually, for every segment we collect dirty files in list. Then, dirty
    blocks are gathered for every dirty file, prepared for write and
    submitted by means of nilfs_segbuf_submit_bh() call. Finally, it takes
    place complete write phase after calling nilfs_end_bio_write() on the
    block layer. Buffers/pages are marked as not dirty on final phase and
    processed files removed from the list of dirty files.

    It is possible to see that we had three prepare_write and submit_bio
    phases before segbuf_wait and complete_write phase. Moreover, segments
    compete between each other for dirty blocks because on every iteration
    of segments processing dirty buffer_heads are added in several lists of
    payload_buffers:

    [SEGMENT 6784]: bh->b_assoc_buffers.next ffff880218a0d5f8, bh->b_assoc_buffers.prev ffff880218bcdf50
    [SEGMENT 6785]: bh->b_assoc_buffers.next ffff880218a0d5f8, bh->b_assoc_buffers.prev ffff880222cc7ee8

    The next pointer is the same but prev pointer has changed. It means
    that buffer_head has next pointer from one list but prev pointer from
    another. Such modification can be made several times. And, finally, it
    can be resulted in various issues: (1) segctor hanging, (2) segctor
    crashing, (3) file system metadata corruption.

    FIX:
    This patch adds:

    (1) setting of BH_Async_Write flag in nilfs_segctor_prepare_write()
    for every proccessed dirty block;

    (2) checking of BH_Async_Write flag in
    nilfs_lookup_dirty_data_buffers() and
    nilfs_lookup_dirty_node_buffers();

    (3) clearing of BH_Async_Write flag in nilfs_segctor_complete_write(),
    nilfs_abort_logs(), nilfs_forget_buffer(), nilfs_clear_dirty_page().

    Reported-by: Jerome Poulin
    Reported-by: Anton Eliasson
    Cc: Paul Fertser
    Cc: ARAI Shun-ichi
    Cc: Piotr Szymaniak
    Cc: Juan Barry Manuel Canham
    Cc: Zahid Chowdhury
    Cc: Elmer Zhang
    Cc: Kenneth Langga
    Signed-off-by: Vyacheslav Dubeyko
    Acked-by: Ryusuke Konishi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vyacheslav Dubeyko
     
  • A high setting of max_map_count, and a process core-dumping with a large
    enough vm_map_count could result in an NT_FILE note not being written,
    and the kernel crashing immediately later because it has assumed
    otherwise.

    Reproduction of the oops-causing bug described here:

    https://lkml.org/lkml/2013/8/30/50

    Rge ussue originated in commit 2aa362c49c31 ("coredump: extend core dump
    note section to contain file names of mapped file") from Oct 4, 2012.

    This patch make that section optional in that case. fill_files_note()
    should signify the error, and also let the info struct in
    elf_core_dump() be zero-initialized so that we can check for the
    optionally written note.

    [akpm@linux-foundation.org: avoid abusing E2BIG, remove a couple of not-really-needed local variables]
    [akpm@linux-foundation.org: fix sparse warning]
    Signed-off-by: Dan Aloni
    Cc: Al Viro
    Cc: Denys Vlasenko
    Reported-by: Martin MOKREJS
    Tested-by: Martin MOKREJS
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Aloni
     

30 Sep, 2013

7 commits


29 Sep, 2013

1 commit

  • Pull xfs bugfixes from Ben Myers:
    - fix for directory node collapse regression
    - fix for recovery over stale on disk structures
    - fix for eofblocks ioctl
    - fix asserts in xfs_inode_free
    - lock the ail before removing an item from it

    * tag 'xfs-for-linus-v3.12-rc3' of git://oss.sgi.com/xfs/xfs:
    xfs: fix node forward in xfs_node_toosmall
    xfs: log recovery lsn ordering needs uuid check
    xfs: fix XFS_IOC_FREE_EOFBLOCKS definition
    xfs: asserting lock not held during freeing not valid
    xfs: lock the AIL before removing the buffer item

    Linus Torvalds