30 Dec, 2015

3 commits

  • We have found a BUG on res->migration_pending when migrating lock
    resources. The situation is as follows.

    dlm_mark_lockres_migration
    res->migration_pending = 1;
    __dlm_lockres_reserve_ast
    dlm_lockres_release_ast returns with res->migration_pending remains
    because other threads reserve asts
    wait dlm_migration_can_proceed returns 1
    >>>>>>> o2hb found that target goes down and remove target
    from domain_map
    dlm_migration_can_proceed returns 1
    dlm_mark_lockres_migrating returns -ESHOTDOWN with
    res->migration_pending still remains.

    When reentering dlm_mark_lockres_migrating(), it will trigger the BUG_ON
    with res->migration_pending. So clear migration_pending when target is
    down.

    Signed-off-by: Jiufei Xue
    Reviewed-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    xuejiufei
     
  • Commit 4f6563677ae8 ("Move locks API users to locks_lock_inode_wait()")
    move flock/posix lock indentify code to locks_lock_inode_wait(), but
    missed to set fl_flags to FL_FLOCK which caused the following kernel
    panic on 4.4.0_rc5.

    kernel BUG at fs/locks.c:1895!
    invalid opcode: 0000 [#1] SMP
    Modules linked in: ocfs2(O) ocfs2_dlmfs(O) ocfs2_stack_o2cb(O) ocfs2_dlm(O) ocfs2_nodemanager(O) ocfs2_stackglue(O) iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi xen_kbdfront xen_netfront xen_fbfront xen_blkfront
    CPU: 0 PID: 20268 Comm: flock_unit_test Tainted: G O 4.4.0-rc5-next-20151217 #1
    Hardware name: Xen HVM domU, BIOS 4.3.1OVM 05/14/2014
    task: ffff88007b3672c0 ti: ffff880028b58000 task.ti: ffff880028b58000
    RIP: locks_lock_inode_wait+0x2e/0x160
    Call Trace:
    ocfs2_do_flock+0x91/0x160 [ocfs2]
    ocfs2_flock+0x76/0xd0 [ocfs2]
    SyS_flock+0x10f/0x1a0
    entry_SYSCALL_64_fastpath+0x12/0x71
    Code: e5 41 57 41 56 49 89 fe 41 55 41 54 53 48 89 f3 48 81 ec 88 00 00 00 8b 46 40 83 e0 03 83 f8 01 0f 84 ad 00 00 00 83 f8 02 74 04 0b eb fe 4c 8d ad 60 ff ff ff 4c 8d 7b 58 e8 0e 8e 73 00 4d
    RIP locks_lock_inode_wait+0x2e/0x160
    RSP
    ---[ end trace dfca74ec9b5b274c ]---

    Fixes: 4f6563677ae8 ("Move locks API users to locks_lock_inode_wait()")
    Signed-off-by: Junxiao Bi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Junxiao Bi
     
  • When resizing, it firstly extends the last gd. Once it should backup
    super in the gd, it calculates new backup super and update the
    corresponding value.

    But it currently doesn't consider the situation that the backup super is
    already done. And in this case, it still sets the bit in gd bitmap and
    then decrease from bg_free_bits_count, which leads to a corrupted gd and
    trigger the BUG in ocfs2_block_group_set_bits:

    BUG_ON(le16_to_cpu(bg->bg_free_bits_count) < num_bits);

    So check whether the backup super is done and then do the updates.

    Signed-off-by: Joseph Qi
    Reviewed-by: Jiufei Xue
    Reviewed-by: Yiwen Jiang
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     

23 Dec, 2015

1 commit


19 Dec, 2015

2 commits

  • Pull btrfs fixes from Chris Mason:
    "A couple of small fixes"

    * 'for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
    Btrfs: check prepare_uptodate_page() error code earlier
    Btrfs: check for empty bitmap list in setup_cluster_bitmaps
    btrfs: fix misleading warning when space cache failed to load
    Btrfs: fix transaction handle leak in balance
    Btrfs: fix unprotected list move from unused_bgs to deleted_bgs list

    Linus Torvalds
     
  • Writing to /proc/$pid/coredump_filter always returns -ESRCH because commit
    774636e19ed51 ("proc: convert to kstrto*()/kstrto*_from_user()") removed
    the setting of ret after the get_proc_task call and incorrectly left it as
    -ESRCH. Instead, return 0 when successful.

    Example breakage:

    echo 0 > /proc/self/coredump_filter
    bash: echo: write error: No such process

    Fixes: 774636e19ed51 ("proc: convert to kstrto*()/kstrto*_from_user()")
    Signed-off-by: Colin Ian King
    Acked-by: Kees Cook
    Cc: [4.3+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Colin Ian King
     

17 Dec, 2015

1 commit

  • We do need to serialize layout stateid morphing operations, but we
    currently hold the ls_mutex across a layout recall which is pretty
    ugly. It's also unnecessary -- once we've bumped the seqid and
    copied it, we don't need to serialize the rest of the CB_LAYOUTRECALL
    vs. anything else. Just drop the mutex once the copy is done.

    This was causing a "workqueue leaked lock or atomic" warning and an
    occasional deadlock.

    There's more work to be done here but this fixes the immediate
    regression.

    Fixes: cc8a55320b5f "nfsd: serialize layout stateid morphing operations"
    Cc: stable@vger.kernel.org
    Reported-by: Kinglong Mee
    Signed-off-by: Jeff Layton
    Signed-off-by: J. Bruce Fields

    Jeff Layton
     

16 Dec, 2015

3 commits

  • …manana/linux into for-linus-4.4

    Chris Mason
     
  • prepare_pages() may end up calling prepare_uptodate_page() twice if our
    write only spans a single page. But if the first call returns an error,
    our page will be unlocked and its not safe to call it again.

    This bug goes all the way back to 2011, and it's not something commonly
    hit.

    While we're here, add a more explicit check for the page being truncated
    away. The bare lock_page() alone is protected only by good thoughts and
    i_mutex, which we're sure to regret eventually.

    Reported-by: Dave Jones
    Signed-off-by: Chris Mason

    Chris Mason
     
  • Dave Jones found a warning from kasan in setup_cluster_bitmaps()

    ==================================================================
    BUG: KASAN: stack-out-of-bounds in setup_cluster_bitmap+0xc4/0x5a0 at
    addr ffff88039bef6828
    Read of size 8 by task nfsd/1009
    page:ffffea000e6fbd80 count:0 mapcount:0 mapping: (null)
    index:0x0
    flags: 0x8000000000000000()
    page dumped because: kasan: bad access detected
    CPU: 1 PID: 1009 Comm: nfsd Tainted: G W
    4.4.0-rc3-backup-debug+ #1
    ffff880065647b50 000000006bb712c2 ffff88039bef6640 ffffffffa680a43e
    0000004559c00000 ffff88039bef66c8 ffffffffa62638d1 ffffffffa61121c0
    ffff8803a5769de8 0000000000000296 ffff8803a5769df0 0000000000046280
    Call Trace:
    [] dump_stack+0x4b/0x6d
    [] kasan_report_error+0x501/0x520
    [] ? debug_show_all_locks+0x1e0/0x1e0
    [] kasan_report+0x58/0x60
    [] ? rb_last+0x10/0x40
    [] ? setup_cluster_bitmap+0xc4/0x5a0
    [] __asan_load8+0x5d/0x70
    [] setup_cluster_bitmap+0xc4/0x5a0
    [] ? setup_cluster_no_bitmap+0x6a/0x400
    [] btrfs_find_space_cluster+0x4b6/0x640
    [] ? btrfs_alloc_from_cluster+0x4e0/0x4e0
    [] ? btrfs_return_cluster_to_free_space+0x9e/0xb0
    [] ? _raw_spin_unlock+0x27/0x40
    [] find_free_extent+0xba1/0x1520

    Andrey noticed this was because we were doing list_first_entry on a list
    that might be empty. Rework the tests a bit so we don't do that.

    Signed-off-by: Chris Mason
    Reprorted-by: Andrey Ryabinin
    Reported-by: Dave Jones

    Chris Mason
     

14 Dec, 2015

2 commits

  • Jan Stancek reported that I wrecked things for him by fixing things for
    Vladimir :/

    His report was due to an UNINTERRUPTIBLE wait getting -EINTR, which
    should not be possible, however my previous patch made this possible by
    unconditionally checking signal_pending().

    We cannot use current->state as was done previously, because the
    instruction after the store to that variable it can be changed. We must
    instead pass the initial state along and use that.

    Fixes: 68985633bccb ("sched/wait: Fix signal handling in bit wait helpers")
    Reported-by: Jan Stancek
    Reported-by: Chris Mason
    Tested-by: Jan Stancek
    Tested-by: Vladimir Murzin
    Tested-by: Chris Mason
    Reviewed-by: Paul Turner
    Cc: Ingo Molnar
    Cc: tglx@linutronix.de
    Cc: Oleg Nesterov
    Cc: hpa@zytor.com
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Pull NFS client bugfix from Trond Myklebust:
    "SUNRPC: Fix a NFSv4.1 callback channel regression"

    * tag 'nfs-for-4.4-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
    SUNRPC: Fix callback channel

    Linus Torvalds
     

13 Dec, 2015

4 commits

  • Merge misc fixes from Andrew Morton:
    "17 fixes"

    * emailed patches from Andrew Morton :
    MIPS: fix DMA contiguous allocation
    sh64: fix __NR_fgetxattr
    ocfs2: fix SGID not inherited issue
    mm/oom_kill.c: avoid attempting to kill init sharing same memory
    drivers/base/memory.c: prohibit offlining of memory blocks with missing sections
    tmpfs: fix shmem_evict_inode() warnings on i_blocks
    mm/hugetlb.c: fix resv map memory leak for placeholder entries
    mm: hugetlb: call huge_pte_alloc() only if ptep is null
    kernel: remove stop_machine() Kconfig dependency
    mm: kmemleak: mark kmemleak_init prototype as __init
    mm: fix kerneldoc on mem_cgroup_replace_page
    osd fs: __r4w_get_page rely on PageUptodate for uptodate
    MAINTAINERS: make Vladimir co-maintainer of the memory controller
    mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress
    mm: fix swapped Movable and Reclaimable in /proc/pagetypeinfo
    memcg: fix memory.high target
    mm: hugetlb: fix hugepage memory leak caused by wrong reserve count

    Linus Torvalds
     
  • Pull block layer fixes from Jens Axboe:
    "A set of fixes for the current series. This contains:

    - A bunch of fixes for lightnvm, should be the last round for this
    series. From Matias and Wenwei.

    - A writeback detach inode fix from Ilya, also marked for stable.

    - A block (though it says SCSI) fix for an OOPS in SCSI runtime power
    management.

    - Module init error path fixes for null_blk from Minfei"

    * 'for-linus' of git://git.kernel.dk/linux-block:
    null_blk: Fix error path in module initialization
    lightnvm: do not compile in debugging by default
    lightnvm: prevent gennvm module unload on use
    lightnvm: fix media mgr registration
    lightnvm: replace req queue with nvmdev for lld
    lightnvm: comments on constants
    lightnvm: check mm before use
    lightnvm: refactor spin_unlock in gennvm_get_blk
    lightnvm: put blks when luns configure failed
    lightnvm: use flags in rrpc_get_blk
    block: detach bdev inode from its wb in __blkdev_put()
    SCSI: Fix NULL pointer dereference in runtime PM

    Linus Torvalds
     
  • Commit 8f1eb48758aa ("ocfs2: fix umask ignored issue") introduced an
    issue, SGID of sub dir was not inherited from its parents dir. It is
    because SGID is set into "inode->i_mode" in ocfs2_get_init_inode(), but
    is overwritten by "mode" which don't have SGID set later.

    Fixes: 8f1eb48758aa ("ocfs2: fix umask ignored issue")
    Signed-off-by: Junxiao Bi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Acked-by: Srinivas Eeda
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Junxiao Bi
     
  • Commit 42cb14b110a5 ("mm: migrate dirty page without
    clear_page_dirty_for_io etc") simplified the migration of a PageDirty
    pagecache page: one stat needs moving from zone to zone and that's about
    all.

    It's convenient and safest for it to shift the PageDirty bit from old
    page to new, just before updating the zone stats: before copying data
    and marking the new PageUptodate. This is all done while both pages are
    isolated and locked, just as before; and just as before, there's a
    moment when the new page is visible in the radix_tree, but not yet
    PageUptodate. What's new is that it may now be briefly visible as
    PageDirty before it is PageUptodate.

    When I scoured the tree to see if this could cause a problem anywhere,
    the only places I found were in two similar functions __r4w_get_page():
    which look up a page with find_get_page() (not using page lock), then
    claim it's uptodate if it's PageDirty or PageWriteback or PageUptodate.

    I'm not sure whether that was right before, but now it might be wrong
    (on rare occasions): only claim the page is uptodate if PageUptodate.
    Or perhaps the page in question could never be migratable anyway?

    Signed-off-by: Hugh Dickins
    Tested-by: Boaz Harrosh
    Cc: Benny Halevy
    Cc: Trond Myklebust
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

12 Dec, 2015

1 commit


10 Dec, 2015

4 commits

  • When an inconsistent space cache is detected during loading we log a
    warning that users frequently mistake as instruction to invalidate the
    cache manually, even though this is not required. Fix the message to
    indicate that the cache will be rebuilt automatically.

    Signed-off-by: Holger Hoffstätte
    Acked-by: Filipe Manana

    Holger Hoffstätte
     
  • If we fail to allocate a new data chunk, we were jumping to the error path
    without release the transaction handle we got before. Fix this by always
    releasing it before doing the jump.

    Fixes: 2c9fe8355258 ("btrfs: Fix lost-data-profile caused by balance bg")
    Signed-off-by: Filipe Manana

    Filipe Manana
     
  • As of my previous change titled "Btrfs: fix scrub preventing unused block
    groups from being deleted", the following warning at
    extent-tree.c:btrfs_delete_unused_bgs() can be hit when we mount the a
    filesysten with "-o discard":

    10263 void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
    10264 {
    (...)
    10405 if (trimming) {
    10406 WARN_ON(!list_empty(&block_group->bg_list));
    10407 spin_lock(&trans->transaction->deleted_bgs_lock);
    10408 list_move(&block_group->bg_list,
    10409 &trans->transaction->deleted_bgs);
    10410 spin_unlock(&trans->transaction->deleted_bgs_lock);
    10411 btrfs_get_block_group(block_group);
    10412 }
    (...)

    This happens because scrub can now add back the block group to the list of
    unused block groups (fs_info->unused_bgs). This is dangerous because we
    are moving the block group from the unused block groups list to the list
    of deleted block groups without holding the lock that protects the source
    list (fs_info->unused_bgs_lock).

    The following diagram illustrates how this happens:

    CPU 1 CPU 2

    cleaner_kthread()
    btrfs_delete_unused_bgs()

    sees bg X in list
    fs_info->unused_bgs

    deletes bg X from list
    fs_info->unused_bgs

    scrub_enumerate_chunks()

    searches device tree using
    its commit root

    finds device extent for
    block group X

    gets block group X from the tree
    fs_info->block_group_cache_tree
    (via btrfs_lookup_block_group())

    sets bg X to RO (again)

    scrub_chunk(bg X)

    sets bg X back to RW mode

    adds bg X to the list
    fs_info->unused_bgs again,
    since it's still unused and
    currently not in that list

    sets bg X to RO mode

    btrfs_remove_chunk(bg X)

    --> discard is enabled and bg X
    is in the fs_info->unused_bgs
    list again so the warning is
    triggered
    --> we move it from that list into
    the transaction's delete_bgs
    list, but we can have another
    task currently manipulating
    the first list (fs_info->unused_bgs)

    Fix this by using the same lock (fs_info->unused_bgs_lock) to protect both
    the list of unused block groups and the list of deleted block groups. This
    makes it safe and there's not much worry for more lock contention, as this
    lock is seldom used and only the cleaner kthread adds elements to the list
    of deleted block groups. The warning goes away too, as this was previously
    an impossible case (and would have been better a BUG_ON/ASSERT) but it's
    not impossible anymore.
    Reproduced with fstest btrfs/073 (using MOUNT_OPTIONS="-o discard").

    Signed-off-by: Filipe Manana

    Filipe Manana
     
  • Pull vfs fixes from Al Viro:
    "A couple of fixes, both -stable fodder (9p one all way back to 2.6.32,
    dio - to all branches where "Fix negative return from dio read beyond
    eof" will end up it; it's a fixup to commit marked for -stable)"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fix the regression from "direct-io: Fix negative return from dio read beyond eof"
    9p: ->evict_inode() should kick out ->i_data, not ->i_mapping

    Linus Torvalds
     

09 Dec, 2015

2 commits

  • Sure, it's better to bail out of past-the-eof read and return 0 than return
    a bogus negative value on such. Only we'd better make sure we are bailing out
    with 0 and not -ENOMEM...

    Cc: stable@vger.kernel.org
    Signed-off-by: Al Viro

    Al Viro
     
  • For block devices the pagecache is associated with the inode
    on bdevfs, not with the aliasing ones on the mountable filesystems.
    The latter have its own ->i_data empty and ->i_mapping pointing
    to the (unique per major/minor) bdevfs inode. That guarantees
    cache coherence between all block device inodes with the same
    device number.

    Eviction of an alias inode has no business trying to evict the
    pages belonging to bdevfs one; moreover, ->i_mapping is only
    safe to access when the thing is opened. At the time of
    ->evict_inode() the victim is definitely *not* opened. We are
    about to kill the address space embedded into struct inode
    (inode->i_data) and that's what we need to empty of any pages.

    9p instance tries to empty inode->i_mapping instead, which is
    both unsafe and bogus - if we have several device nodes with
    the same device number in different places, closing one of them
    should not try to empty the (shared) page cache.

    Fortunately, other instances in the tree are OK; they are
    evicting from &inode->i_data instead, as 9p one should.

    Cc: stable@vger.kernel.org # v2.6.32+, ones prior to 2.6.36 need only half of that
    Reported-by: "Suzuki K. Poulose"
    Tested-by: "Suzuki K. Poulose"
    Signed-off-by: Al Viro

    Al Viro
     

08 Dec, 2015

2 commits

  • The NFSv4.1 callback channel is currently broken because the receive
    message will keep shrinking because the backchannel receive buffer size
    never gets reset.
    The easiest solution to this problem is instead of changing the receive
    buffer, to rather adjust the copied request.

    Fixes: 38b7631fbe42 ("nfs4: limit callback decoding to received bytes")
    Cc: Benjamin Coddington
    Cc: stable@vger.kernel.org
    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • Pull ext4 fixes from Ted Ts'o:
    "Ext4 bug fixes for v4.4, including fixes for post-2038 time encodings,
    some endian conversion problems with ext4 encryption, potential memory
    leaks after truncate in data=journal mode, and an ocfs2 regression
    caused by a jbd2 performance improvement"

    * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
    jbd2: fix null committed data return in undo_access
    ext4: add "static" to ext4_seq_##name##_fops struct
    ext4: fix an endianness bug in ext4_encrypted_follow_link()
    ext4: fix an endianness bug in ext4_encrypted_zeroout()
    jbd2: Fix unreclaimed pages after truncate in data=journal mode
    ext4: Fix handling of extended tv_sec

    Linus Torvalds
     

07 Dec, 2015

4 commits

  • Pull vfs fixes from Al Viro:
    "A couple of fixes (-stable fodder) + dead code removal after the
    overlayfs fix.

    I agree that it's better to separate from the fix part to make
    backporting easier, but IMO it's not worth delaying said dead code
    removal until the next window"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    Don't reset ->total_link_count on nested calls of vfs_path_lookup()
    ovl: get rid of the dead code left from broken (and disabled) optimizations
    ovl: fix permission checking for setattr

    Linus Torvalds
     
  • we already zero it on outermost set_nameidata(), so initialization in
    path_init() is pointless and wrong. The same DoS exists on pre-4.2
    kernels, but there a slightly different fix will be needed.

    Cc: stable@vger.kernel.org # v4.2
    Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • [Al Viro] The bug is in being too enthusiastic about optimizing ->setattr()
    away - instead of "copy verbatim with metadata" + "chmod/chown/utimes"
    (with the former being always safe and the latter failing in case of
    insufficient permissions) it tries to combine these two. Note that copyup
    itself will have to do ->setattr() anyway; _that_ is where the elevated
    capabilities are right. Having these two ->setattr() (one to set verbatim
    copy of metadata, another to do what overlayfs ->setattr() had been asked
    to do in the first place) combined is where it breaks.

    Signed-off-by: Miklos Szeredi
    Cc:
    Signed-off-by: Al Viro

    Miklos Szeredi
     

05 Dec, 2015

2 commits

  • Since 52ebea749aae ("writeback: make backing_dev_info host
    cgroup-specific bdi_writebacks") inode, at some point in its lifetime,
    gets attached to a wb (struct bdi_writeback). Detaching happens on
    evict, in inode_detach_wb() called from __destroy_inode(), and involves
    updating wb.

    However, detaching an internal bdev inode from its wb in
    __destroy_inode() is too late. Its bdi and by extension root wb are
    embedded into struct request_queue, which has different lifetime rules
    and can be freed long before the final bdput() is called (can be from
    __fput() of a corresponding /dev inode, through dput() - evict() -
    bd_forget(). bdevs hold onto the underlying disk/queue pair only while
    opened; as soon as bdev is closed all bets are off. In fact,
    disk/queue can be gone before __blkdev_put() even returns:

    1499 static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part)
    1500 {
    ...
    1518 if (bdev->bd_contains == bdev) {
    1519 if (disk->fops->release)
    1520 disk->fops->release(disk, mode);

    [ Driver puts its references to disk/queue ]

    1521 }
    1522 if (!bdev->bd_openers) {
    1523 struct module *owner = disk->fops->owner;
    1524
    1525 disk_put_part(bdev->bd_part);
    1526 bdev->bd_part = NULL;
    1527 bdev->bd_disk = NULL;
    1528 if (bdev != bdev->bd_contains)
    1529 victim = bdev->bd_contains;
    1530 bdev->bd_contains = NULL;
    1531
    1532 put_disk(disk);

    [ We put ours, the queue is gone
    The last bdput() would result in a write to invalid memory ]

    1533 module_put(owner);
    ...
    1539 }

    Since bdev inodes are special anyway, detach them in __blkdev_put()
    after clearing inode's dirty bits, turning the problematic
    inode_detach_wb() in __destroy_inode() into a noop.

    add_disk() grabs its disk->queue since 523e1d399ce0 ("block: make
    gendisk hold a reference to its queue"), so the old ->release comment
    is removed in favor of the new inode_detach_wb() comment.

    Cc: stable@vger.kernel.org # 4.2+, needs backporting
    Signed-off-by: Ilya Dryomov
    Acked-by: Tejun Heo
    Tested-by: Raghavendra K T
    Signed-off-by: Jens Axboe

    Ilya Dryomov
     
  • introduced jbd2_write_access_granted() to improve write|undo_access
    speed, but missed to check the status of b_committed_data which caused
    a kernel panic on ocfs2.

    [ 6538.405938] ------------[ cut here ]------------
    [ 6538.406686] kernel BUG at fs/ocfs2/suballoc.c:2400!
    [ 6538.406686] invalid opcode: 0000 [#1] SMP
    [ 6538.406686] Modules linked in: ocfs2 nfsd lockd grace nfs_acl auth_rpcgss sunrpc autofs4 ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs sd_mod sg ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ppdev xen_kbdfront xen_netfront xen_fbfront parport_pc parport pcspkr i2c_piix4 acpi_cpufreq ext4 jbd2 mbcache xen_blkfront floppy pata_acpi ata_generic ata_piix cirrus ttm drm_kms_helper drm fb_sys_fops sysimgblt sysfillrect i2c_core syscopyarea dm_mirror dm_region_hash dm_log dm_mod
    [ 6538.406686] CPU: 1 PID: 16265 Comm: mmap_truncate Not tainted 4.3.0 #1
    [ 6538.406686] Hardware name: Xen HVM domU, BIOS 4.3.1OVM 05/14/2014
    [ 6538.406686] task: ffff88007c2bab00 ti: ffff880075b78000 task.ti: ffff880075b78000
    [ 6538.406686] RIP: 0010:[] [] ocfs2_block_group_clear_bits+0x23b/0x250 [ocfs2]
    [ 6538.406686] RSP: 0018:ffff880075b7b7f8 EFLAGS: 00010246
    [ 6538.406686] RAX: ffff8800760c5b40 RBX: ffff88006c06a000 RCX: ffffffffa06e6df0
    [ 6538.406686] RDX: 0000000000000000 RSI: ffff88007a6f6ea0 RDI: ffff88007a760430
    [ 6538.406686] RBP: ffff880075b7b878 R08: 0000000000000002 R09: 0000000000000001
    [ 6538.406686] R10: ffffffffa06769be R11: 0000000000000000 R12: 0000000000000001
    [ 6538.406686] R13: ffffffffa06a1750 R14: 0000000000000001 R15: ffff88007a6f6ea0
    [ 6538.406686] FS: 00007f17fde30720(0000) GS:ffff88007f040000(0000) knlGS:0000000000000000
    [ 6538.406686] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 6538.406686] CR2: 0000000000601730 CR3: 000000007aea0000 CR4: 00000000000406e0
    [ 6538.406686] Stack:
    [ 6538.406686] ffff88007c2bb5b0 ffff880075b7b8e0 ffff88007a7604b0 ffff88006c640800
    [ 6538.406686] ffff88007a7604b0 ffff880075d77390 0000000075b7b878 ffffffffa06a309d
    [ 6538.406686] ffff880075d752d8 ffff880075b7b990 ffff880075b7b898 0000000000000000
    [ 6538.406686] Call Trace:
    [ 6538.406686] [] ? ocfs2_read_group_descriptor+0x6d/0xa0 [ocfs2]
    [ 6538.406686] [] _ocfs2_free_suballoc_bits+0xe4/0x320 [ocfs2]
    [ 6538.406686] [] ? ocfs2_put_slot+0xf0/0xf0 [ocfs2]
    [ 6538.406686] [] _ocfs2_free_clusters+0xee/0x210 [ocfs2]
    [ 6538.406686] [] ? ocfs2_put_slot+0xf0/0xf0 [ocfs2]
    [ 6538.406686] [] ? ocfs2_put_slot+0xf0/0xf0 [ocfs2]
    [ 6538.406686] [] ? ocfs2_extend_trans+0x50/0x1a0 [ocfs2]
    [ 6538.406686] [] ocfs2_free_clusters+0x15/0x20 [ocfs2]
    [ 6538.406686] [] ocfs2_replay_truncate_records+0xfc/0x290 [ocfs2]
    [ 6538.406686] [] ? ocfs2_start_trans+0xec/0x1d0 [ocfs2]
    [ 6538.406686] [] __ocfs2_flush_truncate_log+0x140/0x2d0 [ocfs2]
    [ 6538.406686] [] ? ocfs2_reserve_blocks_for_rec_trunc.clone.0+0x44/0x170 [ocfs2]
    [ 6538.406686] [] ocfs2_remove_btree_range+0x374/0x630 [ocfs2]
    [ 6538.406686] [] ? jbd2_journal_stop+0x25b/0x470 [jbd2]
    [ 6538.406686] [] ocfs2_commit_truncate+0x305/0x670 [ocfs2]
    [ 6538.406686] [] ? ocfs2_journal_access_eb+0x20/0x20 [ocfs2]
    [ 6538.406686] [] ocfs2_truncate_file+0x297/0x380 [ocfs2]
    [ 6538.406686] [] ? jbd2_journal_begin_ordered_truncate+0x64/0xc0 [jbd2]
    [ 6538.406686] [] ocfs2_setattr+0x572/0x860 [ocfs2]
    [ 6538.406686] [] ? current_fs_time+0x3f/0x50
    [ 6538.406686] [] notify_change+0x1d7/0x340
    [ 6538.406686] [] ? generic_getxattr+0x79/0x80
    [ 6538.406686] [] do_truncate+0x66/0x90
    [ 6538.406686] [] ? __audit_syscall_entry+0xb0/0x110
    [ 6538.406686] [] do_sys_ftruncate.clone.0+0xf3/0x120
    [ 6538.406686] [] SyS_ftruncate+0xe/0x10
    [ 6538.406686] [] entry_SYSCALL_64_fastpath+0x12/0x71
    [ 6538.406686] Code: 28 48 81 ee b0 04 00 00 48 8b 92 50 fb ff ff 48 8b 80 b0 03 00 00 48 39 90 88 00 00 00 0f 84 30 fe ff ff 0f 0b eb fe 0f 0b eb fe 0b 0f 1f 00 eb fb 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00
    [ 6538.406686] RIP [] ocfs2_block_group_clear_bits+0x23b/0x250 [ocfs2]
    [ 6538.406686] RSP
    [ 6538.691128] ---[ end trace 31cd7011d6770d7e ]---
    [ 6538.694492] Kernel panic - not syncing: Fatal exception
    [ 6538.695484] Kernel Offset: disabled

    Fixes: de92c8caf16c("jbd2: speedup jbd2_journal_get_[write|undo]_access()")
    Cc:
    Signed-off-by: Junxiao Bi
    Signed-off-by: Theodore Ts'o

    Junxiao Bi
     

04 Dec, 2015

2 commits

  • Pull networking fixes from David Miller:
    "A lot of Thanksgiving turkey leftovers accumulated, here goes:

    1) Fix bluetooth l2cap_chan object leak, from Johan Hedberg.

    2) IDs for some new iwlwifi chips, from Oren Givon.

    3) Fix rtlwifi lockups on boot, from Larry Finger.

    4) Fix memory leak in fm10k, from Stephen Hemminger.

    5) We have a route leak in the ipv6 tunnel infrastructure, fix from
    Paolo Abeni.

    6) Fix buffer pointer handling in arm64 bpf JIT,f rom Zi Shen Lim.

    7) Wrong lockdep annotations in tcp md5 support, fix from Eric
    Dumazet.

    8) Work around some middle boxes which prevent proper handling of TCP
    Fast Open, from Yuchung Cheng.

    9) TCP repair can do huge kmalloc() requests, build paged SKBs
    instead. From Eric Dumazet.

    10) Fix msg_controllen overflow in scm_detach_fds, from Daniel
    Borkmann.

    11) Fix device leaks on ipmr table destruction in ipv4 and ipv6, from
    Nikolay Aleksandrov.

    12) Fix use after free in epoll with AF_UNIX sockets, from Rainer
    Weikusat.

    13) Fix double free in VRF code, from Nikolay Aleksandrov.

    14) Fix skb leaks on socket receive queue in tipc, from Ying Xue.

    15) Fix ifup/ifdown crach in xgene driver, from Iyappan Subramanian.

    16) Fix clearing of persistent array maps in bpf, from Daniel
    Borkmann.

    17) In TCP, for the cross-SYN case, we don't initialize tp->copied_seq
    early enough. From Eric Dumazet.

    18) Fix out of bounds accesses in bpf array implementation when
    updating elements, from Daniel Borkmann.

    19) Fill gaps in RCU protection of np->opt in ipv6 stack, from Eric
    Dumazet.

    20) When dumping proxy neigh entries, we have to accomodate NULL
    device pointers properly, from Konstantin Khlebnikov.

    21) SCTP doesn't release all ipv6 socket resources properly, fix from
    Eric Dumazet.

    22) Prevent underflows of sch->q.qlen for multiqueue packet
    schedulers, also from Eric Dumazet.

    23) Fix MAC and unicast list handling in bnxt_en driver, from Jeffrey
    Huang and Michael Chan.

    24) Don't actively scan radar channels, from Antonio Quartulli"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (110 commits)
    net: phy: reset only targeted phy
    bnxt_en: Setup uc_list mac filters after resetting the chip.
    bnxt_en: enforce proper storing of MAC address
    bnxt_en: Fixed incorrect implementation of ndo_set_mac_address
    net: lpc_eth: remove irq > NR_IRQS check from probe()
    net_sched: fix qdisc_tree_decrease_qlen() races
    openvswitch: fix hangup on vxlan/gre/geneve device deletion
    ipv4: igmp: Allow removing groups from a removed interface
    ipv6: sctp: implement sctp_v6_destroy_sock()
    arm64: bpf: add 'store immediate' instruction
    ipv6: kill sk_dst_lock
    ipv6: sctp: add rcu protection around np->opt
    net/neighbour: fix crash at dumping device-agnostic proxy entries
    sctp: use GFP_USER for user-controlled kmalloc
    sctp: convert sack_needed and sack_generation to bits
    ipv6: add complete rcu protection around np->opt
    bpf: fix allocation warnings in bpf maps and integer overflow
    mvebu: dts: enable IP checksum with jumbo frames for Armada 38x on Port0
    net: mvneta: enable setting custom TX IP checksum limit
    net: mvneta: fix error path for building skb
    ...

    Linus Torvalds
     
  • Pull block fixes from Jens Axboe:
    "A collection of fixes from this series. The most important here is a
    regression fix for an issue that some folks would hit in blk-merge.c,
    and the NVMe queue depth limit for the screwed up Apple "nvme"
    controller.

    In more detail, this pull request contains:

    - a set of fixes for null_blk, including a fix for a few corner cases
    where we could hang the device. From Arianna and Paolo.

    - lightnvm:
    - A build improvement from Keith.
    - Update the qemu pci id detection from Matias.
    - Error handling fixes for leaks and other little fixes from
    Sudip and Wenwei.

    - fix from Eric where BLKRRPART would not return EBUSY for whole
    device mounts, only when partitions were mounted.

    - fix from Jan Kara, where EOF O_DIRECT reads would return
    negatively.

    - remove check for rq_mergeable() when checking limits for cloned
    requests. The check doesn't make any sense. It's assuming that
    since NOMERGE is set on the request that we don't have to
    recalculate limits since the request didn't change, but that's not
    true if the request has been redirected. From Hannes.

    - correctly get the bio front segment value set for single segment
    bio's, fixing a BUG() in blk-merge. From Ming"

    * 'for-linus' of git://git.kernel.dk/linux-block:
    nvme: temporary fix for Apple controller reset
    null_blk: change type of completion_nsec to unsigned long
    null_blk: guarantee device restart in all irq modes
    null_blk: set a separate timer for each command
    blk-merge: fix computing bio->bi_seg_front_size in case of single segment
    direct-io: Fix negative return from dio read beyond eof
    block: Always check queue limits for cloned requests
    lightnvm: missing nvm_lock acquire
    lightnvm: unconverted ppa returned in get_bb_tbl
    lightnvm: refactor and change vendor id for qemu
    lightnvm: do device max sectors boundary check first
    lightnvm: fix ioctl memory leaks
    lightnvm: free memory when gennvm register fails
    lightnvm: Simplify config when disabled
    Return EBUSY from BLKRRPART for mounted whole-dev fs

    Linus Torvalds
     

02 Dec, 2015

1 commit

  • This patch is a cleanup to make following patch easier to
    review.

    Goal is to move SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA
    from (struct socket)->flags to a (struct socket_wq)->flags
    to benefit from RCU protection in sock_wake_async()

    To ease backports, we rename both constants.

    Two new helpers, sk_set_bit(int nr, struct sock *sk)
    and sk_clear_bit(int net, struct sock *sk) are added so that
    following patch can change their implementation.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

01 Dec, 2015

1 commit

  • Assume a filesystem with 4KB blocks. When a file has size 1000 bytes and
    we issue direct IO read at offset 1024, blockdev_direct_IO() reads the
    tail of the last block and the logic for handling short DIO reads in
    dio_complete() results in a return value -24 (1000 - 1024) which
    obviously confuses userspace.

    Fix the problem by bailing out early once we sample i_size and can
    reliably check that direct IO read starts beyond i_size.

    Reported-by: Avi Kivity
    Fixes: 9fe55eea7e4b444bafc42fa0000cc2d1d2847275
    CC: stable@vger.kernel.org
    CC: Steven Whitehouse
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     

28 Nov, 2015

2 commits

  • Pull NFS client bugfixes from Trond Myklebust:
    "Highlights include:

    Stable patches:
    - Fix a NFSv4 callback identifier leak that was also causing client
    crashes
    - Fix NFSv4 callback decoding issues when incoming requests are
    truncated
    - Don't declare the attribute cache valid when we call
    nfs_update_inode with an empty attribute structure.
    - Resend LAYOUTGET when there is a race that changes the seqid

    Bugfixes:
    - Fix a number of issues with the NFSv4.2 CLONE ioctl()
    - Properly set NFS v4.2 NFSDBG_FACILITY
    - NFSv4 referrals are broken; Cleanup FATTR4_WORD0_FS_LOCATIONS after
    decoding success
    - Use sliding delay when LAYOUTGET gets NFS4ERR_DELAY
    - Ensure that attrcache is revalidated after a SETATTR"

    * tag 'nfs-for-4.4-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
    nfs4: resend LAYOUTGET when there is a race that changes the seqid
    nfs: if we have no valid attrs, then don't declare the attribute cache valid
    nfs: ensure that attrcache is revalidated after a SETATTR
    nfs4: limit callback decoding to received bytes
    nfs4: start callback_ident at idr 1
    nfs: use sliding delay when LAYOUTGET gets NFS4ERR_DELAY
    NFS4: Cleanup FATTR4_WORD0_FS_LOCATIONS after decoding success
    NFS: Properly set NFS v4.2 NFSDBG_FACILITY
    nfs: reduce the amount of ifdefs for v4.2 in nfs4file.c
    nfs: use btrfs ioctl defintions for clone
    nfs: allow intra-file CLONE
    nfs: offer native ioctls even if CONFIG_COMPAT is set
    nfs: pass on count for CLONE operations

    Linus Torvalds
     
  • Pull btrfs fixes from Chris Mason:
    "This has Mark Fasheh's patches to fix quota accounting during subvol
    deletion, which we've been working on for a while now. The patch is
    pretty small but it's a key fix.

    Otherwise it's a random assortment"

    * 'for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
    btrfs: fix balance range usage filters in 4.4-rc
    btrfs: qgroup: account shared subtree during snapshot delete
    Btrfs: use btrfs_get_fs_root in resolve_indirect_ref
    btrfs: qgroup: fix quota disable during rescan
    Btrfs: fix race between cleaner kthread and space cache writeout
    Btrfs: fix scrub preventing unused block groups from being deleted
    Btrfs: fix race between scrub and block group deletion
    btrfs: fix rcu warning during device replace
    btrfs: Continue replace when set_block_ro failed
    btrfs: fix clashing number of the enhanced balance usage filter
    Btrfs: fix the number of transaction units needed to remove a block group
    Btrfs: use global reserve when deleting unused block group after ENOSPC
    Btrfs: tests: checking for NULL instead of IS_ERR()
    btrfs: fix signed overflows in btrfs_sync_file

    Linus Torvalds
     

27 Nov, 2015

3 commits