10 Oct, 2018

1 commit

  • commit cbe355f57c8074bc4f452e5b6e35509044c6fa23 upstream.

    In dlm_init_lockres() we access and modify res->tracking and
    dlm->tracking_list without holding dlm->track_lock. This can cause list
    corruptions and can end up in kernel panic.

    Fix this by locking res->tracking and dlm->tracking_list with
    dlm->track_lock instead of dlm->spinlock.

    Link: http://lkml.kernel.org/r/1529951192-4686-1-git-send-email-ashish.samant@oracle.com
    Signed-off-by: Ashish Samant
    Reviewed-by: Changwei Ge
    Acked-by: Joseph Qi
    Acked-by: Jun Piao
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Changwei Ge
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Ashish Samant
     

29 Sep, 2018

1 commit

  • commit 234b69e3e089d850a98e7b3145bd00e9b52b1111 upstream.

    While reading block, it is possible that io error return due to underlying
    storage issue, in this case, BH_NeedsValidate was left in the buffer head.
    Then when reading the very block next time, if it was already linked into
    journal, that will trigger the following panic.

    [203748.702517] kernel BUG at fs/ocfs2/buffer_head_io.c:342!
    [203748.702533] invalid opcode: 0000 [#1] SMP
    [203748.702561] Modules linked in: ocfs2 ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs sunrpc dm_switch dm_queue_length dm_multipath bonding be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i iw_cxgb4 cxgb4 cxgb3i libcxgbi iw_cxgb3 cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ipmi_devintf iTCO_wdt iTCO_vendor_support dcdbas ipmi_ssif i2c_core ipmi_si ipmi_msghandler acpi_pad pcspkr sb_edac edac_core lpc_ich mfd_core shpchp sg tg3 ptp pps_core ext4 jbd2 mbcache2 sr_mod cdrom sd_mod ahci libahci megaraid_sas wmi dm_mirror dm_region_hash dm_log dm_mod
    [203748.703024] CPU: 7 PID: 38369 Comm: touch Not tainted 4.1.12-124.18.6.el6uek.x86_64 #2
    [203748.703045] Hardware name: Dell Inc. PowerEdge R620/0PXXHP, BIOS 2.5.2 01/28/2015
    [203748.703067] task: ffff880768139c00 ti: ffff88006ff48000 task.ti: ffff88006ff48000
    [203748.703088] RIP: 0010:[] [] ocfs2_read_blocks+0x669/0x7f0 [ocfs2]
    [203748.703130] RSP: 0018:ffff88006ff4b818 EFLAGS: 00010206
    [203748.703389] RAX: 0000000008620029 RBX: ffff88006ff4b910 RCX: 0000000000000000
    [203748.703885] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 00000000023079fe
    [203748.704382] RBP: ffff88006ff4b8d8 R08: 0000000000000000 R09: ffff8807578c25b0
    [203748.704877] R10: 000000000f637376 R11: 000000003030322e R12: 0000000000000000
    [203748.705373] R13: ffff88006ff4b910 R14: ffff880732fe38f0 R15: 0000000000000000
    [203748.705871] FS: 00007f401992c700(0000) GS:ffff880bfebc0000(0000) knlGS:0000000000000000
    [203748.706370] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [203748.706627] CR2: 00007f4019252440 CR3: 00000000a621e000 CR4: 0000000000060670
    [203748.707124] Stack:
    [203748.707371] ffff88006ff4b828 ffffffffa0609f52 ffff88006ff4b838 0000000000000001
    [203748.707885] 0000000000000000 0000000000000000 ffff880bf67c3800 ffffffffa05eca00
    [203748.708399] 00000000023079ff ffffffff81c58b80 0000000000000000 0000000000000000
    [203748.708915] Call Trace:
    [203748.709175] [] ? ocfs2_inode_cache_io_unlock+0x12/0x20 [ocfs2]
    [203748.709680] [] ? ocfs2_empty_dir_filldir+0x80/0x80 [ocfs2]
    [203748.710185] [] ocfs2_read_dir_block_direct+0x3b/0x200 [ocfs2]
    [203748.710691] [] ocfs2_prepare_dx_dir_for_insert.isra.57+0x19f/0xf60 [ocfs2]
    [203748.711204] [] ? ocfs2_metadata_cache_io_unlock+0x1f/0x30 [ocfs2]
    [203748.711716] [] ocfs2_prepare_dir_for_insert+0x13a/0x890 [ocfs2]
    [203748.712227] [] ? ocfs2_check_dir_for_entry+0x8e/0x140 [ocfs2]
    [203748.712737] [] ocfs2_mknod+0x4b2/0x1370 [ocfs2]
    [203748.713003] [] ocfs2_create+0x65/0x170 [ocfs2]
    [203748.713263] [] vfs_create+0xdb/0x150
    [203748.713518] [] do_last+0x815/0x1210
    [203748.713772] [] ? path_init+0xb9/0x450
    [203748.714123] [] path_openat+0x80/0x600
    [203748.714378] [] ? handle_pte_fault+0xd15/0x1620
    [203748.714634] [] do_filp_open+0x3a/0xb0
    [203748.714888] [] ? __alloc_fd+0xa7/0x130
    [203748.715143] [] do_sys_open+0x12c/0x220
    [203748.715403] [] ? syscall_trace_enter_phase1+0x11b/0x180
    [203748.715668] [] ? system_call_after_swapgs+0xe9/0x190
    [203748.715928] [] SyS_open+0x1e/0x20
    [203748.716184] [] system_call_fastpath+0x18/0xd7
    [203748.716440] Code: 00 00 48 8b 7b 08 48 83 c3 10 45 89 f8 44 89 e1 44 89 f2 4c 89 ee e8 07 06 11 e1 48 8b 03 48 85 c0 75 df 8b 5d c8 e9 4d fa ff ff 0b 48 8b 7d a0 e8 dc c6 06 00 48 b8 00 00 00 00 00 00 00 10
    [203748.717505] RIP [] ocfs2_read_blocks+0x669/0x7f0 [ocfs2]
    [203748.717775] RSP

    Joesph ever reported a similar panic.
    Link: https://oss.oracle.com/pipermail/ocfs2-devel/2013-May/008931.html

    Link: http://lkml.kernel.org/r/20180912063207.29484-1-junxiao.bi@oracle.com
    Signed-off-by: Junxiao Bi
    Cc: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Changwei Ge
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Junxiao Bi
     

22 Jul, 2018

2 commits

  • commit 3e4c56d41eef5595035872a2ec5a483f42e8917f upstream.

    ip_alloc_sem should be taken in ocfs2_get_block() when reading file in
    DIRECT mode to prevent concurrent access to extent tree with
    ocfs2_dio_end_io_write(), which may cause BUGON in the following
    situation:

    read file 'A' end_io of writing file 'A'
    vfs_read
    __vfs_read
    ocfs2_file_read_iter
    generic_file_read_iter
    ocfs2_direct_IO
    __blockdev_direct_IO
    do_blockdev_direct_IO
    do_direct_IO
    get_more_blocks
    ocfs2_get_block
    ocfs2_extent_map_get_blocks
    ocfs2_get_clusters
    ocfs2_get_clusters_nocache()
    ocfs2_search_extent_list
    return the index of record which
    contains the v_cluster, that is
    v_cluster > rec[i]->e_cpos.
    ocfs2_dio_end_io
    ocfs2_dio_end_io_write
    down_write(&oi->ip_alloc_sem);
    ocfs2_mark_extent_written
    ocfs2_change_extent_flag
    ocfs2_split_extent
    ...
    --> modify the rec[i]->e_cpos, resulting
    in v_cluster < rec[i]->e_cpos.
    BUG_ON(v_cluster < le32_to_cpu(rec->e_cpos))

    [alex.chen@huawei.com: v3]
    Link: http://lkml.kernel.org/r/59EF3614.6050008@huawei.com
    Link: http://lkml.kernel.org/r/59EF3614.6050008@huawei.com
    Fixes: c15471f79506 ("ocfs2: fix sparse file & data ordering issue in direct io")
    Signed-off-by: Alex Chen
    Reviewed-by: Jun Piao
    Reviewed-by: Joseph Qi
    Reviewed-by: Gang He
    Acked-by: Changwei Ge
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Cc: Salvatore Bonaccorso
    Signed-off-by: Greg Kroah-Hartman

    alex chen
     
  • commit 853bc26a7ea39e354b9f8889ae7ad1492ffa28d2 upstream.

    The subsystem.su_mutex is required while accessing the item->ci_parent,
    otherwise, NULL pointer dereference to the item->ci_parent will be
    triggered in the following situation:

    add node delete node
    sys_write
    vfs_write
    configfs_write_file
    o2nm_node_store
    o2nm_node_local_write
    do_rmdir
    vfs_rmdir
    configfs_rmdir
    mutex_lock(&subsys->su_mutex);
    unlink_obj
    item->ci_group = NULL;
    item->ci_parent = NULL;
    to_o2nm_cluster_from_node
    node->nd_item.ci_parent->ci_parent
    BUG since of NULL pointer dereference to nd_item.ci_parent

    Moreover, the o2nm_cluster also should be protected by the
    subsystem.su_mutex.

    [alex.chen@huawei.com: v2]
    Link: http://lkml.kernel.org/r/59EEAA69.9080703@huawei.com
    Link: http://lkml.kernel.org/r/59E9B36A.10700@huawei.com
    Signed-off-by: Alex Chen
    Reviewed-by: Jun Piao
    Reviewed-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Cc: Salvatore Bonaccorso
    Signed-off-by: Greg Kroah-Hartman

    alex chen
     

21 Jun, 2018

1 commit

  • [ Upstream commit e4383029201470523c3ffe339bd7d57e9b4a7d65 ]

    While reflinking an inode, we create a new inode in orphan directory,
    then take EX lock on it, reflink the original inode to orphan inode and
    release EX lock. Once the lock is released another node could request
    it in EX mode from ocfs2_recover_orphans() which causes downconvert of
    the lock, on this node, to NL mode.

    Later we attempt to initialize security acl for the orphan inode and
    move it to the reflink destination. However, while doing this we dont
    take EX lock on the inode. This could potentially cause problems
    because we could be starting transaction, accessing journal and
    modifying metadata of the inode while holding NL lock and with another
    node holding EX lock on the inode.

    Fix this by taking orphan inode cluster lock in EX mode before
    initializing security and moving orphan inode to reflink destination.
    Use the __tracker variant while taking inode lock to avoid recursive
    locking in the ocfs2_init_security_and_acl() call chain.

    Link: http://lkml.kernel.org/r/1523475107-7639-1-git-send-email-ashish.samant@oracle.com
    Signed-off-by: Ashish Samant
    Reviewed-by: Joseph Qi
    Reviewed-by: Junxiao Bi
    Acked-by: Jun Piao
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Changwei Ge
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Ashish Samant
     

30 May, 2018

1 commit

  • [ Upstream commit bb34f24c7d2c98d0c81838a7700e6068325b17a0 ]

    We should not handle migrate lockres if we are already in
    'DLM_CTXT_IN_SHUTDOWN', as that will cause lockres remains after leaving
    dlm domain. At last other nodes will get stuck into infinite loop when
    requsting lock from us.

    The problem is caused by concurrency umount between nodes. Before
    receiveing N1's DLM_BEGIN_EXIT_DOMAIN_MSG, N2 has picked up N1 as the
    migrate target. So N2 will continue sending lockres to N1 even though
    N1 has left domain.

    N1 N2 (owner)
    touch file

    access the file,
    and get pr lock

    begin leave domain and
    pick up N1 as new owner

    begin leave domain and
    migrate all lockres done

    begin migrate lockres to N1

    end leave domain, but
    the lockres left
    unexpectedly, because
    migrate task has passed

    [piaojun@huawei.com: v3]
    Link: http://lkml.kernel.org/r/5A9CBD19.5020107@huawei.com
    Link: http://lkml.kernel.org/r/5A99F028.2090902@huawei.com
    Signed-off-by: Jun Piao
    Reviewed-by: Yiwen Jiang
    Reviewed-by: Joseph Qi
    Reviewed-by: Changwei Ge
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Jun Piao
     

26 Apr, 2018

3 commits

  • [ Upstream commit d984187e3a1ad7d12447a7ab2c43ce3717a2b5b3 ]

    We should not reuse the dirty bh in jbd2 directly due to the following
    situation:

    1. When removing extent rec, we will dirty the bhs of extent rec and
    truncate log at the same time, and hand them over to jbd2.

    2. The bhs are submitted to jbd2 area successfully.

    3. The write-back thread of device help flush the bhs to disk but
    encounter write error due to abnormal storage link.

    4. After a while the storage link become normal. Truncate log flush
    worker triggered by the next space reclaiming found the dirty bh of
    truncate log and clear its 'BH_Write_EIO' and then set it uptodate in
    __ocfs2_journal_access():

    ocfs2_truncate_log_worker
    ocfs2_flush_truncate_log
    __ocfs2_flush_truncate_log
    ocfs2_replay_truncate_records
    ocfs2_journal_access_di
    __ocfs2_journal_access // here we clear io_error and set 'tl_bh' uptodata.

    5. Then jbd2 will flush the bh of truncate log to disk, but the bh of
    extent rec is still in error state, and unfortunately nobody will
    take care of it.

    6. At last the space of extent rec was not reduced, but truncate log
    flush worker have given it back to globalalloc. That will cause
    duplicate cluster problem which could be identified by fsck.ocfs2.

    Sadly we can hardly revert this but set fs read-only in case of ruining
    atomicity and consistency of space reclaim.

    Link: http://lkml.kernel.org/r/5A6E8092.8090701@huawei.com
    Fixes: acf8fdbe6afb ("ocfs2: do not BUG if buffer not uptodate in __ocfs2_journal_access")
    Signed-off-by: Jun Piao
    Reviewed-by: Yiwen Jiang
    Reviewed-by: Changwei Ge
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    piaojun
     
  • [ Upstream commit 16c8d569f5704a84164f30ff01b29879f3438065 ]

    The race between *set_acl and *get_acl will cause getting incomplete
    xattr data as below:

    processA processB

    ocfs2_set_acl
    ocfs2_xattr_set
    __ocfs2_xattr_set_handle

    ocfs2_get_acl_nolock
    ocfs2_xattr_get_nolock:

    processB may get incomplete xattr data if processA hasn't set_acl done.

    So we should use 'ip_xattr_sem' to protect getting extended attribute in
    ocfs2_get_acl_nolock(), as other processes could be changing it
    concurrently.

    Link: http://lkml.kernel.org/r/5A5DDCFF.7030001@huawei.com
    Signed-off-by: Jun Piao
    Reviewed-by: Alex Chen
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Joseph Qi
    Cc: Changwei Ge
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    piaojun
     
  • [ Upstream commit 025bcbde3634b2c9b316f227fed13ad6ad6817fb ]

    If metadata is corrupted such as 'invalid inode block', we will get
    failed by calling 'mount()' and then set filesystem readonly as below:

    ocfs2_mount
    ocfs2_initialize_super
    ocfs2_init_global_system_inodes
    ocfs2_iget
    ocfs2_read_locked_inode
    ocfs2_validate_inode_block
    ocfs2_error
    ocfs2_handle_error
    ocfs2_set_ro_flag(osb, 0); // set readonly

    In this situation we need return -EROFS to 'mount.ocfs2', so that user
    can fix it by fsck. And then mount again. In addition, 'mount.ocfs2'
    should be updated correspondingly as it only return 1 for all errno.
    And I will post a patch for 'mount.ocfs2' too.

    Link: http://lkml.kernel.org/r/5A4302FA.2010606@huawei.com
    Signed-off-by: Jun Piao
    Reviewed-by: Alex Chen
    Reviewed-by: Joseph Qi
    Reviewed-by: Changwei Ge
    Reviewed-by: Gang He
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    piaojun
     

22 Feb, 2018

1 commit

  • commit ff26cc10aec128c3f86b5611fd5f59c71d49c0e3 upstream.

    If we can't get inode lock immediately in the function
    ocfs2_inode_lock_with_page() when reading a page, we should not return
    directly here, since this will lead to a softlockup problem when the
    kernel is configured with CONFIG_PREEMPT is not set. The method is to
    get a blocking lock and immediately unlock before returning, this can
    avoid CPU resource waste due to lots of retries, and benefits fairness
    in getting lock among multiple nodes, increase efficiency in case
    modifying the same file frequently from multiple nodes.

    The softlockup crash (when set /proc/sys/kernel/softlockup_panic to 1)
    looks like:

    Kernel panic - not syncing: softlockup: hung tasks
    CPU: 0 PID: 885 Comm: multi_mmap Tainted: G L 4.12.14-6.1-default #1
    Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    Call Trace:

    dump_stack+0x5c/0x82
    panic+0xd5/0x21e
    watchdog_timer_fn+0x208/0x210
    __hrtimer_run_queues+0xcc/0x200
    hrtimer_interrupt+0xa6/0x1f0
    smp_apic_timer_interrupt+0x34/0x50
    apic_timer_interrupt+0x96/0xa0

    RIP: 0010:unlock_page+0x17/0x30
    RSP: 0000:ffffaf154080bc88 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
    RAX: dead000000000100 RBX: fffff21e009f5300 RCX: 0000000000000004
    RDX: dead0000000000ff RSI: 0000000000000202 RDI: fffff21e009f5300
    RBP: 0000000000000000 R08: 0000000000000000 R09: ffffaf154080bb00
    R10: ffffaf154080bc30 R11: 0000000000000040 R12: ffff993749a39518
    R13: 0000000000000000 R14: fffff21e009f5300 R15: fffff21e009f5300
    ocfs2_inode_lock_with_page+0x25/0x30 [ocfs2]
    ocfs2_readpage+0x41/0x2d0 [ocfs2]
    filemap_fault+0x12b/0x5c0
    ocfs2_fault+0x29/0xb0 [ocfs2]
    __do_fault+0x1a/0xa0
    __handle_mm_fault+0xbe8/0x1090
    handle_mm_fault+0xaa/0x1f0
    __do_page_fault+0x235/0x4b0
    trace_do_page_fault+0x3c/0x110
    async_page_fault+0x28/0x30
    RIP: 0033:0x7fa75ded638e
    RSP: 002b:00007ffd6657db18 EFLAGS: 00010287
    RAX: 000055c7662fb700 RBX: 0000000000000001 RCX: 000055c7662fb700
    RDX: 0000000000001770 RSI: 00007fa75e909000 RDI: 000055c7662fb700
    RBP: 0000000000000003 R08: 000000000000000e R09: 0000000000000000
    R10: 0000000000000483 R11: 00007fa75ded61b0 R12: 00007fa75e90a770
    R13: 000000000000000e R14: 0000000000001770 R15: 0000000000000000

    About performance improvement, we can see the testing time is reduced,
    and CPU utilization decreases, the detailed data is as follows. I ran
    multi_mmap test case in ocfs2-test package in a three nodes cluster.

    Before applying this patch:
    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
    2754 ocfs2te+ 20 0 170248 6980 4856 D 80.73 0.341 0:18.71 multi_mmap
    1505 root rt 0 222236 123060 97224 S 2.658 6.015 0:01.44 corosync
    5 root 20 0 0 0 0 S 1.329 0.000 0:00.19 kworker/u8:0
    95 root 20 0 0 0 0 S 1.329 0.000 0:00.25 kworker/u8:1
    2728 root 20 0 0 0 0 S 0.997 0.000 0:00.24 jbd2/sda1-33
    2721 root 20 0 0 0 0 S 0.664 0.000 0:00.07 ocfs2dc-3C8CFD4
    2750 ocfs2te+ 20 0 142976 4652 3532 S 0.664 0.227 0:00.28 mpirun

    ocfs2test@tb-node2:~>multiple_run.sh -i ens3 -k ~/linux-4.4.21-69.tar.gz -o ~/ocfs2mullog -C hacluster -s pcmk -n tb-node2,tb-node1,tb-node3 -d /dev/sda1 -b 4096 -c 32768 -t multi_mmap /mnt/shared
    Tests with "-b 4096 -C 32768"
    Thu Dec 28 14:44:52 CST 2017
    multi_mmap..................................................Passed.
    Runtime 783 seconds.

    After apply this patch:

    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
    2508 ocfs2te+ 20 0 170248 6804 4680 R 54.00 0.333 0:55.37 multi_mmap
    155 root 20 0 0 0 0 S 2.667 0.000 0:01.20 kworker/u8:3
    95 root 20 0 0 0 0 S 2.000 0.000 0:01.58 kworker/u8:1
    2504 ocfs2te+ 20 0 142976 4604 3480 R 1.667 0.225 0:01.65 mpirun
    5 root 20 0 0 0 0 S 1.000 0.000 0:01.36 kworker/u8:0
    2482 root 20 0 0 0 0 S 1.000 0.000 0:00.86 jbd2/sda1-33
    299 root 0 -20 0 0 0 S 0.333 0.000 0:00.13 kworker/2:1H
    335 root 0 -20 0 0 0 S 0.333 0.000 0:00.17 kworker/1:1H
    535 root 20 0 12140 7268 1456 S 0.333 0.355 0:00.34 haveged
    1282 root rt 0 222284 123108 97224 S 0.333 6.017 0:01.33 corosync

    ocfs2test@tb-node2:~>multiple_run.sh -i ens3 -k ~/linux-4.4.21-69.tar.gz -o ~/ocfs2mullog -C hacluster -s pcmk -n tb-node2,tb-node1,tb-node3 -d /dev/sda1 -b 4096 -c 32768 -t multi_mmap /mnt/shared
    Tests with "-b 4096 -C 32768"
    Thu Dec 28 15:04:12 CST 2017
    multi_mmap..................................................Passed.
    Runtime 487 seconds.

    Link: http://lkml.kernel.org/r/1514447305-30814-1-git-send-email-ghe@suse.com
    Fixes: 1cce4df04f37 ("ocfs2: do not lock/unlock() inode DLM lock")
    Signed-off-by: Gang He
    Reviewed-by: Eric Ren
    Acked-by: alex chen
    Acked-by: piaojun
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Joseph Qi
    Cc: Changwei Ge
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Gang He
     

24 Nov, 2017

2 commits

  • commit 28f5a8a7c033cbf3e32277f4cc9c6afd74f05300 upstream.

    we should wait dio requests to finish before inode lock in
    ocfs2_setattr(), otherwise the following deadlock will happen:

    process 1 process 2 process 3
    truncate file 'A' end_io of writing file 'A' receiving the bast messages
    ocfs2_setattr
    ocfs2_inode_lock_tracker
    ocfs2_inode_lock_full
    inode_dio_wait
    __inode_dio_wait
    -->waiting for all dio
    requests finish
    dlm_proxy_ast_handler
    dlm_do_local_bast
    ocfs2_blocking_ast
    ocfs2_generic_handle_bast
    set OCFS2_LOCK_BLOCKED flag
    dio_end_io
    dio_bio_end_aio
    dio_complete
    ocfs2_dio_end_io
    ocfs2_dio_end_io_write
    ocfs2_inode_lock
    __ocfs2_cluster_lock
    ocfs2_wait_for_mask
    -->waiting for OCFS2_LOCK_BLOCKED
    flag to be cleared, that is waiting
    for 'process 1' unlocking the inode lock
    inode_dio_end
    -->here dec the i_dio_count, but will never
    be called, so a deadlock happened.

    Link: http://lkml.kernel.org/r/59F81636.70508@huawei.com
    Signed-off-by: Alex Chen
    Reviewed-by: Jun Piao
    Reviewed-by: Joseph Qi
    Acked-by: Changwei Ge
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    alex chen
     
  • commit 1c01967116a678fed8e2c68a6ab82abc8effeddc upstream.

    When a node dies, other live nodes have to choose a new master for an
    existed lock resource mastered by the dead node.

    As for ocfs2/dlm implementation, this is done by function -
    dlm_move_lockres_to_recovery_list which marks those lock rsources as
    DLM_LOCK_RES_RECOVERING and manages them via a list from which DLM
    changes lock resource's master later.

    So without invoking dlm_move_lockres_to_recovery_list, no master will be
    choosed after dlm recovery accomplishment since no lock resource can be
    found through ::resource list.

    What's worse is that if DLM_LOCK_RES_RECOVERING is not marked for lock
    resources mastered a dead node, it will break up synchronization among
    nodes.

    So invoke dlm_move_lockres_to_recovery_list again.

    Fixs: 'commit ee8f7fcbe638 ("ocfs2/dlm: continue to purge recovery lockres when recovery master goes down")'
    Link: http://lkml.kernel.org/r/63ADC13FD55D6546B7DECE290D39E373CED6E0F9@H3CMLB14-EX.srv.huawei-3com.com
    Signed-off-by: Changwei Ge
    Reported-by: Vitaly Mayatskih
    Tested-by: Vitaly Mayatskikh
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Changwei Ge
     

03 Nov, 2017

1 commit

  • The first cluster group descriptor is not stored at the start of the
    group but at an offset from the start. We need to take this into
    account while doing fstrim on the first cluster group. Otherwise we
    will wrongly start fstrim a few blocks after the desired start block and
    the range can cross over into the next cluster group and zero out the
    group descriptor there. This can cause filesytem corruption that cannot
    be fixed by fsck.

    Link: http://lkml.kernel.org/r/1507835579-7308-1-git-send-email-ashish.samant@oracle.com
    Signed-off-by: Ashish Samant
    Reviewed-by: Junxiao Bi
    Reviewed-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ashish Samant
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

15 Sep, 2017

1 commit

  • Pull mount flag updates from Al Viro:
    "Another chunk of fmount preparations from dhowells; only trivial
    conflicts for that part. It separates MS_... bits (very grotty
    mount(2) ABI) from the struct super_block ->s_flags (kernel-internal,
    only a small subset of MS_... stuff).

    This does *not* convert the filesystems to new constants; only the
    infrastructure is done here. The next step in that series is where the
    conflicts would be; that's the conversion of filesystems. It's purely
    mechanical and it's better done after the merge, so if you could run
    something like

    list=$(for i in MS_RDONLY MS_NOSUID MS_NODEV MS_NOEXEC MS_SYNCHRONOUS MS_MANDLOCK MS_DIRSYNC MS_NOATIME MS_NODIRATIME MS_SILENT MS_POSIXACL MS_KERNMOUNT MS_I_VERSION MS_LAZYTIME; do git grep -l $i fs drivers/staging/lustre drivers/mtd ipc mm include/linux; done|sort|uniq|grep -v '^fs/namespace.c$')

    sed -i -e 's/\/SB_RDONLY/g' \
    -e 's/\/SB_NOSUID/g' \
    -e 's/\/SB_NODEV/g' \
    -e 's/\/SB_NOEXEC/g' \
    -e 's/\/SB_SYNCHRONOUS/g' \
    -e 's/\/SB_MANDLOCK/g' \
    -e 's/\/SB_DIRSYNC/g' \
    -e 's/\/SB_NOATIME/g' \
    -e 's/\/SB_NODIRATIME/g' \
    -e 's/\/SB_SILENT/g' \
    -e 's/\/SB_POSIXACL/g' \
    -e 's/\/SB_KERNMOUNT/g' \
    -e 's/\/SB_I_VERSION/g' \
    -e 's/\/SB_LAZYTIME/g' \
    $list

    and commit it with something along the lines of 'convert filesystems
    away from use of MS_... constants' as commit message, it would save a
    quite a bit of headache next cycle"

    * 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    VFS: Differentiate mount flags (MS_*) from internal superblock flags
    VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb)
    vfs: Add sb_rdonly(sb) to query the MS_RDONLY flag on s_flags

    Linus Torvalds
     

08 Sep, 2017

2 commits

  • Pull quota scaling updates from Jan Kara:
    "This contains changes to make the quota subsystem more scalable.

    Reportedly it improves number of files created per second on ext4
    filesystem on fast storage by about a factor of 2x"

    * 'quota_scaling' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: (28 commits)
    quota: Add lock annotations to struct members
    quota: Reduce contention on dq_data_lock
    fs: Provide __inode_get_bytes()
    quota: Inline dquot_[re]claim_reserved_space() into callsite
    quota: Inline inode_{incr,decr}_space() into callsites
    quota: Inline functions into their callsites
    ext4: Disable dirty list tracking of dquots when journalling quotas
    quota: Allow disabling tracking of dirty dquots in a list
    quota: Remove dq_wait_unused from dquot
    quota: Move locking into clear_dquot_dirty()
    quota: Do not dirty bad dquots
    quota: Fix possible corruption of dqi_flags
    quota: Propagate ->quota_read errors from v2_read_file_info()
    quota: Fix error codes in v2_read_file_info()
    quota: Push dqio_sem down to ->read_file_info()
    quota: Push dqio_sem down to ->write_file_info()
    quota: Push dqio_sem down to ->get_next_id()
    quota: Push dqio_sem down to ->release_dqblk()
    quota: Remove locking for writing to the old quota format
    quota: Do not acquire dqio_sem for dquot overwrites in v2 format
    ...

    Linus Torvalds
     
  • Pull block layer updates from Jens Axboe:
    "This is the first pull request for 4.14, containing most of the code
    changes. It's a quiet series this round, which I think we needed after
    the churn of the last few series. This contains:

    - Fix for a registration race in loop, from Anton Volkov.

    - Overflow complaint fix from Arnd for DAC960.

    - Series of drbd changes from the usual suspects.

    - Conversion of the stec/skd driver to blk-mq. From Bart.

    - A few BFQ improvements/fixes from Paolo.

    - CFQ improvement from Ritesh, allowing idling for group idle.

    - A few fixes found by Dan's smatch, courtesy of Dan.

    - A warning fixup for a race between changing the IO scheduler and
    device remova. From David Jeffery.

    - A few nbd fixes from Josef.

    - Support for cgroup info in blktrace, from Shaohua.

    - Also from Shaohua, new features in the null_blk driver to allow it
    to actually hold data, among other things.

    - Various corner cases and error handling fixes from Weiping Zhang.

    - Improvements to the IO stats tracking for blk-mq from me. Can
    drastically improve performance for fast devices and/or big
    machines.

    - Series from Christoph removing bi_bdev as being needed for IO
    submission, in preparation for nvme multipathing code.

    - Series from Bart, including various cleanups and fixes for switch
    fall through case complaints"

    * 'for-4.14/block' of git://git.kernel.dk/linux-block: (162 commits)
    kernfs: checking for IS_ERR() instead of NULL
    drbd: remove BIOSET_NEED_RESCUER flag from drbd_{md_,}io_bio_set
    drbd: Fix allyesconfig build, fix recent commit
    drbd: switch from kmalloc() to kmalloc_array()
    drbd: abort drbd_start_resync if there is no connection
    drbd: move global variables to drbd namespace and make some static
    drbd: rename "usermode_helper" to "drbd_usermode_helper"
    drbd: fix race between handshake and admin disconnect/down
    drbd: fix potential deadlock when trying to detach during handshake
    drbd: A single dot should be put into a sequence.
    drbd: fix rmmod cleanup, remove _all_ debugfs entries
    drbd: Use setup_timer() instead of init_timer() to simplify the code.
    drbd: fix potential get_ldev/put_ldev refcount imbalance during attach
    drbd: new disk-option disable-write-same
    drbd: Fix resource role for newly created resources in events2
    drbd: mark symbols static where possible
    drbd: Send P_NEG_ACK upon write error in protocol != C
    drbd: add explicit plugging when submitting batches
    drbd: change list_for_each_safe to while(list_first_entry_or_null)
    drbd: introduce drbd_recv_header_maybe_unplug
    ...

    Linus Torvalds
     

07 Sep, 2017

4 commits

  • Merge updates from Andrew Morton:

    - various misc bits

    - DAX updates

    - OCFS2

    - most of MM

    * emailed patches from Andrew Morton : (119 commits)
    mm,fork: introduce MADV_WIPEONFORK
    x86,mpx: make mpx depend on x86-64 to free up VMA flag
    mm: add /proc/pid/smaps_rollup
    mm: hugetlb: clear target sub-page last when clearing huge page
    mm: oom: let oom_reap_task and exit_mmap run concurrently
    swap: choose swap device according to numa node
    mm: replace TIF_MEMDIE checks by tsk_is_oom_victim
    mm, oom: do not rely on TIF_MEMDIE for memory reserves access
    z3fold: use per-cpu unbuddied lists
    mm, swap: don't use VMA based swap readahead if HDD is used as swap
    mm, swap: add sysfs interface for VMA based swap readahead
    mm, swap: VMA based swap readahead
    mm, swap: fix swap readahead marking
    mm, swap: add swap readahead hit statistics
    mm/vmalloc.c: don't reinvent the wheel but use existing llist API
    mm/vmstat.c: fix wrong comment
    selftests/memfd: add memfd_create hugetlbfs selftest
    mm/shmem: add hugetlbfs support to memfd_create()
    mm, devm_memremap_pages: use multi-order radix for ZONE_DEVICE lookups
    mm/vmalloc.c: halve the number of comparisons performed in pcpu_get_vm_areas()
    ...

    Linus Torvalds
     
  • clean up some unused functions and parameters.

    Link: http://lkml.kernel.org/r/598A5E21.2080807@huawei.com
    Signed-off-by: Jun Piao
    Reviewed-by: Alex Chen
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jun Piao
     
  • The function is never called outside of fs/ocfs2/acl.c.

    Link: http://lkml.kernel.org/r/20170801141252.19675-2-jack@suse.cz
    Signed-off-by: Jan Kara
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Pull writeback error handling updates from Jeff Layton:
    "This pile continues the work from last cycle on better tracking
    writeback errors. In v4.13 we added some basic errseq_t infrastructure
    and converted a few filesystems to use it.

    This set continues refining that infrastructure, adds documentation,
    and converts most of the other filesystems to use it. The main
    exception at this point is the NFS client"

    * tag 'wberr-v4.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
    ecryptfs: convert to file_write_and_wait in ->fsync
    mm: remove optimizations based on i_size in mapping writeback waits
    fs: convert a pile of fsync routines to errseq_t based reporting
    gfs2: convert to errseq_t based writeback error reporting for fsync
    fs: convert sync_file_range to use errseq_t based error-tracking
    mm: add file_fdatawait_range and file_write_and_wait
    fuse: convert to errseq_t based error tracking for fsync
    mm: consolidate dax / non-dax checks for writeback
    Documentation: add some docs for errseq_t
    errseq: rename __errseq_set to errseq_set

    Linus Torvalds
     

24 Aug, 2017

1 commit

  • This way we don't need a block_device structure to submit I/O. The
    block_device has different life time rules from the gendisk and
    request_queue and is usually only available when the block device node
    is open. Other callers need to explicitly create one (e.g. the lightnvm
    passthrough code, or the new nvme multipathing code).

    For the actual I/O path all that we need is the gendisk, which exists
    once per block device. But given that the block layer also does
    partition remapping we additionally need a partition index, which is
    used for said remapping in generic_make_request.

    Note that all the block drivers generally want request_queue or
    sometimes the gendisk, so this removes a layer of indirection all
    over the stack.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

18 Aug, 2017

5 commits

  • dq_data_lock is currently used to protect all modifications of quota
    accounting information, consistency of quota accounting on the inode,
    and dquot pointers from inode. As a result contention on the lock can be
    pretty heavy.

    Reduce the contention on the lock by protecting quota accounting
    information by a new dquot->dq_dqb_lock and consistency of quota
    accounting with inode usage by inode->i_lock.

    This change reduces time to create 500000 files on ext4 on ramdisk by 50
    different processes in separate directories by 6% when user quota is
    turned on. When those 50 processes belong to 50 different users, the
    improvement is about 9%.

    Signed-off-by: Jan Kara

    Jan Kara
     
  • Push down acquisition of dqio_sem into ->read_file_info() callback. This
    is for consistency with other operations and it also allows us to get
    rid of an ugliness in OCFS2.

    Reviewed-by: Andreas Dilger
    Signed-off-by: Jan Kara

    Jan Kara
     
  • Push down acquisition of dqio_sem into ->write_file_info() callback.
    Mostly for consistency with other operations.

    Reviewed-by: Andreas Dilger
    Signed-off-by: Jan Kara

    Jan Kara
     
  • vfs_load_quota_inode() needs dqio_sem only for reading. In fact dqio_sem
    is not needed there at all since the function can be called only during
    quota on when quota file cannot be modified but let's leave the
    protection there since it is logical and the path is in no way
    performance critical.

    Reviewed-by: Andreas Dilger
    Signed-off-by: Jan Kara

    Jan Kara
     
  • Convert dqio_mutex to rwsem and call it dqio_sem. No functional changes
    yet.

    Signed-off-by: Jan Kara

    Jan Kara
     

03 Aug, 2017

1 commit

  • When new directory 'DIR1' is created in a directory 'DIR0' with SGID bit
    set, DIR1 is expected to have SGID bit set (and owning group equal to
    the owning group of 'DIR0'). However when 'DIR0' also has some default
    ACLs that 'DIR1' inherits, setting these ACLs will result in SGID bit on
    'DIR1' to get cleared if user is not member of the owning group.

    Fix the problem by moving posix_acl_update_mode() out of ocfs2_set_acl()
    into ocfs2_iop_set_acl(). That way the function will not be called when
    inheriting ACLs which is what we want as it prevents SGID bit clearing
    and the mode has been properly set by posix_acl_create() anyway. Also
    posix_acl_chmod() that is calling ocfs2_set_acl() takes care of updating
    mode itself.

    Fixes: 073931017b4 ("posix_acl: Clear SGID bit when setting file permissions")
    Link: http://lkml.kernel.org/r/20170801141252.19675-3-jack@suse.cz
    Signed-off-by: Jan Kara
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Joseph Qi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

01 Aug, 2017

1 commit

  • This patch converts most of the in-kernel filesystems that do writeback
    out of the pagecache to report errors using the errseq_t-based
    infrastructure that was recently added. This allows them to report
    errors once for each open file description.

    Most filesystems have a fairly straightforward fsync operation. They
    call filemap_write_and_wait_range to write back all of the data and
    wait on it, and then (sometimes) sync out the metadata.

    For those filesystems this is a straightforward conversion from calling
    filemap_write_and_wait_range in their fsync operation to calling
    file_write_and_wait_range.

    Acked-by: Jan Kara
    Acked-by: Dave Kleikamp
    Signed-off-by: Jeff Layton

    Jeff Layton
     

17 Jul, 2017

1 commit

  • Firstly by applying the following with coccinelle's spatch:

    @@ expression SB; @@
    -SB->s_flags & MS_RDONLY
    +sb_rdonly(SB)

    to effect the conversion to sb_rdonly(sb), then by applying:

    @@ expression A, SB; @@
    (
    -(!sb_rdonly(SB)) && A
    +!sb_rdonly(SB) && A
    |
    -A != (sb_rdonly(SB))
    +A != sb_rdonly(SB)
    |
    -A == (sb_rdonly(SB))
    +A == sb_rdonly(SB)
    |
    -!(sb_rdonly(SB))
    +!sb_rdonly(SB)
    |
    -A && (sb_rdonly(SB))
    +A && sb_rdonly(SB)
    |
    -A || (sb_rdonly(SB))
    +A || sb_rdonly(SB)
    |
    -(sb_rdonly(SB)) != A
    +sb_rdonly(SB) != A
    |
    -(sb_rdonly(SB)) == A
    +sb_rdonly(SB) == A
    |
    -(sb_rdonly(SB)) && A
    +sb_rdonly(SB) && A
    |
    -(sb_rdonly(SB)) || A
    +sb_rdonly(SB) || A
    )

    @@ expression A, B, SB; @@
    (
    -(sb_rdonly(SB)) ? 1 : 0
    +sb_rdonly(SB)
    |
    -(sb_rdonly(SB)) ? A : B
    +sb_rdonly(SB) ? A : B
    )

    to remove left over excess bracketage and finally by applying:

    @@ expression A, SB; @@
    (
    -(A & MS_RDONLY) != sb_rdonly(SB)
    +(bool)(A & MS_RDONLY) != sb_rdonly(SB)
    |
    -(A & MS_RDONLY) == sb_rdonly(SB)
    +(bool)(A & MS_RDONLY) == sb_rdonly(SB)
    )

    to make comparisons against the result of sb_rdonly() (which is a bool)
    work correctly.

    Signed-off-by: David Howells

    David Howells
     

07 Jul, 2017

4 commits

  • attribute_groups are not supposed to change at runtime. All functions
    working with attribute_groups provided by work with
    const attribute_group. So mark the non-const structs as const.

    File size before:
    text data bss dec hex filename
    4402 1088 38 5528 1598 fs/ocfs2/stackglue.o

    File size After adding 'const':
    text data bss dec hex filename
    4442 1024 38 5504 1580 fs/ocfs2/stackglue.o

    Link: http://lkml.kernel.org/r/cab4e59b4918db3ed2ec77073a4cb310c4429ef5.1498808026.git.arvind.yadav.cs@gmail.com
    Signed-off-by: Arvind Yadav
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arvind Yadav
     
  • 'sd->dbg_sock' is malloced in sc_common_open(), but not freed at the end
    of sc_fop_release().

    Link: http://lkml.kernel.org/r/594FB0A4.2050105@huawei.com
    Signed-off-by: Jun Piao
    Reviewed-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    piaojun
     
  • Filesystems generally use SUPER_MAGIC values from magic.h instead of a
    local definition.

    Link: http://lkml.kernel.org/r/20170521154217.27917-1-fabf@skynet.be
    Signed-off-by: Fabian Frederick
    Reviewed-by: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     
  • Fix a static code checker warning:

    fs/ocfs2/inode.c:179 ocfs2_iget() warn: passing zero to 'ERR_PTR'

    Fixes: d56a8f32e4c6 ("ocfs2: check/fix inode block for online file check")
    Link: http://lkml.kernel.org/r/1495516634-1952-1-git-send-email-ghe@suse.com
    Signed-off-by: Gang He
    Reviewed-by: Joseph Qi
    Reviewed-by: Eric Ren
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gang He
     

04 Jul, 2017

2 commits

  • Pull core block/IO updates from Jens Axboe:
    "This is the main pull request for the block layer for 4.13. Not a huge
    round in terms of features, but there's a lot of churn related to some
    core cleanups.

    Note this depends on the UUID tree pull request, that Christoph
    already sent out.

    This pull request contains:

    - A series from Christoph, unifying the error/stats codes in the
    block layer. We now use blk_status_t everywhere, instead of using
    different schemes for different places.

    - Also from Christoph, some cleanups around request allocation and IO
    scheduler interactions in blk-mq.

    - And yet another series from Christoph, cleaning up how we handle
    and do bounce buffering in the block layer.

    - A blk-mq debugfs series from Bart, further improving on the support
    we have for exporting internal information to aid debugging IO
    hangs or stalls.

    - Also from Bart, a series that cleans up the request initialization
    differences across types of devices.

    - A series from Goldwyn Rodrigues, allowing the block layer to return
    failure if we will block and the user asked for non-blocking.

    - Patch from Hannes for supporting setting loop devices block size to
    that of the underlying device.

    - Two series of patches from Javier, fixing various issues with
    lightnvm, particular around pblk.

    - A series from me, adding support for write hints. This comes with
    NVMe support as well, so applications can help guide data placement
    on flash to improve performance, latencies, and write
    amplification.

    - A series from Ming, improving and hardening blk-mq support for
    stopping/starting and quiescing hardware queues.

    - Two pull requests for NVMe updates. Nothing major on the feature
    side, but lots of cleanups and bug fixes. From the usual crew.

    - A series from Neil Brown, greatly improving the bio rescue set
    support. Most notably, this kills the bio rescue work queues, if we
    don't really need them.

    - Lots of other little bug fixes that are all over the place"

    * 'for-4.13/block' of git://git.kernel.dk/linux-block: (217 commits)
    lightnvm: pblk: set line bitmap check under debug
    lightnvm: pblk: verify that cache read is still valid
    lightnvm: pblk: add initialization check
    lightnvm: pblk: remove target using async. I/Os
    lightnvm: pblk: use vmalloc for GC data buffer
    lightnvm: pblk: use right metadata buffer for recovery
    lightnvm: pblk: schedule if data is not ready
    lightnvm: pblk: remove unused return variable
    lightnvm: pblk: fix double-free on pblk init
    lightnvm: pblk: fix bad le64 assignations
    nvme: Makefile: remove dead build rule
    blk-mq: map all HWQ also in hyperthreaded system
    nvmet-rdma: register ib_client to not deadlock in device removal
    nvme_fc: fix error recovery on link down.
    nvmet_fc: fix crashes on bad opcodes
    nvme_fc: Fix crash when nvme controller connection fails.
    nvme_fc: replace ioabort msleep loop with completion
    nvme_fc: fix double calls to nvme_cleanup_cmd()
    nvme-fabrics: verify that a controller returns the correct NQN
    nvme: simplify nvme_dev_attrs_are_visible
    ...

    Linus Torvalds
     
  • Pull uuid subsystem from Christoph Hellwig:
    "This is the new uuid subsystem, in which Amir, Andy and I have started
    consolidating our uuid/guid helpers and improving the types used for
    them. Note that various other subsystems have pulled in this tree, so
    I'd like it to go in early.

    UUID/GUID summary:

    - introduce the new uuid_t/guid_t types that are going to replace the
    somewhat confusing uuid_be/uuid_le types and make the terminology
    fit the various specs, as well as the userspace libuuid library.
    (me, based on a previous version from Amir)

    - consolidated generic uuid/guid helper functions lifted from XFS and
    libnvdimm (Amir and me)

    - conversions to the new types and helpers (Amir, Andy and me)"

    * tag 'uuid-for-4.13' of git://git.infradead.org/users/hch/uuid: (34 commits)
    ACPI: hns_dsaf_acpi_dsm_guid can be static
    mmc: sdhci-pci: make guid intel_dsm_guid static
    uuid: Take const on input of uuid_is_null() and guid_is_null()
    thermal: int340x_thermal: fix compile after the UUID API switch
    thermal: int340x_thermal: Switch to use new generic UUID API
    acpi: always include uuid.h
    ACPI: Switch to use generic guid_t in acpi_evaluate_dsm()
    ACPI / extlog: Switch to use new generic UUID API
    ACPI / bus: Switch to use new generic UUID API
    ACPI / APEI: Switch to use new generic UUID API
    acpi, nfit: Switch to use new generic UUID API
    MAINTAINERS: add uuid entry
    tmpfs: generate random sb->s_uuid
    scsi_debug: switch to uuid_t
    nvme: switch to uuid_t
    sysctl: switch to use uuid_t
    partitions/ldm: switch to use uuid_t
    overlayfs: use uuid_t instead of uuid_be
    fs: switch ->s_uuid to uuid_t
    ima/policy: switch to use uuid_t
    ...

    Linus Torvalds
     

24 Jun, 2017

1 commit

  • Another deadlock path caused by recursive locking is reported. This
    kind of issue was introduced since commit 743b5f1434f5 ("ocfs2: take
    inode lock in ocfs2_iop_set/get_acl()"). Two deadlock paths have been
    fixed by commit b891fa5024a9 ("ocfs2: fix deadlock issue when taking
    inode lock at vfs entry points"). Yes, we intend to fix this kind of
    case in incremental way, because it's hard to find out all possible
    paths at once.

    This one can be reproduced like this. On node1, cp a large file from
    home directory to ocfs2 mountpoint. While on node2, run
    setfacl/getfacl. Both nodes will hang up there. The backtraces:

    On node1:
    __ocfs2_cluster_lock.isra.39+0x357/0x740 [ocfs2]
    ocfs2_inode_lock_full_nested+0x17d/0x840 [ocfs2]
    ocfs2_write_begin+0x43/0x1a0 [ocfs2]
    generic_perform_write+0xa9/0x180
    __generic_file_write_iter+0x1aa/0x1d0
    ocfs2_file_write_iter+0x4f4/0xb40 [ocfs2]
    __vfs_write+0xc3/0x130
    vfs_write+0xb1/0x1a0
    SyS_write+0x46/0xa0

    On node2:
    __ocfs2_cluster_lock.isra.39+0x357/0x740 [ocfs2]
    ocfs2_inode_lock_full_nested+0x17d/0x840 [ocfs2]
    ocfs2_xattr_set+0x12e/0xe80 [ocfs2]
    ocfs2_set_acl+0x22d/0x260 [ocfs2]
    ocfs2_iop_set_acl+0x65/0xb0 [ocfs2]
    set_posix_acl+0x75/0xb0
    posix_acl_xattr_set+0x49/0xa0
    __vfs_setxattr+0x69/0x80
    __vfs_setxattr_noperm+0x72/0x1a0
    vfs_setxattr+0xa7/0xb0
    setxattr+0x12d/0x190
    path_setxattr+0x9f/0xb0
    SyS_setxattr+0x14/0x20

    Fix this one by using ocfs2_inode_{lock|unlock}_tracker, which is
    exported by commit 439a36b8ef38 ("ocfs2/dlmglue: prepare tracking logic
    to avoid recursive cluster lock").

    Link: http://lkml.kernel.org/r/20170622014746.5815-1-zren@suse.com
    Fixes: 743b5f1434f5 ("ocfs2: take inode lock in ocfs2_iop_set/get_acl()")
    Signed-off-by: Eric Ren
    Reported-by: Thomas Voegtle
    Tested-by: Thomas Voegtle
    Reviewed-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Ren
     

13 Jun, 2017

1 commit


12 Jun, 2017

1 commit

  • We've already got a few conflicts and upcoming work depends on some of the
    changes that have gone into mainline as regression fixes for this series.

    Pull in 4.12-rc5 to resolve these conflicts and make it easier on down stream
    trees to continue working on 4.13 changes.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

09 Jun, 2017

1 commit

  • Replace bi_error with a new bi_status to allow for a clear conversion.
    Note that device mapper overloaded bi_error with a private value, which
    we'll have to keep arround at least for now and thus propagate to a
    proper blk_status_t value.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig