26 Jan, 2019

1 commit

  • [ Upstream commit 532e1e54c8140188e192348c790317921cb2dc1c ]

    mount.ocfs2 ignore the inconsistent error that journal is clean but
    local alloc is unrecovered. After mount, local alloc not empty, then
    reserver cluster didn't alloc a new local alloc window, reserveration
    map is empty(ocfs2_reservation_map.m_bitmap_len = 0), that triggered the
    following panic.

    This issue was reported at

    https://oss.oracle.com/pipermail/ocfs2-devel/2015-May/010854.html

    and was advised to fixed during mount. But this is a very unusual
    inconsistent state, usually journal dirty flag should be cleared at the
    last stage of umount until every other things go right. We may need do
    further debug to check that. Any way to avoid possible futher
    corruption, mount should be abort and fsck should be run.

    (mount.ocfs2,1765,1):ocfs2_load_local_alloc:353 ERROR: Local alloc hasn't been recovered!
    found = 6518, set = 6518, taken = 8192, off = 15912372
    ocfs2: Mounting device (202,64) on (node 0, slot 3) with ordered data mode.
    o2dlm: Joining domain 89CEAC63CC4F4D03AC185B44E0EE0F3F ( 0 1 2 3 4 5 6 8 ) 8 nodes
    ocfs2: Mounting device (202,80) on (node 0, slot 3) with ordered data mode.
    o2hb: Region 89CEAC63CC4F4D03AC185B44E0EE0F3F (xvdf) is now a quorum device
    o2net: Accepted connection from node yvwsoa17p (num 7) at 172.22.77.88:7777
    o2dlm: Node 7 joins domain 64FE421C8C984E6D96ED12C55FEE2435 ( 0 1 2 3 4 5 6 7 8 ) 9 nodes
    o2dlm: Node 7 joins domain 89CEAC63CC4F4D03AC185B44E0EE0F3F ( 0 1 2 3 4 5 6 7 8 ) 9 nodes
    ------------[ cut here ]------------
    kernel BUG at fs/ocfs2/reservations.c:507!
    invalid opcode: 0000 [#1] SMP
    Modules linked in: ocfs2 rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs fscache lockd grace ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs sunrpc ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 ovmapi ppdev parport_pc parport xen_netfront fb_sys_fops sysimgblt sysfillrect syscopyarea acpi_cpufreq pcspkr i2c_piix4 i2c_core sg ext4 jbd2 mbcache2 sr_mod cdrom xen_blkfront pata_acpi ata_generic ata_piix floppy dm_mirror dm_region_hash dm_log dm_mod
    CPU: 0 PID: 4349 Comm: startWebLogic.s Not tainted 4.1.12-124.19.2.el6uek.x86_64 #2
    Hardware name: Xen HVM domU, BIOS 4.4.4OVM 09/06/2018
    task: ffff8803fb04e200 ti: ffff8800ea4d8000 task.ti: ffff8800ea4d8000
    RIP: 0010:[] [] __ocfs2_resv_find_window+0x498/0x760 [ocfs2]
    Call Trace:
    ocfs2_resmap_resv_bits+0x10d/0x400 [ocfs2]
    ocfs2_claim_local_alloc_bits+0xd0/0x640 [ocfs2]
    __ocfs2_claim_clusters+0x178/0x360 [ocfs2]
    ocfs2_claim_clusters+0x1f/0x30 [ocfs2]
    ocfs2_convert_inline_data_to_extents+0x634/0xa60 [ocfs2]
    ocfs2_write_begin_nolock+0x1c6/0x1da0 [ocfs2]
    ocfs2_write_begin+0x13e/0x230 [ocfs2]
    generic_perform_write+0xbf/0x1c0
    __generic_file_write_iter+0x19c/0x1d0
    ocfs2_file_write_iter+0x589/0x1360 [ocfs2]
    __vfs_write+0xb8/0x110
    vfs_write+0xa9/0x1b0
    SyS_write+0x46/0xb0
    system_call_fastpath+0x18/0xd7
    Code: ff ff 8b 75 b8 39 75 b0 8b 45 c8 89 45 98 0f 84 e5 fe ff ff 45 8b 74 24 18 41 8b 54 24 1c e9 56 fc ff ff 85 c0 0f 85 48 ff ff ff 0b 48 8b 05 cf c3 de ff 48 ba 00 00 00 00 00 00 00 10 48 85
    RIP __ocfs2_resv_find_window+0x498/0x760 [ocfs2]
    RSP
    ---[ end trace 566f07529f2edf3c ]---
    Kernel panic - not syncing: Fatal exception
    Kernel Offset: disabled

    Link: http://lkml.kernel.org/r/20181121020023.3034-2-junxiao.bi@oracle.com
    Signed-off-by: Junxiao Bi
    Reviewed-by: Yiwen Jiang
    Acked-by: Joseph Qi
    Cc: Jun Piao
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Changwei Ge
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Junxiao Bi
     

17 Dec, 2018

2 commits

  • [ Upstream commit 164f7e586739d07eb56af6f6d66acebb11f315c8 ]

    ocfs2_get_dentry() calls iput(inode) to drop the reference count of
    inode, and if the reference count hits 0, inode is freed. However, in
    this function, it then reads inode->i_generation, which may result in a
    use after free bug. Move the put operation later.

    Link: http://lkml.kernel.org/r/1543109237-110227-1-git-send-email-bianpan2016@163.com
    Fixes: 781f200cb7a("ocfs2: Remove masklog ML_EXPORT.")
    Signed-off-by: Pan Bian
    Reviewed-by: Andrew Morton
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Joseph Qi
    Cc: Changwei Ge
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Pan Bian
     
  • [ Upstream commit e21e57445a64598b29a6f629688f9b9a39e7242a ]

    ocfs2_defrag_extent may fall into deadlock.

    ocfs2_ioctl_move_extents
    ocfs2_ioctl_move_extents
    ocfs2_move_extents
    ocfs2_defrag_extent
    ocfs2_lock_allocators_move_extents

    ocfs2_reserve_clusters
    inode_lock GLOBAL_BITMAP_SYSTEM_INODE

    __ocfs2_flush_truncate_log
    inode_lock GLOBAL_BITMAP_SYSTEM_INODE

    As backtrace shows above, ocfs2_reserve_clusters() will call inode_lock
    against the global bitmap if local allocator has not sufficient cluters.
    Once global bitmap could meet the demand, ocfs2_reserve_cluster will
    return success with global bitmap locked.

    After ocfs2_reserve_cluster(), if truncate log is full,
    __ocfs2_flush_truncate_log() will definitely fall into deadlock because
    it needs to inode_lock global bitmap, which has already been locked.

    To fix this bug, we could remove from
    ocfs2_lock_allocators_move_extents() the code which intends to lock
    global allocator, and put the removed code after
    __ocfs2_flush_truncate_log().

    ocfs2_lock_allocators_move_extents() is referred by 2 places, one is
    here, the other does not need the data allocator context, which means
    this patch does not affect the caller so far.

    Link: http://lkml.kernel.org/r/20181101071422.14470-1-lchen@suse.com
    Signed-off-by: Larry Chen
    Reviewed-by: Changwei Ge
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Larry Chen
     

21 Nov, 2018

2 commits

  • commit 5040f8df56fb90c7919f1c9b0b6e54c843437456 upstream.

    The write context should also be freed even when direct IO failed.
    Otherwise a memory leak is introduced and entries remain in
    oi->ip_unwritten_list causing the following BUG later in unlink path:

    ERROR: bug expression: !list_empty(&oi->ip_unwritten_list)
    ERROR: Clear inode of 215043, inode has unwritten extents
    ...
    Call Trace:
    ? __set_current_blocked+0x42/0x68
    ocfs2_evict_inode+0x91/0x6a0 [ocfs2]
    ? bit_waitqueue+0x40/0x33
    evict+0xdb/0x1af
    iput+0x1a2/0x1f7
    do_unlinkat+0x194/0x28f
    SyS_unlinkat+0x1b/0x2f
    do_syscall_64+0x79/0x1ae
    entry_SYSCALL_64_after_hwframe+0x151/0x0

    This patch also logs, with frequency limit, direct IO failures.

    Link: http://lkml.kernel.org/r/20181102170632.25921-1-wen.gang.wang@oracle.com
    Signed-off-by: Wengang Wang
    Reviewed-by: Junxiao Bi
    Reviewed-by: Changwei Ge
    Reviewed-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Wengang Wang
     
  • commit 29aa30167a0a2e6045a0d6d2e89d8168132333d5 upstream.

    Somehow, file system metadata was corrupted, which causes
    ocfs2_check_dir_entry() to fail in function ocfs2_dir_foreach_blk_el().

    According to the original design intention, if above happens we should
    skip the problematic block and continue to retrieve dir entry. But
    there is obviouse misuse of brelse around related code.

    After failure of ocfs2_check_dir_entry(), current code just moves to
    next position and uses the problematic buffer head again and again
    during which the problematic buffer head is released for multiple times.
    I suppose, this a serious issue which is long-lived in ocfs2. This may
    cause other file systems which is also used in a the same host insane.

    So we should also consider about bakcporting this patch into linux
    -stable.

    Link: http://lkml.kernel.org/r/HK2PR06MB045211675B43EED794E597B6D56E0@HK2PR06MB0452.apcprd06.prod.outlook.com
    Signed-off-by: Changwei Ge
    Suggested-by: Changkuo Shi
    Reviewed-by: Andrew Morton
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Joseph Qi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Changwei Ge
     

04 Nov, 2018

1 commit

  • [ Upstream commit 69eb7765b9c6902444c89c54e7043242faf981e5 ]

    ocfs2_duplicate_clusters_by_page() may crash if one of the extent's pages
    is dirty. When a page has not been written back, it is still in dirty
    state. If ocfs2_duplicate_clusters_by_page() is called against the dirty
    page, the crash happens.

    To fix this bug, we can just unlock the page and wait until the page until
    its not dirty.

    The following is the backtrace:

    kernel BUG at /root/code/ocfs2/refcounttree.c:2961!
    [exception RIP: ocfs2_duplicate_clusters_by_page+822]
    __ocfs2_move_extent+0x80/0x450 [ocfs2]
    ? __ocfs2_claim_clusters+0x130/0x250 [ocfs2]
    ocfs2_defrag_extent+0x5b8/0x5e0 [ocfs2]
    __ocfs2_move_extents_range+0x2a4/0x470 [ocfs2]
    ocfs2_move_extents+0x180/0x3b0 [ocfs2]
    ? ocfs2_wait_for_recovery+0x13/0x70 [ocfs2]
    ocfs2_ioctl_move_extents+0x133/0x2d0 [ocfs2]
    ocfs2_ioctl+0x253/0x640 [ocfs2]
    do_vfs_ioctl+0x90/0x5f0
    SyS_ioctl+0x74/0x80
    do_syscall_64+0x74/0x140
    entry_SYSCALL_64_after_hwframe+0x3d/0xa2

    Once we find the page is dirty, we do not wait until it's clean, rather we
    use write_one_page() to write it back

    Link: http://lkml.kernel.org/r/20180829074740.9438-1-lchen@suse.com
    [lchen@suse.com: update comments]
    Link: http://lkml.kernel.org/r/20180830075041.14879-1-lchen@suse.com
    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Larry Chen
    Acked-by: Changwei Ge
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman
    Signed-off-by: Sasha Levin

    Larry Chen
     

10 Oct, 2018

1 commit

  • commit cbe355f57c8074bc4f452e5b6e35509044c6fa23 upstream.

    In dlm_init_lockres() we access and modify res->tracking and
    dlm->tracking_list without holding dlm->track_lock. This can cause list
    corruptions and can end up in kernel panic.

    Fix this by locking res->tracking and dlm->tracking_list with
    dlm->track_lock instead of dlm->spinlock.

    Link: http://lkml.kernel.org/r/1529951192-4686-1-git-send-email-ashish.samant@oracle.com
    Signed-off-by: Ashish Samant
    Reviewed-by: Changwei Ge
    Acked-by: Joseph Qi
    Acked-by: Jun Piao
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Changwei Ge
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Ashish Samant
     

29 Sep, 2018

1 commit

  • commit 234b69e3e089d850a98e7b3145bd00e9b52b1111 upstream.

    While reading block, it is possible that io error return due to underlying
    storage issue, in this case, BH_NeedsValidate was left in the buffer head.
    Then when reading the very block next time, if it was already linked into
    journal, that will trigger the following panic.

    [203748.702517] kernel BUG at fs/ocfs2/buffer_head_io.c:342!
    [203748.702533] invalid opcode: 0000 [#1] SMP
    [203748.702561] Modules linked in: ocfs2 ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs sunrpc dm_switch dm_queue_length dm_multipath bonding be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i iw_cxgb4 cxgb4 cxgb3i libcxgbi iw_cxgb3 cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ipmi_devintf iTCO_wdt iTCO_vendor_support dcdbas ipmi_ssif i2c_core ipmi_si ipmi_msghandler acpi_pad pcspkr sb_edac edac_core lpc_ich mfd_core shpchp sg tg3 ptp pps_core ext4 jbd2 mbcache2 sr_mod cdrom sd_mod ahci libahci megaraid_sas wmi dm_mirror dm_region_hash dm_log dm_mod
    [203748.703024] CPU: 7 PID: 38369 Comm: touch Not tainted 4.1.12-124.18.6.el6uek.x86_64 #2
    [203748.703045] Hardware name: Dell Inc. PowerEdge R620/0PXXHP, BIOS 2.5.2 01/28/2015
    [203748.703067] task: ffff880768139c00 ti: ffff88006ff48000 task.ti: ffff88006ff48000
    [203748.703088] RIP: 0010:[] [] ocfs2_read_blocks+0x669/0x7f0 [ocfs2]
    [203748.703130] RSP: 0018:ffff88006ff4b818 EFLAGS: 00010206
    [203748.703389] RAX: 0000000008620029 RBX: ffff88006ff4b910 RCX: 0000000000000000
    [203748.703885] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 00000000023079fe
    [203748.704382] RBP: ffff88006ff4b8d8 R08: 0000000000000000 R09: ffff8807578c25b0
    [203748.704877] R10: 000000000f637376 R11: 000000003030322e R12: 0000000000000000
    [203748.705373] R13: ffff88006ff4b910 R14: ffff880732fe38f0 R15: 0000000000000000
    [203748.705871] FS: 00007f401992c700(0000) GS:ffff880bfebc0000(0000) knlGS:0000000000000000
    [203748.706370] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [203748.706627] CR2: 00007f4019252440 CR3: 00000000a621e000 CR4: 0000000000060670
    [203748.707124] Stack:
    [203748.707371] ffff88006ff4b828 ffffffffa0609f52 ffff88006ff4b838 0000000000000001
    [203748.707885] 0000000000000000 0000000000000000 ffff880bf67c3800 ffffffffa05eca00
    [203748.708399] 00000000023079ff ffffffff81c58b80 0000000000000000 0000000000000000
    [203748.708915] Call Trace:
    [203748.709175] [] ? ocfs2_inode_cache_io_unlock+0x12/0x20 [ocfs2]
    [203748.709680] [] ? ocfs2_empty_dir_filldir+0x80/0x80 [ocfs2]
    [203748.710185] [] ocfs2_read_dir_block_direct+0x3b/0x200 [ocfs2]
    [203748.710691] [] ocfs2_prepare_dx_dir_for_insert.isra.57+0x19f/0xf60 [ocfs2]
    [203748.711204] [] ? ocfs2_metadata_cache_io_unlock+0x1f/0x30 [ocfs2]
    [203748.711716] [] ocfs2_prepare_dir_for_insert+0x13a/0x890 [ocfs2]
    [203748.712227] [] ? ocfs2_check_dir_for_entry+0x8e/0x140 [ocfs2]
    [203748.712737] [] ocfs2_mknod+0x4b2/0x1370 [ocfs2]
    [203748.713003] [] ocfs2_create+0x65/0x170 [ocfs2]
    [203748.713263] [] vfs_create+0xdb/0x150
    [203748.713518] [] do_last+0x815/0x1210
    [203748.713772] [] ? path_init+0xb9/0x450
    [203748.714123] [] path_openat+0x80/0x600
    [203748.714378] [] ? handle_pte_fault+0xd15/0x1620
    [203748.714634] [] do_filp_open+0x3a/0xb0
    [203748.714888] [] ? __alloc_fd+0xa7/0x130
    [203748.715143] [] do_sys_open+0x12c/0x220
    [203748.715403] [] ? syscall_trace_enter_phase1+0x11b/0x180
    [203748.715668] [] ? system_call_after_swapgs+0xe9/0x190
    [203748.715928] [] SyS_open+0x1e/0x20
    [203748.716184] [] system_call_fastpath+0x18/0xd7
    [203748.716440] Code: 00 00 48 8b 7b 08 48 83 c3 10 45 89 f8 44 89 e1 44 89 f2 4c 89 ee e8 07 06 11 e1 48 8b 03 48 85 c0 75 df 8b 5d c8 e9 4d fa ff ff 0b 48 8b 7d a0 e8 dc c6 06 00 48 b8 00 00 00 00 00 00 00 10
    [203748.717505] RIP [] ocfs2_read_blocks+0x669/0x7f0 [ocfs2]
    [203748.717775] RSP

    Joesph ever reported a similar panic.
    Link: https://oss.oracle.com/pipermail/ocfs2-devel/2013-May/008931.html

    Link: http://lkml.kernel.org/r/20180912063207.29484-1-junxiao.bi@oracle.com
    Signed-off-by: Junxiao Bi
    Cc: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Changwei Ge
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Junxiao Bi
     

22 Jul, 2018

2 commits

  • commit 3e4c56d41eef5595035872a2ec5a483f42e8917f upstream.

    ip_alloc_sem should be taken in ocfs2_get_block() when reading file in
    DIRECT mode to prevent concurrent access to extent tree with
    ocfs2_dio_end_io_write(), which may cause BUGON in the following
    situation:

    read file 'A' end_io of writing file 'A'
    vfs_read
    __vfs_read
    ocfs2_file_read_iter
    generic_file_read_iter
    ocfs2_direct_IO
    __blockdev_direct_IO
    do_blockdev_direct_IO
    do_direct_IO
    get_more_blocks
    ocfs2_get_block
    ocfs2_extent_map_get_blocks
    ocfs2_get_clusters
    ocfs2_get_clusters_nocache()
    ocfs2_search_extent_list
    return the index of record which
    contains the v_cluster, that is
    v_cluster > rec[i]->e_cpos.
    ocfs2_dio_end_io
    ocfs2_dio_end_io_write
    down_write(&oi->ip_alloc_sem);
    ocfs2_mark_extent_written
    ocfs2_change_extent_flag
    ocfs2_split_extent
    ...
    --> modify the rec[i]->e_cpos, resulting
    in v_cluster < rec[i]->e_cpos.
    BUG_ON(v_cluster < le32_to_cpu(rec->e_cpos))

    [alex.chen@huawei.com: v3]
    Link: http://lkml.kernel.org/r/59EF3614.6050008@huawei.com
    Link: http://lkml.kernel.org/r/59EF3614.6050008@huawei.com
    Fixes: c15471f79506 ("ocfs2: fix sparse file & data ordering issue in direct io")
    Signed-off-by: Alex Chen
    Reviewed-by: Jun Piao
    Reviewed-by: Joseph Qi
    Reviewed-by: Gang He
    Acked-by: Changwei Ge
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Cc: Salvatore Bonaccorso
    Signed-off-by: Greg Kroah-Hartman

    alex chen
     
  • commit 853bc26a7ea39e354b9f8889ae7ad1492ffa28d2 upstream.

    The subsystem.su_mutex is required while accessing the item->ci_parent,
    otherwise, NULL pointer dereference to the item->ci_parent will be
    triggered in the following situation:

    add node delete node
    sys_write
    vfs_write
    configfs_write_file
    o2nm_node_store
    o2nm_node_local_write
    do_rmdir
    vfs_rmdir
    configfs_rmdir
    mutex_lock(&subsys->su_mutex);
    unlink_obj
    item->ci_group = NULL;
    item->ci_parent = NULL;
    to_o2nm_cluster_from_node
    node->nd_item.ci_parent->ci_parent
    BUG since of NULL pointer dereference to nd_item.ci_parent

    Moreover, the o2nm_cluster also should be protected by the
    subsystem.su_mutex.

    [alex.chen@huawei.com: v2]
    Link: http://lkml.kernel.org/r/59EEAA69.9080703@huawei.com
    Link: http://lkml.kernel.org/r/59E9B36A.10700@huawei.com
    Signed-off-by: Alex Chen
    Reviewed-by: Jun Piao
    Reviewed-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Cc: Salvatore Bonaccorso
    Signed-off-by: Greg Kroah-Hartman

    alex chen
     

21 Jun, 2018

1 commit

  • [ Upstream commit e4383029201470523c3ffe339bd7d57e9b4a7d65 ]

    While reflinking an inode, we create a new inode in orphan directory,
    then take EX lock on it, reflink the original inode to orphan inode and
    release EX lock. Once the lock is released another node could request
    it in EX mode from ocfs2_recover_orphans() which causes downconvert of
    the lock, on this node, to NL mode.

    Later we attempt to initialize security acl for the orphan inode and
    move it to the reflink destination. However, while doing this we dont
    take EX lock on the inode. This could potentially cause problems
    because we could be starting transaction, accessing journal and
    modifying metadata of the inode while holding NL lock and with another
    node holding EX lock on the inode.

    Fix this by taking orphan inode cluster lock in EX mode before
    initializing security and moving orphan inode to reflink destination.
    Use the __tracker variant while taking inode lock to avoid recursive
    locking in the ocfs2_init_security_and_acl() call chain.

    Link: http://lkml.kernel.org/r/1523475107-7639-1-git-send-email-ashish.samant@oracle.com
    Signed-off-by: Ashish Samant
    Reviewed-by: Joseph Qi
    Reviewed-by: Junxiao Bi
    Acked-by: Jun Piao
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Changwei Ge
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Ashish Samant
     

30 May, 2018

1 commit

  • [ Upstream commit bb34f24c7d2c98d0c81838a7700e6068325b17a0 ]

    We should not handle migrate lockres if we are already in
    'DLM_CTXT_IN_SHUTDOWN', as that will cause lockres remains after leaving
    dlm domain. At last other nodes will get stuck into infinite loop when
    requsting lock from us.

    The problem is caused by concurrency umount between nodes. Before
    receiveing N1's DLM_BEGIN_EXIT_DOMAIN_MSG, N2 has picked up N1 as the
    migrate target. So N2 will continue sending lockres to N1 even though
    N1 has left domain.

    N1 N2 (owner)
    touch file

    access the file,
    and get pr lock

    begin leave domain and
    pick up N1 as new owner

    begin leave domain and
    migrate all lockres done

    begin migrate lockres to N1

    end leave domain, but
    the lockres left
    unexpectedly, because
    migrate task has passed

    [piaojun@huawei.com: v3]
    Link: http://lkml.kernel.org/r/5A9CBD19.5020107@huawei.com
    Link: http://lkml.kernel.org/r/5A99F028.2090902@huawei.com
    Signed-off-by: Jun Piao
    Reviewed-by: Yiwen Jiang
    Reviewed-by: Joseph Qi
    Reviewed-by: Changwei Ge
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Jun Piao
     

26 Apr, 2018

3 commits

  • [ Upstream commit d984187e3a1ad7d12447a7ab2c43ce3717a2b5b3 ]

    We should not reuse the dirty bh in jbd2 directly due to the following
    situation:

    1. When removing extent rec, we will dirty the bhs of extent rec and
    truncate log at the same time, and hand them over to jbd2.

    2. The bhs are submitted to jbd2 area successfully.

    3. The write-back thread of device help flush the bhs to disk but
    encounter write error due to abnormal storage link.

    4. After a while the storage link become normal. Truncate log flush
    worker triggered by the next space reclaiming found the dirty bh of
    truncate log and clear its 'BH_Write_EIO' and then set it uptodate in
    __ocfs2_journal_access():

    ocfs2_truncate_log_worker
    ocfs2_flush_truncate_log
    __ocfs2_flush_truncate_log
    ocfs2_replay_truncate_records
    ocfs2_journal_access_di
    __ocfs2_journal_access // here we clear io_error and set 'tl_bh' uptodata.

    5. Then jbd2 will flush the bh of truncate log to disk, but the bh of
    extent rec is still in error state, and unfortunately nobody will
    take care of it.

    6. At last the space of extent rec was not reduced, but truncate log
    flush worker have given it back to globalalloc. That will cause
    duplicate cluster problem which could be identified by fsck.ocfs2.

    Sadly we can hardly revert this but set fs read-only in case of ruining
    atomicity and consistency of space reclaim.

    Link: http://lkml.kernel.org/r/5A6E8092.8090701@huawei.com
    Fixes: acf8fdbe6afb ("ocfs2: do not BUG if buffer not uptodate in __ocfs2_journal_access")
    Signed-off-by: Jun Piao
    Reviewed-by: Yiwen Jiang
    Reviewed-by: Changwei Ge
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    piaojun
     
  • [ Upstream commit 16c8d569f5704a84164f30ff01b29879f3438065 ]

    The race between *set_acl and *get_acl will cause getting incomplete
    xattr data as below:

    processA processB

    ocfs2_set_acl
    ocfs2_xattr_set
    __ocfs2_xattr_set_handle

    ocfs2_get_acl_nolock
    ocfs2_xattr_get_nolock:

    processB may get incomplete xattr data if processA hasn't set_acl done.

    So we should use 'ip_xattr_sem' to protect getting extended attribute in
    ocfs2_get_acl_nolock(), as other processes could be changing it
    concurrently.

    Link: http://lkml.kernel.org/r/5A5DDCFF.7030001@huawei.com
    Signed-off-by: Jun Piao
    Reviewed-by: Alex Chen
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Joseph Qi
    Cc: Changwei Ge
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    piaojun
     
  • [ Upstream commit 025bcbde3634b2c9b316f227fed13ad6ad6817fb ]

    If metadata is corrupted such as 'invalid inode block', we will get
    failed by calling 'mount()' and then set filesystem readonly as below:

    ocfs2_mount
    ocfs2_initialize_super
    ocfs2_init_global_system_inodes
    ocfs2_iget
    ocfs2_read_locked_inode
    ocfs2_validate_inode_block
    ocfs2_error
    ocfs2_handle_error
    ocfs2_set_ro_flag(osb, 0); // set readonly

    In this situation we need return -EROFS to 'mount.ocfs2', so that user
    can fix it by fsck. And then mount again. In addition, 'mount.ocfs2'
    should be updated correspondingly as it only return 1 for all errno.
    And I will post a patch for 'mount.ocfs2' too.

    Link: http://lkml.kernel.org/r/5A4302FA.2010606@huawei.com
    Signed-off-by: Jun Piao
    Reviewed-by: Alex Chen
    Reviewed-by: Joseph Qi
    Reviewed-by: Changwei Ge
    Reviewed-by: Gang He
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    piaojun
     

22 Feb, 2018

1 commit

  • commit ff26cc10aec128c3f86b5611fd5f59c71d49c0e3 upstream.

    If we can't get inode lock immediately in the function
    ocfs2_inode_lock_with_page() when reading a page, we should not return
    directly here, since this will lead to a softlockup problem when the
    kernel is configured with CONFIG_PREEMPT is not set. The method is to
    get a blocking lock and immediately unlock before returning, this can
    avoid CPU resource waste due to lots of retries, and benefits fairness
    in getting lock among multiple nodes, increase efficiency in case
    modifying the same file frequently from multiple nodes.

    The softlockup crash (when set /proc/sys/kernel/softlockup_panic to 1)
    looks like:

    Kernel panic - not syncing: softlockup: hung tasks
    CPU: 0 PID: 885 Comm: multi_mmap Tainted: G L 4.12.14-6.1-default #1
    Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    Call Trace:

    dump_stack+0x5c/0x82
    panic+0xd5/0x21e
    watchdog_timer_fn+0x208/0x210
    __hrtimer_run_queues+0xcc/0x200
    hrtimer_interrupt+0xa6/0x1f0
    smp_apic_timer_interrupt+0x34/0x50
    apic_timer_interrupt+0x96/0xa0

    RIP: 0010:unlock_page+0x17/0x30
    RSP: 0000:ffffaf154080bc88 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
    RAX: dead000000000100 RBX: fffff21e009f5300 RCX: 0000000000000004
    RDX: dead0000000000ff RSI: 0000000000000202 RDI: fffff21e009f5300
    RBP: 0000000000000000 R08: 0000000000000000 R09: ffffaf154080bb00
    R10: ffffaf154080bc30 R11: 0000000000000040 R12: ffff993749a39518
    R13: 0000000000000000 R14: fffff21e009f5300 R15: fffff21e009f5300
    ocfs2_inode_lock_with_page+0x25/0x30 [ocfs2]
    ocfs2_readpage+0x41/0x2d0 [ocfs2]
    filemap_fault+0x12b/0x5c0
    ocfs2_fault+0x29/0xb0 [ocfs2]
    __do_fault+0x1a/0xa0
    __handle_mm_fault+0xbe8/0x1090
    handle_mm_fault+0xaa/0x1f0
    __do_page_fault+0x235/0x4b0
    trace_do_page_fault+0x3c/0x110
    async_page_fault+0x28/0x30
    RIP: 0033:0x7fa75ded638e
    RSP: 002b:00007ffd6657db18 EFLAGS: 00010287
    RAX: 000055c7662fb700 RBX: 0000000000000001 RCX: 000055c7662fb700
    RDX: 0000000000001770 RSI: 00007fa75e909000 RDI: 000055c7662fb700
    RBP: 0000000000000003 R08: 000000000000000e R09: 0000000000000000
    R10: 0000000000000483 R11: 00007fa75ded61b0 R12: 00007fa75e90a770
    R13: 000000000000000e R14: 0000000000001770 R15: 0000000000000000

    About performance improvement, we can see the testing time is reduced,
    and CPU utilization decreases, the detailed data is as follows. I ran
    multi_mmap test case in ocfs2-test package in a three nodes cluster.

    Before applying this patch:
    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
    2754 ocfs2te+ 20 0 170248 6980 4856 D 80.73 0.341 0:18.71 multi_mmap
    1505 root rt 0 222236 123060 97224 S 2.658 6.015 0:01.44 corosync
    5 root 20 0 0 0 0 S 1.329 0.000 0:00.19 kworker/u8:0
    95 root 20 0 0 0 0 S 1.329 0.000 0:00.25 kworker/u8:1
    2728 root 20 0 0 0 0 S 0.997 0.000 0:00.24 jbd2/sda1-33
    2721 root 20 0 0 0 0 S 0.664 0.000 0:00.07 ocfs2dc-3C8CFD4
    2750 ocfs2te+ 20 0 142976 4652 3532 S 0.664 0.227 0:00.28 mpirun

    ocfs2test@tb-node2:~>multiple_run.sh -i ens3 -k ~/linux-4.4.21-69.tar.gz -o ~/ocfs2mullog -C hacluster -s pcmk -n tb-node2,tb-node1,tb-node3 -d /dev/sda1 -b 4096 -c 32768 -t multi_mmap /mnt/shared
    Tests with "-b 4096 -C 32768"
    Thu Dec 28 14:44:52 CST 2017
    multi_mmap..................................................Passed.
    Runtime 783 seconds.

    After apply this patch:

    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
    2508 ocfs2te+ 20 0 170248 6804 4680 R 54.00 0.333 0:55.37 multi_mmap
    155 root 20 0 0 0 0 S 2.667 0.000 0:01.20 kworker/u8:3
    95 root 20 0 0 0 0 S 2.000 0.000 0:01.58 kworker/u8:1
    2504 ocfs2te+ 20 0 142976 4604 3480 R 1.667 0.225 0:01.65 mpirun
    5 root 20 0 0 0 0 S 1.000 0.000 0:01.36 kworker/u8:0
    2482 root 20 0 0 0 0 S 1.000 0.000 0:00.86 jbd2/sda1-33
    299 root 0 -20 0 0 0 S 0.333 0.000 0:00.13 kworker/2:1H
    335 root 0 -20 0 0 0 S 0.333 0.000 0:00.17 kworker/1:1H
    535 root 20 0 12140 7268 1456 S 0.333 0.355 0:00.34 haveged
    1282 root rt 0 222284 123108 97224 S 0.333 6.017 0:01.33 corosync

    ocfs2test@tb-node2:~>multiple_run.sh -i ens3 -k ~/linux-4.4.21-69.tar.gz -o ~/ocfs2mullog -C hacluster -s pcmk -n tb-node2,tb-node1,tb-node3 -d /dev/sda1 -b 4096 -c 32768 -t multi_mmap /mnt/shared
    Tests with "-b 4096 -C 32768"
    Thu Dec 28 15:04:12 CST 2017
    multi_mmap..................................................Passed.
    Runtime 487 seconds.

    Link: http://lkml.kernel.org/r/1514447305-30814-1-git-send-email-ghe@suse.com
    Fixes: 1cce4df04f37 ("ocfs2: do not lock/unlock() inode DLM lock")
    Signed-off-by: Gang He
    Reviewed-by: Eric Ren
    Acked-by: alex chen
    Acked-by: piaojun
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Joseph Qi
    Cc: Changwei Ge
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Gang He
     

24 Nov, 2017

2 commits

  • commit 28f5a8a7c033cbf3e32277f4cc9c6afd74f05300 upstream.

    we should wait dio requests to finish before inode lock in
    ocfs2_setattr(), otherwise the following deadlock will happen:

    process 1 process 2 process 3
    truncate file 'A' end_io of writing file 'A' receiving the bast messages
    ocfs2_setattr
    ocfs2_inode_lock_tracker
    ocfs2_inode_lock_full
    inode_dio_wait
    __inode_dio_wait
    -->waiting for all dio
    requests finish
    dlm_proxy_ast_handler
    dlm_do_local_bast
    ocfs2_blocking_ast
    ocfs2_generic_handle_bast
    set OCFS2_LOCK_BLOCKED flag
    dio_end_io
    dio_bio_end_aio
    dio_complete
    ocfs2_dio_end_io
    ocfs2_dio_end_io_write
    ocfs2_inode_lock
    __ocfs2_cluster_lock
    ocfs2_wait_for_mask
    -->waiting for OCFS2_LOCK_BLOCKED
    flag to be cleared, that is waiting
    for 'process 1' unlocking the inode lock
    inode_dio_end
    -->here dec the i_dio_count, but will never
    be called, so a deadlock happened.

    Link: http://lkml.kernel.org/r/59F81636.70508@huawei.com
    Signed-off-by: Alex Chen
    Reviewed-by: Jun Piao
    Reviewed-by: Joseph Qi
    Acked-by: Changwei Ge
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    alex chen
     
  • commit 1c01967116a678fed8e2c68a6ab82abc8effeddc upstream.

    When a node dies, other live nodes have to choose a new master for an
    existed lock resource mastered by the dead node.

    As for ocfs2/dlm implementation, this is done by function -
    dlm_move_lockres_to_recovery_list which marks those lock rsources as
    DLM_LOCK_RES_RECOVERING and manages them via a list from which DLM
    changes lock resource's master later.

    So without invoking dlm_move_lockres_to_recovery_list, no master will be
    choosed after dlm recovery accomplishment since no lock resource can be
    found through ::resource list.

    What's worse is that if DLM_LOCK_RES_RECOVERING is not marked for lock
    resources mastered a dead node, it will break up synchronization among
    nodes.

    So invoke dlm_move_lockres_to_recovery_list again.

    Fixs: 'commit ee8f7fcbe638 ("ocfs2/dlm: continue to purge recovery lockres when recovery master goes down")'
    Link: http://lkml.kernel.org/r/63ADC13FD55D6546B7DECE290D39E373CED6E0F9@H3CMLB14-EX.srv.huawei-3com.com
    Signed-off-by: Changwei Ge
    Reported-by: Vitaly Mayatskih
    Tested-by: Vitaly Mayatskikh
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Changwei Ge
     

03 Nov, 2017

1 commit

  • The first cluster group descriptor is not stored at the start of the
    group but at an offset from the start. We need to take this into
    account while doing fstrim on the first cluster group. Otherwise we
    will wrongly start fstrim a few blocks after the desired start block and
    the range can cross over into the next cluster group and zero out the
    group descriptor there. This can cause filesytem corruption that cannot
    be fixed by fsck.

    Link: http://lkml.kernel.org/r/1507835579-7308-1-git-send-email-ashish.samant@oracle.com
    Signed-off-by: Ashish Samant
    Reviewed-by: Junxiao Bi
    Reviewed-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ashish Samant
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

15 Sep, 2017

1 commit

  • Pull mount flag updates from Al Viro:
    "Another chunk of fmount preparations from dhowells; only trivial
    conflicts for that part. It separates MS_... bits (very grotty
    mount(2) ABI) from the struct super_block ->s_flags (kernel-internal,
    only a small subset of MS_... stuff).

    This does *not* convert the filesystems to new constants; only the
    infrastructure is done here. The next step in that series is where the
    conflicts would be; that's the conversion of filesystems. It's purely
    mechanical and it's better done after the merge, so if you could run
    something like

    list=$(for i in MS_RDONLY MS_NOSUID MS_NODEV MS_NOEXEC MS_SYNCHRONOUS MS_MANDLOCK MS_DIRSYNC MS_NOATIME MS_NODIRATIME MS_SILENT MS_POSIXACL MS_KERNMOUNT MS_I_VERSION MS_LAZYTIME; do git grep -l $i fs drivers/staging/lustre drivers/mtd ipc mm include/linux; done|sort|uniq|grep -v '^fs/namespace.c$')

    sed -i -e 's/\/SB_RDONLY/g' \
    -e 's/\/SB_NOSUID/g' \
    -e 's/\/SB_NODEV/g' \
    -e 's/\/SB_NOEXEC/g' \
    -e 's/\/SB_SYNCHRONOUS/g' \
    -e 's/\/SB_MANDLOCK/g' \
    -e 's/\/SB_DIRSYNC/g' \
    -e 's/\/SB_NOATIME/g' \
    -e 's/\/SB_NODIRATIME/g' \
    -e 's/\/SB_SILENT/g' \
    -e 's/\/SB_POSIXACL/g' \
    -e 's/\/SB_KERNMOUNT/g' \
    -e 's/\/SB_I_VERSION/g' \
    -e 's/\/SB_LAZYTIME/g' \
    $list

    and commit it with something along the lines of 'convert filesystems
    away from use of MS_... constants' as commit message, it would save a
    quite a bit of headache next cycle"

    * 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    VFS: Differentiate mount flags (MS_*) from internal superblock flags
    VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb)
    vfs: Add sb_rdonly(sb) to query the MS_RDONLY flag on s_flags

    Linus Torvalds
     

08 Sep, 2017

2 commits

  • Pull quota scaling updates from Jan Kara:
    "This contains changes to make the quota subsystem more scalable.

    Reportedly it improves number of files created per second on ext4
    filesystem on fast storage by about a factor of 2x"

    * 'quota_scaling' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: (28 commits)
    quota: Add lock annotations to struct members
    quota: Reduce contention on dq_data_lock
    fs: Provide __inode_get_bytes()
    quota: Inline dquot_[re]claim_reserved_space() into callsite
    quota: Inline inode_{incr,decr}_space() into callsites
    quota: Inline functions into their callsites
    ext4: Disable dirty list tracking of dquots when journalling quotas
    quota: Allow disabling tracking of dirty dquots in a list
    quota: Remove dq_wait_unused from dquot
    quota: Move locking into clear_dquot_dirty()
    quota: Do not dirty bad dquots
    quota: Fix possible corruption of dqi_flags
    quota: Propagate ->quota_read errors from v2_read_file_info()
    quota: Fix error codes in v2_read_file_info()
    quota: Push dqio_sem down to ->read_file_info()
    quota: Push dqio_sem down to ->write_file_info()
    quota: Push dqio_sem down to ->get_next_id()
    quota: Push dqio_sem down to ->release_dqblk()
    quota: Remove locking for writing to the old quota format
    quota: Do not acquire dqio_sem for dquot overwrites in v2 format
    ...

    Linus Torvalds
     
  • Pull block layer updates from Jens Axboe:
    "This is the first pull request for 4.14, containing most of the code
    changes. It's a quiet series this round, which I think we needed after
    the churn of the last few series. This contains:

    - Fix for a registration race in loop, from Anton Volkov.

    - Overflow complaint fix from Arnd for DAC960.

    - Series of drbd changes from the usual suspects.

    - Conversion of the stec/skd driver to blk-mq. From Bart.

    - A few BFQ improvements/fixes from Paolo.

    - CFQ improvement from Ritesh, allowing idling for group idle.

    - A few fixes found by Dan's smatch, courtesy of Dan.

    - A warning fixup for a race between changing the IO scheduler and
    device remova. From David Jeffery.

    - A few nbd fixes from Josef.

    - Support for cgroup info in blktrace, from Shaohua.

    - Also from Shaohua, new features in the null_blk driver to allow it
    to actually hold data, among other things.

    - Various corner cases and error handling fixes from Weiping Zhang.

    - Improvements to the IO stats tracking for blk-mq from me. Can
    drastically improve performance for fast devices and/or big
    machines.

    - Series from Christoph removing bi_bdev as being needed for IO
    submission, in preparation for nvme multipathing code.

    - Series from Bart, including various cleanups and fixes for switch
    fall through case complaints"

    * 'for-4.14/block' of git://git.kernel.dk/linux-block: (162 commits)
    kernfs: checking for IS_ERR() instead of NULL
    drbd: remove BIOSET_NEED_RESCUER flag from drbd_{md_,}io_bio_set
    drbd: Fix allyesconfig build, fix recent commit
    drbd: switch from kmalloc() to kmalloc_array()
    drbd: abort drbd_start_resync if there is no connection
    drbd: move global variables to drbd namespace and make some static
    drbd: rename "usermode_helper" to "drbd_usermode_helper"
    drbd: fix race between handshake and admin disconnect/down
    drbd: fix potential deadlock when trying to detach during handshake
    drbd: A single dot should be put into a sequence.
    drbd: fix rmmod cleanup, remove _all_ debugfs entries
    drbd: Use setup_timer() instead of init_timer() to simplify the code.
    drbd: fix potential get_ldev/put_ldev refcount imbalance during attach
    drbd: new disk-option disable-write-same
    drbd: Fix resource role for newly created resources in events2
    drbd: mark symbols static where possible
    drbd: Send P_NEG_ACK upon write error in protocol != C
    drbd: add explicit plugging when submitting batches
    drbd: change list_for_each_safe to while(list_first_entry_or_null)
    drbd: introduce drbd_recv_header_maybe_unplug
    ...

    Linus Torvalds
     

07 Sep, 2017

4 commits

  • Merge updates from Andrew Morton:

    - various misc bits

    - DAX updates

    - OCFS2

    - most of MM

    * emailed patches from Andrew Morton : (119 commits)
    mm,fork: introduce MADV_WIPEONFORK
    x86,mpx: make mpx depend on x86-64 to free up VMA flag
    mm: add /proc/pid/smaps_rollup
    mm: hugetlb: clear target sub-page last when clearing huge page
    mm: oom: let oom_reap_task and exit_mmap run concurrently
    swap: choose swap device according to numa node
    mm: replace TIF_MEMDIE checks by tsk_is_oom_victim
    mm, oom: do not rely on TIF_MEMDIE for memory reserves access
    z3fold: use per-cpu unbuddied lists
    mm, swap: don't use VMA based swap readahead if HDD is used as swap
    mm, swap: add sysfs interface for VMA based swap readahead
    mm, swap: VMA based swap readahead
    mm, swap: fix swap readahead marking
    mm, swap: add swap readahead hit statistics
    mm/vmalloc.c: don't reinvent the wheel but use existing llist API
    mm/vmstat.c: fix wrong comment
    selftests/memfd: add memfd_create hugetlbfs selftest
    mm/shmem: add hugetlbfs support to memfd_create()
    mm, devm_memremap_pages: use multi-order radix for ZONE_DEVICE lookups
    mm/vmalloc.c: halve the number of comparisons performed in pcpu_get_vm_areas()
    ...

    Linus Torvalds
     
  • clean up some unused functions and parameters.

    Link: http://lkml.kernel.org/r/598A5E21.2080807@huawei.com
    Signed-off-by: Jun Piao
    Reviewed-by: Alex Chen
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jun Piao
     
  • The function is never called outside of fs/ocfs2/acl.c.

    Link: http://lkml.kernel.org/r/20170801141252.19675-2-jack@suse.cz
    Signed-off-by: Jan Kara
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Pull writeback error handling updates from Jeff Layton:
    "This pile continues the work from last cycle on better tracking
    writeback errors. In v4.13 we added some basic errseq_t infrastructure
    and converted a few filesystems to use it.

    This set continues refining that infrastructure, adds documentation,
    and converts most of the other filesystems to use it. The main
    exception at this point is the NFS client"

    * tag 'wberr-v4.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
    ecryptfs: convert to file_write_and_wait in ->fsync
    mm: remove optimizations based on i_size in mapping writeback waits
    fs: convert a pile of fsync routines to errseq_t based reporting
    gfs2: convert to errseq_t based writeback error reporting for fsync
    fs: convert sync_file_range to use errseq_t based error-tracking
    mm: add file_fdatawait_range and file_write_and_wait
    fuse: convert to errseq_t based error tracking for fsync
    mm: consolidate dax / non-dax checks for writeback
    Documentation: add some docs for errseq_t
    errseq: rename __errseq_set to errseq_set

    Linus Torvalds
     

24 Aug, 2017

1 commit

  • This way we don't need a block_device structure to submit I/O. The
    block_device has different life time rules from the gendisk and
    request_queue and is usually only available when the block device node
    is open. Other callers need to explicitly create one (e.g. the lightnvm
    passthrough code, or the new nvme multipathing code).

    For the actual I/O path all that we need is the gendisk, which exists
    once per block device. But given that the block layer also does
    partition remapping we additionally need a partition index, which is
    used for said remapping in generic_make_request.

    Note that all the block drivers generally want request_queue or
    sometimes the gendisk, so this removes a layer of indirection all
    over the stack.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

18 Aug, 2017

5 commits

  • dq_data_lock is currently used to protect all modifications of quota
    accounting information, consistency of quota accounting on the inode,
    and dquot pointers from inode. As a result contention on the lock can be
    pretty heavy.

    Reduce the contention on the lock by protecting quota accounting
    information by a new dquot->dq_dqb_lock and consistency of quota
    accounting with inode usage by inode->i_lock.

    This change reduces time to create 500000 files on ext4 on ramdisk by 50
    different processes in separate directories by 6% when user quota is
    turned on. When those 50 processes belong to 50 different users, the
    improvement is about 9%.

    Signed-off-by: Jan Kara

    Jan Kara
     
  • Push down acquisition of dqio_sem into ->read_file_info() callback. This
    is for consistency with other operations and it also allows us to get
    rid of an ugliness in OCFS2.

    Reviewed-by: Andreas Dilger
    Signed-off-by: Jan Kara

    Jan Kara
     
  • Push down acquisition of dqio_sem into ->write_file_info() callback.
    Mostly for consistency with other operations.

    Reviewed-by: Andreas Dilger
    Signed-off-by: Jan Kara

    Jan Kara
     
  • vfs_load_quota_inode() needs dqio_sem only for reading. In fact dqio_sem
    is not needed there at all since the function can be called only during
    quota on when quota file cannot be modified but let's leave the
    protection there since it is logical and the path is in no way
    performance critical.

    Reviewed-by: Andreas Dilger
    Signed-off-by: Jan Kara

    Jan Kara
     
  • Convert dqio_mutex to rwsem and call it dqio_sem. No functional changes
    yet.

    Signed-off-by: Jan Kara

    Jan Kara
     

03 Aug, 2017

1 commit

  • When new directory 'DIR1' is created in a directory 'DIR0' with SGID bit
    set, DIR1 is expected to have SGID bit set (and owning group equal to
    the owning group of 'DIR0'). However when 'DIR0' also has some default
    ACLs that 'DIR1' inherits, setting these ACLs will result in SGID bit on
    'DIR1' to get cleared if user is not member of the owning group.

    Fix the problem by moving posix_acl_update_mode() out of ocfs2_set_acl()
    into ocfs2_iop_set_acl(). That way the function will not be called when
    inheriting ACLs which is what we want as it prevents SGID bit clearing
    and the mode has been properly set by posix_acl_create() anyway. Also
    posix_acl_chmod() that is calling ocfs2_set_acl() takes care of updating
    mode itself.

    Fixes: 073931017b4 ("posix_acl: Clear SGID bit when setting file permissions")
    Link: http://lkml.kernel.org/r/20170801141252.19675-3-jack@suse.cz
    Signed-off-by: Jan Kara
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Joseph Qi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

01 Aug, 2017

1 commit

  • This patch converts most of the in-kernel filesystems that do writeback
    out of the pagecache to report errors using the errseq_t-based
    infrastructure that was recently added. This allows them to report
    errors once for each open file description.

    Most filesystems have a fairly straightforward fsync operation. They
    call filemap_write_and_wait_range to write back all of the data and
    wait on it, and then (sometimes) sync out the metadata.

    For those filesystems this is a straightforward conversion from calling
    filemap_write_and_wait_range in their fsync operation to calling
    file_write_and_wait_range.

    Acked-by: Jan Kara
    Acked-by: Dave Kleikamp
    Signed-off-by: Jeff Layton

    Jeff Layton
     

17 Jul, 2017

1 commit

  • Firstly by applying the following with coccinelle's spatch:

    @@ expression SB; @@
    -SB->s_flags & MS_RDONLY
    +sb_rdonly(SB)

    to effect the conversion to sb_rdonly(sb), then by applying:

    @@ expression A, SB; @@
    (
    -(!sb_rdonly(SB)) && A
    +!sb_rdonly(SB) && A
    |
    -A != (sb_rdonly(SB))
    +A != sb_rdonly(SB)
    |
    -A == (sb_rdonly(SB))
    +A == sb_rdonly(SB)
    |
    -!(sb_rdonly(SB))
    +!sb_rdonly(SB)
    |
    -A && (sb_rdonly(SB))
    +A && sb_rdonly(SB)
    |
    -A || (sb_rdonly(SB))
    +A || sb_rdonly(SB)
    |
    -(sb_rdonly(SB)) != A
    +sb_rdonly(SB) != A
    |
    -(sb_rdonly(SB)) == A
    +sb_rdonly(SB) == A
    |
    -(sb_rdonly(SB)) && A
    +sb_rdonly(SB) && A
    |
    -(sb_rdonly(SB)) || A
    +sb_rdonly(SB) || A
    )

    @@ expression A, B, SB; @@
    (
    -(sb_rdonly(SB)) ? 1 : 0
    +sb_rdonly(SB)
    |
    -(sb_rdonly(SB)) ? A : B
    +sb_rdonly(SB) ? A : B
    )

    to remove left over excess bracketage and finally by applying:

    @@ expression A, SB; @@
    (
    -(A & MS_RDONLY) != sb_rdonly(SB)
    +(bool)(A & MS_RDONLY) != sb_rdonly(SB)
    |
    -(A & MS_RDONLY) == sb_rdonly(SB)
    +(bool)(A & MS_RDONLY) == sb_rdonly(SB)
    )

    to make comparisons against the result of sb_rdonly() (which is a bool)
    work correctly.

    Signed-off-by: David Howells

    David Howells
     

07 Jul, 2017

4 commits

  • attribute_groups are not supposed to change at runtime. All functions
    working with attribute_groups provided by work with
    const attribute_group. So mark the non-const structs as const.

    File size before:
    text data bss dec hex filename
    4402 1088 38 5528 1598 fs/ocfs2/stackglue.o

    File size After adding 'const':
    text data bss dec hex filename
    4442 1024 38 5504 1580 fs/ocfs2/stackglue.o

    Link: http://lkml.kernel.org/r/cab4e59b4918db3ed2ec77073a4cb310c4429ef5.1498808026.git.arvind.yadav.cs@gmail.com
    Signed-off-by: Arvind Yadav
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arvind Yadav
     
  • 'sd->dbg_sock' is malloced in sc_common_open(), but not freed at the end
    of sc_fop_release().

    Link: http://lkml.kernel.org/r/594FB0A4.2050105@huawei.com
    Signed-off-by: Jun Piao
    Reviewed-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    piaojun
     
  • Filesystems generally use SUPER_MAGIC values from magic.h instead of a
    local definition.

    Link: http://lkml.kernel.org/r/20170521154217.27917-1-fabf@skynet.be
    Signed-off-by: Fabian Frederick
    Reviewed-by: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     
  • Fix a static code checker warning:

    fs/ocfs2/inode.c:179 ocfs2_iget() warn: passing zero to 'ERR_PTR'

    Fixes: d56a8f32e4c6 ("ocfs2: check/fix inode block for online file check")
    Link: http://lkml.kernel.org/r/1495516634-1952-1-git-send-email-ghe@suse.com
    Signed-off-by: Gang He
    Reviewed-by: Joseph Qi
    Reviewed-by: Eric Ren
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gang He