26 Sep, 2014

2 commits

  • There is a deadlock case which reported by Guozhonghua:
    https://oss.oracle.com/pipermail/ocfs2-devel/2014-September/010079.html

    This case is caused by &res->spinlock and &dlm->master_lock
    misordering in different threads.

    It was introduced by commit 8d400b81cc83 ("ocfs2/dlm: Clean up refmap
    helpers"). Since lockres is new, it doesn't not require the
    &res->spinlock. So remove it.

    Fixes: 8d400b81cc83 ("ocfs2/dlm: Clean up refmap helpers")
    Signed-off-by: Joseph Qi
    Reviewed-by: joyce.xue
    Reported-by: Guozhonghua
    Cc: Joel Becker
    Cc: Mark Fasheh
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     
  • osb->vol_label is malloced in ocfs2_initialize_super but not freed if
    error occurs or during umount, thus causing a memory leak.

    Signed-off-by: Joseph Qi
    Reviewed-by: joyce.xue
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     

30 Aug, 2014

4 commits

  • For debug use, we can see from the log whether the fence decision is
    made and why it is not fenced.

    Signed-off-by: Junxiao Bi
    Reviewed-by: Srinivas Eeda
    Reviewed-by: Mark Fasheh
    Cc: Joel Becker
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Junxiao Bi
     
  • When tcp retransmit timeout(15mins), the connection will be closed.
    Pending messages may be lost during this time. So we set tcp user
    timeout to override the retransmit timeout to the max value. This is OK
    for ocfs2 since we have disk heartbeat, if peer crash, the disk
    heartbeat will timeout and it will be evicted, if disk heartbeat not
    timeout and connection idle for a long time, then this means the cluster
    enters split-brain state, since fence can't happen, we'd better keep the
    connection and wait network recover.

    Signed-off-by: Junxiao Bi
    Reviewed-by: Srinivas Eeda
    Reviewed-by: Mark Fasheh
    Cc: Joel Becker
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Junxiao Bi
     
  • This patch series is to fix a possible message lost bug in ocfs2 when
    network go bad. This bug will cause ocfs2 hung forever even network
    become good again.

    The messages may lost in this case. After the tcp connection is
    established between two nodes, an idle timer will be set to check its
    state periodically, if no messages are received during this time, idle
    timer will timeout, it will shutdown the connection and try to
    reconnect, so pending messages in tcp queues will be lost. This
    messages may be from dlm. Dlm may get hung in this case. This may
    cause the whole ocfs2 cluster hung.

    This is very possible to happen when network state goes bad. Do the
    reconnect is useless, it will fail if network state is still bad. Just
    waiting there for network recovering may be a good idea, it will not
    lost messages and some node will be fenced until cluster goes into
    split-brain state, for this case, Tcp user timeout is used to override
    the tcp retransmit timeout. It will timeout after 25 days, user should
    have notice this through the provided log and fix the network, if they
    don't, ocfs2 will fall back to original reconnect way.

    This patch (of 3):

    Some messages in the tcp queue maybe lost if we shutdown the connection
    and reconnect when idle timeout. If packets lost and reconnect success,
    then the ocfs2 cluster maybe hung.

    To fix this, we can leave the connection there and do the fence decision
    when idle timeout, if network recover before fence dicision is made, the
    connection survive without lost any messages.

    This bug can be saw when network state go bad. It may cause ocfs2 hung
    forever if some packets lost. With this fix, ocfs2 will recover from
    hung if network becomes good again.

    Signed-off-by: Junxiao Bi
    Reviewed-by: Srinivas Eeda
    Reviewed-by: Mark Fasheh
    Cc: Joel Becker
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Junxiao Bi
     
  • If we failed to copy from the structure, writing back the flags leaks 31
    bits of kernel memory (the rest of the ir_flags field).

    In any case, if we cannot copy from/to the structure, why should we
    expect putting just the flags to work?

    Also make sure ocfs2_info_handle_freeinode() returns the right error
    code if the copy_to_user() fails.

    Fixes: ddee5cdb70e6 ('Ocfs2: Add new OCFS2_IOC_INFO ioctl for ocfs2 v8.')
    Signed-off-by: Ben Hutchings
    Cc: Joel Becker
    Acked-by: Mark Fasheh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ben Hutchings
     

07 Aug, 2014

4 commits

  • kcalloc manages count*sizeof overflow.

    Signed-off-by: Fabian Frederick
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     
  • Orabug: 19074140

    When umount is issued during recovery on the new master that has not
    finished remastering locks, it triggers BUG() in
    dlm_send_mig_lockres_msg(). Here is the situation:

    1) node A has a lock on resource X mastered by node B.

    2) node B dies -> node A sets recovering flag for res X

    3) Node C becomes the new master for resources owned by the
    dead node and is remastering locks of the dead node but
    has not finished the remastering process yet.

    4) umount is issued on node C.

    5) During processing of umount, ignoring unfished recovery,
    node C attempts to migrate resource X to node A.

    6) node A finds res X in DLM_LOCK_RES_RECOVERING state, considers
    it a logic error and sends back -EFAULT.

    7) node C asserts BUG() upon seeing EFAULT resp from node B.

    Fix is to delay migrating res X till remastering is finished at which
    point recovering flag will be cleared on both A and C.

    Signed-off-by: Tariq Saeed
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tariq Saeed
     
  • The unit of total_backoff is msecs not jiffies, so no need to do the
    conversion. Otherwise, the join timeout is not 90 sec.

    Signed-off-by: Yiwen Jiang
    Signed-off-by: joyce.xue
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xue jiufei
     
  • ocfs2_search_extent_list may return -1, so we should check the return
    value in ocfs2_split_and_insert, otherwise it may cause array index out of
    bound.

    And ocfs2_search_extent_list can only return value less than
    el->l_next_free_rec, so check if it is equal or larger than
    le16_to_cpu(el->l_next_free_rec) is meaningless.

    Signed-off-by: Yingtai Xie
    Signed-off-by: Joseph Qi
    Cc: Joel Becker
    Cc: Mark Fasheh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yingtai Xie
     

24 Jun, 2014

9 commits

  • When workqueue is delayed, it may occur that a lockres is purged while it
    is still queued for master assert. it may trigger BUG() as follows.

    N1 N2
    dlm_get_lockres()
    ->dlm_do_master_requery
    is the master of lockres,
    so queue assert_master work

    dlm_thread() start running
    and purge the lockres

    dlm_assert_master_worker()
    send assert master message
    to other nodes
    receiving the assert_master
    message, set master to N2

    dlmlock_remote() send create_lock message to N2, but receive DLM_IVLOCKID,
    if it is RECOVERY lockres, it triggers the BUG().

    Another BUG() is triggered when N3 become the new master and send
    assert_master to N1, N1 will trigger the BUG() because owner doesn't
    match. So we should not purge lockres when it is queued for assert
    master.

    Signed-off-by: joyce.xue
    Reviewed-by: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xue jiufei
     
  • The following case may lead to endless loop during umount.

    node A node B node C node D
    umount volume,
    migrate lockres1
    to B
    want to lock lockres1,
    send
    MASTER_REQUEST_MSG
    to C
    init block mle
    send
    MIGRATE_REQUEST_MSG
    to C
    find a block
    mle, and then
    return
    DLM_MIGRATE_RESPONSE_MASTERY_REF
    to B
    set C in refmap
    umount successfully
    try to umount, endless
    loop occurs when migrate
    lockres1 since C is in
    refmap

    So we can fix this endless loop case by only returning
    DLM_MIGRATE_RESPONSE_MASTERY_REF if it has a mastery mle when receiving
    MIGRATE_REQUEST_MSG.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: jiangyiwen
    Reviewed-by: Mark Fasheh
    Cc: Joel Becker
    Cc: Xue jiufei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    jiangyiwen
     
  • When the call to ocfs2_add_entry() failed in ocfs2_symlink() and
    ocfs2_mknod(), iput() will not be called during dput(dentry) because no
    d_instantiate(), and this will lead to umount hung.

    Signed-off-by: jiangyiwen
    Cc: Joel Becker
    Reviewed-by: Mark Fasheh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    jiangyiwen
     
  • When running dirop_fileop_racer we found a dead lock case.

    2 nodes, say Node A and Node B, mount the same ocfs2 volume. Create
    /race/16/1 in the filesystem, and let the inode number of dir 16 is less
    than the inode number of dir race.

    Node A Node B
    mv /race/16/1 /race/
    right after Node A has got the
    EX mode of /race/16/, and tries to
    get EX mode of /race
    ls /race/16/

    In this case, Node A has got the EX mode of /race/16/, and wants to get EX
    mode of /race/. Node B has got the PR mode of /race/, and wants to get
    the PR mode of /race/16/. Since EX and PR are mutually exclusive, dead
    lock happens.

    This patch fixes this case by locking in ancestor order before trying
    inode number order.

    Signed-off-by: Yiwen Jiang
    Signed-off-by: Joseph Qi
    Cc: Joel Becker
    Reviewed-by: Mark Fasheh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yiwen Jiang
     
  • When a lockres in purge list but is still in use, it should be moved to
    the tail of purge list. dlm_thread will continue to check next lockres in
    purge list. However, code list_move_tail(&dlm->purge_list,
    &lockres->purge) will do *no* movements, so dlm_thread will purge the same
    lockres in this loop again and again. If it is in use for a long time,
    other lockres will not be processed.

    Signed-off-by: Yiwen Jiang
    Signed-off-by: joyce.xue
    Reviewed-by: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xue jiufei
     
  • This patch tries to fix this crash:

    #5 [ffff88003c1cd690] do_invalid_op at ffffffff810166d5
    #6 [ffff88003c1cd730] invalid_op at ffffffff8159b2de
    [exception RIP: ocfs2_direct_IO_get_blocks+359]
    RIP: ffffffffa05dfa27 RSP: ffff88003c1cd7e8 RFLAGS: 00010202
    RAX: 0000000000000000 RBX: ffff88003c1cdaa8 RCX: 0000000000000000
    RDX: 000000000000000c RSI: ffff880027a95000 RDI: ffff88003c79b540
    RBP: ffff88003c1cd858 R8: 0000000000000000 R9: ffffffff815f6ba0
    R10: 00000000000001c9 R11: 00000000000001c9 R12: ffff88002d271500
    R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000001000
    ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
    #7 [ffff88003c1cd860] do_direct_IO at ffffffff811cd31b
    #8 [ffff88003c1cd950] direct_IO_iovec at ffffffff811cde9c
    #9 [ffff88003c1cd9b0] do_blockdev_direct_IO at ffffffff811ce764
    #10 [ffff88003c1cdb80] __blockdev_direct_IO at ffffffff811ce7cc
    #11 [ffff88003c1cdbb0] ocfs2_direct_IO at ffffffffa05df756 [ocfs2]
    #12 [ffff88003c1cdbe0] generic_file_direct_write_iter at ffffffff8112f935
    #13 [ffff88003c1cdc40] ocfs2_file_write_iter at ffffffffa0600ccc [ocfs2]
    #14 [ffff88003c1cdd50] do_aio_write at ffffffff8119126c
    #15 [ffff88003c1cddc0] aio_rw_vect_retry at ffffffff811d9bb4
    #16 [ffff88003c1cddf0] aio_run_iocb at ffffffff811db880
    #17 [ffff88003c1cde30] io_submit_one at ffffffff811dc238
    #18 [ffff88003c1cde80] do_io_submit at ffffffff811dc437
    #19 [ffff88003c1cdf70] sys_io_submit at ffffffff811dc530
    #20 [ffff88003c1cdf80] system_call_fastpath at ffffffff8159a159

    It crashes at
    BUG_ON(create && (ext_flags & OCFS2_EXT_REFCOUNTED));
    in ocfs2_direct_IO_get_blocks.

    ocfs2_direct_IO_get_blocks is expecting the OCFS2_EXT_REFCOUNTED be removed in
    ocfs2_prepare_inode_for_write() if it was there. But no cluster lock is taken
    during the time before (or inside) ocfs2_prepare_inode_for_write() and after
    ocfs2_direct_IO_get_blocks().

    It can happen in this case:

    Node A(which crashes) Node B
    ------------------------ ---------------------------
    ocfs2_file_aio_write
    ocfs2_prepare_inode_for_write
    ocfs2_inode_lock
    ...
    ocfs2_inode_unlock
    #no refcount found
    .... ocfs2_reflink
    ocfs2_inode_lock
    ...
    ocfs2_inode_unlock
    #now, refcount flag set on extent

    ...
    flush change to disk

    ocfs2_direct_IO_get_blocks
    ocfs2_get_clusters
    #extent map miss
    #buffer_head miss
    read extents from disk
    found refcount flag on extent
    crash..

    Fix:
    Take rw_lock in ocfs2_reflink path

    Signed-off-by: Wengang Wang
    Reviewed-by: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wengang Wang
     
  • 75f82eaa502c ("ocfs2: fix NULL pointer dereference when dismount and
    ocfs2rec simultaneously") may cause umount hang while shutting down
    truncate log.

    The situation is as followes:
    ocfs2_dismout_volume
    -> ocfs2_recovery_exit
    -> free osb->recovery_map
    -> ocfs2_truncate_shutdown
    -> lock global bitmap inode
    -> ocfs2_wait_for_recovery
    -> check whether osb->recovery_map->rm_used is zero

    Because osb->recovery_map is already freed, rm_used can be any other
    values, so it may yield umount hang.

    Signed-off-by: joyce.xue
    Reviewed-by: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xue jiufei
     
  • …letimeout closes conn

    Orabug: 18639535

    Two node cluster and both nodes hold a lock at PR level and both want to
    convert to EX at the same time. Master node 1 has sent BAST and then
    closes the connection due to idletime out. Node 0 receives BAST, sends
    unlock req with cancel flag but gets error -ENOTCONN. The problem is
    this error is ignored in dlm_send_remote_unlock_request() on the
    **incorrect** assumption that the master is dead. See NOTE in comment
    why it returns DLM_NORMAL. Upon getting DLM_NORMAL, node 0 proceeds to
    sends convert (without cancel flg) which fails with -ENOTCONN. waits 5
    sec and resends.

    This time gets DLM_IVLOCKID from the master since lock not found in
    grant, it had been moved to converting queue in response to conv PR->EX
    req. No way out.

    Node 1 (master) Node 0
    ============== ======

    lock mode PR PR

    convert PR -> EX
    mv grant -> convert and que BAST
    ...
    <-------- convert PR -> EX
    convert que looks like this: ((node 1, PR -> EX) (node 0, PR -> EX))
    ...
    BAST (want PR -> NL)
    ------------------>
    ...
    idle timout, conn closed
    ...
    In response to BAST,
    sends unlock with cancel convert flag
    gets -ENOTCONN. Ignores and
    sends remote convert request
    gets -ENOTCONN, waits 5 Sec, retries
    ...
    reconnects
    <----------------- convert req goes through on next try
    does not find lock on grant que
    status DLM_IVLOCKID
    ------------------>
    ...

    No way out. Fix is to keep retrying unlock with cancel flag until it
    succeeds or the master dies.

    Signed-off-by: Tariq Saeed <tariq.x.saeed@oracle.com>
    Reviewed-by: Mark Fasheh <mfasheh@suse.de>
    Cc: Joel Becker <jlbec@evilplan.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Tariq Saeed
     
  • There are two files a and b in dir /mnt/ocfs2.

    node A node B

    mv a b
    In ocfs2_rename(), after calling
    ocfs2_orphan_add(), the inode of
    file b will be added into orphan
    dir.

    If ocfs2_update_entry() fails,
    ocfs2_rename return error and mv
    operation fails. But file b still
    exists in the parent dir.

    ocfs2_queue_orphan_scan
    -> ocfs2_queue_recovery_completion
    -> ocfs2_complete_recovery
    -> ocfs2_recover_orphans
    The inode of the file b will be
    put with iput().

    ocfs2_evict_inode
    -> ocfs2_delete_inode
    -> ocfs2_wipe_inode
    -> ocfs2_remove_inode
    OCFS2_VALID_FL in the inode
    i_flags will be cleared.

    The file b still can be accessed
    on node B.
    ls /mnt/ocfs2
    When first read the file b with
    ocfs2_read_inode_block(). It will
    validate the inode using
    ocfs2_validate_inode_block().
    Because OCFS2_VALID_FL not set in
    the inode i_flags, so the file
    system will be readonly.

    So we should add inode into orphan dir after updating entry in
    ocfs2_rename().

    Signed-off-by: alex.chen
    Reviewed-by: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    alex chen
     

13 Jun, 2014

1 commit

  • Pull vfs updates from Al Viro:
    "This the bunch that sat in -next + lock_parent() fix. This is the
    minimal set; there's more pending stuff.

    In particular, I really hope to get acct.c fixes merged this cycle -
    we need that to deal sanely with delayed-mntput stuff. In the next
    pile, hopefully - that series is fairly short and localized
    (kernel/acct.c, fs/super.c and fs/namespace.c). In this pile: more
    iov_iter work. Most of prereqs for ->splice_write with sane locking
    order are there and Kent's dio rewrite would also fit nicely on top of
    this pile"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (70 commits)
    lock_parent: don't step on stale ->d_parent of all-but-freed one
    kill generic_file_splice_write()
    ceph: switch to iter_file_splice_write()
    shmem: switch to iter_file_splice_write()
    nfs: switch to iter_splice_write_file()
    fs/splice.c: remove unneeded exports
    ocfs2: switch to iter_file_splice_write()
    ->splice_write() via ->write_iter()
    bio_vec-backed iov_iter
    optimize copy_page_{to,from}_iter()
    bury generic_file_aio_{read,write}
    lustre: get rid of messing with iovecs
    ceph: switch to ->write_iter()
    ceph_sync_direct_write: stop poking into iov_iter guts
    ceph_sync_read: stop poking into iov_iter guts
    new helper: copy_page_from_iter()
    fuse: switch to ->write_iter()
    btrfs: switch to ->write_iter()
    ocfs2: switch to ->write_iter()
    xfs: switch to ->write_iter()
    ...

    Linus Torvalds
     

12 Jun, 2014

2 commits


11 Jun, 2014

1 commit


05 Jun, 2014

12 commits

  • The last in-tree caller of block_write_full_page_endio() was removed in
    January 2013. It's time to remove the EXPORT_SYMBOL, which leaves
    block_write_full_page() as the only caller of
    block_write_full_page_endio(), so inline block_write_full_page_endio()
    into block_write_full_page().

    Signed-off-by: Matthew Wilcox
    Cc: Hugh Dickins
    Cc: Dave Chinner
    Cc: Dheeraj Reddy
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • dlm_recovery_ctxt.received is unused.

    ocfs2_should_refresh_lock_res() can only return 0 or 1, so the error
    handling code in ocfs2_super_lock() is unneeded.

    Signed-off-by: joyce.xue
    Cc: Joel Becker
    Cc: Mark Fasheh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xue jiufei
     
  • Ocfs2 cluster size may be 1MB, which has 20 bits. When resize, the
    input new clusters is mostly the number of clusters in a group
    descriptor(32256).

    Since the input clusters is defined as type int, so it will overflow
    when shift left 20 bits and then lead to incorrect global bitmap i_size.

    Signed-off-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     
  • Parameters new_clusters and first_new_cluster are not used in
    ocfs2_update_last_group_and_inode, so remove them.

    Signed-off-by: Joseph Qi
    Reviewed-by: joyce.xue
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     
  • We found a race situation when dlm recovery and node joining occurs
    simultaneously if the network state is bad.

    N1 N4

    start joining dlm and send
    query join to all live nodes
    set joining node to N1, return OK
    send query join to other
    live nodes and it may take
    a while

    call dlm_send_join_assert()
    to send assert join message
    when N2 is down, so keep
    trying to send message to N2
    until find N2 is down

    send assert join message to
    N3, but connection is down
    with N3, so it may take a
    while
    become the recovery master for N2
    and send begin reco message to other
    nodes in domain map but no N1
    connection with N3 is rebuild,
    then send assert join to N4
    call dlm_assert_joined_handler(),
    add N1 to domain_map

    dlm recovery done, send finalize message
    to nodes in domain map, including N1
    receiving finalize message,
    trigger the BUG() because
    recovery master mismatch.

    Signed-off-by: joyce.xue
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xue jiufei
     
  • Revert commit 75f82eaa502c ("ocfs2: fix NULL pointer dereference when
    dismount and ocfs2rec simultaneously") because it may cause a umount
    hang while shutting down the truncate log.

    fix NULL pointer dereference when dismount and ocfs2rec simultaneously

    The situation is as followes:
    ocfs2_dismout_volume
    -> ocfs2_recovery_exit
    -> free osb->recovery_map
    -> ocfs2_truncate_shutdown
    -> lock global bitmap inode
    -> ocfs2_wait_for_recovery
    -> check whether osb->recovery_map->rm_used is zero

    Because osb->recovery_map is already freed, rm_used can be any other
    values, so it may yield umount hang.

    To prevent NULL pointer dereference while getting sys_root_inode, we use
    a osb_tl_disable flag to disable schedule osb_truncate_log_wq after
    truncate log shutdown.

    Signed-off-by: joyce.xue
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xue jiufei
     
  • ocfs_info_foo() and ocfs2_get_request_ptr functions are only used in ioctl.c

    Signed-off-by: Fabian Frederick
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     
  • We found there is a conversion deadlock when the owner of lockres
    happened to crash before send DLM_PROXY_AST_MSG for a downconverting
    lock. The situation is as follows:

    Node1 Node2 Node3
    the owner of lockresA
    lock_1 granted at EX mode
    and call ocfs2_cluster_unlock
    to decrease ex_holders.
    converting lock_3 from
    NL to EX
    send DLM_PROXY_AST_MSG
    to Node1, asking Node 1
    to downconvert.
    receiving DLM_PROXY_AST_MSG,
    thread ocfs2dc send
    DLM_CONVERT_LOCK_MSG
    to Node2 to downconvert
    lock_1(EX->NL).
    lock_1 can be granted and
    put it into pending_asts
    list, return DLM_NORMAL.
    then something happened
    and Node2 crashed.
    received DLM_NORMAL, waiting
    for DLM_PROXY_AST_MSG.
    selected as the recovery
    master, receving migrate
    lock from Node1, queue
    lock_1 to the tail of
    converting list.

    After dlm recovery, converting list in the master of lockresA(Node3)
    will be: converting list head lock_3(NL->EX) lock_1(EXNL).
    Requested mode of lock_3 is not compatible with the granted mode of
    lock_1, so it can not be granted. and lock_1 can not downconvert
    because covnerting queue is strictly FIFO. So a deadlock is created.
    We think function dlm_process_recovery_data() should queue_ast for
    lock_1 or alter the order of lock_1 and lock_3, so dlm_thread can
    process lock_1 first. And if there are multiple downconverting locks,
    they must convert form PR to NL, so no need to sort them.

    Signed-off-by: joyce.xue
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xue jiufei
     
  • Once JBD2_ABORT is set, ocfs2_commit_cache will fail in
    ocfs2_commit_thread. Then it will get into a loop with mass logs. This
    will meaninglessly consume a larger number of resource and may lead to
    the system hanging. So limit printk in this case.

    [akpm@linux-foundation.org: document the msleep]
    Signed-off-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     
  • There are two standard techniques for dereferencing structures pointed
    to by void *: cast to the right type each time they're used, or assign
    to local variables of the right type.

    But there's no need to do *both*.

    Signed-off-by: George Spelvin
    Cc: Mark Fasheh
    Acked-by: Joel Becker
    Reviewed-by: Jie Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    George Spelvin
     
  • Replace strncpy(size 63) by defined value.

    Signed-off-by: Fabian Frederick
    Cc: Joel Becker
    Cc: Mark Fasheh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     
  • Static values are automatically initialized to NULL.

    Signed-off-by: Fabian Frederick
    Cc: Joel Becker
    Cc: Mark Fasheh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     

04 Jun, 2014

1 commit

  • …/git/tip/tip into next

    Pull scheduler updates from Ingo Molnar:
    "The main scheduling related changes in this cycle were:

    - various sched/numa updates, for better performance

    - tree wide cleanup of open coded nice levels

    - nohz fix related to rq->nr_running use

    - cpuidle changes and continued consolidation to improve the
    kernel/sched/idle.c high level idle scheduling logic. As part of
    this effort I pulled cpuidle driver changes from Rafael as well.

    - standardized idle polling amongst architectures

    - continued work on preparing better power/energy aware scheduling

    - sched/rt updates

    - misc fixlets and cleanups"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (49 commits)
    sched/numa: Decay ->wakee_flips instead of zeroing
    sched/numa: Update migrate_improves/degrades_locality()
    sched/numa: Allow task switch if load imbalance improves
    sched/rt: Fix 'struct sched_dl_entity' and dl_task_time() comments, to match the current upstream code
    sched: Consolidate open coded implementations of nice level frobbing into nice_to_rlimit() and rlimit_to_nice()
    sched: Initialize rq->age_stamp on processor start
    sched, nohz: Change rq->nr_running to always use wrappers
    sched: Fix the rq->next_balance logic in rebalance_domains() and idle_balance()
    sched: Use clamp() and clamp_val() to make sys_nice() more readable
    sched: Do not zero sg->cpumask and sg->sgp->power in build_sched_groups()
    sched/numa: Fix initialization of sched_domain_topology for NUMA
    sched: Call select_idle_sibling() when not affine_sd
    sched: Simplify return logic in sched_read_attr()
    sched: Simplify return logic in sched_copy_attr()
    sched: Fix exec_start/task_hot on migrated tasks
    arm64: Remove TIF_POLLING_NRFLAG
    metag: Remove TIF_POLLING_NRFLAG
    sched/idle: Make cpuidle_idle_call() void
    sched/idle: Reflow cpuidle_idle_call()
    sched/idle: Delay clearing the polling bit
    ...

    Linus Torvalds
     

24 May, 2014

1 commit

  • In dlm_init, if create dlm_lockname_cache failed in
    dlm_init_master_caches, it will destroy dlm_lockres_cache which created
    before twice. And this will cause system die when loading modules.

    Signed-off-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     

07 May, 2014

3 commits