15 Jan, 2016

40 commits

  • Make memmap_valid_within return bool due to this particular function
    only using either one or zero as its return value.

    No functional change.

    Signed-off-by: Yaowei Bai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yaowei Bai
     
  • To make the intention clearer, use list_{next,first}_entry instead of
    list_entry.

    Signed-off-by: Geliang Tang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Geliang Tang
     
  • __alloc_pages_slowpath is looping over ALLOC_NO_WATERMARKS requests if
    __GFP_NOFAIL is requested. This is fragile because we are basically
    relying on somebody else to make the reclaim (be it the direct reclaim
    or OOM killer) for us. The caller might be holding resources (e.g.
    locks) which block other other reclaimers from making any progress for
    example. Remove the retry loop and rely on __alloc_pages_slowpath to
    invoke all allowed reclaim steps and retry logic.

    We have to be careful about __GFP_NOFAIL allocations from the
    PF_MEMALLOC context even though this is a very bad idea to begin with
    because no progress can be gurateed at all. We shouldn't break the
    __GFP_NOFAIL semantic here though. It could be argued that this is
    essentially GFP_NOWAIT context which we do not support but PF_MEMALLOC
    is much harder to check for existing users because they might happen
    deep down the code path performed much later after setting the flag so
    we cannot really rule out there is no kernel path triggering this
    combination.

    Signed-off-by: Michal Hocko
    Acked-by: Mel Gorman
    Acked-by: David Rientjes
    Cc: Tetsuo Handa
    Acked-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • __alloc_pages_high_priority doesn't do anything special other than it
    calls get_page_from_freelist and loops around GFP_NOFAIL allocation
    until it succeeds. It would be better if the first part was done in
    __alloc_pages_slowpath where we modify the zonelist because this would
    be easier to read and understand. Opencoding the function into its only
    caller allows to simplify it a bit as well.

    This patch doesn't introduce any functional changes.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Michal Hocko
    Acked-by: Mel Gorman
    Acked-by: David Rientjes
    Acked-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Hardcoding index to zonelists array in gfp_zonelist() is not a good
    idea, let's enumerate it to improve readability.

    No functional change.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix CONFIG_NUMA=n build]
    [n-horiguchi@ah.jp.nec.com: fix warning in comparing enumerator]
    Signed-off-by: Yaowei Bai
    Cc: Michal Hocko
    Cc: David Rientjes
    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yaowei Bai
     
  • Since commit a0b8cab3b9b2 ("mm: remove lru parameter from
    __pagevec_lru_add and remove parts of pagevec API") there's no
    user of this function anymore, so remove it.

    Signed-off-by: Yaowei Bai
    Acked-by: Michal Hocko
    Acked-by: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yaowei Bai
     
  • Make memblock_is_memory() and memblock_is_reserved return bool to
    improve readability due to these particular functions only using either
    one or zero as their return value.

    No functional change.

    Signed-off-by: Yaowei Bai
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yaowei Bai
     
  • Make is_file_hugepages() return bool to improve readability due to this
    particular function only using either one or zero as its return value.

    This patch also removed the if condition to make is_file_hugepages
    return directly.

    No functional change.

    Signed-off-by: Yaowei Bai
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yaowei Bai
     
  • Move node_id zone_idx shrink flags into trace function, so thay we don't
    need caculate these args if the trace is disabled, and will make this
    function have less arguments.

    Signed-off-by: yalin wang
    Reviewed-by: Steven Rostedt
    Acked-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    yalin wang
     
  • Now, we have tracepoint in test_pages_isolated() to notify pfn which
    cannot be isolated. But, in alloc_contig_range(), some error path
    doesn't call test_pages_isolated() so it's still hard to know exact pfn
    that causes allocation failure.

    This patch change this situation by calling test_pages_isolated() in
    almost error path. In allocation failure case, some overhead is added
    by this change, but, allocation failure is really rare event so it would
    not matter.

    In fatal signal pending case, we don't call test_pages_isolated()
    because this failure is intentional one.

    There was a bogus outer_start problem due to unchecked buddy order and
    this patch also fix it. Before this patch, it didn't matter, because
    end result is same thing. But, after this patch, tracepoint will report
    failed pfn so it should be accurate.

    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Acked-by: Michal Nazarewicz
    Cc: David Rientjes
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • cma allocation should be guranteeded to succeed. But sometimes it can
    fail in the current implementation. To track down the problem, we need
    to know which page is problematic and this new tracepoint will report
    it.

    Signed-off-by: Joonsoo Kim
    Acked-by: Michal Nazarewicz
    Acked-by: David Rientjes
    Cc: Minchan Kim
    Acked-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • This is preparation step to report test failed pfn in new tracepoint to
    analyze cma allocation failure problem. There is no functional change
    in this patch.

    Signed-off-by: Joonsoo Kim
    Acked-by: David Rientjes
    Acked-by: Michal Nazarewicz
    Cc: Minchan Kim
    Acked-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • When running the SPECint_rate gcc on some very large boxes it was
    noticed that the system was spending lots of time in
    mpol_shared_policy_lookup(). The gamess benchmark can also show it and
    is what I mostly used to chase down the issue since the setup for that I
    found to be easier.

    To be clear the binaries were on tmpfs because of disk I/O requirements.
    We then used text replication to avoid icache misses and having all the
    copies banging on the memory where the instruction code resides. This
    results in us hitting a bottleneck in mpol_shared_policy_lookup() since
    lookup is serialised by the shared_policy lock.

    I have only reproduced this on very large (3k+ cores) boxes. The
    problem starts showing up at just a few hundred ranks getting worse
    until it threatens to livelock once it gets large enough. For example
    on the gamess benchmark at 128 ranks this area consumes only ~1% of
    time, at 512 ranks it consumes nearly 13%, and at 2k ranks it is over
    90%.

    To alleviate the contention in this area I converted the spinlock to an
    rwlock. This allows a large number of lookups to happen simultaneously.
    The results were quite good reducing this consumtion at max ranks to
    around 2%.

    [akpm@linux-foundation.org: tidy up code comments]
    Signed-off-by: Nathan Zimmer
    Acked-by: David Rientjes
    Acked-by: Vlastimil Babka
    Cc: Nadia Yvette Chambers
    Cc: Naoya Horiguchi
    Cc: Mel Gorman
    Cc: "Aneesh Kumar K.V"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nathan Zimmer
     
  • __phys_to_pfn and __pfn_to_phys are symmetric, PHYS_PFN and PFN_PHYS are
    semmetric:

    - y = (phys_addr_t)x << PAGE_SHIFT

    - y >> PAGE_SHIFT = (phys_add_t)x

    - (unsigned long)(y >> PAGE_SHIFT) = x

    [akpm@linux-foundation.org: use macro arg name `x']
    [arnd@arndb.de: include linux/pfn.h for PHYS_PFN definition]
    Signed-off-by: Chen Gang
    Cc: Oleg Nesterov
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Gang
     
  • Move trace_reclaim_flags() into trace function, so that we don't need
    caculate these flags if the trace is disabled.

    Signed-off-by: yalin wang
    Reviewed-by: Steven Rostedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    yalin wang
     
  • Simplify may_expand_vm().

    [akpm@linux-foundation.org: further simplification, per Naoya Horiguchi]
    Signed-off-by: Chen Gang
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Gang
     
  • Before usage page pointer initialized by NULL is reinitialized by
    follow_page_mask(). Drop useless init of page pointer in the beginning
    of loop.

    Signed-off-by: Alexey Klimov
    Acked-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Klimov
     
  • Mark those kmem allocations that are known to be easily triggered from
    userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
    memcg. For the list, see below:

    - threadinfo
    - task_struct
    - task_delay_info
    - pid
    - cred
    - mm_struct
    - vm_area_struct and vm_region (nommu)
    - anon_vma and anon_vma_chain
    - signal_struct
    - sighand_struct
    - fs_struct
    - files_struct
    - fdtable and fdtable->full_fds_bits
    - dentry and external_name
    - inode for all filesystems. This is the most tedious part, because
    most filesystems overwrite the alloc_inode method.

    The list is far from complete, so feel free to add more objects.
    Nevertheless, it should be close to "account everything" approach and
    keep most workloads within bounds. Malevolent users will be able to
    breach the limit, but this was possible even with the former "account
    everything" approach (simply because it did not account everything in
    fact).

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Cc: Greg Thelen
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Make vmalloc family functions allocate vmalloc area pages with
    alloc_kmem_pages so that if __GFP_ACCOUNT is set they will be accounted
    to memcg. This is needed, at least, to account alloc_fdmem allocations.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Greg Thelen
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Currently, if we want to account all objects of a particular kmem cache,
    we have to pass __GFP_ACCOUNT to each kmem_cache_alloc call, which is
    inconvenient. This patch introduces SLAB_ACCOUNT flag which if passed
    to kmem_cache_create will force accounting for every allocation from
    this cache even if __GFP_ACCOUNT is not passed.

    This patch does not make any of the existing caches use this flag - it
    will be done later in the series.

    Note, a cache with SLAB_ACCOUNT cannot be merged with a cache w/o
    SLAB_ACCOUNT, because merged caches share the same kmem_cache struct and
    hence cannot have different sets of SLAB_* flags. Thus using this flag
    will probably reduce the number of merged slabs even if kmem accounting
    is not used (only compiled in).

    Signed-off-by: Vladimir Davydov
    Suggested-by: Tejun Heo
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Greg Thelen
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Black-list kmem accounting policy (aka __GFP_NOACCOUNT) turned out to be
    fragile and difficult to maintain, because there seem to be many more
    allocations that should not be accounted than those that should be.
    Besides, false accounting an allocation might result in much worse
    consequences than not accounting at all, namely increased memory
    consumption due to pinned dead kmem caches.

    So this patch switches kmem accounting to the white-policy: now only
    those kmem allocations that are marked as __GFP_ACCOUNT are accounted to
    memcg. Currently, no kmem allocations are marked like this. The
    following patches will mark several kmem allocations that are known to
    be easily triggered from userspace and therefore should be accounted to
    memcg.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Cc: Greg Thelen
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • This reverts commit 8f4fc071b192 ("gfp: add __GFP_NOACCOUNT").

    Black-list kmem accounting policy (aka __GFP_NOACCOUNT) turned out to be
    fragile and difficult to maintain, because there seem to be many more
    allocations that should not be accounted than those that should be.
    Besides, false accounting an allocation might result in much worse
    consequences than not accounting at all, namely increased memory
    consumption due to pinned dead kmem caches.

    So it was decided to switch to the white-list policy. This patch
    reverts bits introducing the black-list policy. The white-list policy
    will be introduced later in the series.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Greg Thelen
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Currently, all kmem allocations (namely every kmem_cache_alloc, kmalloc,
    alloc_kmem_pages call) are accounted to memory cgroup automatically.
    Callers have to explicitly opt out if they don't want/need accounting
    for some reason. Such a design decision leads to several problems:

    - kmalloc users are highly sensitive to failures, many of them
    implicitly rely on the fact that kmalloc never fails, while memcg
    makes failures quite plausible.

    - A lot of objects are shared among different containers by design.
    Accounting such objects to one of containers is just unfair.
    Moreover, it might lead to pinning a dead memcg along with its kmem
    caches, which aren't tiny, which might result in noticeable increase
    in memory consumption for no apparent reason in the long run.

    - There are tons of short-lived objects. Accounting them to memcg will
    only result in slight noise and won't change the overall picture, but
    we still have to pay accounting overhead.

    For more info, see

    - http://lkml.kernel.org/r/20151105144002.GB15111%40dhcp22.suse.cz
    - http://lkml.kernel.org/r/20151106090555.GK29259@esperanza

    Therefore this patchset switches to the white list policy. Now kmalloc
    users have to explicitly opt in by passing __GFP_ACCOUNT flag.

    Currently, the list of accounted objects is quite limited and only
    includes those allocations that (1) are known to be easily triggered
    from userspace and (2) can fail gracefully (for the full list see patch
    no. 6) and it still misses many object types. However, accounting only
    those objects should be a satisfactory approximation of the behavior we
    used to have for most sane workloads.

    This patch (of 6):

    Revert 499611ed451508a42d1d7d ("kernfs: do not account ino_ida allocations
    to memcg").

    Black-list kmem accounting policy (aka __GFP_NOACCOUNT) turned out to be
    fragile and difficult to maintain, because there seem to be many more
    allocations that should not be accounted than those that should be.
    Besides, false accounting an allocation might result in much worse
    consequences than not accounting at all, namely increased memory
    consumption due to pinned dead kmem caches.

    So it was decided to switch to the white-list policy. This patch reverts
    bits introducing the black-list policy. The white-list policy will be
    introduced later in the series.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Greg Thelen
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Add a new helper function get_first_slab() that get the first slab from
    a kmem_cache_node.

    Signed-off-by: Geliang Tang
    Acked-by: Christoph Lameter
    Acked-by: David Rientjes
    Cc: Pekka Enberg
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Geliang Tang
     
  • Simplify the code with list_for_each_entry().

    Signed-off-by: Geliang Tang
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Geliang Tang
     
  • Simplify the code with list_first_entry_or_null().

    Signed-off-by: Geliang Tang
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Geliang Tang
     
  • A little cleanup - the invocation site provdes the semicolon.

    Cc: Rasmus Villemoes
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • lksb flags are defined both in dlmapi.h and dlmcommon.h. So clean them
    up from dlmcommon.h.

    Signed-off-by: Joseph Qi
    Reviewed-by: Jiufei Xue
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     
  • Found this when do patch review, remove to make it clear and save a
    little cpu time.

    Signed-off-by: Junxiao Bi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Junxiao Bi
     
  • In ocfs2_orphan_del, currently it finds and deletes entry first, and
    then access orphan dir dinode. This will have a problem once
    ocfs2_journal_access_di fails. In this case, entry will be removed from
    orphan dir, but in deed the inode hasn't been deleted successfully. In
    other words, the file is missing but not actually deleted. So we should
    access orphan dinode first like unlink and rename.

    Signed-off-by: Joseph Qi
    Reviewed-by: Jiufei Xue
    Cc: Mark Fasheh
    Cc: Joel Becker
    Reviewed-by: Junxiao Bi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     
  • When two processes are migrating the same lockres,
    dlm_add_migration_mle() return -EEXIST, but insert a new mle in hash
    list. dlm_migrate_lockres() will detach the old mle and free the new
    one which is already in hash list, that will destroy the list.

    Signed-off-by: Jiufei Xue
    Reviewed-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Reviewed-by: Junxiao Bi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    xuejiufei
     
  • We have found that migration source will trigger a BUG that the refcount
    of mle is already zero before put when the target is down during
    migration. The situation is as follows:

    dlm_migrate_lockres
    dlm_add_migration_mle
    dlm_mark_lockres_migrating
    dlm_get_mle_inuse
    <<<<<< Now the refcount of the mle is 2.
    dlm_send_one_lockres and wait for the target to become the
    new master.
    <<<<<< o2hb detect the target down and clean the migration
    mle. Now the refcount is 1.

    dlm_migrate_lockres woken, and put the mle twice when found the target
    goes down which trigger the BUG with the following message:

    "ERROR: bad mle: ".

    Signed-off-by: Jiufei Xue
    Reviewed-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    xuejiufei
     
  • DLM does not cache locks. So, blocking lock and unlock will only make
    the performance worse where contention over the locks is high.

    Signed-off-by: Goldwyn Rodrigues
    Cc: Mark Fasheh
    Cc: Joel Becker
    Reviewed-by: Junxiao Bi
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Goldwyn Rodrigues
     
  • The following case will lead to slot overwritten.

    N1 N2
    mount ocfs2 volume, find and
    allocate slot 0, then set
    osb->slot_num to 0, begin to
    write slot info to disk
    mount ocfs2 volume, wait for super lock
    write block fail because of
    storage link down, unlock
    super lock
    got super lock and also allocate slot 0
    then unlock super lock

    mount fail and then dismount,
    since osb->slot_num is 0, try to
    put invalid slot to disk. And it
    will succeed if storage link
    restores.
    N2 slot info is now overwritten

    Once another node say N3 mount, it will find and allocate slot 0 again,
    which will lead to mount hung because journal has already been locked by
    N2. so when write slot info failed, invalidate slot in advance to avoid
    overwrite slot.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Yiwen Jiang
    Reviewed-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    jiangyiwen
     
  • dlm_grab() may return NULL when the node is doing unmount. When doing
    code review, we found that some dlm handlers may return error to caller
    when dlm_grab() returns NULL and make caller BUG or other problems.
    Here is an example:

    Node 1 Node 2
    receives migration message
    from node 3, and send
    migrate request to others
    start unmounting

    receives migrate request
    from node 1 and call
    dlm_migrate_request_handler()

    unmount thread unregisters
    domain handlers and removes
    dlm_context from dlm_domains

    dlm_migrate_request_handlers()
    returns -EINVAL to node 1
    Exit migration neither clearing the
    migration state nor sending
    assert master message to node 3 which
    cause node 3 hung.

    Signed-off-by: Jiufei Xue
    Reviewed-by: Joseph Qi
    Reviewed-by: Yiwen Jiang
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xue jiufei
     
  • Since iput will take care the NULL check itself, NULL check before
    calling it is redundant. So clean them up.

    Signed-off-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     
  • Commit f3f854648de6 ("ocfs2_dlm: Ensure correct ordering of set/clear
    refmap bit on lockres") still exists a race which can't ensure the
    ordering is exactly correct.

    Node1 Node2 Node3
    umount, migrate
    lockres to Node2
    migrate finished,
    send migrate request
    to Node3
    received migrate request,
    create a migration_mle,
    respond to Node2.
    set DLM_LOCK_RES_SETREF_INPROG
    and send assert master to
    Node3
    delete migration_mle in
    assert_master_handler,
    Node3 umount without response
    dlm_thread purge
    this lockres, send drop
    deref message to Node2
    found the flag of
    DLM_LOCK_RES_SETREF_INPROG
    is set, dispatch
    dlm_deref_lockres_worker to
    clear refmap, but in function of
    dlm_deref_lockres_worker,
    only if node in refmap it wait
    DLM_LOCK_RES_SETREF_INPROG
    to be cleared. So worker is
    done successfully

    purge lockres, send
    assert master response
    to Node1, and finish umount
    set Node3 in refmap, and it
    won't be cleared forever, thus
    lead to umount hung

    so wait until DLM_LOCK_RES_SETREF_INPROG is cleared in
    dlm_deref_lockres_worker.

    Signed-off-by: Yiwen Jiang
    Reviewed-by: Joseph Qi
    Reviewed-by: Junxiao Bi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    jiangyiwen
     
  • The ocfs2_extent_tree_operations structures are never modified, so
    declare them as const.

    Done with the help of Coccinelle.

    Signed-off-by: Julia Lawall
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Julia Lawall
     
  • We found a race between purge and migration when doing code review.
    Node A put lockres to purgelist before receiving the migrate message
    from node B which is the master. Node A call dlm_mig_lockres_handler to
    handle this message.

    dlm_mig_lockres_handler
    dlm_lookup_lockres
    >>>>>> race window, dlm_run_purge_list may run and send
    deref message to master, waiting the response
    spin_lock(&res->spinlock);
    res->state |= DLM_LOCK_RES_MIGRATING;
    spin_unlock(&res->spinlock);
    dlm_mig_lockres_handler returns

    >>>>>> dlm_thread receives the response from master for the deref
    message and triggers the BUG because the lockres has the state
    DLM_LOCK_RES_MIGRATING with the following message:

    dlm_purge_lockres:209 ERROR: 6633EB681FA7474A9C280A4E1A836F0F: res
    M0000000000000000030c0300000000 in use after deref

    Signed-off-by: Jiufei Xue
    Reviewed-by: Joseph Qi
    Reviewed-by: Yiwen Jiang
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xue jiufei
     
  • When run multiple xattr test of ocfs2-test on a three-nodes cluster,
    mount failed sometimes with the following message.

    o2hb: Unable to stabilize heartbeart on region D18B775E758D4D80837E8CF3D086AD4A (xvdb)

    Stabilize heartbeat depends on the timing order to mount ocfs2 from
    cluster nodes and how fast the tcp connections are established. So
    increase unsteady interations to leave more time for it.

    Signed-off-by: Junxiao Bi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Junxiao Bi