Eric Lee / smarc-fsl-linux-kernel

15 Jan, 2016

40 commits

5b80287a6 mm/mmzone.c: memmap_valid_within() can be boolean ... Browse Code »

Make memmap_valid_within return bool due to this particular function
only using either one or zero as its return value.

No functional change.

Signed-off-by: Yaowei Bai
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yaowei Bai
2016-01-15 08:00:49 +0800
6219c2a2e mm/vmalloc.c: use list_{next,first}_entry ... Browse Code »

To make the intention clearer, use list_{next,first}_entry instead of
list_entry.

Signed-off-by: Geliang Tang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Geliang Tang
2016-01-15 08:00:49 +0800
33d531030 mm/page_alloc.c: do not loop over ALLOC_NO_WATERMARKS without triggering reclaim ... Browse Code »

__alloc_pages_slowpath is looping over ALLOC_NO_WATERMARKS requests if
__GFP_NOFAIL is requested. This is fragile because we are basically
relying on somebody else to make the reclaim (be it the direct reclaim
or OOM killer) for us. The caller might be holding resources (e.g.
locks) which block other other reclaimers from making any progress for
example. Remove the retry loop and rely on __alloc_pages_slowpath to
invoke all allowed reclaim steps and retry logic.

We have to be careful about __GFP_NOFAIL allocations from the
PF_MEMALLOC context even though this is a very bad idea to begin with
because no progress can be gurateed at all. We shouldn't break the
__GFP_NOFAIL semantic here though. It could be argued that this is
essentially GFP_NOWAIT context which we do not support but PF_MEMALLOC
is much harder to check for existing users because they might happen
deep down the code path performed much later after setting the flag so
we cannot really rule out there is no kernel path triggering this
combination.

Signed-off-by: Michal Hocko
Acked-by: Mel Gorman
Acked-by: David Rientjes
Cc: Tetsuo Handa
Acked-by: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2016-01-15 08:00:49 +0800
fde82aaa7 mm/page_alloc.c: get rid of __alloc_pages_high_priority() ... Browse Code »

__alloc_pages_high_priority doesn't do anything special other than it
calls get_page_from_freelist and loops around GFP_NOFAIL allocation
until it succeeds. It would be better if the first part was done in
__alloc_pages_slowpath where we modify the zonelist because this would
be easier to read and understand. Opencoding the function into its only
caller allows to simplify it a bit as well.

This patch doesn't introduce any functional changes.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Michal Hocko
Acked-by: Mel Gorman
Acked-by: David Rientjes
Acked-by: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2016-01-15 08:00:49 +0800
c00eb15a8 mm/zonelist: enumerate zonelists array index ... Browse Code »

Hardcoding index to zonelists array in gfp_zonelist() is not a good
idea, let's enumerate it to improve readability.

No functional change.

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix CONFIG_NUMA=n build]
[n-horiguchi@ah.jp.nec.com: fix warning in comparing enumerator]
Signed-off-by: Yaowei Bai
Cc: Michal Hocko
Cc: David Rientjes
Signed-off-by: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yaowei Bai
2016-01-15 08:00:49 +0800
06640290b include/linux/mmzone.h: remove unused is_unevictable_lru() ... Browse Code »

Since commit a0b8cab3b9b2 ("mm: remove lru parameter from
__pagevec_lru_add and remove parts of pagevec API") there's no
user of this function anymore, so remove it.

Signed-off-by: Yaowei Bai
Acked-by: Michal Hocko
Acked-by: Hillf Danton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yaowei Bai
2016-01-15 08:00:49 +0800
b4ad0c7e0 mm/memblock.c: memblock_is_memory()/reserved() can be boolean ... Browse Code »

Make memblock_is_memory() and memblock_is_reserved return bool to
improve readability due to these particular functions only using either
one or zero as their return value.

No functional change.

Signed-off-by: Yaowei Bai
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yaowei Bai
2016-01-15 08:00:49 +0800
719ff3216 include/linux/hugetlb.h: is_file_hugepages() can be boolean ... Browse Code »

Make is_file_hugepages() return bool to improve readability due to this
particular function only using either one or zero as its return value.

This patch also removed the if condition to make is_file_hugepages
return directly.

No functional change.

Signed-off-by: Yaowei Bai
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yaowei Bai
2016-01-15 08:00:49 +0800
ba5e95794 mm: change mm_vmscan_lru_shrink_inactive() proto types ... Browse Code »

Move node_id zone_idx shrink flags into trace function, so thay we don't
need caculate these args if the trace is disabled, and will make this
function have less arguments.

Signed-off-by: yalin wang
Reviewed-by: Steven Rostedt
Acked-by: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

yalin wang
2016-01-15 08:00:49 +0800
8ef5849fa mm/cma: always check which page caused allocation failure ... Browse Code »

Now, we have tracepoint in test_pages_isolated() to notify pfn which
cannot be isolated. But, in alloc_contig_range(), some error path
doesn't call test_pages_isolated() so it's still hard to know exact pfn
that causes allocation failure.

This patch change this situation by calling test_pages_isolated() in
almost error path. In allocation failure case, some overhead is added
by this change, but, allocation failure is really rare event so it would
not matter.

In fatal signal pending case, we don't call test_pages_isolated()
because this failure is intentional one.

There was a bogus outer_start problem due to unchecked buddy order and
this patch also fix it. Before this patch, it didn't matter, because
end result is same thing. But, after this patch, tracepoint will report
failed pfn so it should be accurate.

Signed-off-by: Joonsoo Kim
Acked-by: Vlastimil Babka
Acked-by: Michal Nazarewicz
Cc: David Rientjes
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2016-01-15 08:00:49 +0800
0f0848e51 mm/page_isolation.c: add new tracepoint, test_pages_isolated ... Browse Code »

cma allocation should be guranteeded to succeed. But sometimes it can
fail in the current implementation. To track down the problem, we need
to know which page is problematic and this new tracepoint will report
it.

Signed-off-by: Joonsoo Kim
Acked-by: Michal Nazarewicz
Acked-by: David Rientjes
Cc: Minchan Kim
Acked-by: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2016-01-15 08:00:49 +0800
fea85cff1 mm/page_isolation.c: return last tested pfn rather than failure indicator ... Browse Code »

This is preparation step to report test failed pfn in new tracepoint to
analyze cma allocation failure problem. There is no functional change
in this patch.

Signed-off-by: Joonsoo Kim
Acked-by: David Rientjes
Acked-by: Michal Nazarewicz
Cc: Minchan Kim
Acked-by: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2016-01-15 08:00:49 +0800
4a8c7bb59 mm/mempolicy.c: convert the shared_policy lock to a rwlock ... Browse Code »

When running the SPECint_rate gcc on some very large boxes it was
noticed that the system was spending lots of time in
mpol_shared_policy_lookup(). The gamess benchmark can also show it and
is what I mostly used to chase down the issue since the setup for that I
found to be easier.

To be clear the binaries were on tmpfs because of disk I/O requirements.
We then used text replication to avoid icache misses and having all the
copies banging on the memory where the instruction code resides. This
results in us hitting a bottleneck in mpol_shared_policy_lookup() since
lookup is serialised by the shared_policy lock.

I have only reproduced this on very large (3k+ cores) boxes. The
problem starts showing up at just a few hundred ranks getting worse
until it threatens to livelock once it gets large enough. For example
on the gamess benchmark at 128 ranks this area consumes only ~1% of
time, at 512 ranks it consumes nearly 13%, and at 2k ranks it is over
90%.

To alleviate the contention in this area I converted the spinlock to an
rwlock. This allows a large number of lookups to happen simultaneously.
The results were quite good reducing this consumtion at max ranks to
around 2%.

[akpm@linux-foundation.org: tidy up code comments]
Signed-off-by: Nathan Zimmer
Acked-by: David Rientjes
Acked-by: Vlastimil Babka
Cc: Nadia Yvette Chambers
Cc: Naoya Horiguchi
Cc: Mel Gorman
Cc: "Aneesh Kumar K.V"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nathan Zimmer
2016-01-15 08:00:49 +0800
8f235d1a3 mm: add PHYS_PFN, use it in __phys_to_pfn() ... Browse Code »

__phys_to_pfn and __pfn_to_phys are symmetric, PHYS_PFN and PFN_PHYS are
semmetric:

- y = (phys_addr_t)x << PAGE_SHIFT

- y >> PAGE_SHIFT = (phys_add_t)x

- (unsigned long)(y >> PAGE_SHIFT) = x

[akpm@linux-foundation.org: use macro arg name `x']
[arnd@arndb.de: include linux/pfn.h for PHYS_PFN definition]
Signed-off-by: Chen Gang
Cc: Oleg Nesterov
Signed-off-by: Arnd Bergmann
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Chen Gang
2016-01-15 08:00:49 +0800
3aa238511 mm/vmscan.c: change trace_mm_vmscan_writepage() proto type ... Browse Code »

Move trace_reclaim_flags() into trace function, so that we don't need
caculate these flags if the trace is disabled.

Signed-off-by: yalin wang
Reviewed-by: Steven Rostedt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

yalin wang
2016-01-15 08:00:49 +0800
0b57d6ba0 mm/mmap.c: remove redundant local variables for may_expand_vm() ... Browse Code »

Simplify may_expand_vm().

[akpm@linux-foundation.org: further simplification, per Naoya Horiguchi]
Signed-off-by: Chen Gang
Cc: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Chen Gang
2016-01-15 08:00:49 +0800
ab7a5af7f mm/mlock.c: drop unneeded initialization in munlock_vma_pages_range() ... Browse Code »

Before usage page pointer initialized by NULL is reinitialized by
follow_page_mask(). Drop useless init of page pointer in the beginning
of loop.

Signed-off-by: Alexey Klimov
Acked-by: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Klimov
2016-01-15 08:00:49 +0800
5d097056c kmemcg: account certain kmem allocations to memcg ... Browse Code »

Mark those kmem allocations that are known to be easily triggered from
userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
memcg. For the list, see below:

- threadinfo
- task_struct
- task_delay_info
- pid
- cred
- mm_struct
- vm_area_struct and vm_region (nommu)
- anon_vma and anon_vma_chain
- signal_struct
- sighand_struct
- fs_struct
- files_struct
- fdtable and fdtable->full_fds_bits
- dentry and external_name
- inode for all filesystems. This is the most tedious part, because
most filesystems overwrite the alloc_inode method.

The list is far from complete, so feel free to add more objects.
Nevertheless, it should be close to "account everything" approach and
keep most workloads within bounds. Malevolent users will be able to
breach the limit, but this was possible even with the former "account
everything" approach (simply because it did not account everything in
fact).

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Vladimir Davydov
Acked-by: Johannes Weiner
Acked-by: Michal Hocko
Cc: Tejun Heo
Cc: Greg Thelen
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2016-01-15 08:00:49 +0800
37f08dda2 vmalloc: allow to account vmalloc to memcg ... Browse Code »

Make vmalloc family functions allocate vmalloc area pages with
alloc_kmem_pages so that if __GFP_ACCOUNT is set they will be accounted
to memcg. This is needed, at least, to account alloc_fdmem allocations.

Signed-off-by: Vladimir Davydov
Acked-by: Johannes Weiner
Cc: Michal Hocko
Cc: Tejun Heo
Cc: Greg Thelen
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2016-01-15 08:00:49 +0800
230e9fc28 slab: add SLAB_ACCOUNT flag ... Browse Code »

Currently, if we want to account all objects of a particular kmem cache,
we have to pass __GFP_ACCOUNT to each kmem_cache_alloc call, which is
inconvenient. This patch introduces SLAB_ACCOUNT flag which if passed
to kmem_cache_create will force accounting for every allocation from
this cache even if __GFP_ACCOUNT is not passed.

This patch does not make any of the existing caches use this flag - it
will be done later in the series.

Note, a cache with SLAB_ACCOUNT cannot be merged with a cache w/o
SLAB_ACCOUNT, because merged caches share the same kmem_cache struct and
hence cannot have different sets of SLAB_* flags. Thus using this flag
will probably reduce the number of merged slabs even if kmem accounting
is not used (only compiled in).

Signed-off-by: Vladimir Davydov
Suggested-by: Tejun Heo
Acked-by: Johannes Weiner
Acked-by: Michal Hocko
Cc: Greg Thelen
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2016-01-15 08:00:49 +0800
a9bb7e620 memcg: only account kmem allocations marked as __GFP_ACCOUNT ... Browse Code »

Black-list kmem accounting policy (aka __GFP_NOACCOUNT) turned out to be
fragile and difficult to maintain, because there seem to be many more
allocations that should not be accounted than those that should be.
Besides, false accounting an allocation might result in much worse
consequences than not accounting at all, namely increased memory
consumption due to pinned dead kmem caches.

So this patch switches kmem accounting to the white-policy: now only
those kmem allocations that are marked as __GFP_ACCOUNT are accounted to
memcg. Currently, no kmem allocations are marked like this. The
following patches will mark several kmem allocations that are known to
be easily triggered from userspace and therefore should be accounted to
memcg.

Signed-off-by: Vladimir Davydov
Acked-by: Johannes Weiner
Acked-by: Michal Hocko
Cc: Tejun Heo
Cc: Greg Thelen
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2016-01-15 08:00:49 +0800
20b5c3039 Revert "gfp: add __GFP_NOACCOUNT" ... Browse Code »

This reverts commit 8f4fc071b192 ("gfp: add __GFP_NOACCOUNT").

Black-list kmem accounting policy (aka __GFP_NOACCOUNT) turned out to be
fragile and difficult to maintain, because there seem to be many more
allocations that should not be accounted than those that should be.
Besides, false accounting an allocation might result in much worse
consequences than not accounting at all, namely increased memory
consumption due to pinned dead kmem caches.

So it was decided to switch to the white-list policy. This patch
reverts bits introducing the black-list policy. The white-list policy
will be introduced later in the series.

Signed-off-by: Vladimir Davydov
Acked-by: Johannes Weiner
Cc: Michal Hocko
Cc: Tejun Heo
Cc: Greg Thelen
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2016-01-15 08:00:49 +0800
b2a209ffa Revert "kernfs: do not account ino_ida allocations to memcg" ... Browse Code »

Currently, all kmem allocations (namely every kmem_cache_alloc, kmalloc,
alloc_kmem_pages call) are accounted to memory cgroup automatically.
Callers have to explicitly opt out if they don't want/need accounting
for some reason. Such a design decision leads to several problems:

- kmalloc users are highly sensitive to failures, many of them
implicitly rely on the fact that kmalloc never fails, while memcg
makes failures quite plausible.

- A lot of objects are shared among different containers by design.
Accounting such objects to one of containers is just unfair.
Moreover, it might lead to pinning a dead memcg along with its kmem
caches, which aren't tiny, which might result in noticeable increase
in memory consumption for no apparent reason in the long run.

- There are tons of short-lived objects. Accounting them to memcg will
only result in slight noise and won't change the overall picture, but
we still have to pay accounting overhead.

For more info, see

- http://lkml.kernel.org/r/20151105144002.GB15111%40dhcp22.suse.cz
- http://lkml.kernel.org/r/20151106090555.GK29259@esperanza

Therefore this patchset switches to the white list policy. Now kmalloc
users have to explicitly opt in by passing __GFP_ACCOUNT flag.

Currently, the list of accounted objects is quite limited and only
includes those allocations that (1) are known to be easily triggered
from userspace and (2) can fail gracefully (for the full list see patch
no. 6) and it still misses many object types. However, accounting only
those objects should be a satisfactory approximation of the behavior we
used to have for most sane workloads.

This patch (of 6):

Revert 499611ed451508a42d1d7d ("kernfs: do not account ino_ida allocations
to memcg").

Black-list kmem accounting policy (aka __GFP_NOACCOUNT) turned out to be
fragile and difficult to maintain, because there seem to be many more
allocations that should not be accounted than those that should be.
Besides, false accounting an allocation might result in much worse
consequences than not accounting at all, namely increased memory
consumption due to pinned dead kmem caches.

So it was decided to switch to the white-list policy. This patch reverts
bits introducing the black-list policy. The white-list policy will be
introduced later in the series.

Signed-off-by: Vladimir Davydov
Acked-by: Johannes Weiner
Cc: Michal Hocko
Cc: Tejun Heo
Cc: Greg Thelen
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2016-01-15 08:00:49 +0800
7aa0d2278 mm/slab.c: add a helper function get_first_slab ... Browse Code »

Add a new helper function get_first_slab() that get the first slab from
a kmem_cache_node.

Signed-off-by: Geliang Tang
Acked-by: Christoph Lameter
Acked-by: David Rientjes
Cc: Pekka Enberg
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Geliang Tang
2016-01-15 08:00:49 +0800
73c0219d8 mm/slab.c: use list_for_each_entry in cache_flusharray ... Browse Code »

Simplify the code with list_for_each_entry().

Signed-off-by: Geliang Tang
Acked-by: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Geliang Tang
2016-01-15 08:00:49 +0800
d8ad47d83 mm/slab.c use list_first_entry_or_null() ... Browse Code »

Simplify the code with list_first_entry_or_null().

Signed-off-by: Geliang Tang
Acked-by: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Geliang Tang
2016-01-15 08:00:49 +0800
2bd03e49d include/linux/dcache.h: remove semicolons from HASH_LEN_DECLARE ... Browse Code »

A little cleanup - the invocation site provdes the semicolon.

Cc: Rasmus Villemoes
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2016-01-15 08:00:49 +0800
3c973b0e7 ocfs2/dlm: cleanup redunant lksb flags in dlmcommon.h ... Browse Code »

lksb flags are defined both in dlmapi.h and dlmcommon.h. So clean them
up from dlmcommon.h.

Signed-off-by: Joseph Qi
Reviewed-by: Jiufei Xue
Cc: Mark Fasheh
Cc: Joel Becker
Cc: Junxiao Bi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joseph Qi
2016-01-15 08:00:49 +0800
98e141f26 ocfs2: dlm: remove redundant code ... Browse Code »

Found this when do patch review, remove to make it clear and save a
little cpu time.

Signed-off-by: Junxiao Bi
Cc: Mark Fasheh
Cc: Joel Becker
Cc: Joseph Qi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Junxiao Bi
2016-01-15 08:00:49 +0800
074a6c655 ocfs2: access orphan dinode before delete entry in ocfs2_orphan_del ... Browse Code »

In ocfs2_orphan_del, currently it finds and deletes entry first, and
then access orphan dir dinode. This will have a problem once
ocfs2_journal_access_di fails. In this case, entry will be removed from
orphan dir, but in deed the inode hasn't been deleted successfully. In
other words, the file is missing but not actually deleted. So we should
access orphan dinode first like unlink and rename.

Signed-off-by: Joseph Qi
Reviewed-by: Jiufei Xue
Cc: Mark Fasheh
Cc: Joel Becker
Reviewed-by: Junxiao Bi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joseph Qi
2016-01-15 08:00:49 +0800
32e493265 ocfs2/dlm: do not insert a new mle when another process is already migrating ... Browse Code »

When two processes are migrating the same lockres,
dlm_add_migration_mle() return -EEXIST, but insert a new mle in hash
list. dlm_migrate_lockres() will detach the old mle and free the new
one which is already in hash list, that will destroy the list.

Signed-off-by: Jiufei Xue
Reviewed-by: Joseph Qi
Cc: Mark Fasheh
Cc: Joel Becker
Reviewed-by: Junxiao Bi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

xuejiufei
2016-01-15 08:00:49 +0800
bef5502de ocfs2/dlm: ignore cleaning the migration mle that is inuse ... Browse Code »

We have found that migration source will trigger a BUG that the refcount
of mle is already zero before put when the target is down during
migration. The situation is as follows:

dlm_migrate_lockres
dlm_add_migration_mle
dlm_mark_lockres_migrating
dlm_get_mle_inuse
<<<<<< Now the refcount of the mle is 2.
dlm_send_one_lockres and wait for the target to become the
new master.
<<<<<< o2hb detect the target down and clean the migration
mle. Now the refcount is 1.

dlm_migrate_lockres woken, and put the mle twice when found the target
goes down which trigger the BUG with the following message:

"ERROR: bad mle: ".

Signed-off-by: Jiufei Xue
Reviewed-by: Joseph Qi
Cc: Mark Fasheh
Cc: Joel Becker
Cc: Junxiao Bi
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

xuejiufei
2016-01-15 08:00:49 +0800
1cce4df04 ocfs2: do not lock/unlock() inode DLM lock ... Browse Code »

DLM does not cache locks. So, blocking lock and unlock will only make
the performance worse where contention over the locks is high.

Signed-off-by: Goldwyn Rodrigues
Cc: Mark Fasheh
Cc: Joel Becker
Reviewed-by: Junxiao Bi
Cc: Joseph Qi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Goldwyn Rodrigues
2016-01-15 08:00:49 +0800
1247017f4 ocfs2: fix slot overwritten if storage link down during mount ... Browse Code »

The following case will lead to slot overwritten.

N1 N2
mount ocfs2 volume, find and
allocate slot 0, then set
osb->slot_num to 0, begin to
write slot info to disk
mount ocfs2 volume, wait for super lock
write block fail because of
storage link down, unlock
super lock
got super lock and also allocate slot 0
then unlock super lock

mount fail and then dismount,
since osb->slot_num is 0, try to
put invalid slot to disk. And it
will succeed if storage link
restores.
N2 slot info is now overwritten

Once another node say N3 mount, it will find and allocate slot 0 again,
which will lead to mount hung because journal has already been locked by
N2. so when write slot info failed, invalidate slot in advance to avoid
overwrite slot.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Yiwen Jiang
Reviewed-by: Joseph Qi
Cc: Mark Fasheh
Cc: Joel Becker
Cc: Junxiao Bi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

jiangyiwen
2016-01-15 08:00:49 +0800
c372f2193 ocfs2/dlm: return appropriate value when dlm_grab() returns NULL ... Browse Code »

dlm_grab() may return NULL when the node is doing unmount. When doing
code review, we found that some dlm handlers may return error to caller
when dlm_grab() returns NULL and make caller BUG or other problems.
Here is an example:

Node 1 Node 2
receives migration message
from node 3, and send
migrate request to others
start unmounting

receives migrate request
from node 1 and call
dlm_migrate_request_handler()

unmount thread unregisters
domain handlers and removes
dlm_context from dlm_domains

dlm_migrate_request_handlers()
returns -EINVAL to node 1
Exit migration neither clearing the
migration state nor sending
assert master message to node 3 which
cause node 3 hung.

Signed-off-by: Jiufei Xue
Reviewed-by: Joseph Qi
Reviewed-by: Yiwen Jiang
Cc: Mark Fasheh
Cc: Joel Becker
Cc: Junxiao Bi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Xue jiufei
2016-01-15 08:00:49 +0800
72865d923 ocfs2: clean up redundant NULL check before iput ... Browse Code »

Since iput will take care the NULL check itself, NULL check before
calling it is redundant. So clean them up.

Signed-off-by: Joseph Qi
Cc: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joseph Qi
2016-01-15 08:00:49 +0800
b55601433 ocfs2/dlm: wait until DLM_LOCK_RES_SETREF_INPROG is cleared in dlm_deref_lockres_worker ... Browse Code »

Commit f3f854648de6 ("ocfs2_dlm: Ensure correct ordering of set/clear
refmap bit on lockres") still exists a race which can't ensure the
ordering is exactly correct.

Node1 Node2 Node3
umount, migrate
lockres to Node2
migrate finished,
send migrate request
to Node3
received migrate request,
create a migration_mle,
respond to Node2.
set DLM_LOCK_RES_SETREF_INPROG
and send assert master to
Node3
delete migration_mle in
assert_master_handler,
Node3 umount without response
dlm_thread purge
this lockres, send drop
deref message to Node2
found the flag of
DLM_LOCK_RES_SETREF_INPROG
is set, dispatch
dlm_deref_lockres_worker to
clear refmap, but in function of
dlm_deref_lockres_worker,
only if node in refmap it wait
DLM_LOCK_RES_SETREF_INPROG
to be cleared. So worker is
done successfully

purge lockres, send
assert master response
to Node1, and finish umount
set Node3 in refmap, and it
won't be cleared forever, thus
lead to umount hung

so wait until DLM_LOCK_RES_SETREF_INPROG is cleared in
dlm_deref_lockres_worker.

Signed-off-by: Yiwen Jiang
Reviewed-by: Joseph Qi
Reviewed-by: Junxiao Bi
Cc: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

jiangyiwen
2016-01-15 08:00:49 +0800
9e62dc096 ocfs2: constify ocfs2_extent_tree_operations structures ... Browse Code »

The ocfs2_extent_tree_operations structures are never modified, so
declare them as const.

Done with the help of Coccinelle.

Signed-off-by: Julia Lawall
Cc: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Julia Lawall
2016-01-15 08:00:49 +0800
30bee898f ocfs2/dlm: fix a race between purge and migration ... Browse Code »

We found a race between purge and migration when doing code review.
Node A put lockres to purgelist before receiving the migrate message
from node B which is the master. Node A call dlm_mig_lockres_handler to
handle this message.

dlm_mig_lockres_handler
dlm_lookup_lockres
>>>>>> race window, dlm_run_purge_list may run and send
deref message to master, waiting the response
spin_lock(&res->spinlock);
res->state |= DLM_LOCK_RES_MIGRATING;
spin_unlock(&res->spinlock);
dlm_mig_lockres_handler returns

>>>>>> dlm_thread receives the response from master for the deref
message and triggers the BUG because the lockres has the state
DLM_LOCK_RES_MIGRATING with the following message:

dlm_purge_lockres:209 ERROR: 6633EB681FA7474A9C280A4E1A836F0F: res
M0000000000000000030c0300000000 in use after deref

Signed-off-by: Jiufei Xue
Reviewed-by: Joseph Qi
Reviewed-by: Yiwen Jiang
Cc: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Xue jiufei
2016-01-15 08:00:49 +0800
a84ac334d ocfs2: o2hb: increase unsteady iterations ... Browse Code »

When run multiple xattr test of ocfs2-test on a three-nodes cluster,
mount failed sometimes with the following message.

o2hb: Unable to stabilize heartbeart on region D18B775E758D4D80837E8CF3D086AD4A (xvdb)

Stabilize heartbeat depends on the timing order to mount ocfs2 from
cluster nodes and how fast the tcp connections are established. So
increase unsteady interations to leave more time for it.

Signed-off-by: Junxiao Bi
Cc: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Junxiao Bi
2016-01-15 08:00:49 +0800