15 Jan, 2016
40 commits
-
Currently looking at /proc//status or statm, there is no way to
distinguish shmem pages from pages mapped to a regular file (shmem pages
are mapped to /dev/zero), even though their implication in actual memory
use is quite different.The internal accounting currently counts shmem pages together with
regular files. As a preparation to extend the userspace interfaces,
this patch adds MM_SHMEMPAGES counter to mm_rss_stat to account for
shmem pages separately from MM_FILEPAGES. The next patch will expose it
to userspace - this patch doesn't change the exported values yet, by
adding up MM_SHMEMPAGES to MM_FILEPAGES at places where MM_FILEPAGES was
used before. The only user-visible change after this patch is the OOM
killer message that separates the reported "shmem-rss" from "file-rss".[vbabka@suse.cz: forward-porting, tweak changelog]
Signed-off-by: Jerome Marchand
Signed-off-by: Vlastimil Babka
Acked-by: Konstantin Khlebnikov
Acked-by: Michal Hocko
Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Following the previous patch, further reduction of /proc/pid/smaps cost
is possible for private writable shmem mappings with unpopulated areas
where the page walk invokes the .pte_hole function. We can use radix
tree iterator for each such area instead of calling find_get_entry() in
a loop. This is possible at the extra maintenance cost of introducing
another shmem function shmem_partial_swap_usage().To demonstrate the diference, I have measured this on a process that
creates a private writable 2GB mapping of a partially swapped out
/dev/shm/file (which cannot employ the optimizations from the prvious
patch) and doesn't populate it at all. I time how long does it take to
cat /proc/pid/smaps of this process 100 times.Before this patch:
real 0m3.831s
user 0m0.180s
sys 0m3.212sAfter this patch:
real 0m1.176s
user 0m0.180s
sys 0m0.684sThe time is similar to the case where a radix tree iterator is employed
on the whole mapping.Signed-off-by: Vlastimil Babka
Cc: Hugh Dickins
Cc: Jerome Marchand
Cc: Konstantin Khlebnikov
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The previous patch has improved swap accounting for shmem mapping, which
however made /proc/pid/smaps more expensive for shmem mappings, as we
consult the radix tree for each pte_none entry, so the overal complexity
is O(n*log(n)).We can reduce this significantly for mappings that cannot contain COWed
pages, because then we can either use the statistics tha shmem object
itself tracks (if the mapping contains the whole object, or the swap
usage of the whole object is zero), or use the radix tree iterator,
which is much more effective than repeated find_get_entry() calls.This patch therefore introduces a function shmem_swap_usage(vma) and
makes /proc/pid/smaps use it when possible. Only for writable private
mappings of shmem objects (i.e. tmpfs files) with the shmem object
itself (partially) swapped outwe have to resort to the find_get_entry()
approach.Hopefully such mappings are relatively uncommon.
To demonstrate the diference, I have measured this on a process that
creates a 2GB mapping and dirties single pages with a stride of 2MB, and
time how long does it take to cat /proc/pid/smaps of this process 100
times.Private writable mapping of a /dev/shm/file (the most complex case):
real 0m3.831s
user 0m0.180s
sys 0m3.212sShared mapping of an almost full mapping of a partially swapped /dev/shm/file
(which needs to employ the radix tree iterator).real 0m1.351s
user 0m0.096s
sys 0m0.768sSame, but with /dev/shm/file not swapped (so no radix tree walk needed)
real 0m0.935s
user 0m0.128s
sys 0m0.344sPrivate anonymous mapping:
real 0m0.949s
user 0m0.116s
sys 0m0.348sThe cost is now much closer to the private anonymous mapping case, unless
the shmem mapping is private and writable.Signed-off-by: Vlastimil Babka
Cc: Hugh Dickins
Cc: Jerome Marchand
Cc: Konstantin Khlebnikov
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Currently, /proc/pid/smaps will always show "Swap: 0 kB" for
shmem-backed mappings, even if the mapped portion does contain pages
that were swapped out. This is because unlike private anonymous
mappings, shmem does not change pte to swap entry, but pte_none when
swapping the page out. In the smaps page walk, such page thus looks
like it was never faulted in.This patch changes smaps_pte_entry() to determine the swap status for
such pte_none entries for shmem mappings, similarly to how
mincore_page() does it. Swapped out shmem pages are thus accounted for.
For private mappings of tmpfs files that COWed some of the pages, swaped
out status of the original shmem pages is naturally ignored. If some of
the private copies was also swapped out, they are accounted via their
page table swap entries, so the resulting reported swap usage is then a
sum of both swapped out private copies, and swapped out shmem pages that
were not COWed. No double accounting can thus happen.The accounting is arguably still not as precise as for private anonymous
mappings, since now we will count also pages that the process in
question never accessed, but another process populated them and then let
them become swapped out. I believe it is still less confusing and
subtle than not showing any swap usage by shmem mappings at all.
Swapped out counter might of interest of users who would like to prevent
from future swapins during performance critical operation and pre-fault
them at their convenience. Especially for larger swapped out regions
the cost of swapin is much higher than a fresh page allocation. So a
differentiation between pte_none vs. swapped out is important for those
usecases.One downside of this patch is that it makes /proc/pid/smaps more
expensive for shmem mappings, as we consult the radix tree for each
pte_none entry, so the overal complexity is O(n*log(n)). I have
measured this on a process that creates a 2GB mapping and dirties single
pages with a stride of 2MB, and time how long does it take to cat
/proc/pid/smaps of this process 100 times.Private anonymous mapping:
real 0m0.949s
user 0m0.116s
sys 0m0.348sMapping of a /dev/shm/file:
real 0m3.831s
user 0m0.180s
sys 0m3.212sThe difference is rather substantial, so the next patch will reduce the
cost for shared or read-only mappings.In a less controlled experiment, I've gathered pids of processes on my
desktop that have either '/dev/shm/*' or 'SYSV*' in smaps. This
included the Chrome browser and some KDE processes. Again, I've run cat
/proc/pid/smaps on each 100 times.Before this patch:
real 0m9.050s
user 0m0.518s
sys 0m8.066sAfter this patch:
real 0m9.221s
user 0m0.541s
sys 0m8.187sThis suggests low impact on average systems.
Note that this patch doesn't attempt to adjust the SwapPss field for
shmem mappings, which would need extra work to determine who else could
have the pages mapped. Thus the value stays zero except for COWed
swapped out pages in a shmem mapping, which are accounted as usual.Signed-off-by: Vlastimil Babka
Acked-by: Konstantin Khlebnikov
Acked-by: Jerome Marchand
Acked-by: Michal Hocko
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This series is based on Jerome Marchand's [1] so let me quote the first
paragraph from there:There are several shortcomings with the accounting of shared memory
(sysV shm, shared anonymous mapping, mapping to a tmpfs file). The
values in /proc//status and statm don't allow to distinguish
between shmem memory and a shared mapping to a regular file, even though
their implications on memory usage are quite different: at reclaim, file
mapping can be dropped or written back on disk while shmem needs a place
in swap. As for shmem pages that are swapped-out or in swap cache, they
aren't accounted at all.The original motivation for myself is that a customer found (IMHO
rightfully) confusing that e.g. top output for process swap usage is
unreliable with respect to swapped out shmem pages, which are not
accounted for.The fundamental difference between private anonymous and shmem pages is
that the latter has PTE's converted to pte_none, and not swapents. As
such, they are not accounted to the number of swapents visible e.g. in
/proc/pid/status VmSwap row. It might be theoretically possible to use
swapents when swapping out shmem (without extra cost, as one has to
change all mappers anyway), and on swap in only convert the swapent for
the faulting process, leaving swapents in other processes until they
also fault (so again no extra cost). But I don't know how many
assumptions this would break, and it would be too disruptive change for
a relatively small benefit.Instead, my approach is to document the limitation of VmSwap, and
provide means to determine the swap usage for shmem areas for those who
are interested and willing to pay the price, using /proc/pid/smaps.
Because outside of ipcs, I don't think it's possible to currently to
determine the usage at all. The previous patchset [1] did introduce new
shmem-specific fields into smaps output, and functions to determine the
values. I take a simpler approach, noting that smaps output already has
a "Swap: X kB" line, where currently X == 0 always for shmem areas. I
think we can just consider this a bug and provide the proper value by
consulting the radix tree, as e.g. mincore_page() does. In the patch
changelog I explain why this is also not perfect (and cannot be without
swapents), but still arguably much better than showing a 0.The last two patches are adapted from Jerome's patchset and provide a
VmRSS breakdown to RssAnon, RssFile and RssShm in /proc/pid/status.
Hugh noted that this is a welcome addition, and I agree that it might
help e.g. debugging process memory usage at albeit non-zero, but still
rather low cost of extra per-mm counter and some page flag checks.[1] http://lwn.net/Articles/611966/
This patch (of 6):
The documentation for /proc/pid/status does not mention that the value
of VmSwap counts only swapped out anonymous private pages, and not
swapped out pages of the underlying shmem objects (for shmem mappings).
This is not obvious, so document this limitation.Signed-off-by: Vlastimil Babka
Acked-by: Konstantin Khlebnikov
Acked-by: Michal Hocko
Acked-by: Jerome Marchand
Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Make memmap_valid_within return bool due to this particular function
only using either one or zero as its return value.No functional change.
Signed-off-by: Yaowei Bai
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
To make the intention clearer, use list_{next,first}_entry instead of
list_entry.Signed-off-by: Geliang Tang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
__alloc_pages_slowpath is looping over ALLOC_NO_WATERMARKS requests if
__GFP_NOFAIL is requested. This is fragile because we are basically
relying on somebody else to make the reclaim (be it the direct reclaim
or OOM killer) for us. The caller might be holding resources (e.g.
locks) which block other other reclaimers from making any progress for
example. Remove the retry loop and rely on __alloc_pages_slowpath to
invoke all allowed reclaim steps and retry logic.We have to be careful about __GFP_NOFAIL allocations from the
PF_MEMALLOC context even though this is a very bad idea to begin with
because no progress can be gurateed at all. We shouldn't break the
__GFP_NOFAIL semantic here though. It could be argued that this is
essentially GFP_NOWAIT context which we do not support but PF_MEMALLOC
is much harder to check for existing users because they might happen
deep down the code path performed much later after setting the flag so
we cannot really rule out there is no kernel path triggering this
combination.Signed-off-by: Michal Hocko
Acked-by: Mel Gorman
Acked-by: David Rientjes
Cc: Tetsuo Handa
Acked-by: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
__alloc_pages_high_priority doesn't do anything special other than it
calls get_page_from_freelist and loops around GFP_NOFAIL allocation
until it succeeds. It would be better if the first part was done in
__alloc_pages_slowpath where we modify the zonelist because this would
be easier to read and understand. Opencoding the function into its only
caller allows to simplify it a bit as well.This patch doesn't introduce any functional changes.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Michal Hocko
Acked-by: Mel Gorman
Acked-by: David Rientjes
Acked-by: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Hardcoding index to zonelists array in gfp_zonelist() is not a good
idea, let's enumerate it to improve readability.No functional change.
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix CONFIG_NUMA=n build]
[n-horiguchi@ah.jp.nec.com: fix warning in comparing enumerator]
Signed-off-by: Yaowei Bai
Cc: Michal Hocko
Cc: David Rientjes
Signed-off-by: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Since commit a0b8cab3b9b2 ("mm: remove lru parameter from
__pagevec_lru_add and remove parts of pagevec API") there's no
user of this function anymore, so remove it.Signed-off-by: Yaowei Bai
Acked-by: Michal Hocko
Acked-by: Hillf Danton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Make memblock_is_memory() and memblock_is_reserved return bool to
improve readability due to these particular functions only using either
one or zero as their return value.No functional change.
Signed-off-by: Yaowei Bai
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Make is_file_hugepages() return bool to improve readability due to this
particular function only using either one or zero as its return value.This patch also removed the if condition to make is_file_hugepages
return directly.No functional change.
Signed-off-by: Yaowei Bai
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Move node_id zone_idx shrink flags into trace function, so thay we don't
need caculate these args if the trace is disabled, and will make this
function have less arguments.Signed-off-by: yalin wang
Reviewed-by: Steven Rostedt
Acked-by: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Now, we have tracepoint in test_pages_isolated() to notify pfn which
cannot be isolated. But, in alloc_contig_range(), some error path
doesn't call test_pages_isolated() so it's still hard to know exact pfn
that causes allocation failure.This patch change this situation by calling test_pages_isolated() in
almost error path. In allocation failure case, some overhead is added
by this change, but, allocation failure is really rare event so it would
not matter.In fatal signal pending case, we don't call test_pages_isolated()
because this failure is intentional one.There was a bogus outer_start problem due to unchecked buddy order and
this patch also fix it. Before this patch, it didn't matter, because
end result is same thing. But, after this patch, tracepoint will report
failed pfn so it should be accurate.Signed-off-by: Joonsoo Kim
Acked-by: Vlastimil Babka
Acked-by: Michal Nazarewicz
Cc: David Rientjes
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
cma allocation should be guranteeded to succeed. But sometimes it can
fail in the current implementation. To track down the problem, we need
to know which page is problematic and this new tracepoint will report
it.Signed-off-by: Joonsoo Kim
Acked-by: Michal Nazarewicz
Acked-by: David Rientjes
Cc: Minchan Kim
Acked-by: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This is preparation step to report test failed pfn in new tracepoint to
analyze cma allocation failure problem. There is no functional change
in this patch.Signed-off-by: Joonsoo Kim
Acked-by: David Rientjes
Acked-by: Michal Nazarewicz
Cc: Minchan Kim
Acked-by: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
When running the SPECint_rate gcc on some very large boxes it was
noticed that the system was spending lots of time in
mpol_shared_policy_lookup(). The gamess benchmark can also show it and
is what I mostly used to chase down the issue since the setup for that I
found to be easier.To be clear the binaries were on tmpfs because of disk I/O requirements.
We then used text replication to avoid icache misses and having all the
copies banging on the memory where the instruction code resides. This
results in us hitting a bottleneck in mpol_shared_policy_lookup() since
lookup is serialised by the shared_policy lock.I have only reproduced this on very large (3k+ cores) boxes. The
problem starts showing up at just a few hundred ranks getting worse
until it threatens to livelock once it gets large enough. For example
on the gamess benchmark at 128 ranks this area consumes only ~1% of
time, at 512 ranks it consumes nearly 13%, and at 2k ranks it is over
90%.To alleviate the contention in this area I converted the spinlock to an
rwlock. This allows a large number of lookups to happen simultaneously.
The results were quite good reducing this consumtion at max ranks to
around 2%.[akpm@linux-foundation.org: tidy up code comments]
Signed-off-by: Nathan Zimmer
Acked-by: David Rientjes
Acked-by: Vlastimil Babka
Cc: Nadia Yvette Chambers
Cc: Naoya Horiguchi
Cc: Mel Gorman
Cc: "Aneesh Kumar K.V"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
__phys_to_pfn and __pfn_to_phys are symmetric, PHYS_PFN and PFN_PHYS are
semmetric:- y = (phys_addr_t)x << PAGE_SHIFT
- y >> PAGE_SHIFT = (phys_add_t)x
- (unsigned long)(y >> PAGE_SHIFT) = x
[akpm@linux-foundation.org: use macro arg name `x']
[arnd@arndb.de: include linux/pfn.h for PHYS_PFN definition]
Signed-off-by: Chen Gang
Cc: Oleg Nesterov
Signed-off-by: Arnd Bergmann
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Move trace_reclaim_flags() into trace function, so that we don't need
caculate these flags if the trace is disabled.Signed-off-by: yalin wang
Reviewed-by: Steven Rostedt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Simplify may_expand_vm().
[akpm@linux-foundation.org: further simplification, per Naoya Horiguchi]
Signed-off-by: Chen Gang
Cc: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Before usage page pointer initialized by NULL is reinitialized by
follow_page_mask(). Drop useless init of page pointer in the beginning
of loop.Signed-off-by: Alexey Klimov
Acked-by: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Mark those kmem allocations that are known to be easily triggered from
userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
memcg. For the list, see below:- threadinfo
- task_struct
- task_delay_info
- pid
- cred
- mm_struct
- vm_area_struct and vm_region (nommu)
- anon_vma and anon_vma_chain
- signal_struct
- sighand_struct
- fs_struct
- files_struct
- fdtable and fdtable->full_fds_bits
- dentry and external_name
- inode for all filesystems. This is the most tedious part, because
most filesystems overwrite the alloc_inode method.The list is far from complete, so feel free to add more objects.
Nevertheless, it should be close to "account everything" approach and
keep most workloads within bounds. Malevolent users will be able to
breach the limit, but this was possible even with the former "account
everything" approach (simply because it did not account everything in
fact).[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Vladimir Davydov
Acked-by: Johannes Weiner
Acked-by: Michal Hocko
Cc: Tejun Heo
Cc: Greg Thelen
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Make vmalloc family functions allocate vmalloc area pages with
alloc_kmem_pages so that if __GFP_ACCOUNT is set they will be accounted
to memcg. This is needed, at least, to account alloc_fdmem allocations.Signed-off-by: Vladimir Davydov
Acked-by: Johannes Weiner
Cc: Michal Hocko
Cc: Tejun Heo
Cc: Greg Thelen
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Currently, if we want to account all objects of a particular kmem cache,
we have to pass __GFP_ACCOUNT to each kmem_cache_alloc call, which is
inconvenient. This patch introduces SLAB_ACCOUNT flag which if passed
to kmem_cache_create will force accounting for every allocation from
this cache even if __GFP_ACCOUNT is not passed.This patch does not make any of the existing caches use this flag - it
will be done later in the series.Note, a cache with SLAB_ACCOUNT cannot be merged with a cache w/o
SLAB_ACCOUNT, because merged caches share the same kmem_cache struct and
hence cannot have different sets of SLAB_* flags. Thus using this flag
will probably reduce the number of merged slabs even if kmem accounting
is not used (only compiled in).Signed-off-by: Vladimir Davydov
Suggested-by: Tejun Heo
Acked-by: Johannes Weiner
Acked-by: Michal Hocko
Cc: Greg Thelen
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Black-list kmem accounting policy (aka __GFP_NOACCOUNT) turned out to be
fragile and difficult to maintain, because there seem to be many more
allocations that should not be accounted than those that should be.
Besides, false accounting an allocation might result in much worse
consequences than not accounting at all, namely increased memory
consumption due to pinned dead kmem caches.So this patch switches kmem accounting to the white-policy: now only
those kmem allocations that are marked as __GFP_ACCOUNT are accounted to
memcg. Currently, no kmem allocations are marked like this. The
following patches will mark several kmem allocations that are known to
be easily triggered from userspace and therefore should be accounted to
memcg.Signed-off-by: Vladimir Davydov
Acked-by: Johannes Weiner
Acked-by: Michal Hocko
Cc: Tejun Heo
Cc: Greg Thelen
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This reverts commit 8f4fc071b192 ("gfp: add __GFP_NOACCOUNT").
Black-list kmem accounting policy (aka __GFP_NOACCOUNT) turned out to be
fragile and difficult to maintain, because there seem to be many more
allocations that should not be accounted than those that should be.
Besides, false accounting an allocation might result in much worse
consequences than not accounting at all, namely increased memory
consumption due to pinned dead kmem caches.So it was decided to switch to the white-list policy. This patch
reverts bits introducing the black-list policy. The white-list policy
will be introduced later in the series.Signed-off-by: Vladimir Davydov
Acked-by: Johannes Weiner
Cc: Michal Hocko
Cc: Tejun Heo
Cc: Greg Thelen
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Currently, all kmem allocations (namely every kmem_cache_alloc, kmalloc,
alloc_kmem_pages call) are accounted to memory cgroup automatically.
Callers have to explicitly opt out if they don't want/need accounting
for some reason. Such a design decision leads to several problems:- kmalloc users are highly sensitive to failures, many of them
implicitly rely on the fact that kmalloc never fails, while memcg
makes failures quite plausible.- A lot of objects are shared among different containers by design.
Accounting such objects to one of containers is just unfair.
Moreover, it might lead to pinning a dead memcg along with its kmem
caches, which aren't tiny, which might result in noticeable increase
in memory consumption for no apparent reason in the long run.- There are tons of short-lived objects. Accounting them to memcg will
only result in slight noise and won't change the overall picture, but
we still have to pay accounting overhead.For more info, see
- http://lkml.kernel.org/r/20151105144002.GB15111%40dhcp22.suse.cz
- http://lkml.kernel.org/r/20151106090555.GK29259@esperanzaTherefore this patchset switches to the white list policy. Now kmalloc
users have to explicitly opt in by passing __GFP_ACCOUNT flag.Currently, the list of accounted objects is quite limited and only
includes those allocations that (1) are known to be easily triggered
from userspace and (2) can fail gracefully (for the full list see patch
no. 6) and it still misses many object types. However, accounting only
those objects should be a satisfactory approximation of the behavior we
used to have for most sane workloads.This patch (of 6):
Revert 499611ed451508a42d1d7d ("kernfs: do not account ino_ida allocations
to memcg").Black-list kmem accounting policy (aka __GFP_NOACCOUNT) turned out to be
fragile and difficult to maintain, because there seem to be many more
allocations that should not be accounted than those that should be.
Besides, false accounting an allocation might result in much worse
consequences than not accounting at all, namely increased memory
consumption due to pinned dead kmem caches.So it was decided to switch to the white-list policy. This patch reverts
bits introducing the black-list policy. The white-list policy will be
introduced later in the series.Signed-off-by: Vladimir Davydov
Acked-by: Johannes Weiner
Cc: Michal Hocko
Cc: Tejun Heo
Cc: Greg Thelen
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Add a new helper function get_first_slab() that get the first slab from
a kmem_cache_node.Signed-off-by: Geliang Tang
Acked-by: Christoph Lameter
Acked-by: David Rientjes
Cc: Pekka Enberg
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Simplify the code with list_for_each_entry().
Signed-off-by: Geliang Tang
Acked-by: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Simplify the code with list_first_entry_or_null().
Signed-off-by: Geliang Tang
Acked-by: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
A little cleanup - the invocation site provdes the semicolon.
Cc: Rasmus Villemoes
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
lksb flags are defined both in dlmapi.h and dlmcommon.h. So clean them
up from dlmcommon.h.Signed-off-by: Joseph Qi
Reviewed-by: Jiufei Xue
Cc: Mark Fasheh
Cc: Joel Becker
Cc: Junxiao Bi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Found this when do patch review, remove to make it clear and save a
little cpu time.Signed-off-by: Junxiao Bi
Cc: Mark Fasheh
Cc: Joel Becker
Cc: Joseph Qi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
In ocfs2_orphan_del, currently it finds and deletes entry first, and
then access orphan dir dinode. This will have a problem once
ocfs2_journal_access_di fails. In this case, entry will be removed from
orphan dir, but in deed the inode hasn't been deleted successfully. In
other words, the file is missing but not actually deleted. So we should
access orphan dinode first like unlink and rename.Signed-off-by: Joseph Qi
Reviewed-by: Jiufei Xue
Cc: Mark Fasheh
Cc: Joel Becker
Reviewed-by: Junxiao Bi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
When two processes are migrating the same lockres,
dlm_add_migration_mle() return -EEXIST, but insert a new mle in hash
list. dlm_migrate_lockres() will detach the old mle and free the new
one which is already in hash list, that will destroy the list.Signed-off-by: Jiufei Xue
Reviewed-by: Joseph Qi
Cc: Mark Fasheh
Cc: Joel Becker
Reviewed-by: Junxiao Bi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
We have found that migration source will trigger a BUG that the refcount
of mle is already zero before put when the target is down during
migration. The situation is as follows:dlm_migrate_lockres
dlm_add_migration_mle
dlm_mark_lockres_migrating
dlm_get_mle_inuse
<<<<<< Now the refcount of the mle is 2.
dlm_send_one_lockres and wait for the target to become the
new master.
<<<<<< o2hb detect the target down and clean the migration
mle. Now the refcount is 1.dlm_migrate_lockres woken, and put the mle twice when found the target
goes down which trigger the BUG with the following message:"ERROR: bad mle: ".
Signed-off-by: Jiufei Xue
Reviewed-by: Joseph Qi
Cc: Mark Fasheh
Cc: Joel Becker
Cc: Junxiao Bi
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
DLM does not cache locks. So, blocking lock and unlock will only make
the performance worse where contention over the locks is high.Signed-off-by: Goldwyn Rodrigues
Cc: Mark Fasheh
Cc: Joel Becker
Reviewed-by: Junxiao Bi
Cc: Joseph Qi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The following case will lead to slot overwritten.
N1 N2
mount ocfs2 volume, find and
allocate slot 0, then set
osb->slot_num to 0, begin to
write slot info to disk
mount ocfs2 volume, wait for super lock
write block fail because of
storage link down, unlock
super lock
got super lock and also allocate slot 0
then unlock super lockmount fail and then dismount,
since osb->slot_num is 0, try to
put invalid slot to disk. And it
will succeed if storage link
restores.
N2 slot info is now overwrittenOnce another node say N3 mount, it will find and allocate slot 0 again,
which will lead to mount hung because journal has already been locked by
N2. so when write slot info failed, invalidate slot in advance to avoid
overwrite slot.[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Yiwen Jiang
Reviewed-by: Joseph Qi
Cc: Mark Fasheh
Cc: Joel Becker
Cc: Junxiao Bi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
dlm_grab() may return NULL when the node is doing unmount. When doing
code review, we found that some dlm handlers may return error to caller
when dlm_grab() returns NULL and make caller BUG or other problems.
Here is an example:Node 1 Node 2
receives migration message
from node 3, and send
migrate request to others
start unmountingreceives migrate request
from node 1 and call
dlm_migrate_request_handler()unmount thread unregisters
domain handlers and removes
dlm_context from dlm_domainsdlm_migrate_request_handlers()
returns -EINVAL to node 1
Exit migration neither clearing the
migration state nor sending
assert master message to node 3 which
cause node 3 hung.Signed-off-by: Jiufei Xue
Reviewed-by: Joseph Qi
Reviewed-by: Yiwen Jiang
Cc: Mark Fasheh
Cc: Joel Becker
Cc: Junxiao Bi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds