Eric Lee / smarc-fsl-linux-kernel

15 Jan, 2016

40 commits

eca56ff90 mm, shmem: add internal shmem resident memory accounting ... Browse Code »

Currently looking at /proc//status or statm, there is no way to
distinguish shmem pages from pages mapped to a regular file (shmem pages
are mapped to /dev/zero), even though their implication in actual memory
use is quite different.

The internal accounting currently counts shmem pages together with
regular files. As a preparation to extend the userspace interfaces,
this patch adds MM_SHMEMPAGES counter to mm_rss_stat to account for
shmem pages separately from MM_FILEPAGES. The next patch will expose it
to userspace - this patch doesn't change the exported values yet, by
adding up MM_SHMEMPAGES to MM_FILEPAGES at places where MM_FILEPAGES was
used before. The only user-visible change after this patch is the OOM
killer message that separates the reported "shmem-rss" from "file-rss".

[vbabka@suse.cz: forward-porting, tweak changelog]
Signed-off-by: Jerome Marchand
Signed-off-by: Vlastimil Babka
Acked-by: Konstantin Khlebnikov
Acked-by: Michal Hocko
Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jerome Marchand
2016-01-15 08:00:49 +0800
48131e03c mm, proc: reduce cost of /proc/pid/smaps for unpopulated shmem mappings ... Browse Code »

Following the previous patch, further reduction of /proc/pid/smaps cost
is possible for private writable shmem mappings with unpopulated areas
where the page walk invokes the .pte_hole function. We can use radix
tree iterator for each such area instead of calling find_get_entry() in
a loop. This is possible at the extra maintenance cost of introducing
another shmem function shmem_partial_swap_usage().

To demonstrate the diference, I have measured this on a process that
creates a private writable 2GB mapping of a partially swapped out
/dev/shm/file (which cannot employ the optimizations from the prvious
patch) and doesn't populate it at all. I time how long does it take to
cat /proc/pid/smaps of this process 100 times.

Before this patch:

real 0m3.831s
user 0m0.180s
sys 0m3.212s

After this patch:

real 0m1.176s
user 0m0.180s
sys 0m0.684s

The time is similar to the case where a radix tree iterator is employed
on the whole mapping.

Signed-off-by: Vlastimil Babka
Cc: Hugh Dickins
Cc: Jerome Marchand
Cc: Konstantin Khlebnikov
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2016-01-15 08:00:49 +0800
6a15a3709 mm, proc: reduce cost of /proc/pid/smaps for shmem mappings ... Browse Code »

The previous patch has improved swap accounting for shmem mapping, which
however made /proc/pid/smaps more expensive for shmem mappings, as we
consult the radix tree for each pte_none entry, so the overal complexity
is O(n*log(n)).

We can reduce this significantly for mappings that cannot contain COWed
pages, because then we can either use the statistics tha shmem object
itself tracks (if the mapping contains the whole object, or the swap
usage of the whole object is zero), or use the radix tree iterator,
which is much more effective than repeated find_get_entry() calls.

This patch therefore introduces a function shmem_swap_usage(vma) and
makes /proc/pid/smaps use it when possible. Only for writable private
mappings of shmem objects (i.e. tmpfs files) with the shmem object
itself (partially) swapped outwe have to resort to the find_get_entry()
approach.

Hopefully such mappings are relatively uncommon.

To demonstrate the diference, I have measured this on a process that
creates a 2GB mapping and dirties single pages with a stride of 2MB, and
time how long does it take to cat /proc/pid/smaps of this process 100
times.

Private writable mapping of a /dev/shm/file (the most complex case):

real 0m3.831s
user 0m0.180s
sys 0m3.212s

Shared mapping of an almost full mapping of a partially swapped /dev/shm/file
(which needs to employ the radix tree iterator).

real 0m1.351s
user 0m0.096s
sys 0m0.768s

Same, but with /dev/shm/file not swapped (so no radix tree walk needed)

real 0m0.935s
user 0m0.128s
sys 0m0.344s

Private anonymous mapping:

real 0m0.949s
user 0m0.116s
sys 0m0.348s

The cost is now much closer to the private anonymous mapping case, unless
the shmem mapping is private and writable.

Signed-off-by: Vlastimil Babka
Cc: Hugh Dickins
Cc: Jerome Marchand
Cc: Konstantin Khlebnikov
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2016-01-15 08:00:49 +0800
c261e7d94 mm, proc: account for shmem swap in /proc/pid/smaps ... Browse Code »

Currently, /proc/pid/smaps will always show "Swap: 0 kB" for
shmem-backed mappings, even if the mapped portion does contain pages
that were swapped out. This is because unlike private anonymous
mappings, shmem does not change pte to swap entry, but pte_none when
swapping the page out. In the smaps page walk, such page thus looks
like it was never faulted in.

This patch changes smaps_pte_entry() to determine the swap status for
such pte_none entries for shmem mappings, similarly to how
mincore_page() does it. Swapped out shmem pages are thus accounted for.
For private mappings of tmpfs files that COWed some of the pages, swaped
out status of the original shmem pages is naturally ignored. If some of
the private copies was also swapped out, they are accounted via their
page table swap entries, so the resulting reported swap usage is then a
sum of both swapped out private copies, and swapped out shmem pages that
were not COWed. No double accounting can thus happen.

The accounting is arguably still not as precise as for private anonymous
mappings, since now we will count also pages that the process in
question never accessed, but another process populated them and then let
them become swapped out. I believe it is still less confusing and
subtle than not showing any swap usage by shmem mappings at all.
Swapped out counter might of interest of users who would like to prevent
from future swapins during performance critical operation and pre-fault
them at their convenience. Especially for larger swapped out regions
the cost of swapin is much higher than a fresh page allocation. So a
differentiation between pte_none vs. swapped out is important for those
usecases.

One downside of this patch is that it makes /proc/pid/smaps more
expensive for shmem mappings, as we consult the radix tree for each
pte_none entry, so the overal complexity is O(n*log(n)). I have
measured this on a process that creates a 2GB mapping and dirties single
pages with a stride of 2MB, and time how long does it take to cat
/proc/pid/smaps of this process 100 times.

Private anonymous mapping:

real 0m0.949s
user 0m0.116s
sys 0m0.348s

Mapping of a /dev/shm/file:

real 0m3.831s
user 0m0.180s
sys 0m3.212s

The difference is rather substantial, so the next patch will reduce the
cost for shared or read-only mappings.

In a less controlled experiment, I've gathered pids of processes on my
desktop that have either '/dev/shm/*' or 'SYSV*' in smaps. This
included the Chrome browser and some KDE processes. Again, I've run cat
/proc/pid/smaps on each 100 times.

Before this patch:

real 0m9.050s
user 0m0.518s
sys 0m8.066s

After this patch:

real 0m9.221s
user 0m0.541s
sys 0m8.187s

This suggests low impact on average systems.

Note that this patch doesn't attempt to adjust the SwapPss field for
shmem mappings, which would need extra work to determine who else could
have the pages mapped. Thus the value stays zero except for COWed
swapped out pages in a shmem mapping, which are accounted as usual.

Signed-off-by: Vlastimil Babka
Acked-by: Konstantin Khlebnikov
Acked-by: Jerome Marchand
Acked-by: Michal Hocko
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2016-01-15 08:00:49 +0800
bf9683d69 mm, documentation: clarify /proc/pid/status VmSwap limitations for shmem ... Browse Code »

This series is based on Jerome Marchand's [1] so let me quote the first
paragraph from there:

There are several shortcomings with the accounting of shared memory
(sysV shm, shared anonymous mapping, mapping to a tmpfs file). The
values in /proc//status and statm don't allow to distinguish
between shmem memory and a shared mapping to a regular file, even though
their implications on memory usage are quite different: at reclaim, file
mapping can be dropped or written back on disk while shmem needs a place
in swap. As for shmem pages that are swapped-out or in swap cache, they
aren't accounted at all.

The original motivation for myself is that a customer found (IMHO
rightfully) confusing that e.g. top output for process swap usage is
unreliable with respect to swapped out shmem pages, which are not
accounted for.

The fundamental difference between private anonymous and shmem pages is
that the latter has PTE's converted to pte_none, and not swapents. As
such, they are not accounted to the number of swapents visible e.g. in
/proc/pid/status VmSwap row. It might be theoretically possible to use
swapents when swapping out shmem (without extra cost, as one has to
change all mappers anyway), and on swap in only convert the swapent for
the faulting process, leaving swapents in other processes until they
also fault (so again no extra cost). But I don't know how many
assumptions this would break, and it would be too disruptive change for
a relatively small benefit.

Instead, my approach is to document the limitation of VmSwap, and
provide means to determine the swap usage for shmem areas for those who
are interested and willing to pay the price, using /proc/pid/smaps.
Because outside of ipcs, I don't think it's possible to currently to
determine the usage at all. The previous patchset [1] did introduce new
shmem-specific fields into smaps output, and functions to determine the
values. I take a simpler approach, noting that smaps output already has
a "Swap: X kB" line, where currently X == 0 always for shmem areas. I
think we can just consider this a bug and provide the proper value by
consulting the radix tree, as e.g. mincore_page() does. In the patch
changelog I explain why this is also not perfect (and cannot be without
swapents), but still arguably much better than showing a 0.

The last two patches are adapted from Jerome's patchset and provide a
VmRSS breakdown to RssAnon, RssFile and RssShm in /proc/pid/status.
Hugh noted that this is a welcome addition, and I agree that it might
help e.g. debugging process memory usage at albeit non-zero, but still
rather low cost of extra per-mm counter and some page flag checks.

[1] http://lwn.net/Articles/611966/

This patch (of 6):

The documentation for /proc/pid/status does not mention that the value
of VmSwap counts only swapped out anonymous private pages, and not
swapped out pages of the underlying shmem objects (for shmem mappings).
This is not obvious, so document this limitation.

Signed-off-by: Vlastimil Babka
Acked-by: Konstantin Khlebnikov
Acked-by: Michal Hocko
Acked-by: Jerome Marchand
Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2016-01-15 08:00:49 +0800
5b80287a6 mm/mmzone.c: memmap_valid_within() can be boolean ... Browse Code »

Make memmap_valid_within return bool due to this particular function
only using either one or zero as its return value.

No functional change.

Signed-off-by: Yaowei Bai
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yaowei Bai
2016-01-15 08:00:49 +0800
6219c2a2e mm/vmalloc.c: use list_{next,first}_entry ... Browse Code »

To make the intention clearer, use list_{next,first}_entry instead of
list_entry.

Signed-off-by: Geliang Tang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Geliang Tang
2016-01-15 08:00:49 +0800
33d531030 mm/page_alloc.c: do not loop over ALLOC_NO_WATERMARKS without triggering reclaim ... Browse Code »

__alloc_pages_slowpath is looping over ALLOC_NO_WATERMARKS requests if
__GFP_NOFAIL is requested. This is fragile because we are basically
relying on somebody else to make the reclaim (be it the direct reclaim
or OOM killer) for us. The caller might be holding resources (e.g.
locks) which block other other reclaimers from making any progress for
example. Remove the retry loop and rely on __alloc_pages_slowpath to
invoke all allowed reclaim steps and retry logic.

We have to be careful about __GFP_NOFAIL allocations from the
PF_MEMALLOC context even though this is a very bad idea to begin with
because no progress can be gurateed at all. We shouldn't break the
__GFP_NOFAIL semantic here though. It could be argued that this is
essentially GFP_NOWAIT context which we do not support but PF_MEMALLOC
is much harder to check for existing users because they might happen
deep down the code path performed much later after setting the flag so
we cannot really rule out there is no kernel path triggering this
combination.

Signed-off-by: Michal Hocko
Acked-by: Mel Gorman
Acked-by: David Rientjes
Cc: Tetsuo Handa
Acked-by: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2016-01-15 08:00:49 +0800
fde82aaa7 mm/page_alloc.c: get rid of __alloc_pages_high_priority() ... Browse Code »

__alloc_pages_high_priority doesn't do anything special other than it
calls get_page_from_freelist and loops around GFP_NOFAIL allocation
until it succeeds. It would be better if the first part was done in
__alloc_pages_slowpath where we modify the zonelist because this would
be easier to read and understand. Opencoding the function into its only
caller allows to simplify it a bit as well.

This patch doesn't introduce any functional changes.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Michal Hocko
Acked-by: Mel Gorman
Acked-by: David Rientjes
Acked-by: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2016-01-15 08:00:49 +0800
c00eb15a8 mm/zonelist: enumerate zonelists array index ... Browse Code »

Hardcoding index to zonelists array in gfp_zonelist() is not a good
idea, let's enumerate it to improve readability.

No functional change.

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix CONFIG_NUMA=n build]
[n-horiguchi@ah.jp.nec.com: fix warning in comparing enumerator]
Signed-off-by: Yaowei Bai
Cc: Michal Hocko
Cc: David Rientjes
Signed-off-by: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yaowei Bai
2016-01-15 08:00:49 +0800
06640290b include/linux/mmzone.h: remove unused is_unevictable_lru() ... Browse Code »

Since commit a0b8cab3b9b2 ("mm: remove lru parameter from
__pagevec_lru_add and remove parts of pagevec API") there's no
user of this function anymore, so remove it.

Signed-off-by: Yaowei Bai
Acked-by: Michal Hocko
Acked-by: Hillf Danton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yaowei Bai
2016-01-15 08:00:49 +0800
b4ad0c7e0 mm/memblock.c: memblock_is_memory()/reserved() can be boolean ... Browse Code »

Make memblock_is_memory() and memblock_is_reserved return bool to
improve readability due to these particular functions only using either
one or zero as their return value.

No functional change.

Signed-off-by: Yaowei Bai
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yaowei Bai
2016-01-15 08:00:49 +0800
719ff3216 include/linux/hugetlb.h: is_file_hugepages() can be boolean ... Browse Code »

Make is_file_hugepages() return bool to improve readability due to this
particular function only using either one or zero as its return value.

This patch also removed the if condition to make is_file_hugepages
return directly.

No functional change.

Signed-off-by: Yaowei Bai
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yaowei Bai
2016-01-15 08:00:49 +0800
ba5e95794 mm: change mm_vmscan_lru_shrink_inactive() proto types ... Browse Code »

Move node_id zone_idx shrink flags into trace function, so thay we don't
need caculate these args if the trace is disabled, and will make this
function have less arguments.

Signed-off-by: yalin wang
Reviewed-by: Steven Rostedt
Acked-by: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

yalin wang
2016-01-15 08:00:49 +0800
8ef5849fa mm/cma: always check which page caused allocation failure ... Browse Code »

Now, we have tracepoint in test_pages_isolated() to notify pfn which
cannot be isolated. But, in alloc_contig_range(), some error path
doesn't call test_pages_isolated() so it's still hard to know exact pfn
that causes allocation failure.

This patch change this situation by calling test_pages_isolated() in
almost error path. In allocation failure case, some overhead is added
by this change, but, allocation failure is really rare event so it would
not matter.

In fatal signal pending case, we don't call test_pages_isolated()
because this failure is intentional one.

There was a bogus outer_start problem due to unchecked buddy order and
this patch also fix it. Before this patch, it didn't matter, because
end result is same thing. But, after this patch, tracepoint will report
failed pfn so it should be accurate.

Signed-off-by: Joonsoo Kim
Acked-by: Vlastimil Babka
Acked-by: Michal Nazarewicz
Cc: David Rientjes
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2016-01-15 08:00:49 +0800
0f0848e51 mm/page_isolation.c: add new tracepoint, test_pages_isolated ... Browse Code »

cma allocation should be guranteeded to succeed. But sometimes it can
fail in the current implementation. To track down the problem, we need
to know which page is problematic and this new tracepoint will report
it.

Signed-off-by: Joonsoo Kim
Acked-by: Michal Nazarewicz
Acked-by: David Rientjes
Cc: Minchan Kim
Acked-by: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2016-01-15 08:00:49 +0800
fea85cff1 mm/page_isolation.c: return last tested pfn rather than failure indicator ... Browse Code »

This is preparation step to report test failed pfn in new tracepoint to
analyze cma allocation failure problem. There is no functional change
in this patch.

Signed-off-by: Joonsoo Kim
Acked-by: David Rientjes
Acked-by: Michal Nazarewicz
Cc: Minchan Kim
Acked-by: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2016-01-15 08:00:49 +0800
4a8c7bb59 mm/mempolicy.c: convert the shared_policy lock to a rwlock ... Browse Code »

When running the SPECint_rate gcc on some very large boxes it was
noticed that the system was spending lots of time in
mpol_shared_policy_lookup(). The gamess benchmark can also show it and
is what I mostly used to chase down the issue since the setup for that I
found to be easier.

To be clear the binaries were on tmpfs because of disk I/O requirements.
We then used text replication to avoid icache misses and having all the
copies banging on the memory where the instruction code resides. This
results in us hitting a bottleneck in mpol_shared_policy_lookup() since
lookup is serialised by the shared_policy lock.

I have only reproduced this on very large (3k+ cores) boxes. The
problem starts showing up at just a few hundred ranks getting worse
until it threatens to livelock once it gets large enough. For example
on the gamess benchmark at 128 ranks this area consumes only ~1% of
time, at 512 ranks it consumes nearly 13%, and at 2k ranks it is over
90%.

To alleviate the contention in this area I converted the spinlock to an
rwlock. This allows a large number of lookups to happen simultaneously.
The results were quite good reducing this consumtion at max ranks to
around 2%.

[akpm@linux-foundation.org: tidy up code comments]
Signed-off-by: Nathan Zimmer
Acked-by: David Rientjes
Acked-by: Vlastimil Babka
Cc: Nadia Yvette Chambers
Cc: Naoya Horiguchi
Cc: Mel Gorman
Cc: "Aneesh Kumar K.V"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nathan Zimmer
2016-01-15 08:00:49 +0800
8f235d1a3 mm: add PHYS_PFN, use it in __phys_to_pfn() ... Browse Code »

__phys_to_pfn and __pfn_to_phys are symmetric, PHYS_PFN and PFN_PHYS are
semmetric:

- y = (phys_addr_t)x << PAGE_SHIFT

- y >> PAGE_SHIFT = (phys_add_t)x

- (unsigned long)(y >> PAGE_SHIFT) = x

[akpm@linux-foundation.org: use macro arg name `x']
[arnd@arndb.de: include linux/pfn.h for PHYS_PFN definition]
Signed-off-by: Chen Gang
Cc: Oleg Nesterov
Signed-off-by: Arnd Bergmann
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Chen Gang
2016-01-15 08:00:49 +0800
3aa238511 mm/vmscan.c: change trace_mm_vmscan_writepage() proto type ... Browse Code »

Move trace_reclaim_flags() into trace function, so that we don't need
caculate these flags if the trace is disabled.

Signed-off-by: yalin wang
Reviewed-by: Steven Rostedt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

yalin wang
2016-01-15 08:00:49 +0800
0b57d6ba0 mm/mmap.c: remove redundant local variables for may_expand_vm() ... Browse Code »

Simplify may_expand_vm().

[akpm@linux-foundation.org: further simplification, per Naoya Horiguchi]
Signed-off-by: Chen Gang
Cc: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Chen Gang
2016-01-15 08:00:49 +0800
ab7a5af7f mm/mlock.c: drop unneeded initialization in munlock_vma_pages_range() ... Browse Code »

Before usage page pointer initialized by NULL is reinitialized by
follow_page_mask(). Drop useless init of page pointer in the beginning
of loop.

Signed-off-by: Alexey Klimov
Acked-by: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Klimov
2016-01-15 08:00:49 +0800
5d097056c kmemcg: account certain kmem allocations to memcg ... Browse Code »

Mark those kmem allocations that are known to be easily triggered from
userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
memcg. For the list, see below:

- threadinfo
- task_struct
- task_delay_info
- pid
- cred
- mm_struct
- vm_area_struct and vm_region (nommu)
- anon_vma and anon_vma_chain
- signal_struct
- sighand_struct
- fs_struct
- files_struct
- fdtable and fdtable->full_fds_bits
- dentry and external_name
- inode for all filesystems. This is the most tedious part, because
most filesystems overwrite the alloc_inode method.

The list is far from complete, so feel free to add more objects.
Nevertheless, it should be close to "account everything" approach and
keep most workloads within bounds. Malevolent users will be able to
breach the limit, but this was possible even with the former "account
everything" approach (simply because it did not account everything in
fact).

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Vladimir Davydov
Acked-by: Johannes Weiner
Acked-by: Michal Hocko
Cc: Tejun Heo
Cc: Greg Thelen
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2016-01-15 08:00:49 +0800
37f08dda2 vmalloc: allow to account vmalloc to memcg ... Browse Code »

Make vmalloc family functions allocate vmalloc area pages with
alloc_kmem_pages so that if __GFP_ACCOUNT is set they will be accounted
to memcg. This is needed, at least, to account alloc_fdmem allocations.

Signed-off-by: Vladimir Davydov
Acked-by: Johannes Weiner
Cc: Michal Hocko
Cc: Tejun Heo
Cc: Greg Thelen
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2016-01-15 08:00:49 +0800
230e9fc28 slab: add SLAB_ACCOUNT flag ... Browse Code »

Currently, if we want to account all objects of a particular kmem cache,
we have to pass __GFP_ACCOUNT to each kmem_cache_alloc call, which is
inconvenient. This patch introduces SLAB_ACCOUNT flag which if passed
to kmem_cache_create will force accounting for every allocation from
this cache even if __GFP_ACCOUNT is not passed.

This patch does not make any of the existing caches use this flag - it
will be done later in the series.

Note, a cache with SLAB_ACCOUNT cannot be merged with a cache w/o
SLAB_ACCOUNT, because merged caches share the same kmem_cache struct and
hence cannot have different sets of SLAB_* flags. Thus using this flag
will probably reduce the number of merged slabs even if kmem accounting
is not used (only compiled in).

Signed-off-by: Vladimir Davydov
Suggested-by: Tejun Heo
Acked-by: Johannes Weiner
Acked-by: Michal Hocko
Cc: Greg Thelen
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2016-01-15 08:00:49 +0800
a9bb7e620 memcg: only account kmem allocations marked as __GFP_ACCOUNT ... Browse Code »

Black-list kmem accounting policy (aka __GFP_NOACCOUNT) turned out to be
fragile and difficult to maintain, because there seem to be many more
allocations that should not be accounted than those that should be.
Besides, false accounting an allocation might result in much worse
consequences than not accounting at all, namely increased memory
consumption due to pinned dead kmem caches.

So this patch switches kmem accounting to the white-policy: now only
those kmem allocations that are marked as __GFP_ACCOUNT are accounted to
memcg. Currently, no kmem allocations are marked like this. The
following patches will mark several kmem allocations that are known to
be easily triggered from userspace and therefore should be accounted to
memcg.

Signed-off-by: Vladimir Davydov
Acked-by: Johannes Weiner
Acked-by: Michal Hocko
Cc: Tejun Heo
Cc: Greg Thelen
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2016-01-15 08:00:49 +0800
20b5c3039 Revert "gfp: add __GFP_NOACCOUNT" ... Browse Code »

This reverts commit 8f4fc071b192 ("gfp: add __GFP_NOACCOUNT").

Black-list kmem accounting policy (aka __GFP_NOACCOUNT) turned out to be
fragile and difficult to maintain, because there seem to be many more
allocations that should not be accounted than those that should be.
Besides, false accounting an allocation might result in much worse
consequences than not accounting at all, namely increased memory
consumption due to pinned dead kmem caches.

So it was decided to switch to the white-list policy. This patch
reverts bits introducing the black-list policy. The white-list policy
will be introduced later in the series.

Signed-off-by: Vladimir Davydov
Acked-by: Johannes Weiner
Cc: Michal Hocko
Cc: Tejun Heo
Cc: Greg Thelen
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2016-01-15 08:00:49 +0800
b2a209ffa Revert "kernfs: do not account ino_ida allocations to memcg" ... Browse Code »

Currently, all kmem allocations (namely every kmem_cache_alloc, kmalloc,
alloc_kmem_pages call) are accounted to memory cgroup automatically.
Callers have to explicitly opt out if they don't want/need accounting
for some reason. Such a design decision leads to several problems:

- kmalloc users are highly sensitive to failures, many of them
implicitly rely on the fact that kmalloc never fails, while memcg
makes failures quite plausible.

- A lot of objects are shared among different containers by design.
Accounting such objects to one of containers is just unfair.
Moreover, it might lead to pinning a dead memcg along with its kmem
caches, which aren't tiny, which might result in noticeable increase
in memory consumption for no apparent reason in the long run.

- There are tons of short-lived objects. Accounting them to memcg will
only result in slight noise and won't change the overall picture, but
we still have to pay accounting overhead.

For more info, see

- http://lkml.kernel.org/r/20151105144002.GB15111%40dhcp22.suse.cz
- http://lkml.kernel.org/r/20151106090555.GK29259@esperanza

Therefore this patchset switches to the white list policy. Now kmalloc
users have to explicitly opt in by passing __GFP_ACCOUNT flag.

Currently, the list of accounted objects is quite limited and only
includes those allocations that (1) are known to be easily triggered
from userspace and (2) can fail gracefully (for the full list see patch
no. 6) and it still misses many object types. However, accounting only
those objects should be a satisfactory approximation of the behavior we
used to have for most sane workloads.

This patch (of 6):

Revert 499611ed451508a42d1d7d ("kernfs: do not account ino_ida allocations
to memcg").

Black-list kmem accounting policy (aka __GFP_NOACCOUNT) turned out to be
fragile and difficult to maintain, because there seem to be many more
allocations that should not be accounted than those that should be.
Besides, false accounting an allocation might result in much worse
consequences than not accounting at all, namely increased memory
consumption due to pinned dead kmem caches.

So it was decided to switch to the white-list policy. This patch reverts
bits introducing the black-list policy. The white-list policy will be
introduced later in the series.

Signed-off-by: Vladimir Davydov
Acked-by: Johannes Weiner
Cc: Michal Hocko
Cc: Tejun Heo
Cc: Greg Thelen
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2016-01-15 08:00:49 +0800
7aa0d2278 mm/slab.c: add a helper function get_first_slab ... Browse Code »

Add a new helper function get_first_slab() that get the first slab from
a kmem_cache_node.

Signed-off-by: Geliang Tang
Acked-by: Christoph Lameter
Acked-by: David Rientjes
Cc: Pekka Enberg
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Geliang Tang
2016-01-15 08:00:49 +0800
73c0219d8 mm/slab.c: use list_for_each_entry in cache_flusharray ... Browse Code »

Simplify the code with list_for_each_entry().

Signed-off-by: Geliang Tang
Acked-by: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Geliang Tang
2016-01-15 08:00:49 +0800
d8ad47d83 mm/slab.c use list_first_entry_or_null() ... Browse Code »

Simplify the code with list_first_entry_or_null().

Signed-off-by: Geliang Tang
Acked-by: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Geliang Tang
2016-01-15 08:00:49 +0800
2bd03e49d include/linux/dcache.h: remove semicolons from HASH_LEN_DECLARE ... Browse Code »

A little cleanup - the invocation site provdes the semicolon.

Cc: Rasmus Villemoes
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2016-01-15 08:00:49 +0800
3c973b0e7 ocfs2/dlm: cleanup redunant lksb flags in dlmcommon.h ... Browse Code »

lksb flags are defined both in dlmapi.h and dlmcommon.h. So clean them
up from dlmcommon.h.

Signed-off-by: Joseph Qi
Reviewed-by: Jiufei Xue
Cc: Mark Fasheh
Cc: Joel Becker
Cc: Junxiao Bi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joseph Qi
2016-01-15 08:00:49 +0800
98e141f26 ocfs2: dlm: remove redundant code ... Browse Code »

Found this when do patch review, remove to make it clear and save a
little cpu time.

Signed-off-by: Junxiao Bi
Cc: Mark Fasheh
Cc: Joel Becker
Cc: Joseph Qi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Junxiao Bi
2016-01-15 08:00:49 +0800
074a6c655 ocfs2: access orphan dinode before delete entry in ocfs2_orphan_del ... Browse Code »

In ocfs2_orphan_del, currently it finds and deletes entry first, and
then access orphan dir dinode. This will have a problem once
ocfs2_journal_access_di fails. In this case, entry will be removed from
orphan dir, but in deed the inode hasn't been deleted successfully. In
other words, the file is missing but not actually deleted. So we should
access orphan dinode first like unlink and rename.

Signed-off-by: Joseph Qi
Reviewed-by: Jiufei Xue
Cc: Mark Fasheh
Cc: Joel Becker
Reviewed-by: Junxiao Bi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joseph Qi
2016-01-15 08:00:49 +0800
32e493265 ocfs2/dlm: do not insert a new mle when another process is already migrating ... Browse Code »

When two processes are migrating the same lockres,
dlm_add_migration_mle() return -EEXIST, but insert a new mle in hash
list. dlm_migrate_lockres() will detach the old mle and free the new
one which is already in hash list, that will destroy the list.

Signed-off-by: Jiufei Xue
Reviewed-by: Joseph Qi
Cc: Mark Fasheh
Cc: Joel Becker
Reviewed-by: Junxiao Bi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

xuejiufei
2016-01-15 08:00:49 +0800
bef5502de ocfs2/dlm: ignore cleaning the migration mle that is inuse ... Browse Code »

We have found that migration source will trigger a BUG that the refcount
of mle is already zero before put when the target is down during
migration. The situation is as follows:

dlm_migrate_lockres
dlm_add_migration_mle
dlm_mark_lockres_migrating
dlm_get_mle_inuse
<<<<<< Now the refcount of the mle is 2.
dlm_send_one_lockres and wait for the target to become the
new master.
<<<<<< o2hb detect the target down and clean the migration
mle. Now the refcount is 1.

dlm_migrate_lockres woken, and put the mle twice when found the target
goes down which trigger the BUG with the following message:

"ERROR: bad mle: ".

Signed-off-by: Jiufei Xue
Reviewed-by: Joseph Qi
Cc: Mark Fasheh
Cc: Joel Becker
Cc: Junxiao Bi
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

xuejiufei
2016-01-15 08:00:49 +0800
1cce4df04 ocfs2: do not lock/unlock() inode DLM lock ... Browse Code »

DLM does not cache locks. So, blocking lock and unlock will only make
the performance worse where contention over the locks is high.

Signed-off-by: Goldwyn Rodrigues
Cc: Mark Fasheh
Cc: Joel Becker
Reviewed-by: Junxiao Bi
Cc: Joseph Qi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Goldwyn Rodrigues
2016-01-15 08:00:49 +0800
1247017f4 ocfs2: fix slot overwritten if storage link down during mount ... Browse Code »

The following case will lead to slot overwritten.

N1 N2
mount ocfs2 volume, find and
allocate slot 0, then set
osb->slot_num to 0, begin to
write slot info to disk
mount ocfs2 volume, wait for super lock
write block fail because of
storage link down, unlock
super lock
got super lock and also allocate slot 0
then unlock super lock

mount fail and then dismount,
since osb->slot_num is 0, try to
put invalid slot to disk. And it
will succeed if storage link
restores.
N2 slot info is now overwritten

Once another node say N3 mount, it will find and allocate slot 0 again,
which will lead to mount hung because journal has already been locked by
N2. so when write slot info failed, invalidate slot in advance to avoid
overwrite slot.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Yiwen Jiang
Reviewed-by: Joseph Qi
Cc: Mark Fasheh
Cc: Joel Becker
Cc: Junxiao Bi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

jiangyiwen
2016-01-15 08:00:49 +0800
c372f2193 ocfs2/dlm: return appropriate value when dlm_grab() returns NULL ... Browse Code »

dlm_grab() may return NULL when the node is doing unmount. When doing
code review, we found that some dlm handlers may return error to caller
when dlm_grab() returns NULL and make caller BUG or other problems.
Here is an example:

Node 1 Node 2
receives migration message
from node 3, and send
migrate request to others
start unmounting

receives migrate request
from node 1 and call
dlm_migrate_request_handler()

unmount thread unregisters
domain handlers and removes
dlm_context from dlm_domains

dlm_migrate_request_handlers()
returns -EINVAL to node 1
Exit migration neither clearing the
migration state nor sending
assert master message to node 3 which
cause node 3 hung.

Signed-off-by: Jiufei Xue
Reviewed-by: Joseph Qi
Reviewed-by: Yiwen Jiang
Cc: Mark Fasheh
Cc: Joel Becker
Cc: Junxiao Bi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Xue jiufei
2016-01-15 08:00:49 +0800