Eric Lee / smarc-fsl-linux-kernel

23 Aug, 2018

40 commits

9a27e97aa proc: use "unsigned int" in /proc/stat hook ... Browse Code »

Number of CPUs is never high enough to force 64-bit arithmetic.
Save couple of bytes on x86_64.

Link: http://lkml.kernel.org/r/20180627200710.GC18434@avx2
Signed-off-by: Alexey Dobriyan
Reviewed-by: Andrew Morton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2018-08-23 01:52:46 +0800
891ae71dc proc: spread "const" a bit ... Browse Code »

Link: http://lkml.kernel.org/r/20180627200614.GB18434@avx2
Signed-off-by: Alexey Dobriyan
Reviewed-by: Andrew Morton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2018-08-23 01:52:46 +0800
f6d2f584d proc: use macro in /proc/latency hook ... Browse Code »

->latency_record is defined as

struct latency_record[LT_SAVECOUNT];

so use the same macro whie iterating.

Link: http://lkml.kernel.org/r/20180627200534.GA18434@avx2
Signed-off-by: Alexey Dobriyan
Reviewed-by: Andrew Morton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2018-08-23 01:52:46 +0800
41089b6d3 proc: save 2 atomic ops on write to "/proc/*/attr/*" ... Browse Code »

Code checks if write is done by current to its own attributes.
For that get/put pair is unnecessary as it can be done under RCU.

Note: rcu_read_unlock() can be done even earlier since pointer to a task
is not dereferenced. It depends if /proc code should look scary or not:

rcu_read_lock();
task = pid_task(...);
rcu_read_unlock();
if (!task)
return -ESRCH;
if (task != current)
return -EACCESS:

P.S.: rename "length" variable. Code like this

length = -EINVAL;

should not exist.

Link: http://lkml.kernel.org/r/20180627200218.GF18113@avx2
Signed-off-by: Alexey Dobriyan
Reviewed-by: Andrew Morton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2018-08-23 01:52:45 +0800
a44937fe4 proc: put task earlier in /proc/*/fail-nth ... Browse Code »

Link: http://lkml.kernel.org/r/20180627195427.GE18113@avx2
Signed-off-by: Alexey Dobriyan
Reviewed-by: Andrew Morton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2018-08-23 01:52:45 +0800
8d48b2e04 proc: smaller readlock section in readdir("/proc") ... Browse Code »

Readdir context is thread local, so ->pos is thread local,
move it out of readlock.

Link: http://lkml.kernel.org/r/20180627195339.GD18113@avx2
Signed-off-by: Alexey Dobriyan
Reviewed-by: Andrew Morton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2018-08-23 01:52:45 +0800
2cd36fb32 proc: test /proc/thread-self symlink ... Browse Code »

Same story: I have WIP patch to make it faster, so better have a test
as well.

Link: http://lkml.kernel.org/r/20180627195209.GC18113@avx2
Signed-off-by: Alexey Dobriyan
Cc: Shuah Khan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2018-08-23 01:52:45 +0800
61d47c4e7 proc: test /proc/self symlink ... Browse Code »

There are plans to change how /proc/self result is calculated,
for that a test is necessary.

Use direct system call because of this whole getpid caching story.

Link: http://lkml.kernel.org/r/20180627195103.GB18113@avx2
Signed-off-by: Alexey Dobriyan
Cc: Shuah Khan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2018-08-23 01:52:45 +0800
bdf228a27 fs/proc/uptime.c: use ktime_get_boottime_ts64 ... Browse Code »

get_monotonic_boottime() is deprecated and uses the old timespec type.
Let's convert /proc/uptime to use ktime_get_boottime_ts64().

Link: http://lkml.kernel.org/r/20180620081746.282742-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann
Acked-by: Thomas Gleixner
Cc: Al Viro
Cc: Deepa Dinamani
Cc: Alexey Dobriyan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Arnd Bergmann
2018-08-23 01:52:45 +0800
2d6e4e822 proc: fixup PDE allocation bloat ... Browse Code »

24074a35c5c975 ("proc: Make inline name size calculation automatic")
started to put PDE allocations into kmalloc-256 which is unnecessary as
~40 character names are very rare.

Put allocation back into kmalloc-192 cache for 64-bit non-debug builds.

Put BUILD_BUG_ON to know when PDE size has gotten out of control.

[adobriyan@gmail.com: fix BUILD_BUG_ON breakage on powerpc64]
Link: http://lkml.kernel.org/r/20180703191602.GA25521@avx2
Link: http://lkml.kernel.org/r/20180617215732.GA24688@avx2
Signed-off-by: Alexey Dobriyan
Cc: David Howells
Cc: Al Viro
Cc: Stephen Rothwell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2018-08-23 01:52:45 +0800
5df66d306 mm: fix comment for NODEMASK_ALLOC ... Browse Code »

Currently, NODEMASK_ALLOC allocates a nodemask_t with kmalloc when
NODES_SHIFT is higher than 8, otherwise it declares it within the stack.

The comment says that the reasoning behind this, is that nodemask_t will
be 256 bytes when NODES_SHIFT is higher than 8, but this is not true. For
example, NODES_SHIFT = 9 will give us a 64 bytes nodemask_t. Let us fix
up the comment for that.

Another thing is that it might make sense to let values lower than
128bytes be allocated in the stack. Although this all depends on the
depth of the stack (and this changes from function to function), I think
that 64 bytes is something we can easily afford. So we could even bump
the limit by 1 (from > 8 to > 9).

Link: http://lkml.kernel.org/r/20180820085516.9687-1-osalvador@techadventures.net
Signed-off-by: Oscar Salvador
Reviewed-by: Andrew Morton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oscar Salvador
2018-08-23 01:52:45 +0800
c8bd134a4 drivers/block/zram/zram_drv.c: fix bug storing backing_dev ... Browse Code »

The call to strlcpy in backing_dev_store is incorrect. It should take
the size of the destination buffer instead of the size of the source
buffer. Additionally, ignore the newline character (\n) when reading
the new file_name buffer. This makes it possible to set the backing_dev
as follows:

echo /dev/sdX > /sys/block/zram0/backing_dev

The reason it worked before was the fact that strlcpy() copies 'len - 1'
bytes, which is strlen(buf) - 1 in our case, so it accidentally didn't
copy the trailing new line symbol. Which also means that "echo -n
/dev/sdX" most likely was broken.

Signed-off-by: Peter Kalauskas
Link: http://lkml.kernel.org/r/20180813061623.GC64836@rodete-desktop-imager.corp.google.com
Acked-by: Minchan Kim
Reviewed-by: Sergey Senozhatsky
Cc: [4.14+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Kalauskas
2018-08-23 01:52:45 +0800
7e8a6304d /proc/meminfo: add percpu populated pages count ... Browse Code »

Currently, percpu memory only exposes allocation and utilization
information via debugfs. This more or less is only really useful for
understanding the fragmentation and allocation information at a per-chunk
level with a few global counters. This is also gated behind a config.
BPF and cgroup, for example, have seen an increase in use causing
increased use of percpu memory. Let's make it easier for someone to
identify how much memory is being used.

This patch adds the "Percpu" stat to meminfo to more easily look up how
much percpu memory is in use. This number includes the cost for all
allocated backing pages and not just insight at the per a unit, per chunk
level. Metadata is excluded. I think excluding metadata is fair because
the backing memory scales with the numbere of cpus and can quickly
outweigh the metadata. It also makes this calculation light.

Link: http://lkml.kernel.org/r/20180807184723.74919-1-dennisszhou@gmail.com
Signed-off-by: Dennis Zhou
Acked-by: Tejun Heo
Acked-by: Roman Gushchin
Reviewed-by: Andrew Morton
Acked-by: David Rientjes
Acked-by: Vlastimil Babka
Cc: Johannes Weiner
Cc: Christoph Lameter
Cc: Alexey Dobriyan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dennis Zhou (Facebook)
2018-08-23 01:52:45 +0800
3d8b38eb8 mm, oom: introduce memory.oom.group ... Browse Code »

For some workloads an intervention from the OOM killer can be painful.
Killing a random task can bring the workload into an inconsistent state.

Historically, there are two common solutions for this
problem:
1) enabling panic_on_oom,
2) using a userspace daemon to monitor OOMs and kill
all outstanding processes.

Both approaches have their downsides: rebooting on each OOM is an obvious
waste of capacity, and handling all in userspace is tricky and requires a
userspace agent, which will monitor all cgroups for OOMs.

In most cases an in-kernel after-OOM cleaning-up mechanism can eliminate
the necessity of enabling panic_on_oom. Also, it can simplify the cgroup
management for userspace applications.

This commit introduces a new knob for cgroup v2 memory controller:
memory.oom.group. The knob determines whether the cgroup should be
treated as an indivisible workload by the OOM killer. If set, all tasks
belonging to the cgroup or to its descendants (if the memory cgroup is not
a leaf cgroup) are killed together or not at all.

To determine which cgroup has to be killed, we do traverse the cgroup
hierarchy from the victim task's cgroup up to the OOMing cgroup (or root)
and looking for the highest-level cgroup with memory.oom.group set.

Tasks with the OOM protection (oom_score_adj set to -1000) are treated as
an exception and are never killed.

This patch doesn't change the OOM victim selection algorithm.

Link: http://lkml.kernel.org/r/20180802003201.817-4-guro@fb.com
Signed-off-by: Roman Gushchin
Acked-by: Michal Hocko
Acked-by: Johannes Weiner
Cc: David Rientjes
Cc: Tetsuo Handa
Cc: Tejun Heo
Cc: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Roman Gushchin
2018-08-23 01:52:45 +0800
5989ad7b5 mm, oom: refactor oom_kill_process() ... Browse Code »

Patch series "introduce memory.oom.group", v2.

This is a tiny implementation of cgroup-aware OOM killer, which adds an
ability to kill a cgroup as a single unit and so guarantee the integrity
of the workload.

Although it has only a limited functionality in comparison to what now
resides in the mm tree (it doesn't change the victim task selection
algorithm, doesn't look at memory stas on cgroup level, etc), it's also
much simpler and more straightforward. So, hopefully, we can avoid having
long debates here, as we had with the full implementation.

As it doesn't prevent any futher development, and implements an useful and
complete feature, it looks as a sane way forward.

This patch (of 2):

oom_kill_process() consists of two logical parts: the first one is
responsible for considering task's children as a potential victim and
printing the debug information. The second half is responsible for
sending SIGKILL to all tasks sharing the mm struct with the given victim.

This commit splits oom_kill_process() with an intention to re-use the the
second half: __oom_kill_process().

The cgroup-aware OOM killer will kill multiple tasks belonging to the
victim cgroup. We don't need to print the debug information for the each
task, as well as play with task selection (considering task's children),
so we can't use the existing oom_kill_process().

Link: http://lkml.kernel.org/r/20171130152824.1591-2-guro@fb.com
Link: http://lkml.kernel.org/r/20180802003201.817-3-guro@fb.com
Signed-off-by: Roman Gushchin
Acked-by: Michal Hocko
Acked-by: Johannes Weiner
Acked-by: David Rientjes
Cc: Vladimir Davydov
Cc: Tetsuo Handa
Cc: David Rientjes
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Roman Gushchin
2018-08-23 01:52:45 +0800
1caed8602 tools/testing/selftests/vm/: add MAP_POPULATE test ... Browse Code »

As with many other projects, we use some shmalloc allocator. At some
point we need to make a part of allocated pages back private to process.
And it should be populated straight away. Check that (MAP_PRIVATE |
MAP_POPULATE) actually copies the private page.

[akpm@linux-foundation.org: change message, per review discussion]
Link: http://lkml.kernel.org/r/20180801233636.29354-1-dima@arista.com
Signed-off-by: Dmitry Safonov
Reviewed-by: Andrew Morton
Cc: Dmitry Safonov
Cc: Hua Zhong
Cc: Shuah Khan
Cc: Stuart Ritchie
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dmitry Safonov
2018-08-23 01:52:45 +0800
03e85f9d5 mm/page_alloc: Introduce free_area_init_core_hotplug ... Browse Code »

Currently, whenever a new node is created/re-used from the memhotplug
path, we call free_area_init_node()->free_area_init_core(). But there is
some code that we do not really need to run when we are coming from such
path.

free_area_init_core() performs the following actions:

1) Initializes pgdat internals, such as spinlock, waitqueues and more.
2) Account # nr_all_pages and # nr_kernel_pages. These values are used later on
when creating hash tables.
3) Account number of managed_pages per zone, substracting dma_reserved and
memmap pages.
4) Initializes some fields of the zone structure data
5) Calls init_currently_empty_zone to initialize all the freelists
6) Calls memmap_init to initialize all pages belonging to certain zone

When called from memhotplug path, free_area_init_core() only performs
actions #1 and #4.

Action #2 is pointless as the zones do not have any pages since either the
node was freed, or we are re-using it, eitherway all zones belonging to
this node should have 0 pages. For the same reason, action #3 results
always in manages_pages being 0.

Action #5 and #6 are performed later on when onlining the pages:
online_pages()->move_pfn_range_to_zone()->init_currently_empty_zone()
online_pages()->move_pfn_range_to_zone()->memmap_init_zone()

This patch does two things:

First, moves the node/zone initializtion to their own function, so it
allows us to create a small version of free_area_init_core, where we only
perform:

1) Initialization of pgdat internals, such as spinlock, waitqueues and more
4) Initialization of some fields of the zone structure data

These two functions are: pgdat_init_internals() and zone_init_internals().

The second thing this patch does, is to introduce
free_area_init_core_hotplug(), the memhotplug version of
free_area_init_core():

Currently, we call free_area_init_node() from the memhotplug path. In
there, we set some pgdat's fields, and call calculate_node_totalpages().
calculate_node_totalpages() calculates the # of pages the node has.

Since the node is either new, or we are re-using it, the zones belonging
to this node should not have any pages, so there is no point to calculate
this now.

Actually, we re-set these values to 0 later on with the calls to:

reset_node_managed_pages()
reset_node_present_pages()

The # of pages per node and the # of pages per zone will be calculated when
onlining the pages:

online_pages()->move_pfn_range()->move_pfn_range_to_zone()->resize_zone_range()
online_pages()->move_pfn_range()->move_pfn_range_to_zone()->resize_pgdat_range()

Also, since free_area_init_core/free_area_init_node will now only get called during early init, let us replace
__paginginit with __init, so their code gets freed up.

[osalvador@techadventures.net: fix section usage]
Link: http://lkml.kernel.org/r/20180731101752.GA473@techadventures.net
[osalvador@suse.de: v6]
Link: http://lkml.kernel.org/r/20180801122348.21588-6-osalvador@techadventures.net
Link: http://lkml.kernel.org/r/20180730101757.28058-5-osalvador@techadventures.net
Signed-off-by: Oscar Salvador
Reviewed-by: Pavel Tatashin
Acked-by: Michal Hocko
Acked-by: Vlastimil Babka
Cc: Pasha Tatashin
Cc: Aaron Lu
Cc: Dan Williams
Cc: David Hildenbrand
Cc: Joonsoo Kim
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oscar Salvador
2018-08-23 01:52:45 +0800
0188dc98a mm/page_alloc: inline function to handle CONFIG_DEFERRED_STRUCT_PAGE_INIT ... Browse Code »

Let us move the code between CONFIG_DEFERRED_STRUCT_PAGE_INIT to an inline
function. Not having an ifdef in the function makes the code more
readable.

Link: http://lkml.kernel.org/r/20180730101757.28058-4-osalvador@techadventures.net
Signed-off-by: Oscar Salvador
Acked-by: Michal Hocko
Reviewed-by: Pavel Tatashin
Acked-by: Vlastimil Babka
Cc: Aaron Lu
Cc: Dan Williams
Cc: David Hildenbrand
Cc: Joonsoo Kim
Cc: Mel Gorman
Cc: Pasha Tatashin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oscar Salvador
2018-08-23 01:52:45 +0800
7cc2a9596 mm: remove __paginginit ... Browse Code »

__paginginit is the same thing as __meminit except for platforms without
sparsemem, there it is defined as __init.

Remove __paginginit and use __meminit. Use __ref in one single function
that merges __meminit and __init sections: setup_usemap().

Link: http://lkml.kernel.org/r/20180801122348.21588-4-osalvador@techadventures.net
Signed-off-by: Pavel Tatashin
Signed-off-by: Oscar Salvador
Reviewed-by: Oscar Salvador
Cc: Pasha Tatashin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Tatashin
2018-08-23 01:52:45 +0800
c1093b746 mm: access zone->node via zone_to_nid() and zone_set_nid() ... Browse Code »

zone->node is configured only when CONFIG_NUMA=y, so it is a good idea to
have inline functions to access this field in order to avoid ifdef's in c
files.

Link: http://lkml.kernel.org/r/20180730101757.28058-3-osalvador@techadventures.net
Signed-off-by: Pavel Tatashin
Signed-off-by: Oscar Salvador
Reviewed-by: Oscar Salvador
Acked-by: Michal Hocko
Acked-by: Vlastimil Babka
Cc: Aaron Lu
Cc: Dan Williams
Cc: David Hildenbrand
Cc: Joonsoo Kim
Cc: Mel Gorman
Cc: Pasha Tatashin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Tatashin
2018-08-23 01:52:45 +0800
ace1db397 mm/page_alloc.c: move ifdefery out of free_area_init_core ... Browse Code »

Patch series "Refactor free_area_init_core and add
free_area_init_core_hotplug", v6.

This patchset does three things:

1) Clean up/refactor free_area_init_core/free_area_init_node
by moving the ifdefery out of the functions.
2) Move the pgdat/zone initialization in free_area_init_core to its
own function.
3) Introduce free_area_init_core_hotplug, a small subset of
free_area_init_core, which is only called from memhotlug code path. In this
way, we have:

free_area_init_core: called during early initialization
free_area_init_core_hotplug: called whenever a new node is allocated/re-used (memhotplug path)

This patch (of 5):

Moving the #ifdefs out of the function makes it easier to follow.

Link: http://lkml.kernel.org/r/20180730101757.28058-2-osalvador@techadventures.net
Signed-off-by: Oscar Salvador
Acked-by: Michal Hocko
Reviewed-by: Pavel Tatashin
Acked-by: Vlastimil Babka
Cc: Pasha Tatashin
Cc: Mel Gorman
Cc: Aaron Lu
Cc: Joonsoo Kim
Cc: Dan Williams
Cc: David Hildenbrand
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oscar Salvador
2018-08-23 01:52:45 +0800
89696701e mm: remove zone_id() and make use of zone_idx() in is_dev_zone() ... Browse Code »

is_dev_zone() is using zone_id() to check if the zone is ZONE_DEVICE.
zone_id() looks pretty much the same as zone_idx(), and while the use of
zone_idx() is quite spread in the kernel, zone_id() is only being used by
is_dev_zone().

This patch removes zone_id() and makes is_dev_zone() use zone_idx() to
check the zone, so we do not have two things with the same functionality
around.

Link: http://lkml.kernel.org/r/20180730133718.28683-1-osalvador@techadventures.net
Signed-off-by: Oscar Salvador
Acked-by: Michal Hocko
Acked-by: Vlastimil Babka
Reviewed-by: Pavel Tatashin
Cc: Pasha Tatashin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oscar Salvador
2018-08-23 01:52:45 +0800
85f237a57 Documentation/sysctl/vm.txt: update __vm_enough_memory()'s path ... Browse Code »

__vm_enough_memory has moved to mm/util.c.

Link: http://lkml.kernel.org/r/E18EDF4A4FA4A04BBFA824B6D7699E532A7E5913@EXMBX-SZMAIL013.tencent.com
Signed-off-by: Juvi Liu
Reviewed-by: Andrew Morton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

juviliu
2018-08-23 01:52:44 +0800
8de7ecc64 memcg: reduce memcg tree traversals for stats collection ... Browse Code »

Currently cgroup-v1's memcg_stat_show traverses the memcg tree ~17 times
to collect the stats while cgroup-v2's memory_stat_show traverses the
memcg tree thrice. On a large machine, a couple thousand memcgs is very
normal and if the churn is high and memcgs stick around during to several
reasons, tens of thousands of nodes in memcg tree can exist. This patch
has refactored and shared the stat collection code between cgroup-v1 and
cgroup-v2 and has reduced the tree traversal to just one.

I ran a simple benchmark which reads the root_mem_cgroup's stat file
1000 times in the presense of 2500 memcgs on cgroup-v1. The results are:

Without the patch:
$ time ./read-root-stat-1000-times

real 0m1.663s
user 0m0.000s
sys 0m1.660s

With the patch:
$ time ./read-root-stat-1000-times

real 0m0.468s
user 0m0.000s
sys 0m0.467s

Link: http://lkml.kernel.org/r/20180724224635.143944-1-shakeelb@google.com
Signed-off-by: Shakeel Butt
Acked-by: Michal Hocko
Cc: Johannes Weiner
Cc: Vladimir Davydov
Cc: Greg Thelen
Cc: Bruce Merry
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shakeel Butt
2018-08-23 01:52:44 +0800
1c4c3b99c mm: fix page_freeze_refs and page_unfreeze_refs in comments ... Browse Code »

page_freeze_refs/page_unfreeze_refs have already been relplaced by
page_ref_freeze/page_ref_unfreeze , but they are not modified in the
comments.

Link: http://lkml.kernel.org/r/1532590226-106038-1-git-send-email-jiang.biao2@zte.com.cn
Signed-off-by: Jiang Biao
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jiang Biao
2018-08-23 01:52:44 +0800
8c9a134ca mm: clarify CONFIG_PAGE_POISONING and usage ... Browse Code »

The Kconfig text for CONFIG_PAGE_POISONING doesn't mention that it has to
be enabled explicitly. This updates the documentation for that and adds a
note about CONFIG_PAGE_POISONING to the "page_poison" command line docs.
While here, change description of CONFIG_PAGE_POISONING_ZERO too, as it's
not "random" data, but rather the fixed debugging value that would be used
when not zeroing. Additionally removes a stray "bool" in the Kconfig.

Link: http://lkml.kernel.org/r/20180725223832.GA43733@beast
Signed-off-by: Kees Cook
Reviewed-by: Andrew Morton
Cc: Jonathan Corbet
Cc: Laura Abbott
Cc: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kees Cook
2018-08-23 01:52:44 +0800
a670468f5 mm: zero out the vma in vma_init() ... Browse Code »

Rather than in vm_area_alloc(). To ensure that the various oddball
stack-based vmas are in a good state. Some of the callers were zeroing
them out, others were not.

Acked-by: Kirill A. Shutemov
Cc: Russell King
Cc: Dmitry Vyukov
Cc: Oleg Nesterov
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2018-08-23 01:52:44 +0800
a3bf6ce36 mm/mempool.c: add missing parameter description ... Browse Code »

The kernel-doc for mempool_init function is missing the description of the
pool parameter. Add it.

Link: http://lkml.kernel.org/r/1532336274-26228-1-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Rapoport
2018-08-23 01:52:44 +0800
258f669e7 mm: /proc/pid/smaps_rollup: convert to single value seq_file ... Browse Code »

The /proc/pid/smaps_rollup file is currently implemented via the
m_start/m_next/m_stop seq_file iterators shared with the other maps files,
that iterate over vma's. However, the rollup file doesn't print anything
for each vma, only accumulate the stats.

There are some issues with the current code as reported in [1] - the
accumulated stats can get skewed if seq_file start()/stop() op is called
multiple times, if show() is called multiple times, and after seeks to
non-zero position.

Patch [1] fixed those within existing design, but I believe it is
fundamentally wrong to expose the vma iterators to the seq_file mechanism
when smaps_rollup shows logically a single set of values for the whole
address space.

This patch thus refactors the code to provide a single "value" at offset
0, with vma iteration to gather the stats done internally. This fixes the
situations where results are skewed, and simplifies the code, especially
in show_smap(), at the expense of somewhat less code reuse.

[1] https://marc.info/?l=linux-mm&m=151927723128134&w=2

[vbabka@suse.c: use seq_file infrastructure]
Link: http://lkml.kernel.org/r/bf4525b0-fd5b-4c4c-2cb3-adee3dd95a48@suse.cz
Link: http://lkml.kernel.org/r/20180723111933.15443-5-vbabka@suse.cz
Signed-off-by: Vlastimil Babka
Reported-by: Daniel Colascione
Reviewed-by: Alexey Dobriyan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2018-08-23 01:52:44 +0800
f1547959d mm: /proc/pid/smaps: factor out common stats printing ... Browse Code »

To prepare for handling /proc/pid/smaps_rollup differently from
/proc/pid/smaps factor out from show_smap() printing the parts of output
that are common for both variants, which is the bulk of the gathered
memory stats.

[vbabka@suse.cz: add const, per Alexey]
Link: http://lkml.kernel.org/r/b45f319f-cd04-337b-37f8-77f99786aa8a@suse.cz
Link: http://lkml.kernel.org/r/20180723111933.15443-4-vbabka@suse.cz
Signed-off-by: Vlastimil Babka
Reviewed-by: Alexey Dobriyan
Cc: Daniel Colascione
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2018-08-23 01:52:44 +0800
8e68d689a mm: /proc/pid/smaps: factor out mem stats gathering ... Browse Code »

To prepare for handling /proc/pid/smaps_rollup differently from
/proc/pid/smaps factor out vma mem stats gathering from show_smap() - it
will be used by both.

Link: http://lkml.kernel.org/r/20180723111933.15443-3-vbabka@suse.cz
Signed-off-by: Vlastimil Babka
Reviewed-by: Alexey Dobriyan
Cc: Daniel Colascione
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2018-08-23 01:52:44 +0800
871305bb2 mm: /proc/pid/*maps remove is_pid and related wrappers ... Browse Code »

Patch series "cleanups and refactor of /proc/pid/smaps*".

The recent regression in /proc/pid/smaps made me look more into the code.
Especially the issues with smaps_rollup reported in [1] as explained in
Patch 4, which fixes them by refactoring the code. Patches 2 and 3 are
preparations for that. Patch 1 is me realizing that there's a lot of
boilerplate left from times where we tried (unsuccessfuly) to mark thread
stacks in the output.

Originally I had also plans to rework the translation from
/proc/pid/*maps* file offsets to the internal structures. Now the offset
means "vma number", which is not really stable (vma's can come and go
between read() calls) and there's an extra caching of last vma's address.
My idea was that offsets would be interpreted directly as addresses, which
would also allow meaningful seeks (see the ugly seek_to_smaps_entry() in
tools/testing/selftests/vm/mlock2.h). However loff_t is (signed) long
long so that might be insufficient somewhere for the unsigned long
addresses.

So the result is fixed issues with skewed /proc/pid/smaps_rollup results,
simpler smaps code, and a lot of unused code removed.

[1] https://marc.info/?l=linux-mm&m=151927723128134&w=2

This patch (of 4):

Commit b76437579d13 ("procfs: mark thread stack correctly in
proc//maps") introduced differences between /proc/PID/maps and
/proc/PID/task/TID/maps to mark thread stacks properly, and this was
also done for smaps and numa_maps. However it didn't work properly and
was ultimately removed by commit b18cb64ead40 ("fs/proc: Stop trying to
report thread stacks").

Now the is_pid parameter for the related show_*() functions is unused
and we can remove it together with wrapper functions and ops structures
that differ for PID and TID cases only in this parameter.

Link: http://lkml.kernel.org/r/20180723111933.15443-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka
Reviewed-by: Alexey Dobriyan
Cc: Daniel Colascione
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2018-08-23 01:52:44 +0800
431f42fdf mm/oom_kill.c: clean up oom_reap_task_mm() ... Browse Code »

Andrew has noticed some inconsistencies in oom_reap_task_mm. Notably

- Undocumented return value.

- comment "failed to reap part..." is misleading - sounds like it's
referring to something which happened in the past, is in fact
referring to something which might happen in the future.

- fails to call trace_finish_task_reaping() in one case

- code duplication.

- Increases mmap_sem hold time a little by moving
trace_finish_task_reaping() inside the locked region. So sue me ;)

- Sharing the finish: path means that the trace event won't
distinguish between the two sources of finishing.

Add a short explanation for the return value and fix the rest by
reorganizing the function a bit to have unified function exit paths.

Link: http://lkml.kernel.org/r/20180724141747.GP28386@dhcp22.suse.cz
Suggested-by: Andrew Morton
Signed-off-by: Michal Hocko
Reviewed-by: Andrew Morton
Cc: Tetsuo Handa
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2018-08-23 01:52:44 +0800
c3b78b11e mm, oom: describe task memory unit, larger PID pad ... Browse Code »

The default page memory unit of OOM task dump events might not be
intuitive and potentially misleading for the non-initiated when debugging
OOM events: These are pages and not kBs. Add a small printk prior to the
task dump informing that the memory units are actually memory _pages_.

Also extends PID field to align on up to 7 characters.
Reference https://lkml.org/lkml/2018/7/3/1201

Link: http://lkml.kernel.org/r/c795eb5129149ed8a6345c273aba167ff1bbd388.1530715938.git.rfreire@redhat.com
Signed-off-by: Rodrigo Freire
Acked-by: David Rientjes
Acked-by: Rafael Aquini
Cc: Michal Hocko
Cc: Tetsuo Handa
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rodrigo Freire
2018-08-23 01:52:44 +0800
af5679fbc mm, oom: remove oom_lock from oom_reaper ... Browse Code »

oom_reaper used to rely on the oom_lock since e2fe14564d33 ("oom_reaper:
close race with exiting task"). We do not really need the lock anymore
though. 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run
concurrently") has removed serialization with the exit path based on the
mm reference count and so we do not really rely on the oom_lock anymore.

Tetsuo was arguing that at least MMF_OOM_SKIP should be set under the lock
to prevent from races when the page allocator didn't manage to get the
freed (reaped) memory in __alloc_pages_may_oom but it sees the flag later
on and move on to another victim. Although this is possible in principle
let's wait for it to actually happen in real life before we make the
locking more complex again.

Therefore remove the oom_lock for oom_reaper paths (both exit_mmap and
oom_reap_task_mm). The reaper serializes with exit_mmap by mmap_sem +
MMF_OOM_SKIP flag. There is no synchronization with out_of_memory path
now.

[mhocko@kernel.org: oom_reap_task_mm should return false when __oom_reap_task_mm did]
Link: http://lkml.kernel.org/r/20180724141747.GP28386@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/20180719075922.13784-1-mhocko@kernel.org
Signed-off-by: Michal Hocko
Suggested-by: David Rientjes
Acked-by: David Rientjes
Cc: Tetsuo Handa
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2018-08-23 01:52:44 +0800
93065ac75 mm, oom: distinguish blockable mode for mmu notifiers ... Browse Code »

There are several blockable mmu notifiers which might sleep in
mmu_notifier_invalidate_range_start and that is a problem for the
oom_reaper because it needs to guarantee a forward progress so it cannot
depend on any sleepable locks.

Currently we simply back off and mark an oom victim with blockable mmu
notifiers as done after a short sleep. That can result in selecting a new
oom victim prematurely because the previous one still hasn't torn its
memory down yet.

We can do much better though. Even if mmu notifiers use sleepable locks
there is no reason to automatically assume those locks are held. Moreover
majority of notifiers only care about a portion of the address space and
there is absolutely zero reason to fail when we are unmapping an unrelated
range. Many notifiers do really block and wait for HW which is harder to
handle and we have to bail out though.

This patch handles the low hanging fruit.
__mmu_notifier_invalidate_range_start gets a blockable flag and callbacks
are not allowed to sleep if the flag is set to false. This is achieved by
using trylock instead of the sleepable lock for most callbacks and
continue as long as we do not block down the call chain.

I think we can improve that even further because there is a common pattern
to do a range lookup first and then do something about that. The first
part can be done without a sleeping lock in most cases AFAICS.

The oom_reaper end then simply retries if there is at least one notifier
which couldn't make any progress in !blockable mode. A retry loop is
already implemented to wait for the mmap_sem and this is basically the
same thing.

The simplest way for driver developers to test this code path is to wrap
userspace code which uses these notifiers into a memcg and set the hard
limit to hit the oom. This can be done e.g. after the test faults in all
the mmu notifier managed memory and set the hard limit to something really
small. Then we are looking for a proper process tear down.

[akpm@linux-foundation.org: coding style fixes]
[akpm@linux-foundation.org: minor code simplification]
Link: http://lkml.kernel.org/r/20180716115058.5559-1-mhocko@kernel.org
Signed-off-by: Michal Hocko
Acked-by: Christian König # AMD notifiers
Acked-by: Leon Romanovsky # mlx and umem_odp
Reported-by: David Rientjes
Cc: "David (ChunMing) Zhou"
Cc: Paolo Bonzini
Cc: Alex Deucher
Cc: David Airlie
Cc: Jani Nikula
Cc: Joonas Lahtinen
Cc: Rodrigo Vivi
Cc: Doug Ledford
Cc: Jason Gunthorpe
Cc: Mike Marciniszyn
Cc: Dennis Dalessandro
Cc: Sudeep Dutt
Cc: Ashutosh Dixit
Cc: Dimitri Sivanich
Cc: Boris Ostrovsky
Cc: Juergen Gross
Cc: "Jérôme Glisse"
Cc: Andrea Arcangeli
Cc: Felix Kuehling
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2018-08-23 01:52:44 +0800
c2343d276 mm/swapfile.c: put_swap_page: share more between huge/normal code path ... Browse Code »

In this patch, locking related code is shared between huge/normal code
path in put_swap_page() to reduce code duplication. The `free_entries == 0`
case is merged into the more general `free_entries != SWAPFILE_CLUSTER`
case, because the new locking method makes it easy.

The added lines is same as the removed lines. But the code size is
increased when CONFIG_TRANSPARENT_HUGEPAGE=n.

text data bss dec hex filename
base: 24123 2004 340 26467 6763 mm/swapfile.o
unified: 24485 2004 340 26829 68cd mm/swapfile.o

Dig on step deeper with `size -A mm/swapfile.o` for base and unified
kernel and compare the result, yields,

-.text 17723 0
+.text 17835 0
-.orc_unwind_ip 1380 0
+.orc_unwind_ip 1480 0
-.orc_unwind 2070 0
+.orc_unwind 2220 0
-Total 26686
+Total 27048

The total difference is the same. The text segment difference is much
smaller: 112. More difference comes from the ORC unwinder segments:
(1480 + 2220) - (1380 + 2070) = 250. If the frame pointer unwinder is
used, this costs nothing.

Link: http://lkml.kernel.org/r/20180720071845.17920-9-ying.huang@intel.com
Signed-off-by: "Huang, Ying"
Reviewed-by: Daniel Jordan
Acked-by: Dave Hansen
Cc: Michal Hocko
Cc: Johannes Weiner
Cc: Shaohua Li
Cc: Hugh Dickins
Cc: Minchan Kim
Cc: Rik van Riel
Cc: Dan Williams
Cc: Matthew Wilcox
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Huang Ying
2018-08-23 01:52:44 +0800
b32d5f32b mm/swapfile.c: add __swap_entry_free_locked() ... Browse Code »

The part of __swap_entry_free() with lock held is separated into a new
function __swap_entry_free_locked(). Because we want to reuse that
piece of code in some other places.

Just mechanical code refactoring, there is no any functional change in
this function.

Link: http://lkml.kernel.org/r/20180720071845.17920-8-ying.huang@intel.com
Signed-off-by: "Huang, Ying"
Reviewed-by: Daniel Jordan
Acked-by: Dave Hansen
Cc: Michal Hocko
Cc: Johannes Weiner
Cc: Shaohua Li
Cc: Hugh Dickins
Cc: Minchan Kim
Cc: Rik van Riel
Cc: Dan Williams
Cc: Matthew Wilcox
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Huang Ying
2018-08-23 01:52:44 +0800
5d5e8f195 mm, swap, get_swap_pages: use entry_size instead of cluster in parameter ... Browse Code »

As suggested by Matthew Wilcox, it is better to use "int entry_size"
instead of "bool cluster" as parameter to specify whether to operate for
huge or normal swap entries. Because this improve the flexibility to
support other swap entry size. And Dave Hansen thinks that this
improves code readability too.

So in this patch, the "bool cluster" parameter of get_swap_pages() is
replaced by "int entry_size".

And nr_swap_entries() trick is used to reduce the binary size when
!CONFIG_TRANSPARENT_HUGE_PAGE.

text data bss dec hex filename
base 24215 2028 340 26583 67d7 mm/swapfile.o
head 24123 2004 340 26467 6763 mm/swapfile.o

Link: http://lkml.kernel.org/r/20180720071845.17920-7-ying.huang@intel.com
Signed-off-by: "Huang, Ying"
Suggested-by: Matthew Wilcox
Acked-by: Dave Hansen
Cc: Daniel Jordan
Cc: Michal Hocko
Cc: Johannes Weiner
Cc: Shaohua Li
Cc: Hugh Dickins
Cc: Minchan Kim
Cc: Rik van Riel
Cc: Dan Williams
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Huang Ying
2018-08-23 01:52:44 +0800
a448f2d07 mm/swapfile.c: unify normal/huge code path in put_swap_page() ... Browse Code »

In this patch, the normal/huge code path in put_swap_page() and several
helper functions are unified to avoid duplicated code, bugs, etc. and
make it easier to review the code.

The removed lines are more than added lines. And the binary size is
kept exactly same when CONFIG_TRANSPARENT_HUGEPAGE=n.

Link: http://lkml.kernel.org/r/20180720071845.17920-6-ying.huang@intel.com
Signed-off-by: "Huang, Ying"
Suggested-by: Dave Hansen
Acked-by: Dave Hansen
Reviewed-by: Daniel Jordan
Cc: Michal Hocko
Cc: Johannes Weiner
Cc: Shaohua Li
Cc: Hugh Dickins
Cc: Minchan Kim
Cc: Rik van Riel
Cc: Dan Williams
Cc: Matthew Wilcox
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Huang Ying
2018-08-23 01:52:44 +0800