23 Aug, 2018

40 commits

  • Number of CPUs is never high enough to force 64-bit arithmetic.
    Save couple of bytes on x86_64.

    Link: http://lkml.kernel.org/r/20180627200710.GC18434@avx2
    Signed-off-by: Alexey Dobriyan
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Link: http://lkml.kernel.org/r/20180627200614.GB18434@avx2
    Signed-off-by: Alexey Dobriyan
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • ->latency_record is defined as

    struct latency_record[LT_SAVECOUNT];

    so use the same macro whie iterating.

    Link: http://lkml.kernel.org/r/20180627200534.GA18434@avx2
    Signed-off-by: Alexey Dobriyan
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Code checks if write is done by current to its own attributes.
    For that get/put pair is unnecessary as it can be done under RCU.

    Note: rcu_read_unlock() can be done even earlier since pointer to a task
    is not dereferenced. It depends if /proc code should look scary or not:

    rcu_read_lock();
    task = pid_task(...);
    rcu_read_unlock();
    if (!task)
    return -ESRCH;
    if (task != current)
    return -EACCESS:

    P.S.: rename "length" variable. Code like this

    length = -EINVAL;

    should not exist.

    Link: http://lkml.kernel.org/r/20180627200218.GF18113@avx2
    Signed-off-by: Alexey Dobriyan
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Link: http://lkml.kernel.org/r/20180627195427.GE18113@avx2
    Signed-off-by: Alexey Dobriyan
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Readdir context is thread local, so ->pos is thread local,
    move it out of readlock.

    Link: http://lkml.kernel.org/r/20180627195339.GD18113@avx2
    Signed-off-by: Alexey Dobriyan
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Same story: I have WIP patch to make it faster, so better have a test
    as well.

    Link: http://lkml.kernel.org/r/20180627195209.GC18113@avx2
    Signed-off-by: Alexey Dobriyan
    Cc: Shuah Khan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • There are plans to change how /proc/self result is calculated,
    for that a test is necessary.

    Use direct system call because of this whole getpid caching story.

    Link: http://lkml.kernel.org/r/20180627195103.GB18113@avx2
    Signed-off-by: Alexey Dobriyan
    Cc: Shuah Khan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • get_monotonic_boottime() is deprecated and uses the old timespec type.
    Let's convert /proc/uptime to use ktime_get_boottime_ts64().

    Link: http://lkml.kernel.org/r/20180620081746.282742-1-arnd@arndb.de
    Signed-off-by: Arnd Bergmann
    Acked-by: Thomas Gleixner
    Cc: Al Viro
    Cc: Deepa Dinamani
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     
  • 24074a35c5c975 ("proc: Make inline name size calculation automatic")
    started to put PDE allocations into kmalloc-256 which is unnecessary as
    ~40 character names are very rare.

    Put allocation back into kmalloc-192 cache for 64-bit non-debug builds.

    Put BUILD_BUG_ON to know when PDE size has gotten out of control.

    [adobriyan@gmail.com: fix BUILD_BUG_ON breakage on powerpc64]
    Link: http://lkml.kernel.org/r/20180703191602.GA25521@avx2
    Link: http://lkml.kernel.org/r/20180617215732.GA24688@avx2
    Signed-off-by: Alexey Dobriyan
    Cc: David Howells
    Cc: Al Viro
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Currently, NODEMASK_ALLOC allocates a nodemask_t with kmalloc when
    NODES_SHIFT is higher than 8, otherwise it declares it within the stack.

    The comment says that the reasoning behind this, is that nodemask_t will
    be 256 bytes when NODES_SHIFT is higher than 8, but this is not true. For
    example, NODES_SHIFT = 9 will give us a 64 bytes nodemask_t. Let us fix
    up the comment for that.

    Another thing is that it might make sense to let values lower than
    128bytes be allocated in the stack. Although this all depends on the
    depth of the stack (and this changes from function to function), I think
    that 64 bytes is something we can easily afford. So we could even bump
    the limit by 1 (from > 8 to > 9).

    Link: http://lkml.kernel.org/r/20180820085516.9687-1-osalvador@techadventures.net
    Signed-off-by: Oscar Salvador
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oscar Salvador
     
  • The call to strlcpy in backing_dev_store is incorrect. It should take
    the size of the destination buffer instead of the size of the source
    buffer. Additionally, ignore the newline character (\n) when reading
    the new file_name buffer. This makes it possible to set the backing_dev
    as follows:

    echo /dev/sdX > /sys/block/zram0/backing_dev

    The reason it worked before was the fact that strlcpy() copies 'len - 1'
    bytes, which is strlen(buf) - 1 in our case, so it accidentally didn't
    copy the trailing new line symbol. Which also means that "echo -n
    /dev/sdX" most likely was broken.

    Signed-off-by: Peter Kalauskas
    Link: http://lkml.kernel.org/r/20180813061623.GC64836@rodete-desktop-imager.corp.google.com
    Acked-by: Minchan Kim
    Reviewed-by: Sergey Senozhatsky
    Cc: [4.14+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Kalauskas
     
  • Currently, percpu memory only exposes allocation and utilization
    information via debugfs. This more or less is only really useful for
    understanding the fragmentation and allocation information at a per-chunk
    level with a few global counters. This is also gated behind a config.
    BPF and cgroup, for example, have seen an increase in use causing
    increased use of percpu memory. Let's make it easier for someone to
    identify how much memory is being used.

    This patch adds the "Percpu" stat to meminfo to more easily look up how
    much percpu memory is in use. This number includes the cost for all
    allocated backing pages and not just insight at the per a unit, per chunk
    level. Metadata is excluded. I think excluding metadata is fair because
    the backing memory scales with the numbere of cpus and can quickly
    outweigh the metadata. It also makes this calculation light.

    Link: http://lkml.kernel.org/r/20180807184723.74919-1-dennisszhou@gmail.com
    Signed-off-by: Dennis Zhou
    Acked-by: Tejun Heo
    Acked-by: Roman Gushchin
    Reviewed-by: Andrew Morton
    Acked-by: David Rientjes
    Acked-by: Vlastimil Babka
    Cc: Johannes Weiner
    Cc: Christoph Lameter
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dennis Zhou (Facebook)
     
  • For some workloads an intervention from the OOM killer can be painful.
    Killing a random task can bring the workload into an inconsistent state.

    Historically, there are two common solutions for this
    problem:
    1) enabling panic_on_oom,
    2) using a userspace daemon to monitor OOMs and kill
    all outstanding processes.

    Both approaches have their downsides: rebooting on each OOM is an obvious
    waste of capacity, and handling all in userspace is tricky and requires a
    userspace agent, which will monitor all cgroups for OOMs.

    In most cases an in-kernel after-OOM cleaning-up mechanism can eliminate
    the necessity of enabling panic_on_oom. Also, it can simplify the cgroup
    management for userspace applications.

    This commit introduces a new knob for cgroup v2 memory controller:
    memory.oom.group. The knob determines whether the cgroup should be
    treated as an indivisible workload by the OOM killer. If set, all tasks
    belonging to the cgroup or to its descendants (if the memory cgroup is not
    a leaf cgroup) are killed together or not at all.

    To determine which cgroup has to be killed, we do traverse the cgroup
    hierarchy from the victim task's cgroup up to the OOMing cgroup (or root)
    and looking for the highest-level cgroup with memory.oom.group set.

    Tasks with the OOM protection (oom_score_adj set to -1000) are treated as
    an exception and are never killed.

    This patch doesn't change the OOM victim selection algorithm.

    Link: http://lkml.kernel.org/r/20180802003201.817-4-guro@fb.com
    Signed-off-by: Roman Gushchin
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: David Rientjes
    Cc: Tetsuo Handa
    Cc: Tejun Heo
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Patch series "introduce memory.oom.group", v2.

    This is a tiny implementation of cgroup-aware OOM killer, which adds an
    ability to kill a cgroup as a single unit and so guarantee the integrity
    of the workload.

    Although it has only a limited functionality in comparison to what now
    resides in the mm tree (it doesn't change the victim task selection
    algorithm, doesn't look at memory stas on cgroup level, etc), it's also
    much simpler and more straightforward. So, hopefully, we can avoid having
    long debates here, as we had with the full implementation.

    As it doesn't prevent any futher development, and implements an useful and
    complete feature, it looks as a sane way forward.

    This patch (of 2):

    oom_kill_process() consists of two logical parts: the first one is
    responsible for considering task's children as a potential victim and
    printing the debug information. The second half is responsible for
    sending SIGKILL to all tasks sharing the mm struct with the given victim.

    This commit splits oom_kill_process() with an intention to re-use the the
    second half: __oom_kill_process().

    The cgroup-aware OOM killer will kill multiple tasks belonging to the
    victim cgroup. We don't need to print the debug information for the each
    task, as well as play with task selection (considering task's children),
    so we can't use the existing oom_kill_process().

    Link: http://lkml.kernel.org/r/20171130152824.1591-2-guro@fb.com
    Link: http://lkml.kernel.org/r/20180802003201.817-3-guro@fb.com
    Signed-off-by: Roman Gushchin
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Acked-by: David Rientjes
    Cc: Vladimir Davydov
    Cc: Tetsuo Handa
    Cc: David Rientjes
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • As with many other projects, we use some shmalloc allocator. At some
    point we need to make a part of allocated pages back private to process.
    And it should be populated straight away. Check that (MAP_PRIVATE |
    MAP_POPULATE) actually copies the private page.

    [akpm@linux-foundation.org: change message, per review discussion]
    Link: http://lkml.kernel.org/r/20180801233636.29354-1-dima@arista.com
    Signed-off-by: Dmitry Safonov
    Reviewed-by: Andrew Morton
    Cc: Dmitry Safonov
    Cc: Hua Zhong
    Cc: Shuah Khan
    Cc: Stuart Ritchie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitry Safonov
     
  • Currently, whenever a new node is created/re-used from the memhotplug
    path, we call free_area_init_node()->free_area_init_core(). But there is
    some code that we do not really need to run when we are coming from such
    path.

    free_area_init_core() performs the following actions:

    1) Initializes pgdat internals, such as spinlock, waitqueues and more.
    2) Account # nr_all_pages and # nr_kernel_pages. These values are used later on
    when creating hash tables.
    3) Account number of managed_pages per zone, substracting dma_reserved and
    memmap pages.
    4) Initializes some fields of the zone structure data
    5) Calls init_currently_empty_zone to initialize all the freelists
    6) Calls memmap_init to initialize all pages belonging to certain zone

    When called from memhotplug path, free_area_init_core() only performs
    actions #1 and #4.

    Action #2 is pointless as the zones do not have any pages since either the
    node was freed, or we are re-using it, eitherway all zones belonging to
    this node should have 0 pages. For the same reason, action #3 results
    always in manages_pages being 0.

    Action #5 and #6 are performed later on when onlining the pages:
    online_pages()->move_pfn_range_to_zone()->init_currently_empty_zone()
    online_pages()->move_pfn_range_to_zone()->memmap_init_zone()

    This patch does two things:

    First, moves the node/zone initializtion to their own function, so it
    allows us to create a small version of free_area_init_core, where we only
    perform:

    1) Initialization of pgdat internals, such as spinlock, waitqueues and more
    4) Initialization of some fields of the zone structure data

    These two functions are: pgdat_init_internals() and zone_init_internals().

    The second thing this patch does, is to introduce
    free_area_init_core_hotplug(), the memhotplug version of
    free_area_init_core():

    Currently, we call free_area_init_node() from the memhotplug path. In
    there, we set some pgdat's fields, and call calculate_node_totalpages().
    calculate_node_totalpages() calculates the # of pages the node has.

    Since the node is either new, or we are re-using it, the zones belonging
    to this node should not have any pages, so there is no point to calculate
    this now.

    Actually, we re-set these values to 0 later on with the calls to:

    reset_node_managed_pages()
    reset_node_present_pages()

    The # of pages per node and the # of pages per zone will be calculated when
    onlining the pages:

    online_pages()->move_pfn_range()->move_pfn_range_to_zone()->resize_zone_range()
    online_pages()->move_pfn_range()->move_pfn_range_to_zone()->resize_pgdat_range()

    Also, since free_area_init_core/free_area_init_node will now only get called during early init, let us replace
    __paginginit with __init, so their code gets freed up.

    [osalvador@techadventures.net: fix section usage]
    Link: http://lkml.kernel.org/r/20180731101752.GA473@techadventures.net
    [osalvador@suse.de: v6]
    Link: http://lkml.kernel.org/r/20180801122348.21588-6-osalvador@techadventures.net
    Link: http://lkml.kernel.org/r/20180730101757.28058-5-osalvador@techadventures.net
    Signed-off-by: Oscar Salvador
    Reviewed-by: Pavel Tatashin
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Pasha Tatashin
    Cc: Aaron Lu
    Cc: Dan Williams
    Cc: David Hildenbrand
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oscar Salvador
     
  • Let us move the code between CONFIG_DEFERRED_STRUCT_PAGE_INIT to an inline
    function. Not having an ifdef in the function makes the code more
    readable.

    Link: http://lkml.kernel.org/r/20180730101757.28058-4-osalvador@techadventures.net
    Signed-off-by: Oscar Salvador
    Acked-by: Michal Hocko
    Reviewed-by: Pavel Tatashin
    Acked-by: Vlastimil Babka
    Cc: Aaron Lu
    Cc: Dan Williams
    Cc: David Hildenbrand
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Pasha Tatashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oscar Salvador
     
  • __paginginit is the same thing as __meminit except for platforms without
    sparsemem, there it is defined as __init.

    Remove __paginginit and use __meminit. Use __ref in one single function
    that merges __meminit and __init sections: setup_usemap().

    Link: http://lkml.kernel.org/r/20180801122348.21588-4-osalvador@techadventures.net
    Signed-off-by: Pavel Tatashin
    Signed-off-by: Oscar Salvador
    Reviewed-by: Oscar Salvador
    Cc: Pasha Tatashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Tatashin
     
  • zone->node is configured only when CONFIG_NUMA=y, so it is a good idea to
    have inline functions to access this field in order to avoid ifdef's in c
    files.

    Link: http://lkml.kernel.org/r/20180730101757.28058-3-osalvador@techadventures.net
    Signed-off-by: Pavel Tatashin
    Signed-off-by: Oscar Salvador
    Reviewed-by: Oscar Salvador
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Aaron Lu
    Cc: Dan Williams
    Cc: David Hildenbrand
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Pasha Tatashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Tatashin
     
  • Patch series "Refactor free_area_init_core and add
    free_area_init_core_hotplug", v6.

    This patchset does three things:

    1) Clean up/refactor free_area_init_core/free_area_init_node
    by moving the ifdefery out of the functions.
    2) Move the pgdat/zone initialization in free_area_init_core to its
    own function.
    3) Introduce free_area_init_core_hotplug, a small subset of
    free_area_init_core, which is only called from memhotlug code path. In this
    way, we have:

    free_area_init_core: called during early initialization
    free_area_init_core_hotplug: called whenever a new node is allocated/re-used (memhotplug path)

    This patch (of 5):

    Moving the #ifdefs out of the function makes it easier to follow.

    Link: http://lkml.kernel.org/r/20180730101757.28058-2-osalvador@techadventures.net
    Signed-off-by: Oscar Salvador
    Acked-by: Michal Hocko
    Reviewed-by: Pavel Tatashin
    Acked-by: Vlastimil Babka
    Cc: Pasha Tatashin
    Cc: Mel Gorman
    Cc: Aaron Lu
    Cc: Joonsoo Kim
    Cc: Dan Williams
    Cc: David Hildenbrand
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oscar Salvador
     
  • is_dev_zone() is using zone_id() to check if the zone is ZONE_DEVICE.
    zone_id() looks pretty much the same as zone_idx(), and while the use of
    zone_idx() is quite spread in the kernel, zone_id() is only being used by
    is_dev_zone().

    This patch removes zone_id() and makes is_dev_zone() use zone_idx() to
    check the zone, so we do not have two things with the same functionality
    around.

    Link: http://lkml.kernel.org/r/20180730133718.28683-1-osalvador@techadventures.net
    Signed-off-by: Oscar Salvador
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Reviewed-by: Pavel Tatashin
    Cc: Pasha Tatashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oscar Salvador
     
  • __vm_enough_memory has moved to mm/util.c.

    Link: http://lkml.kernel.org/r/E18EDF4A4FA4A04BBFA824B6D7699E532A7E5913@EXMBX-SZMAIL013.tencent.com
    Signed-off-by: Juvi Liu
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    juviliu
     
  • Currently cgroup-v1's memcg_stat_show traverses the memcg tree ~17 times
    to collect the stats while cgroup-v2's memory_stat_show traverses the
    memcg tree thrice. On a large machine, a couple thousand memcgs is very
    normal and if the churn is high and memcgs stick around during to several
    reasons, tens of thousands of nodes in memcg tree can exist. This patch
    has refactored and shared the stat collection code between cgroup-v1 and
    cgroup-v2 and has reduced the tree traversal to just one.

    I ran a simple benchmark which reads the root_mem_cgroup's stat file
    1000 times in the presense of 2500 memcgs on cgroup-v1. The results are:

    Without the patch:
    $ time ./read-root-stat-1000-times

    real 0m1.663s
    user 0m0.000s
    sys 0m1.660s

    With the patch:
    $ time ./read-root-stat-1000-times

    real 0m0.468s
    user 0m0.000s
    sys 0m0.467s

    Link: http://lkml.kernel.org/r/20180724224635.143944-1-shakeelb@google.com
    Signed-off-by: Shakeel Butt
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Cc: Greg Thelen
    Cc: Bruce Merry
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     
  • page_freeze_refs/page_unfreeze_refs have already been relplaced by
    page_ref_freeze/page_ref_unfreeze , but they are not modified in the
    comments.

    Link: http://lkml.kernel.org/r/1532590226-106038-1-git-send-email-jiang.biao2@zte.com.cn
    Signed-off-by: Jiang Biao
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Biao
     
  • The Kconfig text for CONFIG_PAGE_POISONING doesn't mention that it has to
    be enabled explicitly. This updates the documentation for that and adds a
    note about CONFIG_PAGE_POISONING to the "page_poison" command line docs.
    While here, change description of CONFIG_PAGE_POISONING_ZERO too, as it's
    not "random" data, but rather the fixed debugging value that would be used
    when not zeroing. Additionally removes a stray "bool" in the Kconfig.

    Link: http://lkml.kernel.org/r/20180725223832.GA43733@beast
    Signed-off-by: Kees Cook
    Reviewed-by: Andrew Morton
    Cc: Jonathan Corbet
    Cc: Laura Abbott
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • Rather than in vm_area_alloc(). To ensure that the various oddball
    stack-based vmas are in a good state. Some of the callers were zeroing
    them out, others were not.

    Acked-by: Kirill A. Shutemov
    Cc: Russell King
    Cc: Dmitry Vyukov
    Cc: Oleg Nesterov
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • The kernel-doc for mempool_init function is missing the description of the
    pool parameter. Add it.

    Link: http://lkml.kernel.org/r/1532336274-26228-1-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • The /proc/pid/smaps_rollup file is currently implemented via the
    m_start/m_next/m_stop seq_file iterators shared with the other maps files,
    that iterate over vma's. However, the rollup file doesn't print anything
    for each vma, only accumulate the stats.

    There are some issues with the current code as reported in [1] - the
    accumulated stats can get skewed if seq_file start()/stop() op is called
    multiple times, if show() is called multiple times, and after seeks to
    non-zero position.

    Patch [1] fixed those within existing design, but I believe it is
    fundamentally wrong to expose the vma iterators to the seq_file mechanism
    when smaps_rollup shows logically a single set of values for the whole
    address space.

    This patch thus refactors the code to provide a single "value" at offset
    0, with vma iteration to gather the stats done internally. This fixes the
    situations where results are skewed, and simplifies the code, especially
    in show_smap(), at the expense of somewhat less code reuse.

    [1] https://marc.info/?l=linux-mm&m=151927723128134&w=2

    [vbabka@suse.c: use seq_file infrastructure]
    Link: http://lkml.kernel.org/r/bf4525b0-fd5b-4c4c-2cb3-adee3dd95a48@suse.cz
    Link: http://lkml.kernel.org/r/20180723111933.15443-5-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Reported-by: Daniel Colascione
    Reviewed-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • To prepare for handling /proc/pid/smaps_rollup differently from
    /proc/pid/smaps factor out from show_smap() printing the parts of output
    that are common for both variants, which is the bulk of the gathered
    memory stats.

    [vbabka@suse.cz: add const, per Alexey]
    Link: http://lkml.kernel.org/r/b45f319f-cd04-337b-37f8-77f99786aa8a@suse.cz
    Link: http://lkml.kernel.org/r/20180723111933.15443-4-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Reviewed-by: Alexey Dobriyan
    Cc: Daniel Colascione
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • To prepare for handling /proc/pid/smaps_rollup differently from
    /proc/pid/smaps factor out vma mem stats gathering from show_smap() - it
    will be used by both.

    Link: http://lkml.kernel.org/r/20180723111933.15443-3-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Reviewed-by: Alexey Dobriyan
    Cc: Daniel Colascione
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Patch series "cleanups and refactor of /proc/pid/smaps*".

    The recent regression in /proc/pid/smaps made me look more into the code.
    Especially the issues with smaps_rollup reported in [1] as explained in
    Patch 4, which fixes them by refactoring the code. Patches 2 and 3 are
    preparations for that. Patch 1 is me realizing that there's a lot of
    boilerplate left from times where we tried (unsuccessfuly) to mark thread
    stacks in the output.

    Originally I had also plans to rework the translation from
    /proc/pid/*maps* file offsets to the internal structures. Now the offset
    means "vma number", which is not really stable (vma's can come and go
    between read() calls) and there's an extra caching of last vma's address.
    My idea was that offsets would be interpreted directly as addresses, which
    would also allow meaningful seeks (see the ugly seek_to_smaps_entry() in
    tools/testing/selftests/vm/mlock2.h). However loff_t is (signed) long
    long so that might be insufficient somewhere for the unsigned long
    addresses.

    So the result is fixed issues with skewed /proc/pid/smaps_rollup results,
    simpler smaps code, and a lot of unused code removed.

    [1] https://marc.info/?l=linux-mm&m=151927723128134&w=2

    This patch (of 4):

    Commit b76437579d13 ("procfs: mark thread stack correctly in
    proc//maps") introduced differences between /proc/PID/maps and
    /proc/PID/task/TID/maps to mark thread stacks properly, and this was
    also done for smaps and numa_maps. However it didn't work properly and
    was ultimately removed by commit b18cb64ead40 ("fs/proc: Stop trying to
    report thread stacks").

    Now the is_pid parameter for the related show_*() functions is unused
    and we can remove it together with wrapper functions and ops structures
    that differ for PID and TID cases only in this parameter.

    Link: http://lkml.kernel.org/r/20180723111933.15443-2-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Reviewed-by: Alexey Dobriyan
    Cc: Daniel Colascione
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Andrew has noticed some inconsistencies in oom_reap_task_mm. Notably

    - Undocumented return value.

    - comment "failed to reap part..." is misleading - sounds like it's
    referring to something which happened in the past, is in fact
    referring to something which might happen in the future.

    - fails to call trace_finish_task_reaping() in one case

    - code duplication.

    - Increases mmap_sem hold time a little by moving
    trace_finish_task_reaping() inside the locked region. So sue me ;)

    - Sharing the finish: path means that the trace event won't
    distinguish between the two sources of finishing.

    Add a short explanation for the return value and fix the rest by
    reorganizing the function a bit to have unified function exit paths.

    Link: http://lkml.kernel.org/r/20180724141747.GP28386@dhcp22.suse.cz
    Suggested-by: Andrew Morton
    Signed-off-by: Michal Hocko
    Reviewed-by: Andrew Morton
    Cc: Tetsuo Handa
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • The default page memory unit of OOM task dump events might not be
    intuitive and potentially misleading for the non-initiated when debugging
    OOM events: These are pages and not kBs. Add a small printk prior to the
    task dump informing that the memory units are actually memory _pages_.

    Also extends PID field to align on up to 7 characters.
    Reference https://lkml.org/lkml/2018/7/3/1201

    Link: http://lkml.kernel.org/r/c795eb5129149ed8a6345c273aba167ff1bbd388.1530715938.git.rfreire@redhat.com
    Signed-off-by: Rodrigo Freire
    Acked-by: David Rientjes
    Acked-by: Rafael Aquini
    Cc: Michal Hocko
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rodrigo Freire
     
  • oom_reaper used to rely on the oom_lock since e2fe14564d33 ("oom_reaper:
    close race with exiting task"). We do not really need the lock anymore
    though. 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run
    concurrently") has removed serialization with the exit path based on the
    mm reference count and so we do not really rely on the oom_lock anymore.

    Tetsuo was arguing that at least MMF_OOM_SKIP should be set under the lock
    to prevent from races when the page allocator didn't manage to get the
    freed (reaped) memory in __alloc_pages_may_oom but it sees the flag later
    on and move on to another victim. Although this is possible in principle
    let's wait for it to actually happen in real life before we make the
    locking more complex again.

    Therefore remove the oom_lock for oom_reaper paths (both exit_mmap and
    oom_reap_task_mm). The reaper serializes with exit_mmap by mmap_sem +
    MMF_OOM_SKIP flag. There is no synchronization with out_of_memory path
    now.

    [mhocko@kernel.org: oom_reap_task_mm should return false when __oom_reap_task_mm did]
    Link: http://lkml.kernel.org/r/20180724141747.GP28386@dhcp22.suse.cz
    Link: http://lkml.kernel.org/r/20180719075922.13784-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Suggested-by: David Rientjes
    Acked-by: David Rientjes
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • There are several blockable mmu notifiers which might sleep in
    mmu_notifier_invalidate_range_start and that is a problem for the
    oom_reaper because it needs to guarantee a forward progress so it cannot
    depend on any sleepable locks.

    Currently we simply back off and mark an oom victim with blockable mmu
    notifiers as done after a short sleep. That can result in selecting a new
    oom victim prematurely because the previous one still hasn't torn its
    memory down yet.

    We can do much better though. Even if mmu notifiers use sleepable locks
    there is no reason to automatically assume those locks are held. Moreover
    majority of notifiers only care about a portion of the address space and
    there is absolutely zero reason to fail when we are unmapping an unrelated
    range. Many notifiers do really block and wait for HW which is harder to
    handle and we have to bail out though.

    This patch handles the low hanging fruit.
    __mmu_notifier_invalidate_range_start gets a blockable flag and callbacks
    are not allowed to sleep if the flag is set to false. This is achieved by
    using trylock instead of the sleepable lock for most callbacks and
    continue as long as we do not block down the call chain.

    I think we can improve that even further because there is a common pattern
    to do a range lookup first and then do something about that. The first
    part can be done without a sleeping lock in most cases AFAICS.

    The oom_reaper end then simply retries if there is at least one notifier
    which couldn't make any progress in !blockable mode. A retry loop is
    already implemented to wait for the mmap_sem and this is basically the
    same thing.

    The simplest way for driver developers to test this code path is to wrap
    userspace code which uses these notifiers into a memcg and set the hard
    limit to hit the oom. This can be done e.g. after the test faults in all
    the mmu notifier managed memory and set the hard limit to something really
    small. Then we are looking for a proper process tear down.

    [akpm@linux-foundation.org: coding style fixes]
    [akpm@linux-foundation.org: minor code simplification]
    Link: http://lkml.kernel.org/r/20180716115058.5559-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Christian König # AMD notifiers
    Acked-by: Leon Romanovsky # mlx and umem_odp
    Reported-by: David Rientjes
    Cc: "David (ChunMing) Zhou"
    Cc: Paolo Bonzini
    Cc: Alex Deucher
    Cc: David Airlie
    Cc: Jani Nikula
    Cc: Joonas Lahtinen
    Cc: Rodrigo Vivi
    Cc: Doug Ledford
    Cc: Jason Gunthorpe
    Cc: Mike Marciniszyn
    Cc: Dennis Dalessandro
    Cc: Sudeep Dutt
    Cc: Ashutosh Dixit
    Cc: Dimitri Sivanich
    Cc: Boris Ostrovsky
    Cc: Juergen Gross
    Cc: "Jérôme Glisse"
    Cc: Andrea Arcangeli
    Cc: Felix Kuehling
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • In this patch, locking related code is shared between huge/normal code
    path in put_swap_page() to reduce code duplication. The `free_entries == 0`
    case is merged into the more general `free_entries != SWAPFILE_CLUSTER`
    case, because the new locking method makes it easy.

    The added lines is same as the removed lines. But the code size is
    increased when CONFIG_TRANSPARENT_HUGEPAGE=n.

    text data bss dec hex filename
    base: 24123 2004 340 26467 6763 mm/swapfile.o
    unified: 24485 2004 340 26829 68cd mm/swapfile.o

    Dig on step deeper with `size -A mm/swapfile.o` for base and unified
    kernel and compare the result, yields,

    -.text 17723 0
    +.text 17835 0
    -.orc_unwind_ip 1380 0
    +.orc_unwind_ip 1480 0
    -.orc_unwind 2070 0
    +.orc_unwind 2220 0
    -Total 26686
    +Total 27048

    The total difference is the same. The text segment difference is much
    smaller: 112. More difference comes from the ORC unwinder segments:
    (1480 + 2220) - (1380 + 2070) = 250. If the frame pointer unwinder is
    used, this costs nothing.

    Link: http://lkml.kernel.org/r/20180720071845.17920-9-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Reviewed-by: Daniel Jordan
    Acked-by: Dave Hansen
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Shaohua Li
    Cc: Hugh Dickins
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Dan Williams
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • The part of __swap_entry_free() with lock held is separated into a new
    function __swap_entry_free_locked(). Because we want to reuse that
    piece of code in some other places.

    Just mechanical code refactoring, there is no any functional change in
    this function.

    Link: http://lkml.kernel.org/r/20180720071845.17920-8-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Reviewed-by: Daniel Jordan
    Acked-by: Dave Hansen
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Shaohua Li
    Cc: Hugh Dickins
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Dan Williams
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • As suggested by Matthew Wilcox, it is better to use "int entry_size"
    instead of "bool cluster" as parameter to specify whether to operate for
    huge or normal swap entries. Because this improve the flexibility to
    support other swap entry size. And Dave Hansen thinks that this
    improves code readability too.

    So in this patch, the "bool cluster" parameter of get_swap_pages() is
    replaced by "int entry_size".

    And nr_swap_entries() trick is used to reduce the binary size when
    !CONFIG_TRANSPARENT_HUGE_PAGE.

    text data bss dec hex filename
    base 24215 2028 340 26583 67d7 mm/swapfile.o
    head 24123 2004 340 26467 6763 mm/swapfile.o

    Link: http://lkml.kernel.org/r/20180720071845.17920-7-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Suggested-by: Matthew Wilcox
    Acked-by: Dave Hansen
    Cc: Daniel Jordan
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Shaohua Li
    Cc: Hugh Dickins
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • In this patch, the normal/huge code path in put_swap_page() and several
    helper functions are unified to avoid duplicated code, bugs, etc. and
    make it easier to review the code.

    The removed lines are more than added lines. And the binary size is
    kept exactly same when CONFIG_TRANSPARENT_HUGEPAGE=n.

    Link: http://lkml.kernel.org/r/20180720071845.17920-6-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Suggested-by: Dave Hansen
    Acked-by: Dave Hansen
    Reviewed-by: Daniel Jordan
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Shaohua Li
    Cc: Hugh Dickins
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Dan Williams
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying