24 Feb, 2013

40 commits

  • We were deferring the kmemcg static branch increment to a later time,
    due to a nasty dependency between the cpu_hotplug lock, taken by the
    jump label update, and the cgroup_lock.

    Now we no longer take the cgroup lock, and we can save ourselves the
    trouble.

    Signed-off-by: Glauber Costa
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Cc: Hiroyuki Kamezawa
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Glauber Costa
     
  • After the preparation work done in earlier patches, the cgroup_lock can
    be trivially replaced with a memcg-specific lock. This is an automatic
    translation at every site where the values involved were queried.

    The sites where values are written, however, used to be naturally called
    under cgroup_lock. This is the case for instance in the css_online
    callback. For those, we now need to explicitly add the memcg lock.

    With this, all the calls to cgroup_lock outside cgroup core are gone.

    Signed-off-by: Glauber Costa
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Cc: Hiroyuki Kamezawa
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Glauber Costa
     
  • Currently, we use cgroups' provided list of children to verify if it is
    safe to proceed with any value change that is dependent on the cgroup
    being empty.

    This is less than ideal, because it enforces a dependency over cgroup
    core that we would be better off without. The solution proposed here is
    to iterate over the child cgroups and if any is found that is already
    online, we bounce and return: we don't really care how many children we
    have, only if we have any.

    This is also made to be hierarchy aware. IOW, cgroups with hierarchy
    disabled, while they still exist, will be considered for the purpose of
    this interface as having no children.

    [akpm@linux-foundation.org: tweak comments]
    Signed-off-by: Glauber Costa
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Cc: Hiroyuki Kamezawa
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Glauber Costa
     
  • This patch is a preparatory work for later locking rework to get rid of
    big cgroup lock from memory controller code.

    The memory controller uses some tunables to adjust its operation. Those
    tunables are inherited from parent to children upon children
    intialization. For most of them, the value cannot be changed after the
    parent has a new children.

    cgroup core splits initialization in two phases: css_alloc and css_online.
    After css_alloc, the memory allocation and basic initialization are done.
    But the new group is not yet visible anywhere, not even for cgroup core
    code. It is only somewhere between css_alloc and css_online that it is
    inserted into the internal children lists. Copying tunable values in
    css_alloc will lead to inconsistent values: the children will copy the old
    parent values, that can change between the copy and the moment in which
    the groups is linked to any data structure that can indicate the presence
    of children.

    Signed-off-by: Glauber Costa
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Cc: Hiroyuki Kamezawa
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Glauber Costa
     
  • In memcg, we use the cgroup_lock basically to synchronize against
    attaching new children to a cgroup. We do this because we rely on
    cgroup core to provide us with this information.

    We need to guarantee that upon child creation, our tunables are
    consistent. For those, the calls to cgroup_lock() all live in handlers
    like mem_cgroup_hierarchy_write(), where we change a tunable in the
    group that is hierarchy-related. For instance, the use_hierarchy flag
    cannot be changed if the cgroup already have children.

    Furthermore, those values are propagated from the parent to the child
    when a new child is created. So if we don't lock like this, we can end
    up with the following situation:

    A B
    memcg_css_alloc() mem_cgroup_hierarchy_write()
    copy use hierarchy from parent change use hierarchy in parent
    finish creation.

    This is mainly because during create, we are still not fully connected
    to the css tree. So all iterators and the such that we could use, will
    fail to show that the group has children.

    My observation is that all of creation can proceed in parallel with
    those tasks, except value assignment. So what this patch series does is
    to first move all value assignment that is dependent on parent values
    from css_alloc to css_online, where the iterators all work, and then we
    lock only the value assignment. This will guarantee that parent and
    children always have consistent values. Together with an online test,
    that can be derived from the observation that the refcount of an online
    memcg can be made to be always positive, we should be able to
    synchronize our side without the cgroup lock.

    This patch:

    Currently, we rely on the cgroup_lock() to prevent changes to
    move_charge_at_immigrate during task migration. However, this is only
    needed because the current strategy keeps checking this value throughout
    the whole process. Since all we need is serialization, one needs only
    to guarantee that whatever decision we made in the beginning of a
    specific migration is respected throughout the process.

    We can achieve this by just saving it in mc. By doing this, no kind of
    locking is needed.

    Signed-off-by: Glauber Costa
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Cc: Hiroyuki Kamezawa
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Glauber Costa
     
  • In order to maintain all the memcg bookkeeping, we need per-node
    descriptors, which will in turn contain a per-zone descriptor.

    Because we want to statically allocate those, this array ends up being
    very big. Part of the reason is that we allocate something large enough
    to hold MAX_NUMNODES, the compile time constant that holds the maximum
    number of nodes we would ever consider.

    However, we can do better in some cases if the firmware help us. This
    is true for modern x86 machines; coincidentally one of the architectures
    in which MAX_NUMNODES tends to be very big.

    By using the firmware-provided maximum number of nodes instead of
    MAX_NUMNODES, we can reduce the memory footprint of struct memcg
    considerably. In the extreme case in which we have only one node, this
    reduces the size of the structure from ~ 64k to ~2k. This is
    particularly important because it means that we will no longer resort to
    the vmalloc area for the struct memcg on defconfigs. We also have
    enough room for an extra node and still be outside vmalloc.

    One also has to keep in mind that with the industry's ability to fit
    more processors in a die as fast as the FED prints money, a nodes = 2
    configuration is already respectably big.

    [akpm@linux-foundation.org: add check for invalid nid, remove inline]
    Signed-off-by: Glauber Costa
    Acked-by: Michal Hocko
    Cc: Kamezawa Hiroyuki
    Cc: Johannes Weiner
    Reviewed-by: Greg Thelen
    Cc: Hugh Dickins
    Cc: Ying Han
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Glauber Costa
     
  • Answering the question "how much space remains in the page->flags" is
    time-consuming. mminit_loglevel can help answer the question but it
    does not take last_nid information into account. This patch corrects it
    and while there it corrects the messages related to page flag usage,
    pgshifts and node/zone id. When applied the relevant output looks
    something like this but will depend on the kernel configuration.

    mminit::pageflags_layout_widths Section 0 Node 9 Zone 2 Lastnid 9 Flags 25
    mminit::pageflags_layout_shifts Section 19 Node 9 Zone 2 Lastnid 9
    mminit::pageflags_layout_pgshifts Section 0 Node 55 Zone 53 Lastnid 44
    mminit::pageflags_layout_nodezoneid Node/Zone ID: 64 -> 53
    mminit::pageflags_layout_usage location: 64 -> 44 layout 44 -> 25 unused 25 -> 0 page-flags

    Signed-off-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Andrew Morton pointed out that page_xchg_last_nid() and
    reset_page_last_nid() were "getting nuttily large" and asked that it be
    investigated.

    reset_page_last_nid() is on the page free path and it would be
    unfortunate to make that path more expensive than it needs to be. Due
    to the internal use of page_xchg_last_nid() it is already too expensive
    but fortunately, it should also be impossible for the page->flags to be
    updated in parallel when we call reset_page_last_nid(). Instead of
    unlining the function, it uses a simplier implementation that assumes no
    parallel updates and should now be sufficiently short for inlining.

    page_xchg_last_nid() is called in paths that are already quite expensive
    (splitting huge page, fault handling, migration) and it is reasonable to
    uninline. There was not really a good place to place the function but
    mm/mmzone.c was the closest fit IMO.

    This patch saved 128 bytes of text in the vmlinux file for the kernel
    configuration I used for testing automatic NUMA balancing.

    Signed-off-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Memcg swap accounting is currently enabled by enable_swap_cgroup when
    the root cgroup is created. mem_cgroup_init acts as a memcg subsystem
    initializer which sounds like a much better place for enable_swap_cgroup
    as well. We already register memsw files from there so it makes a lot
    of sense to merge those two into a single enable_swap_cgroup function.

    This patch doesn't introduce any semantic changes.

    Signed-off-by: Michal Hocko
    Cc: Zhouping Liu
    Cc: Kamezawa Hiroyuki
    Cc: David Rientjes
    Cc: Li Zefan
    Cc: CAI Qian
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Zhouping Liu has reported that memsw files are exported even though swap
    accounting is runtime disabled if MEMCG_SWAP is enabled. This behavior
    has been introduced by commit af36f906c0f4 ("memcg: always create memsw
    files if CGROUP_MEM_RES_CTLR_SWAP") and it causes any attempt to open
    the file to return EOPNOTSUPP. Although EOPNOTSUPP should say be clear
    that memsw operations are not supported in the given configuration it is
    fair to say that this behavior could be quite confusing.

    Let's tear memsw files out of default cgroup files and add them only if
    the swap accounting is really enabled (either by MEMCG_SWAP_ENABLED or
    swapaccount=1 boot parameter). We can hook into mem_cgroup_init which
    is called when the memcg subsystem is initialized and which happens
    after boot command line is processed.

    Signed-off-by: Michal Hocko
    Reported-by: Zhouping Liu
    Tested-by: Zhouping Liu
    Cc: Kamezawa Hiroyuki
    Cc: David Rientjes
    Cc: Li Zefan
    Cc: CAI Qian
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • When calculating amount of dirtyable memory, min_free_kbytes should be
    subtracted because it is not intended for dirty pages.

    Addresses http://bugs.debian.org/695182

    [akpm@linux-foundation.org: fix up min_free_kbytes extern declarations]
    [akpm@linux-foundation.org: fix min() warning]
    Signed-off-by: Paul Szabo
    Acked-by: Rik van Riel
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Szabo
     
  • The comment in commit 4fc3f1d66b1e ("mm/rmap, migration: Make
    rmap_walk_anon() and try_to_unmap_anon() more scalable") says:

    | Rename anon_vma_[un]lock() => anon_vma_[un]lock_write(),
    | to make it clearer that it's an exclusive write-lock in
    | that case - suggested by Rik van Riel.

    But that commit renames only anon_vma_lock()

    Signed-off-by: Konstantin Khlebnikov
    Cc: Ingo Molnar
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • swap_lock is heavily contended when I test swap to 3 fast SSD (even
    slightly slower than swap to 2 such SSD). The main contention comes
    from swap_info_get(). This patch tries to fix the gap with adding a new
    per-partition lock.

    Global data like nr_swapfiles, total_swap_pages, least_priority and
    swap_list are still protected by swap_lock.

    nr_swap_pages is an atomic now, it can be changed without swap_lock. In
    theory, it's possible get_swap_page() finds no swap pages but actually
    there are free swap pages. But sounds not a big problem.

    Accessing partition specific data (like scan_swap_map and so on) is only
    protected by swap_info_struct.lock.

    Changing swap_info_struct.flags need hold swap_lock and
    swap_info_struct.lock, because scan_scan_map() will check it. read the
    flags is ok with either the locks hold.

    If both swap_lock and swap_info_struct.lock must be hold, we always hold
    the former first to avoid deadlock.

    swap_entry_free() can change swap_list. To delete that code, we add a
    new highest_priority_index. Whenever get_swap_page() is called, we
    check it. If it's valid, we use it.

    It's a pity get_swap_page() still holds swap_lock(). But in practice,
    swap_lock() isn't heavily contended in my test with this patch (or I can
    say there are other much more heavier bottlenecks like TLB flush). And
    BTW, looks get_swap_page() doesn't really need the lock. We never free
    swap_info[] and we check SWAP_WRITEOK flag. The only risk without the
    lock is we could swapout to some low priority swap, but we can quickly
    recover after several rounds of swap, so sounds not a big deal to me.
    But I'd prefer to fix this if it's a real problem.

    "swap: make each swap partition have one address_space" improved the
    swapout speed from 1.7G/s to 2G/s. This patch further improves the
    speed to 2.3G/s, so around 15% improvement. It's a multi-process test,
    so TLB flush isn't the biggest bottleneck before the patches.

    [arnd@arndb.de: fix it for nommu]
    [hughd@google.com: add missing unlock]
    [minchan@kernel.org: get rid of lockdep whinge on sys_swapon]
    Signed-off-by: Shaohua Li
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Greg Kroah-Hartman
    Cc: Seth Jennings
    Cc: Konrad Rzeszutek Wilk
    Cc: Xiao Guangrong
    Cc: Dan Magenheimer
    Cc: Stephen Rothwell
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Hugh Dickins
    Signed-off-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • When I use several fast SSD to do swap, swapper_space.tree_lock is
    heavily contended. This makes each swap partition have one
    address_space to reduce the lock contention. There is an array of
    address_space for swap. The swap entry type is the index to the array.

    In my test with 3 SSD, this increases the swapout throughput 20%.

    [akpm@linux-foundation.org: revert unneeded change to __add_to_swap_cache]
    Signed-off-by: Shaohua Li
    Cc: Hugh Dickins
    Acked-by: Rik van Riel
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • According to akpm, this saves 1/2k text and makes things simple for the
    next patch.

    Numbers from Minchan:

    add/remove: 1/0 grow/shrink: 6/22 up/down: 92/-516 (-424)
    function old new delta
    page_mapping - 48 +48
    do_task_stat 2292 2308 +16
    page_remove_rmap 240 248 +8
    load_elf_binary 4500 4508 +8
    update_queue 532 536 +4
    scsi_probe_and_add_lun 2892 2896 +4
    lookup_fast 644 648 +4
    vcs_read 1040 1036 -4
    __ip_route_output_key 1904 1900 -4
    ip_route_input_noref 2508 2500 -8
    shmem_file_aio_read 784 772 -12
    __isolate_lru_page 272 256 -16
    shmem_replace_page 708 688 -20
    mark_buffer_dirty 228 208 -20
    __set_page_dirty_buffers 240 220 -20
    __remove_mapping 276 256 -20
    update_mmu_cache 500 476 -24
    set_page_dirty_balance 92 68 -24
    set_page_dirty 172 148 -24
    page_evictable 88 64 -24
    page_cache_pipe_buf_steal 248 224 -24
    clear_page_dirty_for_io 340 316 -24
    test_set_page_writeback 400 372 -28
    test_clear_page_writeback 516 488 -28
    invalidate_inode_page 156 128 -28
    page_mkclean 432 400 -32
    flush_dcache_page 360 328 -32
    __set_page_dirty_nobuffers 324 280 -44
    shrink_page_list 2412 2356 -56

    Signed-off-by: Shaohua Li
    Suggested-by: Andrew Morton
    Cc: Hugh Dickins
    Acked-by: Rik van Riel
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • When correcting commit 04fa5d6a6547 ("mm: migrate: check page_count of
    THP before migrating") Hugh Dickins noted that the control flow for
    transhuge migration was difficult to follow. Unconditionally calling
    put_page() in numamigrate_isolate_page() made the failure paths of both
    migrate_misplaced_transhuge_page() and migrate_misplaced_page() more
    complex that they should be. Further, he was extremely wary that an
    unlock_page() should ever happen after a put_page() even if the
    put_page() should never be the final put_page.

    Hugh implemented the following cleanup to simplify the path by calling
    putback_lru_page() inside numamigrate_isolate_page() if it failed to
    isolate and always calling unlock_page() within
    migrate_misplaced_transhuge_page().

    There is no functional change after this patch is applied but the code
    is easier to follow and unlock_page() always happens before put_page().

    [mgorman@suse.de: changelog only]
    Signed-off-by: Mel Gorman
    Signed-off-by: Hugh Dickins
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: Ingo Molnar
    Cc: Simon Jeons
    Cc: Wanpeng Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • page->_last_nid fits into page->flags on 64-bit. The unlikely 32-bit
    NUMA configuration with NUMA Balancing will still need an extra page
    field. As Peter notes "Completely dropping 32bit support for
    CONFIG_NUMA_BALANCING would simplify things, but it would also remove
    the warning if we grow enough 64bit only page-flags to push the last-cpu
    out."

    [mgorman@suse.de: minor modifications]
    Signed-off-by: Mel Gorman
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: Ingo Molnar
    Cc: Simon Jeons
    Cc: Wanpeng Li
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • This is a preparation patch for moving page->_last_nid into page->flags
    that moves page flag layout information to a separate header. This
    patch is necessary because otherwise there would be a circular
    dependency between mm_types.h and mm.h.

    Signed-off-by: Mel Gorman
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: Ingo Molnar
    Cc: Simon Jeons
    Cc: Wanpeng Li
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • The current definitions for count_vm_numa_events() is wrong for
    !CONFIG_NUMA_BALANCING as the following would miss the side-effect.

    count_vm_numa_events(NUMA_FOO, bar++);

    There are no such users of count_vm_numa_events() but this patch fixes
    it as it is a potential pitfall. Ideally both would be converted to
    static inline but NUMA_PTE_UPDATES is not defined if
    !CONFIG_NUMA_BALANCING and creating dummy constants just to have a
    static inline would be similarly clumsy.

    Signed-off-by: Mel Gorman
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: Ingo Molnar
    Cc: Simon Jeons
    Cc: Wanpeng Li
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Wanpeng Li pointed out that numamigrate_isolate_page() assumes that only
    one base page is being migrated when in fact it can also be checking
    THP.

    The consequences are that a migration will be attempted when a target
    node is nearly full and fail later. It's unlikely to be user-visible
    but it should be fixed. While we are there, migrate_balanced_pgdat()
    should treat nr_migrate_pages as an unsigned long as it is treated as a
    watermark.

    Signed-off-by: Mel Gorman
    Suggested-by: Wanpeng Li
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: Ingo Molnar
    Cc: Simon Jeons
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • s/me/be/ and clarify the comment a bit when we're changing it anyway.

    Signed-off-by: Mel Gorman
    Suggested-by: Simon Jeons
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: Ingo Molnar
    Cc: Wanpeng Li
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • If one storage interface or usb network interface(iSCSI case) exists in
    current configuration, memory allocation with GFP_KERNEL during
    usb_device_reset() might trigger I/O transfer on the storage interface
    itself and cause deadlock because the 'us->dev_mutex' is held in
    .pre_reset() and the storage interface can't do I/O transfer when the
    reset is triggered by other interface, or the error handling can't be
    completed if the reset is triggered by the storage itself (error
    handling path).

    Signed-off-by: Ming Lei
    Cc: "David S. Miller"
    Cc: Eric Dumazet
    Cc: David Decotigny
    Cc: Tom Herbert
    Cc: Ingo Molnar
    Cc: Jens Axboe
    Cc: Minchan Kim
    Cc: Alan Stern
    Cc: Oliver Neukum
    Reviewed-by: Jiri Kosina
    Cc: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: "Rafael J. Wysocki"
    Cc: Greg KH
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ming Lei
     
  • Apply the introduced memalloc_noio_save() and memalloc_noio_restore() to
    force memory allocation with no I/O during runtime_resume/runtime_suspend
    callback on device with the flag of 'memalloc_noio' set.

    Signed-off-by: Ming Lei
    Cc: "David S. Miller"
    Cc: Eric Dumazet
    Cc: David Decotigny
    Cc: Tom Herbert
    Cc: Ingo Molnar
    Cc: Jens Axboe
    Cc: Minchan Kim
    Cc: Alan Stern
    Cc: Oliver Neukum
    Cc: Jiri Kosina
    Cc: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: "Rafael J. Wysocki"
    Cc: Greg KH
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ming Lei
     
  • Deadlock might be caused by allocating memory with GFP_KERNEL in
    runtime_resume and runtime_suspend callback of network devices in iSCSI
    situation, so mark network devices and its ancestor as 'memalloc_noio'
    with the introduced pm_runtime_set_memalloc_noio().

    Signed-off-by: Ming Lei
    Cc: "David S. Miller"
    Cc: Eric Dumazet
    Cc: David Decotigny
    Cc: Tom Herbert
    Cc: Ingo Molnar
    Cc: Jens Axboe
    Cc: Minchan Kim
    Cc: Alan Stern
    Cc: Oliver Neukum
    Cc: Jiri Kosina
    Cc: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: "Rafael J. Wysocki"
    Cc: Greg KH
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ming Lei
     
  • Apply the introduced pm_runtime_set_memalloc_noio on block device so
    that PM core will teach mm to not allocate memory with GFP_IOFS when
    calling the runtime_resume and runtime_suspend callback for block
    devices and its ancestors.

    Signed-off-by: Ming Lei
    Cc: Jens Axboe
    Cc: Minchan Kim
    Cc: Alan Stern
    Cc: Oliver Neukum
    Cc: Jiri Kosina
    Cc: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: "Rafael J. Wysocki"
    Cc: Greg KH
    Cc: "David S. Miller"
    Cc: Eric Dumazet
    Cc: David Decotigny
    Cc: Tom Herbert
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ming Lei
     
  • Introduce the flag memalloc_noio in 'struct dev_pm_info' to help PM core
    to teach mm not allocating memory with GFP_KERNEL flag for avoiding
    probable deadlock.

    As explained in the comment, any GFP_KERNEL allocation inside
    runtime_resume() or runtime_suspend() on any one of device in the path
    from one block or network device to the root device in the device tree
    may cause deadlock, the introduced pm_runtime_set_memalloc_noio() sets
    or clears the flag on device in the path recursively.

    Signed-off-by: Ming Lei
    Cc: Minchan Kim
    Cc: Alan Stern
    Cc: Oliver Neukum
    Cc: Jiri Kosina
    Cc: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: "Rafael J. Wysocki"
    Cc: Greg KH
    Cc: Jens Axboe
    Cc: "David S. Miller"
    Cc: Eric Dumazet
    Cc: David Decotigny
    Cc: Tom Herbert
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ming Lei
     
  • This patch introduces PF_MEMALLOC_NOIO on process flag('flags' field of
    'struct task_struct'), so that the flag can be set by one task to avoid
    doing I/O inside memory allocation in the task's context.

    The patch trys to solve one deadlock problem caused by block device, and
    the problem may happen at least in the below situations:

    - during block device runtime resume, if memory allocation with
    GFP_KERNEL is called inside runtime resume callback of any one of its
    ancestors(or the block device itself), the deadlock may be triggered
    inside the memory allocation since it might not complete until the block
    device becomes active and the involed page I/O finishes. The situation
    is pointed out first by Alan Stern. It is not a good approach to
    convert all GFP_KERNEL[1] in the path into GFP_NOIO because several
    subsystems may be involved(for example, PCI, USB and SCSI may be
    involved for usb mass stoarage device, network devices involved too in
    the iSCSI case)

    - during block device runtime suspend, because runtime resume need to
    wait for completion of concurrent runtime suspend.

    - during error handling of usb mass storage deivce, USB bus reset will
    be put on the device, so there shouldn't have any memory allocation with
    GFP_KERNEL during USB bus reset, otherwise the deadlock similar with
    above may be triggered. Unfortunately, any usb device may include one
    mass storage interface in theory, so it requires all usb interface
    drivers to handle the situation. In fact, most usb drivers don't know
    how to handle bus reset on the device and don't provide .pre_set() and
    .post_reset() callback at all, so USB core has to unbind and bind driver
    for these devices. So it is still not practical to resort to GFP_NOIO
    for solving the problem.

    Also the introduced solution can be used by block subsystem or block
    drivers too, for example, set the PF_MEMALLOC_NOIO flag before doing
    actual I/O transfer.

    It is not a good idea to convert all these GFP_KERNEL in the affected
    path into GFP_NOIO because these functions doing that may be implemented
    as library and will be called in many other contexts.

    In fact, memalloc_noio_flags() can convert some of current static
    GFP_NOIO allocation into GFP_KERNEL back in other non-affected contexts,
    at least almost all GFP_NOIO in USB subsystem can be converted into
    GFP_KERNEL after applying the approach and make allocation with GFP_NOIO
    only happen in runtime resume/bus reset/block I/O transfer contexts
    generally.

    [1], several GFP_KERNEL allocation examples in runtime resume path

    - pci subsystem
    acpi_os_allocate

    Signed-off-by: Minchan Kim
    Cc: Alan Stern
    Cc: Oliver Neukum
    Cc: Jiri Kosina
    Cc: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: "Rafael J. Wysocki"
    Cc: Greg KH
    Cc: Jens Axboe
    Cc: "David S. Miller"
    Cc: Eric Dumazet
    Cc: David Decotigny
    Cc: Tom Herbert
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ming Lei
     
  • From: Zlatko Calusic

    Commit 92df3a723f84 ("mm: vmscan: throttle reclaim if encountering too
    many dirty pages under writeback") introduced waiting on congested zones
    based on a sane algorithm in shrink_inactive_list().

    What this means is that there's no more need for throttling and
    additional heuristics in balance_pgdat(). So, let's remove it and tidy
    up the code.

    Signed-off-by: Zlatko Calusic
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zlatko Calusic
     
  • num_poisoned_pages counts up the number of pages isolated by memory
    errors. But for thp, only one subpage is isolated because memory error
    handler splits it, so it's wrong to add (1 << compound_trans_order).

    [akpm@linux-foundation.org: tweak comment]
    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Tony Luck
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Currently soft_offline_page() is hard to maintain because it has many
    return points and goto statements. All of this mess come from
    get_any_page().

    This function should only get page refcount as the name implies, but it
    does some page isolating actions like SetPageHWPoison() and dequeuing
    hugepage. This patch corrects it and introduces some internal
    subroutines to make soft offlining code more readable and maintainable.

    Signed-off-by: Naoya Horiguchi
    Reviewed-by: Andi Kleen
    Cc: Tony Luck
    Cc: Wu Fengguang
    Cc: Xishi Qiu
    Cc: Jiang Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Since MCE is an x86 concept, and this code is in mm/, it would be better
    to use the name num_poisoned_pages instead of mce_bad_pages.

    [akpm@linux-foundation.org: fix mm/sparse.c]
    Signed-off-by: Xishi Qiu
    Signed-off-by: Jiang Liu
    Suggested-by: Borislav Petkov
    Reviewed-by: Wanpeng Li
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xishi Qiu
     
  • There are too many return points randomly intermingled with some "goto
    done" return points. So adjust the function structure, one for the
    success path, the other for the failure path. Use atomic_long_inc
    instead of atomic_long_add.

    Signed-off-by: Xishi Qiu
    Signed-off-by: Jiang Liu
    Suggested-by: Andrew Morton
    Cc: Borislav Petkov
    Cc: Wanpeng Li
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xishi Qiu
     
  • When doing

    $ echo paddr > /sys/devices/system/memory/soft_offline_page

    to offline a *free* page, the value of mce_bad_pages will be added, and
    the page is set HWPoison flag, but it is still managed by page buddy
    alocator.

    $ cat /proc/meminfo | grep HardwareCorrupted

    shows the value.

    If we offline the same page, the value of mce_bad_pages will be added
    *again*, this means the value is incorrect now. Assume the page is
    still free during this short time.

    soft_offline_page()
    get_any_page()
    "else if (is_free_buddy_page(p))" branch return 0
    "goto done";
    "atomic_long_add(1, &mce_bad_pages);"

    This patch:

    Move poisoned page check at the beginning of the function in order to
    fix the error.

    Signed-off-by: Xishi Qiu
    Signed-off-by: Jiang Liu
    Tested-by: Naoya Horiguchi
    Cc: Borislav Petkov
    Cc: Wanpeng Li
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xishi Qiu
     
  • Several functions test MIGRATE_ISOLATE and some of those are hotpath but
    MIGRATE_ISOLATE is used only if we enable CONFIG_MEMORY_ISOLATION(ie,
    CMA, memory-hotplug and memory-failure) which are not common config
    option. So let's not add unnecessary overhead and code when we don't
    enable CONFIG_MEMORY_ISOLATION.

    Signed-off-by: Minchan Kim
    Cc: KOSAKI Motohiro
    Acked-by: Michal Nazarewicz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Function put_page_bootmem() is used to free pages allocated by bootmem
    allocator, so it should increase totalram_pages when freeing pages into
    the buddy system.

    Signed-off-by: Jiang Liu
    Cc: Wen Congyang
    Cc: David Rientjes
    Cc: Jiang Liu
    Cc: Maciej Rutecki
    Cc: Chris Clayton
    Cc: "Rafael J . Wysocki"
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Cc: Jianguo Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • Now all users of "number of pages managed by the buddy system" have been
    converted to use zone->managed_pages, so set zone->present_pages to what
    it should be:

    present_pages = spanned_pages - absent_pages;

    Signed-off-by: Jiang Liu
    Cc: Wen Congyang
    Cc: David Rientjes
    Cc: Jiang Liu
    Cc: Maciej Rutecki
    Cc: Chris Clayton
    Cc: "Rafael J . Wysocki"
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Cc: Jianguo Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • Now we have zone->managed_pages for "pages managed by the buddy system
    in the zone", so replace zone->present_pages with zone->managed_pages if
    what the user really wants is number of allocatable pages.

    Signed-off-by: Jiang Liu
    Cc: Wen Congyang
    Cc: David Rientjes
    Cc: Jiang Liu
    Cc: Maciej Rutecki
    Cc: Chris Clayton
    Cc: "Rafael J . Wysocki"
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Cc: Jianguo Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • …emblock_overlaps_region().

    The definition of struct movablecore_map is protected by
    CONFIG_HAVE_MEMBLOCK_NODE_MAP but its use in memblock_overlaps_region()
    is not. So add CONFIG_HAVE_MEMBLOCK_NODE_MAP to protect the use of
    movablecore_map in memblock_overlaps_region().

    Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
    Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Tang Chen
     
  • We now provide an option for users who don't want to specify physical
    memory address in kernel commandline.

    /*
    * For movablemem_map=acpi:
    *
    * SRAT: |_____| |_____| |_________| |_________| ......
    * node id: 0 1 1 2
    * hotpluggable: n y y n
    * movablemem_map: |_____| |_________|
    *
    * Using movablemem_map, we can prevent memblock from allocating memory
    * on ZONE_MOVABLE at boot time.
    */

    So user just specify movablemem_map=acpi, and the kernel will use
    hotpluggable info in SRAT to determine which memory ranges should be set
    as ZONE_MOVABLE.

    If all the memory ranges in SRAT is hotpluggable, then no memory can be
    used by kernel. But before parsing SRAT, memblock has already reserve
    some memory ranges for other purposes, such as for kernel image, and so
    on. We cannot prevent kernel from using these memory. So we need to
    exclude these ranges even if these memory is hotpluggable.

    Furthermore, there could be several memory ranges in the single node
    which the kernel resides in. We may skip one range that have memory
    reserved by memblock, but if the rest of memory is too small, then the
    kernel will fail to boot. So, make the whole node which the kernel
    resides in un-hotpluggable. Then the kernel has enough memory to use.

    NOTE: Using this way will cause NUMA performance down because the
    whole node will be set as ZONE_MOVABLE, and kernel cannot use memory
    on it. If users don't want to lose NUMA performance, just don't use
    it.

    [akpm@linux-foundation.org: fix warning]
    [akpm@linux-foundation.org: use strcmp()]
    Signed-off-by: Tang Chen
    Cc: KOSAKI Motohiro
    Cc: Jiang Liu
    Cc: Jianguo Wu
    Cc: Kamezawa Hiroyuki
    Cc: Lai Jiangshan
    Cc: Wu Jianguo
    Cc: Yasuaki Ishimatsu
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Len Brown
    Cc: "Brown, Len"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tang Chen