26 Sep, 2014

7 commits

  • commit 4ffeaf3560a52b4a69cc7909873d08c0ef5909d4 upstream.

    The fair zone allocation policy round-robins allocations between zones
    within a node to avoid age inversion problems during reclaim. If the
    first allocation fails, the batch counts are reset and a second attempt
    made before entering the slow path.

    One assumption made with this scheme is that batches expire at roughly
    the same time and the resets each time are justified. This assumption
    does not hold when zones reach their low watermark as the batches will
    be consumed at uneven rates. Allocation failure due to watermark
    depletion result in additional zonelist scans for the reset and another
    watermark check before hitting the slowpath.

    On UMA, the benefit is negligible -- around 0.25%. On 4-socket NUMA
    machine it's variable due to the variability of measuring overhead with
    the vmstat changes. The system CPU overhead comparison looks like

    3.16.0-rc3 3.16.0-rc3 3.16.0-rc3
    vanilla vmstat-v5 lowercost-v5
    User 746.94 774.56 802.00
    System 65336.22 32847.27 40852.33
    Elapsed 27553.52 27415.04 27368.46

    However it is worth noting that the overall benchmark still completed
    faster and intuitively it makes sense to take as few passes as possible
    through the zonelists.

    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Mel Gorman
     
  • commit 0d5d823ab4e608ec7b52ac4410de4cb74bbe0edd upstream.

    zone->pages_scanned is a write-intensive cache line during page reclaim
    and it's also updated during page free. Move the counter into vmstat to
    take advantage of the per-cpu updates and do not update it in the free
    paths unless necessary.

    On a small UMA machine running tiobench the difference is marginal. On
    a 4-node machine the overhead is more noticable. Note that automatic
    NUMA balancing was disabled for this test as otherwise the system CPU
    overhead is unpredictable.

    3.16.0-rc3 3.16.0-rc3 3.16.0-rc3
    vanillarearrange-v5 vmstat-v5
    User 746.94 759.78 774.56
    System 65336.22 58350.98 32847.27
    Elapsed 27553.52 27282.02 27415.04

    Note that the overhead reduction will vary depending on where exactly
    pages are allocated and freed.

    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Mel Gorman
     
  • commit 3484b2de9499df23c4604a513b36f96326ae81ad upstream.

    The arrangement of struct zone has changed over time and now it has
    reached the point where there is some inappropriate sharing going on.
    On x86-64 for example

    o The zone->node field is shared with the zone lock and zone->node is
    accessed frequently from the page allocator due to the fair zone
    allocation policy.

    o span_seqlock is almost never used by shares a line with free_area

    o Some zone statistics share a cache line with the LRU lock so
    reclaim-intensive and allocator-intensive workloads can bounce the cache
    line on a stat update

    This patch rearranges struct zone to put read-only and read-mostly
    fields together and then splits the page allocator intensive fields, the
    zone statistics and the page reclaim intensive fields into their own
    cache lines. Note that the type of lowmem_reserve changes due to the
    watermark calculations being signed and avoiding a signed/unsigned
    conversion there.

    On the test configuration I used the overall size of struct zone shrunk
    by one cache line. On smaller machines, this is not likely to be
    noticable. However, on a 4-node NUMA machine running tiobench the
    system CPU overhead is reduced by this patch.

    3.16.0-rc3 3.16.0-rc3
    vanillarearrange-v5r9
    User 746.94 759.78
    System 65336.22 58350.98
    Elapsed 27553.52 27282.02

    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Mel Gorman
     
  • commit dc4b0caff24d9b2918e9f27bc65499ee63187eba upstream.

    In the free path we calculate page_to_pfn multiple times. Reduce that.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Mel Gorman
     
  • commit 7aeb09f9104b760fc53c98cb7d20d06640baf9e6 upstream.

    X86 prefers the use of unsigned types for iterators and there is a
    tendency to mix whether a signed or unsigned type if used for page order.
    This converts a number of sites in mm/page_alloc.c to use unsigned int for
    order where possible.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Mel Gorman
     
  • commit 35979ef3393110ff3c12c6b94552208d3bdf1a36 upstream.

    Each zone has a cached migration scanner pfn for memory compaction so that
    subsequent calls to memory compaction can start where the previous call
    left off.

    Currently, the compaction migration scanner only updates the per-zone
    cached pfn when pageblocks were not skipped for async compaction. This
    creates a dependency on calling sync compaction to avoid having subsequent
    calls to async compaction from scanning an enormous amount of non-MOVABLE
    pageblocks each time it is called. On large machines, this could be
    potentially very expensive.

    This patch adds a per-zone cached migration scanner pfn only for async
    compaction. It is updated everytime a pageblock has been scanned in its
    entirety and when no pages from it were successfully isolated. The cached
    migration scanner pfn for sync compaction is updated only when called for
    sync compaction.

    Signed-off-by: David Rientjes
    Acked-by: Vlastimil Babka
    Reviewed-by: Naoya Horiguchi
    Cc: Greg Thelen
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    David Rientjes
     
  • commit 943dca1a1fcbccb58de944669b833fd38a6c809b upstream.

    Yasuaki Ishimatsu reported memory hot-add spent more than 5 _hours_ on
    9TB memory machine since onlining memory sections is too slow. And we
    found out setup_zone_migrate_reserve spent >90% of the time.

    The problem is, setup_zone_migrate_reserve scans all pageblocks
    unconditionally, but it is only necessary if the number of reserved
    block was reduced (i.e. memory hot remove).

    Moreover, maximum MIGRATE_RESERVE per zone is currently 2. It means
    that the number of reserved pageblocks is almost always unchanged.

    This patch adds zone->nr_migrate_reserve_block to maintain the number of
    MIGRATE_RESERVE pageblocks and it reduces the overhead of
    setup_zone_migrate_reserve dramatically. The following table shows time
    of onlining a memory section.

    Amount of memory | 128GB | 192GB | 256GB|
    ---------------------------------------------
    linux-3.12 | 23.9 | 31.4 | 44.5 |
    This patch | 8.3 | 8.3 | 8.6 |
    Mel's proposal patch | 10.9 | 19.2 | 31.3 |
    ---------------------------------------------
    (millisecond)

    128GB : 4 nodes and each node has 32GB of memory
    192GB : 6 nodes and each node has 32GB of memory
    256GB : 8 nodes and each node has 32GB of memory

    (*1) Mel proposed his idea by the following threads.
    https://lkml.org/lkml/2013/10/30/272

    [akpm@linux-foundation.org: tweak comment]
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Yasuaki Ishimatsu
    Reported-by: Yasuaki Ishimatsu
    Tested-by: Yasuaki Ishimatsu
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Yasuaki Ishimatsu
     

02 Jul, 2014

1 commit

  • commit e58469bafd0524e848c3733bc3918d854595e20f upstream.

    The test_bit operations in get/set pageblock flags are expensive. This
    patch reads the bitmap on a word basis and use shifts and masks to isolate
    the bits of interest. Similarly masks are used to set a local copy of the
    bitmap and then use cmpxchg to update the bitmap if there have been no
    other changes made in parallel.

    In a test running dd onto tmpfs the overhead of the pageblock-related
    functions went from 1.27% in profiles to 0.5%.

    In addition to the performance benefits, this patch closes races that are
    possible between:

    a) get_ and set_pageblock_migratetype(), where get_pageblock_migratetype()
    reads part of the bits before and other part of the bits after
    set_pageblock_migratetype() has updated them.

    b) set_pageblock_migratetype() and set_pageblock_skip(), where the non-atomic
    read-modify-update set bit operation in set_pageblock_skip() will cause
    lost updates to some bits changed in the set_pageblock_migratetype().

    Joonsoo Kim first reported the case a) via code inspection. Vlastimil
    Babka's testing with a debug patch showed that either a) or b) occurs
    roughly once per mmtests' stress-highalloc benchmark (although not
    necessarily in the same pageblock). Furthermore during development of
    unrelated compaction patches, it was observed that frequent calls to
    {start,undo}_isolate_page_range() the race occurs several thousands of
    times and has resulted in NULL pointer dereferences in move_freepages()
    and free_one_page() in places where free_list[migratetype] is
    manipulated by e.g. list_move(). Further debugging confirmed that
    migratetype had invalid value of 6, causing out of bounds access to the
    free_list array.

    That confirmed that the race exist, although it may be extremely rare,
    and currently only fatal where page isolation is performed due to
    memory hot remove. Races on pageblocks being updated by
    set_pageblock_migratetype(), where both old and new migratetype are
    lower MIGRATE_RESERVE, currently cannot result in an invalid value
    being observed, although theoretically they may still lead to
    unexpected creation or destruction of MIGRATE_RESERVE pageblocks.
    Furthermore, things could get suddenly worse when memory isolation is
    used more, or when new migratetypes are added.

    After this patch, the race has no longer been observed in testing.

    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Reported-by: Joonsoo Kim
    Reported-and-tested-by: Vlastimil Babka
    Cc: Johannes Weiner
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Jiri Slaby

    Mel Gorman
     

12 Sep, 2013

2 commits

  • This patch is based on KOSAKI's work and I add a little more description,
    please refer https://lkml.org/lkml/2012/6/14/74.

    Currently, I found system can enter a state that there are lots of free
    pages in a zone but only order-0 and order-1 pages which means the zone is
    heavily fragmented, then high order allocation could make direct reclaim
    path's long stall(ex, 60 seconds) especially in no swap and no compaciton
    enviroment. This problem happened on v3.4, but it seems issue still lives
    in current tree, the reason is do_try_to_free_pages enter live lock:

    kswapd will go to sleep if the zones have been fully scanned and are still
    not balanced. As kswapd thinks there's little point trying all over again
    to avoid infinite loop. Instead it changes order from high-order to
    0-order because kswapd think order-0 is the most important. Look at
    73ce02e9 in detail. If watermarks are ok, kswapd will go back to sleep
    and may leave zone->all_unreclaimable =3D 0. It assume high-order users
    can still perform direct reclaim if they wish.

    Direct reclaim continue to reclaim for a high order which is not a
    COSTLY_ORDER without oom-killer until kswapd turn on
    zone->all_unreclaimble= . This is because to avoid too early oom-kill.
    So it means direct_reclaim depends on kswapd to break this loop.

    In worst case, direct-reclaim may continue to page reclaim forever when
    kswapd sleeps forever until someone like watchdog detect and finally kill
    the process. As described in:
    http://thread.gmane.org/gmane.linux.kernel.mm/103737

    We can't turn on zone->all_unreclaimable from direct reclaim path because
    direct reclaim path don't take any lock and this way is racy. Thus this
    patch removes zone->all_unreclaimable field completely and recalculates
    zone reclaimable state every time.

    Note: we can't take the idea that direct-reclaim see zone->pages_scanned
    directly and kswapd continue to use zone->all_unreclaimable. Because, it
    is racy. commit 929bea7c71 (vmscan: all_unreclaimable() use
    zone->all_unreclaimable as a name) describes the detail.

    [akpm@linux-foundation.org: uninline zone_reclaimable_pages() and zone_reclaimable()]
    Cc: Aaditya Kumar
    Cc: Ying Han
    Cc: Nick Piggin
    Acked-by: Rik van Riel
    Cc: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Cc: Christoph Lameter
    Cc: Bob Liu
    Cc: Neil Zhang
    Cc: Russell King - ARM Linux
    Reviewed-by: Michal Hocko
    Acked-by: Minchan Kim
    Acked-by: Johannes Weiner
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Lisa Du
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lisa Du
     
  • Each zone that holds userspace pages of one workload must be aged at a
    speed proportional to the zone size. Otherwise, the time an individual
    page gets to stay in memory depends on the zone it happened to be
    allocated in. Asymmetry in the zone aging creates rather unpredictable
    aging behavior and results in the wrong pages being reclaimed, activated
    etc.

    But exactly this happens right now because of the way the page allocator
    and kswapd interact. The page allocator uses per-node lists of all zones
    in the system, ordered by preference, when allocating a new page. When
    the first iteration does not yield any results, kswapd is woken up and the
    allocator retries. Due to the way kswapd reclaims zones below the high
    watermark while a zone can be allocated from when it is above the low
    watermark, the allocator may keep kswapd running while kswapd reclaim
    ensures that the page allocator can keep allocating from the first zone in
    the zonelist for extended periods of time. Meanwhile the other zones
    rarely see new allocations and thus get aged much slower in comparison.

    The result is that the occasional page placed in lower zones gets
    relatively more time in memory, even gets promoted to the active list
    after its peers have long been evicted. Meanwhile, the bulk of the
    working set may be thrashing on the preferred zone even though there may
    be significant amounts of memory available in the lower zones.

    Even the most basic test -- repeatedly reading a file slightly bigger than
    memory -- shows how broken the zone aging is. In this scenario, no single
    page should be able stay in memory long enough to get referenced twice and
    activated, but activation happens in spades:

    $ grep active_file /proc/zoneinfo
    nr_inactive_file 0
    nr_active_file 0
    nr_inactive_file 0
    nr_active_file 8
    nr_inactive_file 1582
    nr_active_file 11994
    $ cat data data data data >/dev/null
    $ grep active_file /proc/zoneinfo
    nr_inactive_file 0
    nr_active_file 70
    nr_inactive_file 258753
    nr_active_file 443214
    nr_inactive_file 149793
    nr_active_file 12021

    Fix this with a very simple round robin allocator. Each zone is allowed a
    batch of allocations that is proportional to the zone's size, after which
    it is treated as full. The batch counters are reset when all zones have
    been tried and the allocator enters the slowpath and kicks off kswapd
    reclaim. Allocation and reclaim is now fairly spread out to all
    available/allowable zones:

    $ grep active_file /proc/zoneinfo
    nr_inactive_file 0
    nr_active_file 0
    nr_inactive_file 174
    nr_active_file 4865
    nr_inactive_file 53
    nr_active_file 860
    $ cat data data data data >/dev/null
    $ grep active_file /proc/zoneinfo
    nr_inactive_file 0
    nr_active_file 0
    nr_inactive_file 666622
    nr_active_file 4988
    nr_inactive_file 190969
    nr_active_file 937

    When zone_reclaim_mode is enabled, allocations will now spread out to all
    zones on the local node, not just the first preferred zone (which on a 4G
    node might be a tiny Normal zone).

    Signed-off-by: Johannes Weiner
    Acked-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Paul Bolle
    Cc: Zlatko Calusic
    Tested-by: Kevin Hilman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

10 Jul, 2013

1 commit


04 Jul, 2013

6 commits

  • Instead of leaving a hidden trap for the next person who comes along and
    wants to add something to mem_section, add a big fat warning about it
    needing to be a power-of-2, and insert a BUILD_BUG_ON() in sparse_init()
    to catch mistakes.

    Right now non-power-of-2 mem_sections cause a number of WARNs at boot
    (which don't clearly point to the size of mem_section as an issue), but
    the system limps on (temporarily, at least).

    This is based upon Dave Hansen's earlier RFC where he ran into the same
    issue:
    "sparsemem: fix boot when SECTIONS_PER_ROOT is not power-of-2"
    http://lkml.indiana.edu/hypermail/linux/kernel/1205.2/03077.html

    Signed-off-by: Cody P Schafer
    Acked-by: Dave Hansen
    Cc: Jiang Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cody P Schafer
     
  • Currently lock_memory_hotplug()/unlock_memory_hotplug() are used to
    protect totalram_pages and zone->managed_pages. Other than the memory
    hotplug driver, totalram_pages and zone->managed_pages may also be
    modified at runtime by other drivers, such as Xen balloon,
    virtio_balloon etc. For those cases, memory hotplug lock is a little
    too heavy, so introduce a dedicated lock to protect totalram_pages and
    zone->managed_pages.

    Now we have a simplified locking rules totalram_pages and
    zone->managed_pages as:

    1) no locking for read accesses because they are unsigned long.
    2) no locking for write accesses at boot time in single-threaded context.
    3) serialize write accesses at runtime by acquiring the dedicated
    managed_page_count_lock.

    Also adjust zone->managed_pages when freeing reserved pages into the
    buddy system, to keep totalram_pages and zone->managed_pages in
    consistence.

    [akpm@linux-foundation.org: don't export adjust_managed_page_count to modules (for now)]
    Signed-off-by: Jiang Liu
    Cc: Mel Gorman
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: "H. Peter Anvin"
    Cc: "Michael S. Tsirkin"
    Cc:
    Cc: Arnd Bergmann
    Cc: Catalin Marinas
    Cc: Chris Metcalf
    Cc: David Howells
    Cc: Geert Uytterhoeven
    Cc: Ingo Molnar
    Cc: Jeremy Fitzhardinge
    Cc: Jianguo Wu
    Cc: Joonsoo Kim
    Cc: Kamezawa Hiroyuki
    Cc: Konrad Rzeszutek Wilk
    Cc: Marek Szyprowski
    Cc: Rusty Russell
    Cc: Tang Chen
    Cc: Tejun Heo
    Cc: Thomas Gleixner
    Cc: Wen Congyang
    Cc: Will Deacon
    Cc: Yasuaki Ishimatsu
    Cc: Yinghai Lu
    Cc: Russell King
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • Signed-off-by: Cody P Schafer
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cody P Schafer
     
  • Signed-off-by: Cody P Schafer
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cody P Schafer
     
  • Historically, kswapd used to congestion_wait() at higher priorities if
    it was not making forward progress. This made no sense as the failure
    to make progress could be completely independent of IO. It was later
    replaced by wait_iff_congested() and removed entirely by commit 258401a6
    (mm: don't wait on congested zones in balance_pgdat()) as it was
    duplicating logic in shrink_inactive_list().

    This is problematic. If kswapd encounters many pages under writeback
    and it continues to scan until it reaches the high watermark then it
    will quickly skip over the pages under writeback and reclaim clean young
    pages or push applications out to swap.

    The use of wait_iff_congested() is not suited to kswapd as it will only
    stall if the underlying BDI is really congested or a direct reclaimer
    was unable to write to the underlying BDI. kswapd bypasses the BDI
    congestion as it sets PF_SWAPWRITE but even if this was taken into
    account then it would cause direct reclaimers to stall on writeback
    which is not desirable.

    This patch sets a ZONE_WRITEBACK flag if direct reclaim or kswapd is
    encountering too many pages under writeback. If this flag is set and
    kswapd encounters a PageReclaim page under writeback then it'll assume
    that the LRU lists are being recycled too quickly before IO can complete
    and block waiting for some IO to complete.

    Signed-off-by: Mel Gorman
    Reviewed-by: Michal Hocko
    Acked-by: Rik van Riel
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: Jiri Slaby
    Cc: Valdis Kletnieks
    Tested-by: Zlatko Calusic
    Cc: dormando
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Currently kswapd queues dirty pages for writeback if scanning at an
    elevated priority but the priority kswapd scans at is not related to the
    number of unqueued dirty encountered. Since commit "mm: vmscan: Flatten
    kswapd priority loop", the priority is related to the size of the LRU
    and the zone watermark which is no indication as to whether kswapd
    should write pages or not.

    This patch tracks if an excessive number of unqueued dirty pages are
    being encountered at the end of the LRU. If so, it indicates that dirty
    pages are being recycled before flusher threads can clean them and flags
    the zone so that kswapd will start writing pages until the zone is
    balanced.

    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Reviewed-by: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Cc: Rik van Riel
    Cc: Jiri Slaby
    Cc: Valdis Kletnieks
    Tested-by: Zlatko Calusic
    Cc: dormando
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

01 May, 2013

1 commit

  • Pull trivial tree updates from Jiri Kosina:
    "Usual stuff, mostly comment fixes, typo fixes, printk fixes and small
    code cleanups"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (45 commits)
    mm: Convert print_symbol to %pSR
    gfs2: Convert print_symbol to %pSR
    m32r: Convert print_symbol to %pSR
    iostats.txt: add easy-to-find description for field 6
    x86 cmpxchg.h: fix wrong comment
    treewide: Fix typo in printk and comments
    doc: devicetree: Fix various typos
    docbook: fix 8250 naming in device-drivers
    pata_pdc2027x: Fix compiler warning
    treewide: Fix typo in printks
    mei: Fix comments in drivers/misc/mei
    treewide: Fix typos in kernel messages
    pm44xx: Fix comment for "CONFIG_CPU_IDLE"
    doc: Fix typo "CONFIG_CGROUP_CGROUP_MEMCG_SWAP"
    mmzone: correct "pags" to "pages" in comment.
    kernel-parameters: remove outdated 'noresidual' parameter
    Remove spurious _H suffixes from ifdef comments
    sound: Remove stray pluses from Kconfig file
    radio-shark: Fix printk "CONFIG_LED_CLASS"
    doc: put proper reference to CONFIG_MODULE_SIG_ENFORCE
    ...

    Linus Torvalds
     

27 Mar, 2013

1 commit


23 Mar, 2013

1 commit

  • Booting with 32 TBytes memory hits BUG at mm/page_alloc.c:552! (output
    below).

    The key hint is "page 4294967296 outside zone".
    4294967296 = 0x100000000 (bit 32 is set).

    The problem is in include/linux/mmzone.h:

    530 static inline unsigned zone_end_pfn(const struct zone *zone)
    531 {
    532 return zone->zone_start_pfn + zone->spanned_pages;
    533 }

    zone_end_pfn is "unsigned" (32 bits). Changing it to "unsigned long"
    (64 bits) fixes the problem.

    zone_end_pfn() was added recently in commit 108bcc96ef70 ("mm: add & use
    zone_end_pfn() and zone_spans_pfn()")

    Output from the failure.

    No AGP bridge found
    page 4294967296 outside zone [ 4294967296 - 4327469056 ]
    ------------[ cut here ]------------
    kernel BUG at mm/page_alloc.c:552!
    invalid opcode: 0000 [#1] SMP
    Modules linked in:
    CPU 0
    Pid: 0, comm: swapper Not tainted 3.9.0-rc2.dtp+ #10
    RIP: free_one_page+0x382/0x430
    Process swapper (pid: 0, threadinfo ffffffff81942000, task ffffffff81955420)
    Call Trace:
    __free_pages_ok+0x96/0xb0
    __free_pages+0x25/0x50
    __free_pages_bootmem+0x8a/0x8c
    __free_memory_core+0xea/0x131
    free_low_memory_core_early+0x4a/0x98
    free_all_bootmem+0x45/0x47
    mem_init+0x7b/0x14c
    start_kernel+0x216/0x433
    x86_64_start_reservations+0x2a/0x2c
    x86_64_start_kernel+0x144/0x153
    Code: 89 f1 ba 01 00 00 00 31 f6 d3 e2 4c 89 ef e8 66 a4 01 00 e9 2c fe ff ff 0f 0b eb fe 0f 0b 66 66 2e 0f 1f 84 00 00 00 00 00 eb f3 0b eb fe 0f 0b 0f 1f 84 00 00 00 00 00 eb f6 0f 0b eb fe 49

    Signed-off-by: Russ Anderson
    Reported-by: George Beshers
    Acked-by: Hedi Berriche
    Cc: Cody P Schafer
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Russ Anderson
     

24 Feb, 2013

5 commits

  • Add pgdat_end_pfn() and pgdat_is_empty() helpers which match the similar
    zone_*() functions.

    Change node_end_pfn() to be a wrapper of pgdat_end_pfn().

    Signed-off-by: Cody P Schafer
    Cc: David Hansen
    Cc: Catalin Marinas
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cody P Schafer
     
  • Factoring out these 2 checks makes it more clear what we are actually
    checking for.

    Signed-off-by: Cody P Schafer
    Cc: David Hansen
    Cc: Catalin Marinas
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cody P Schafer
     
  • Add 2 helpers (zone_end_pfn() and zone_spans_pfn()) to reduce code
    duplication.

    This also switches to using them in compaction (where an additional
    variable needed to be renamed), page_alloc, vmstat, memory_hotplug, and
    kmemleak.

    Note that in compaction.c I avoid calling zone_end_pfn() repeatedly
    because I expect at some point the sycronization issues with start_pfn &
    spanned_pages will need fixing, either by actually using the seqlock or
    clever memory barrier usage.

    Signed-off-by: Cody P Schafer
    Cc: David Hansen
    Cc: Catalin Marinas
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cody P Schafer
     
  • This is a preparation patch for moving page->_last_nid into page->flags
    that moves page flag layout information to a separate header. This
    patch is necessary because otherwise there would be a circular
    dependency between mm_types.h and mm.h.

    Signed-off-by: Mel Gorman
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: Ingo Molnar
    Cc: Simon Jeons
    Cc: Wanpeng Li
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Several functions test MIGRATE_ISOLATE and some of those are hotpath but
    MIGRATE_ISOLATE is used only if we enable CONFIG_MEMORY_ISOLATION(ie,
    CMA, memory-hotplug and memory-failure) which are not common config
    option. So let's not add unnecessary overhead and code when we don't
    enable CONFIG_MEMORY_ISOLATION.

    Signed-off-by: Minchan Kim
    Cc: KOSAKI Motohiro
    Acked-by: Michal Nazarewicz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

05 Jan, 2013

1 commit

  • Commit 702d1a6e0766 ("memory-hotplug: fix kswapd looping forever
    problem") added an isolated pageblocks counter (nr_pageblock_isolate in
    struct zone) and used it to adjust free pages counter in
    zone_watermark_ok_safe() to prevent kswapd looping forever problem.

    Then later, commit 2139cbe627b8 ("cma: fix counting of isolated pages")
    fixed accounting of isolated pages in global free pages counter. It
    made the previous zone_watermark_ok_safe() fix unnecessary and
    potentially harmful (cause now isolated pages may be accounted twice
    making free pages counter incorrect).

    This patch removes the special isolated pageblocks counter altogether
    which fixes zone_watermark_ok_safe() free pages check.

    Reported-by: Tomasz Stanislawski
    Signed-off-by: Bartlomiej Zolnierkiewicz
    Signed-off-by: Kyungmin Park
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Aaditya Kumar
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Marek Szyprowski
    Cc: Michal Nazarewicz
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bartlomiej Zolnierkiewicz
     

17 Dec, 2012

1 commit

  • Pull Automatic NUMA Balancing bare-bones from Mel Gorman:
    "There are three implementations for NUMA balancing, this tree
    (balancenuma), numacore which has been developed in tip/master and
    autonuma which is in aa.git.

    In almost all respects balancenuma is the dumbest of the three because
    its main impact is on the VM side with no attempt to be smart about
    scheduling. In the interest of getting the ball rolling, it would be
    desirable to see this much merged for 3.8 with the view to building
    scheduler smarts on top and adapting the VM where required for 3.9.

    The most recent set of comparisons available from different people are

    mel: https://lkml.org/lkml/2012/12/9/108
    mingo: https://lkml.org/lkml/2012/12/7/331
    tglx: https://lkml.org/lkml/2012/12/10/437
    srikar: https://lkml.org/lkml/2012/12/10/397

    The results are a mixed bag. In my own tests, balancenuma does
    reasonably well. It's dumb as rocks and does not regress against
    mainline. On the other hand, Ingo's tests shows that balancenuma is
    incapable of converging for this workloads driven by perf which is bad
    but is potentially explained by the lack of scheduler smarts. Thomas'
    results show balancenuma improves on mainline but falls far short of
    numacore or autonuma. Srikar's results indicate we all suffer on a
    large machine with imbalanced node sizes.

    My own testing showed that recent numacore results have improved
    dramatically, particularly in the last week but not universally.
    We've butted heads heavily on system CPU usage and high levels of
    migration even when it shows that overall performance is better.
    There are also cases where it regresses. Of interest is that for
    specjbb in some configurations it will regress for lower numbers of
    warehouses and show gains for higher numbers which is not reported by
    the tool by default and sometimes missed in treports. Recently I
    reported for numacore that the JVM was crashing with
    NullPointerExceptions but currently it's unclear what the source of
    this problem is. Initially I thought it was in how numacore batch
    handles PTEs but I'm no longer think this is the case. It's possible
    numacore is just able to trigger it due to higher rates of migration.

    These reports were quite late in the cycle so I/we would like to start
    with this tree as it contains much of the code we can agree on and has
    not changed significantly over the last 2-3 weeks."

    * tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma: (50 commits)
    mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable
    mm/rmap: Convert the struct anon_vma::mutex to an rwsem
    mm: migrate: Account a transhuge page properly when rate limiting
    mm: numa: Account for failed allocations and isolations as migration failures
    mm: numa: Add THP migration for the NUMA working set scanning fault case build fix
    mm: numa: Add THP migration for the NUMA working set scanning fault case.
    mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node
    mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG
    mm: sched: numa: Control enabling and disabling of NUMA balancing
    mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate
    mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely tasknode relationships
    mm: numa: migrate: Set last_nid on newly allocated page
    mm: numa: split_huge_page: Transfer last_nid on tail page
    mm: numa: Introduce last_nid to the page frame
    sched: numa: Slowly increase the scanning period as NUMA faults are handled
    mm: numa: Rate limit setting of pte_numa if node is saturated
    mm: numa: Rate limit the amount of memory that is migrated between nodes
    mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting
    mm: numa: Migrate pages handled during a pmd_numa hinting fault
    mm: numa: Migrate on reference policy
    ...

    Linus Torvalds
     

13 Dec, 2012

1 commit

  • Currently a zone's present_pages is calcuated as below, which is
    inaccurate and may cause trouble to memory hotplug.

    spanned_pages - absent_pages - memmap_pages - dma_reserve.

    During fixing bugs caused by inaccurate zone->present_pages, we found
    zone->present_pages has been abused. The field zone->present_pages may
    have different meanings in different contexts:

    1) pages existing in a zone.
    2) pages managed by the buddy system.

    For more discussions about the issue, please refer to:
    http://lkml.org/lkml/2012/11/5/866
    https://patchwork.kernel.org/patch/1346751/

    This patchset tries to introduce a new field named "managed_pages" to
    struct zone, which counts "pages managed by the buddy system". And revert
    zone->present_pages to count "physical pages existing in a zone", which
    also keep in consistence with pgdat->node_present_pages.

    We will set an initial value for zone->managed_pages in function
    free_area_init_core() and will adjust it later if the initial value is
    inaccurate.

    For DMA/normal zones, the initial value is set to:

    (spanned_pages - absent_pages - memmap_pages - dma_reserve)

    Later zone->managed_pages will be adjusted to the accurate value when the
    bootmem allocator frees all free pages to the buddy system in function
    free_all_bootmem_node() and free_all_bootmem().

    The bootmem allocator doesn't touch highmem pages, so highmem zones'
    managed_pages is set to the accurate value "spanned_pages - absent_pages"
    in function free_area_init_core() and won't be updated anymore.

    This patch also adds a new field "managed_pages" to /proc/zoneinfo
    and sysrq showmem.

    [akpm@linux-foundation.org: small comment tweaks]
    Signed-off-by: Jiang Liu
    Cc: Wen Congyang
    Cc: David Rientjes
    Cc: Maciej Rutecki
    Tested-by: Chris Clayton
    Cc: "Rafael J . Wysocki"
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Cc: Jianguo Wu
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     

12 Dec, 2012

1 commit

  • Commits 2139cbe627b8 ("cma: fix counting of isolated pages") and
    d95ea5d18e69 ("cma: fix watermark checking") introduced a reliable
    method of free page accounting when memory is being allocated from CMA
    regions, so the workaround introduced earlier by commit 49f223a9cd96
    ("mm: trigger page reclaim in alloc_contig_range() to stabilise
    watermarks") can be finally removed.

    Signed-off-by: Marek Szyprowski
    Cc: Kyungmin Park
    Cc: Arnd Bergmann
    Cc: Mel Gorman
    Acked-by: Michal Nazarewicz
    Cc: Minchan Kim
    Cc: Bartlomiej Zolnierkiewicz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Marek Szyprowski
     

11 Dec, 2012

1 commit


17 Nov, 2012

1 commit

  • When MEMCG is configured on (even when it's disabled by boot option),
    when adding or removing a page to/from its lru list, the zone pointer
    used for stats updates is nowadays taken from the struct lruvec. (On
    many configurations, calculating zone from page is slower.)

    But we have no code to update all the lruvecs (per zone, per memcg) when
    a memory node is hotadded. Here's an extract from the oops which
    results when running numactl to bind a program to a newly onlined node:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000f60
    IP: __mod_zone_page_state+0x9/0x60
    Pid: 1219, comm: numactl Not tainted 3.6.0-rc5+ #180 Bochs Bochs
    Process numactl (pid: 1219, threadinfo ffff880039abc000, task ffff8800383c4ce0)
    Call Trace:
    __pagevec_lru_add_fn+0xdf/0x140
    pagevec_lru_move_fn+0xb1/0x100
    __pagevec_lru_add+0x1c/0x30
    lru_add_drain_cpu+0xa3/0x130
    lru_add_drain+0x2f/0x40
    ...

    The natural solution might be to use a memcg callback whenever memory is
    hotadded; but that solution has not been scoped out, and it happens that
    we do have an easy location at which to update lruvec->zone. The lruvec
    pointer is discovered either by mem_cgroup_zone_lruvec() or by
    mem_cgroup_page_lruvec(), and both of those do know the right zone.

    So check and set lruvec->zone in those; and remove the inadequate
    attempt to set lruvec->zone from lruvec_init(), which is called before
    NODE_DATA(node) has been allocated in such cases.

    Ah, there was one exceptionr. For no particularly good reason,
    mem_cgroup_force_empty_list() has its own code for deciding lruvec.
    Change it to use the standard mem_cgroup_zone_lruvec() and
    mem_cgroup_get_lru_size() too. In fact it was already safe against such
    an oops (the lru lists in danger could only be empty), but we're better
    proofed against future changes this way.

    I've marked this for stable (3.6) since we introduced the problem in 3.5
    (now closed to stable); but I have no idea if this is the only fix
    needed to get memory hotadd working with memcg in 3.6, and received no
    answer when I enquired twice before.

    Reported-by: Tang Chen
    Signed-off-by: Hugh Dickins
    Acked-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Konstantin Khlebnikov
    Cc: Wen Congyang
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

09 Oct, 2012

7 commits

  • Presently CMA cannot migrate mlocked pages so it ends up failing to allocate
    contiguous memory space.

    This patch makes mlocked pages be migrated out. Of course, it can affect
    realtime processes but in CMA usecase, contiguous memory allocation failing
    is far worse than access latency to an mlocked page being variable while
    CMA is running. If someone wants to make the system realtime, he shouldn't
    enable CMA because stalls can still happen at random times.

    [akpm@linux-foundation.org: tweak comment text, per Mel]
    Signed-off-by: Minchan Kim
    Acked-by: Mel Gorman
    Cc: Michal Nazarewicz
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Marek Szyprowski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • RECLAIM_DISTANCE represents the distance between nodes at which it is
    deemed too costly to allocate from; it's preferred to try to reclaim from
    a local zone before falling back to allocating on a remote node with such
    a distance.

    To do this, zone_reclaim_mode is set if the distance between any two
    nodes on the system is greather than this distance. This, however, ends
    up causing the page allocator to reclaim from every zone regardless of
    its affinity.

    What we really want is to reclaim only from zones that are closer than
    RECLAIM_DISTANCE. This patch adds a nodemask to each node that
    represents the set of nodes that are within this distance. During the
    zone iteration, if the bit for a zone's node is set for the local node,
    then reclaim is attempted; otherwise, the zone is skipped.

    [akpm@linux-foundation.org: fix CONFIG_NUMA=n build]
    Signed-off-by: David Rientjes
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Compaction caches if a pageblock was scanned and no pages were isolated so
    that the pageblocks can be skipped in the future to reduce scanning. This
    information is not cleared by the page allocator based on activity due to
    the impact it would have to the page allocator fast paths. Hence there is
    a requirement that something clear the cache or pageblocks will be skipped
    forever. Currently the cache is cleared if there were a number of recent
    allocation failures and it has not been cleared within the last 5 seconds.
    Time-based decisions like this are terrible as they have no relationship
    to VM activity and is basically a big hammer.

    Unfortunately, accurate heuristics would add cost to some hot paths so
    this patch implements a rough heuristic. There are two cases where the
    cache is cleared.

    1. If a !kswapd process completes a compaction cycle (migrate and free
    scanner meet), the zone is marked compact_blockskip_flush. When kswapd
    goes to sleep, it will clear the cache. This is expected to be the
    common case where the cache is cleared. It does not really matter if
    kswapd happens to be asleep or going to sleep when the flag is set as
    it will be woken on the next allocation request.

    2. If there have been multiple failures recently and compaction just
    finished being deferred then a process will clear the cache and start a
    full scan. This situation happens if there are multiple high-order
    allocation requests under heavy memory pressure.

    The clearing of the PG_migrate_skip bits and other scans is inherently
    racy but the race is harmless. For allocations that can fail such as THP,
    they will simply fail. For requests that cannot fail, they will retry the
    allocation. Tests indicated that scanning rates were roughly similar to
    when the time-based heuristic was used and the allocation success rates
    were similar.

    Signed-off-by: Mel Gorman
    Cc: Rik van Riel
    Cc: Richard Davies
    Cc: Shaohua Li
    Cc: Avi Kivity
    Cc: Rafael Aquini
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This is almost entirely based on Rik's previous patches and discussions
    with him about how this might be implemented.

    Order > 0 compaction stops when enough free pages of the correct page
    order have been coalesced. When doing subsequent higher order
    allocations, it is possible for compaction to be invoked many times.

    However, the compaction code always starts out looking for things to
    compact at the start of the zone, and for free pages to compact things to
    at the end of the zone.

    This can cause quadratic behaviour, with isolate_freepages starting at the
    end of the zone each time, even though previous invocations of the
    compaction code already filled up all free memory on that end of the zone.
    This can cause isolate_freepages to take enormous amounts of CPU with
    certain workloads on larger memory systems.

    This patch caches where the migration and free scanner should start from
    on subsequent compaction invocations using the pageblock-skip information.
    When compaction starts it begins from the cached restart points and will
    update the cached restart points until a page is isolated or a pageblock
    is skipped that would have been scanned by synchronous compaction.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Richard Davies
    Cc: Shaohua Li
    Cc: Avi Kivity
    Acked-by: Rafael Aquini
    Cc: Fengguang Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • When compaction was implemented it was known that scanning could
    potentially be excessive. The ideal was that a counter be maintained for
    each pageblock but maintaining this information would incur a severe
    penalty due to a shared writable cache line. It has reached the point
    where the scanning costs are a serious problem, particularly on
    long-lived systems where a large process starts and allocates a large
    number of THPs at the same time.

    Instead of using a shared counter, this patch adds another bit to the
    pageblock flags called PG_migrate_skip. If a pageblock is scanned by
    either migrate or free scanner and 0 pages were isolated, the pageblock is
    marked to be skipped in the future. When scanning, this bit is checked
    before any scanning takes place and the block skipped if set.

    The main difficulty with a patch like this is "when to ignore the cached
    information?" If it's ignored too often, the scanning rates will still be
    excessive. If the information is too stale then allocations will fail
    that might have otherwise succeeded. In this patch

    o CMA always ignores the information
    o If the migrate and free scanner meet then the cached information will
    be discarded if it's at least 5 seconds since the last time the cache
    was discarded
    o If there are a large number of allocation failures, discard the cache.

    The time-based heuristic is very clumsy but there are few choices for a
    better event. Depending solely on multiple allocation failures still
    allows excessive scanning when THP allocations are failing in quick
    succession due to memory pressure. Waiting until memory pressure is
    relieved would cause compaction to continually fail instead of using
    reclaim/compaction to try allocate the page. The time-based mechanism is
    clumsy but a better option is not obvious.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Richard Davies
    Cc: Shaohua Li
    Cc: Avi Kivity
    Acked-by: Rafael Aquini
    Cc: Fengguang Wu
    Cc: Michal Nazarewicz
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Kyungmin Park
    Cc: Mark Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This reverts commit 7db8889ab05b ("mm: have order > 0 compaction start
    off where it left") and commit de74f1cc ("mm: have order > 0 compaction
    start near a pageblock with free pages"). These patches were a good
    idea and tests confirmed that they massively reduced the amount of
    scanning but the implementation is complex and tricky to understand. A
    later patch will cache what pageblocks should be skipped and
    reimplements the concept of compact_cached_free_pfn on top for both
    migration and free scanners.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Richard Davies
    Cc: Shaohua Li
    Cc: Avi Kivity
    Acked-by: Rafael Aquini
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Add NR_FREE_CMA_PAGES counter to be later used for checking watermark in
    __zone_watermark_ok(). For simplicity and to avoid #ifdef hell make this
    counter always available (not only when CONFIG_CMA=y).

    [akpm@linux-foundation.org: use conventional migratetype naming]
    Signed-off-by: Bartlomiej Zolnierkiewicz
    Signed-off-by: Kyungmin Park
    Cc: Marek Szyprowski
    Cc: Michal Nazarewicz
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bartlomiej Zolnierkiewicz
     

01 Aug, 2012

2 commits

  • If swap is backed by network storage such as NBD, there is a risk that a
    large number of reclaimers can hang the system by consuming all
    PF_MEMALLOC reserves. To avoid these hangs, the administrator must tune
    min_free_kbytes in advance which is a bit fragile.

    This patch throttles direct reclaimers if half the PF_MEMALLOC reserves
    are in use. If the system is routinely getting throttled the system
    administrator can increase min_free_kbytes so degradation is smoother but
    the system will keep running.

    Signed-off-by: Mel Gorman
    Cc: David Miller
    Cc: Neil Brown
    Cc: Peter Zijlstra
    Cc: Mike Christie
    Cc: Eric B Munson
    Cc: Eric Dumazet
    Cc: Sebastian Andrzej Siewior
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • When hotplug offlining happens on zone A, it starts to mark freed page as
    MIGRATE_ISOLATE type in buddy for preventing further allocation.
    (MIGRATE_ISOLATE is very irony type because it's apparently on buddy but
    we can't allocate them).

    When the memory shortage happens during hotplug offlining, current task
    starts to reclaim, then wake up kswapd. Kswapd checks watermark, then go
    sleep because current zone_watermark_ok_safe doesn't consider
    MIGRATE_ISOLATE freed page count. Current task continue to reclaim in
    direct reclaim path without kswapd's helping. The problem is that
    zone->all_unreclaimable is set by only kswapd so that current task would
    be looping forever like below.

    __alloc_pages_slowpath
    restart:
    wake_all_kswapd
    rebalance:
    __alloc_pages_direct_reclaim
    do_try_to_free_pages
    if global_reclaim && !all_unreclaimable
    return 1; /* It means we did did_some_progress */
    skip __alloc_pages_may_oom
    should_alloc_retry
    goto rebalance;

    If we apply KOSAKI's patch[1] which doesn't depends on kswapd about
    setting zone->all_unreclaimable, we can solve this problem by killing some
    task in direct reclaim path. But it doesn't wake up kswapd, still. It
    could be a problem still if other subsystem needs GFP_ATOMIC request. So
    kswapd should consider MIGRATE_ISOLATE when it calculate free pages BEFORE
    going sleep.

    This patch counts the number of MIGRATE_ISOLATE page block and
    zone_watermark_ok_safe will consider it if the system has such blocks
    (fortunately, it's very rare so no problem in POV overhead and kswapd is
    never hotpath).

    Copy/modify from Mel's quote
    "
    Ideal solution would be "allocating" the pageblock.
    It would keep the free space accounting as it is but historically,
    memory hotplug didn't allocate pages because it would be difficult to
    detect if a pageblock was isolated or if part of some balloon.
    Allocating just full pageblocks would work around this, However,
    it would play very badly with CMA.
    "

    [1] http://lkml.org/lkml/2012/6/14/74

    [akpm@linux-foundation.org: simplify nr_zone_isolate_freepages(), rework zone_watermark_ok_safe() comment, simplify set_pageblock_isolate() and restore_pageblock_isolate()]
    [akpm@linux-foundation.org: fix CONFIG_MEMORY_ISOLATION=n build]
    Signed-off-by: Minchan Kim
    Suggested-by: KOSAKI Motohiro
    Tested-by: Aaditya Kumar
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim