14 Jan, 2011

40 commits

  • Used by paravirt and not paravirt set_pmd_at.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Clear compound mapping for anonymous compound pages like it already
    happens for regular anonymous pages. But crash if mapping is set for any
    tail page, also the PageAnon check is meaningless for tail pages. This
    check only makes sense for the head page, for tail page it can only hide
    bugs and we definitely don't want to hide bugs.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Futex code is smarter than most other gup_fast O_DIRECT code and knows
    about the compound internals. However now doing a put_page(head_page)
    will not release the pin on the tail page taken by gup-fast, leading to
    all sort of refcounting bugchecks. Getting a stable head_page is a little
    tricky.

    page_head = page is there because if this is not a tail page it's also the
    page_head. Only in case this is a tail page, compound_head is called,
    otherwise it's guaranteed unnecessary. And if it's a tail page
    compound_head has to run atomically inside irq disabled section
    __get_user_pages_fast before returning. Otherwise ->first_page won't be a
    stable pointer.

    Disableing irq before __get_user_page_fast and releasing irq after running
    compound_head is needed because if __get_user_page_fast returns == 1, it
    means the huge pmd is established and cannot go away from under us.
    pmdp_splitting_flush_notify in __split_huge_page_splitting will have to
    wait for local_irq_enable before the IPI delivery can return. This means
    __split_huge_page_refcount can't be running from under us, and in turn
    when we run compound_head(page) we're not reading a dangling pointer from
    tailpage->first_page. Then after we get to stable head page, we are
    always safe to call compound_lock and after taking the compound lock on
    head page we can finally re-check if the page returned by gup-fast is
    still a tail page. in which case we're set and we didn't need to split
    the hugepage in order to take a futex on it.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Mel Gorman
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • After releasing the compound_lock split_huge_page can still run and release the
    page before put_page_testzero runs.

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Alter compound get_page/put_page to keep references on subpages too, in
    order to allow __split_huge_page_refcount to split an hugepage even while
    subpages have been pinned by one of the get_user_pages() variants.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Add a new compound_lock() needed to serialize put_page against
    __split_huge_page_refcount().

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Define MADV_HUGEPAGE.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Acked-by: Arnd Bergmann
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Documentation/vm/transhuge.txt

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • page_count shows the count of the head page, but the actual check is done
    on the tail page, so show what is really being checked.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • When a swapcache page is replaced by a ksm page, it's best to free that
    swap immediately.

    Reported-by: Andrea Arcangeli
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • I think determine_dirtyable_memory() is a rather costly function since it
    need many atomic reads for gathering zone/global page state. But when we
    use vm_dirty_bytes && dirty_background_bytes, we don't need that costly
    calculation.

    This patch eliminates such unnecessary overhead.

    NOTE : newly added if condition might add overhead in normal path.
    But it should be _really_ small because anyway we need the
    access both vm_dirty_bytes and dirty_background_bytes so it is
    likely to hit the cache.

    [akpm@linux-foundation.org: fix used-uninitialised warning]
    Signed-off-by: Minchan Kim
    Cc: Wu Fengguang
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • When numa_zonelist_order parameter is set to "node" or "zone" on the
    command line it's still showing as "default" in sysctl. That's because
    early_param parsing function changes only user_zonelist_order variable.
    Fix this by copying user-provided string to numa_zonelist_order if it was
    successfully parsed.

    Signed-off-by: Volodymyr G Lukiianyk
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Volodymyr G. Lukiianyk
     
  • When kswapd is woken up for a high-order allocation, it takes account of
    the highest usable zone by the caller (the classzone idx). During
    allocation, this index is used to select the lowmem_reserve[] that should
    be applied to the watermark calculation in zone_watermark_ok().

    When balancing a node, kswapd considers the highest unbalanced zone to be
    the classzone index. This will always be at least be the callers
    classzone_idx and can be higher. However, sleeping_prematurely() always
    considers the lowest zone (e.g. ZONE_DMA) to be the classzone index.
    This means that sleeping_prematurely() can consider a zone to be balanced
    that is unusable by the allocation request that originally woke kswapd.
    This patch changes sleeping_prematurely() to use a classzone_idx matching
    the value it used in balance_pgdat().

    Signed-off-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Reviewed-by: Eric B Munson
    Cc: KAMEZAWA Hiroyuki
    Cc: Simon Kirby
    Cc: KOSAKI Motohiro
    Cc: Shaohua Li
    Cc: Dave Hansen
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • After DEF_PRIORITY, balance_pgdat() considers all_unreclaimable zones to
    be balanced but sleeping_prematurely does not. This can force kswapd to
    stay awake longer than it should. This patch fixes it.

    Signed-off-by: Mel Gorman
    Reviewed-by: Eric B Munson
    Cc: Minchan Kim
    Cc: KAMEZAWA Hiroyuki
    Cc: Simon Kirby
    Cc: KOSAKI Motohiro
    Cc: Shaohua Li
    Cc: Dave Hansen
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • When kswapd wakes up, it reads its order and classzone from pgdat and
    calls balance_pgdat. While its awake, it potentially reclaimes at a high
    order and a low classzone index. This might have been a once-off that was
    not required by subsequent callers. However, because the pgdat values
    were not reset, they remain artifically high while balance_pgdat() is
    running and potentially kswapd enters a second unnecessary reclaim cycle.
    Reset the pgdat order and classzone index after reading.

    Signed-off-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Eric B Munson
    Cc: Simon Kirby
    Cc: KOSAKI Motohiro
    Cc: Shaohua Li
    Cc: Dave Hansen
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Before kswapd goes to sleep, it uses sleeping_prematurely() to check if
    there was a race pushing a zone below its watermark. If the race
    happened, it stays awake. However, balance_pgdat() can decide to reclaim
    at order-0 if it decides that high-order reclaim is not working as
    expected. This information is not passed back to sleeping_prematurely().
    The impact is that kswapd remains awake reclaiming pages long after it
    should have gone to sleep. This patch passes the adjusted order to
    sleeping_prematurely and uses the same logic as balance_pgdat to decide if
    it's ok to go to sleep.

    Signed-off-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Eric B Munson
    Cc: Simon Kirby
    Cc: KOSAKI Motohiro
    Cc: Shaohua Li
    Cc: Dave Hansen
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • When reclaiming for high-orders, kswapd is responsible for balancing a
    node but it should not reclaim excessively. It avoids excessive reclaim
    by considering if any zone in a node is balanced then the node is
    balanced. In the cases where there are imbalanced zone sizes (e.g.
    ZONE_DMA with both ZONE_DMA32 and ZONE_NORMAL), kswapd can go to sleep
    prematurely as just one small zone was balanced.

    This alters the sleep logic of kswapd slightly. It counts the number of
    pages that make up the balanced zones. If the total number of balanced
    pages is more than a quarter of the zone, kswapd will go back to sleep.
    This should keep a node balanced without reclaiming an excessive number of
    pages.

    Signed-off-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Eric B Munson
    Cc: Simon Kirby
    Cc: KOSAKI Motohiro
    Cc: Shaohua Li
    Cc: Dave Hansen
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Simon Kirby reported the following problem

    We're seeing cases on a number of servers where cache never fully
    grows to use all available memory. Sometimes we see servers with 4 GB
    of memory that never seem to have less than 1.5 GB free, even with a
    constantly-active VM. In some cases, these servers also swap out while
    this happens, even though they are constantly reading the working set
    into memory. We have been seeing this happening for a long time; I
    don't think it's anything recent, and it still happens on 2.6.36.

    After some debugging work by Simon, Dave Hansen and others, the prevaling
    theory became that kswapd is reclaiming order-3 pages requested by SLUB
    too aggressive about it.

    There are two apparent problems here. On the target machine, there is a
    small Normal zone in comparison to DMA32. As kswapd tries to balance all
    zones, it would continually try reclaiming for Normal even though DMA32
    was balanced enough for callers. The second problem is that
    sleeping_prematurely() does not use the same logic as balance_pgdat() when
    deciding whether to sleep or not. This keeps kswapd artifically awake.

    A number of tests were run and the figures from previous postings will
    look very different for a few reasons. One, the old figures were forcing
    my network card to use GFP_ATOMIC in attempt to replicate Simon's problem.
    Second, I previous specified slub_min_order=3 again in an attempt to
    reproduce Simon's problem. In this posting, I'm depending on Simon to say
    whether his problem is fixed or not and these figures are to show the
    impact to the ordinary cases. Finally, the "vmscan" figures are taken
    from /proc/vmstat instead of the tracepoints. There is less information
    but recording is less disruptive.

    The first test of relevance was postmark with a process running in the
    background reading a large amount of anonymous memory in blocks. The
    objective was to vaguely simulate what was happening on Simon's machine
    and it's memory intensive enough to have kswapd awake.

    POSTMARK
    traceonly kanyzone
    Transactions per second: 156.00 ( 0.00%) 153.00 (-1.96%)
    Data megabytes read per second: 21.51 ( 0.00%) 21.52 ( 0.05%)
    Data megabytes written per second: 29.28 ( 0.00%) 29.11 (-0.58%)
    Files created alone per second: 250.00 ( 0.00%) 416.00 (39.90%)
    Files create/transact per second: 79.00 ( 0.00%) 76.00 (-3.95%)
    Files deleted alone per second: 520.00 ( 0.00%) 420.00 (-23.81%)
    Files delete/transact per second: 79.00 ( 0.00%) 76.00 (-3.95%)

    MMTests Statistics: duration
    User/Sys Time Running Test (seconds) 16.58 17.4
    Total Elapsed Time (seconds) 218.48 222.47

    VMstat Reclaim Statistics: vmscan
    Direct reclaims 0 4
    Direct reclaim pages scanned 0 203
    Direct reclaim pages reclaimed 0 184
    Kswapd pages scanned 326631 322018
    Kswapd pages reclaimed 312632 309784
    Kswapd low wmark quickly 1 4
    Kswapd high wmark quickly 122 475
    Kswapd skip congestion_wait 1 0
    Pages activated 700040 705317
    Pages deactivated 212113 203922
    Pages written 9875 6363

    Total pages scanned 326631 322221
    Total pages reclaimed 312632 309968
    %age total pages scanned/reclaimed 95.71% 96.20%
    %age total pages scanned/written 3.02% 1.97%

    proc vmstat: Faults
    Major Faults 300 254
    Minor Faults 645183 660284
    Page ins 493588 486704
    Page outs 4960088 4986704
    Swap ins 1230 661
    Swap outs 9869 6355

    Performance is mildly affected because kswapd is no longer doing as much
    work and the background memory consumer process is getting in the way.
    Note that kswapd scanned and reclaimed fewer pages as it's less aggressive
    and overall fewer pages were scanned and reclaimed. Swap in/out is
    particularly reduced again reflecting kswapd throwing out fewer pages.

    The slight performance impact is unfortunate here but it looks like a
    direct result of kswapd being less aggressive. As the bug report is about
    too many pages being freed by kswapd, it may have to be accepted for now.

    The second test is a streaming IO benchmark that was previously used by
    Johannes to show regressions in page reclaim.

    MICRO
    traceonly kanyzone
    User/Sys Time Running Test (seconds) 29.29 28.87
    Total Elapsed Time (seconds) 492.18 488.79

    VMstat Reclaim Statistics: vmscan
    Direct reclaims 2128 1460
    Direct reclaim pages scanned 2284822 1496067
    Direct reclaim pages reclaimed 148919 110937
    Kswapd pages scanned 15450014 16202876
    Kswapd pages reclaimed 8503697 8537897
    Kswapd low wmark quickly 3100 3397
    Kswapd high wmark quickly 1860 7243
    Kswapd skip congestion_wait 708 801
    Pages activated 9635 9573
    Pages deactivated 1432 1271
    Pages written 223 1130

    Total pages scanned 17734836 17698943
    Total pages reclaimed 8652616 8648834
    %age total pages scanned/reclaimed 48.79% 48.87%
    %age total pages scanned/written 0.00% 0.01%

    proc vmstat: Faults
    Major Faults 165 221
    Minor Faults 9655785 9656506
    Page ins 3880 7228
    Page outs 37692940 37480076
    Swap ins 0 69
    Swap outs 19 15

    Again fewer pages are scanned and reclaimed as expected and this time the
    test completed faster. Note that kswapd is hitting its watermarks faster
    (low and high wmark quickly) which I expect is due to kswapd reclaiming
    fewer pages.

    I also ran fs-mark, iozone and sysbench but there is nothing interesting
    to report in the figures. Performance is not significantly changed and
    the reclaim statistics look reasonable.

    Tgis patch:

    When the allocator enters its slow path, kswapd is woken up to balance the
    node. It continues working until all zones within the node are balanced.
    For order-0 allocations, this makes perfect sense but for higher orders it
    can have unintended side-effects. If the zone sizes are imbalanced,
    kswapd may reclaim heavily within a smaller zone discarding an excessive
    number of pages. The user-visible behaviour is that kswapd is awake and
    reclaiming even though plenty of pages are free from a suitable zone.

    This patch alters the "balance" logic for high-order reclaim allowing
    kswapd to stop if any suitable zone becomes balanced to reduce the number
    of pages it reclaims from other zones. kswapd still tries to ensure that
    order-0 watermarks for all zones are met before sleeping.

    Signed-off-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Eric B Munson
    Cc: Simon Kirby
    Cc: KOSAKI Motohiro
    Cc: Shaohua Li
    Cc: Dave Hansen
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Running the annotated branch profiler on a box doing average work
    (firefox, evolution, xchat, distcc farm), the likely() used in
    grab_cache_page_write_begin() was incorrect most of the time:

    correct incorrect % Function File Line
    ------- --------- - -------- ---- ----
    1924262 71332401 97 grab_cache_page_write_begin filemap.c 2206

    Adding a trace_printk() and running the function tracer limited to
    just this function I can see:

    gconfd-2-2696 [000] 4467.268935: grab_cache_page_write_begin: page= (null) mapping=ffff8800676a9460 index=7
    gconfd-2-2696 [000] 4467.268946: grab_cache_page_write_begin
    Acked-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Rostedt
     
  • page_mapping() has a unlikely that the mapping has PAGE_MAPPING_ANON set.
    But running the annotated branch profiler on a normal desktop system doing
    vairous tasks (xchat, evolution, firefox, distcc), it is not really that
    unlikely that the mapping here will have the PAGE_MAPPING_ANON flag set:

    correct incorrect % Function File Line
    ------- --------- - -------- ---- ----
    35935762 1270265395 97 page_mapping mm.h 659
    1306198001 143659 0 page_mapping mm.h 657
    203131478 121586 0 page_mapping mm.h 657
    5415491 1116 0 page_mapping mm.h 657
    74899487 1116 0 page_mapping mm.h 657
    203132845 224 0 page_mapping mm.h 659
    5415464 27 0 page_mapping mm.h 659
    13552 0 0 page_mapping mm.h 657
    13552 0 0 page_mapping mm.h 659
    242630 0 0 page_mapping mm.h 657
    242630 0 0 page_mapping mm.h 659
    74899487 0 0 page_mapping mm.h 659

    The page_mapping() is a static inline, which is why it shows up multiple
    times.

    The unlikely in page_mapping() was correct a total of 1909540379 times and
    incorrect 1270533123 times, with a 39% being incorrect. With this much of
    an error, it's best to simply remove the unlikely and have the compiler
    and branch prediction figure this out.

    Signed-off-by: Steven Rostedt
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Rik van Riel
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Rostedt
     
  • The mapping_unevictable() has a likely() around the mapping parameter.
    This mapping parameter comes from page_mapping() which has an unlikely()
    that the page will be set as PAGE_MAPPING_ANON, and if so, it will return
    NULL. One would think that this unlikely() means that the mapping
    returned by page_mapping() would not be NULL, but where page_mapping() is
    used just above mapping_unevictable(), that unlikely() is incorrect most
    of the time. This means that the "likely(mapping)" in
    mapping_unevictable() is incorrect most of the time.

    Running the annotated branch profiler on my main box which runs firefox,
    evolution, xchat and is part of my distcc farm, I had this:

    correct incorrect % Function File Line
    ------- --------- - -------- ---- ----
    12872836 1269443893 98 mapping_unevictable pagemap.h 51
    35935762 1270265395 97 page_mapping mm.h 659
    1306198001 143659 0 page_mapping mm.h 657
    203131478 121586 0 page_mapping mm.h 657
    5415491 1116 0 page_mapping mm.h 657
    74899487 1116 0 page_mapping mm.h 657
    203132845 224 0 page_mapping mm.h 659
    5415464 27 0 page_mapping mm.h 659
    13552 0 0 page_mapping mm.h 657
    13552 0 0 page_mapping mm.h 659
    242630 0 0 page_mapping mm.h 657
    242630 0 0 page_mapping mm.h 659
    74899487 0 0 page_mapping mm.h 659

    The page_mapping() is a static inline, which is why it shows up multiple
    times. The mapping_unevictable() is also a static inline but seems to be
    used only once in my setup.

    The unlikely in page_mapping() was correct a total of 1909540379 times and
    incorrect 1270533123 times, with a 39% being incorrect. Perhaps this is
    enough to remove the unlikely from page_mapping() as well.

    Signed-off-by: Steven Rostedt
    Reviewed-by: KOSAKI Motohiro
    Acked-by: Nick Piggin
    Acked-by: Rik van Riel
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Rostedt
     
  • IS_ERR() already implies unlikely(), so it can be omitted here.

    Signed-off-by: Tobias Klauser
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tobias Klauser
     
  • Today, tasklist_lock in migrate_pages doesn't protect anything.
    rcu_read_lock() provide enough protection from pid hash walk.

    Signed-off-by: KOSAKI Motohiro
    Reported-by: Peter Zijlstra
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • __get_user_pages gets a new 'nonblocking' parameter to signal that the
    caller is prepared to re-acquire mmap_sem and retry the operation if
    needed. This is used to split off long operations if they are going to
    block on a disk transfer, or when we detect contention on the mmap_sem.

    [akpm@linux-foundation.org: remove ref to rwsem_is_contended()]
    Signed-off-by: Michel Lespinasse
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Nick Piggin
    Cc: KOSAKI Motohiro
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Use a single code path for faulting in pages during mlock.

    The reason to have it in this patch series is that I did not want to
    update both code paths in a later change that releases mmap_sem when
    blocking on disk.

    Signed-off-by: Michel Lespinasse
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Nick Piggin
    Cc: KOSAKI Motohiro
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Move the code to mlock pages from __mlock_vma_pages_range() to
    follow_page().

    This allows __mlock_vma_pages_range() to not have to break down work into
    16-page batches.

    An additional motivation for doing this within the present patch series is
    that it'll make it easier for a later chagne to drop mmap_sem when
    blocking on disk (we'd like to be able to resume at the page that was read
    from disk instead of at the start of a 16-page batch).

    Signed-off-by: Michel Lespinasse
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Nick Piggin
    Cc: KOSAKI Motohiro
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Currently mlock() holds mmap_sem in exclusive mode while the pages get
    faulted in. In the case of a large mlock, this can potentially take a
    very long time, during which various commands such as 'ps auxw' will
    block. This makes sysadmins unhappy:

    real 14m36.232s
    user 0m0.003s
    sys 0m0.015s
    (output from 'time ps auxw' while a 20GB file was being mlocked without
    being previously preloaded into page cache)

    I propose that mlock() could release mmap_sem after the VM_LOCKED bits
    have been set in all appropriate VMAs. Then a second pass could be done
    to actually mlock the pages, in small batches, releasing mmap_sem when we
    block on disk access or when we detect some contention.

    This patch:

    Before this change, mlock() holds mmap_sem in exclusive mode while the
    pages get faulted in. In the case of a large mlock, this can potentially
    take a very long time. Various things will block while mmap_sem is held,
    including 'ps auxw'. This can make sysadmins angry.

    I propose that mlock() could release mmap_sem after the VM_LOCKED bits
    have been set in all appropriate VMAs. Then a second pass could be done
    to actually mlock the pages with mmap_sem held for reads only. We need to
    recheck the vma flags after we re-acquire mmap_sem, but this is easy.

    In the case where a vma has been munlocked before mlock completes, pages
    that were already marked as PageMlocked() are handled by the munlock()
    call, and mlock() is careful to not mark new page batches as PageMlocked()
    after the munlock() call has cleared the VM_LOCKED vma flags. So, the end
    result will be identical to what'd happen if munlock() had executed after
    the mlock() call.

    In a later change, I will allow the second pass to release mmap_sem when
    blocking on disk accesses or when it is otherwise contended, so that it
    won't be held for long periods of time even in shared mode.

    Signed-off-by: Michel Lespinasse
    Tested-by: Valdis Kletnieks
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Nick Piggin
    Cc: KOSAKI Motohiro
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • When faulting in pages for mlock(), we want to break COW for anonymous or
    file pages within VM_WRITABLE, non-VM_SHARED vmas. However, there is no
    need to write-fault into VM_SHARED vmas since shared file pages can be
    mlocked first and dirtied later, when/if they actually get written to.
    Skipping the write fault is desirable, as we don't want to unnecessarily
    cause these pages to be dirtied and queued for writeback.

    Signed-off-by: Michel Lespinasse
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Kosaki Motohiro
    Cc: Peter Zijlstra
    Cc: Nick Piggin
    Cc: Theodore Tso
    Cc: Michael Rubin
    Cc: Suleiman Souhlal
    Cc: Dave Chinner
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Reorganize the code so that dirty pages are handled closer to the place
    that makes them dirty (handling write fault into shared, writable VMAs).
    No behavior changes.

    Signed-off-by: Michel Lespinasse
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Kosaki Motohiro
    Cc: Peter Zijlstra
    Cc: Nick Piggin
    Cc: Theodore Tso
    Cc: Michael Rubin
    Cc: Suleiman Souhlal
    Cc: Dave Chinner
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • mlocking a shared, writable vma currently causes the corresponding pages
    to be marked as dirty and queued for writeback. This seems rather
    unnecessary given that the pages are not being actually modified during
    mlock. It is understood that for non-shared mappings (file or anon) we
    want to use a write fault in order to break COW, but there is just no such
    need for shared mappings.

    The first two patches in this series do not introduce any behavior change.
    The intent there is to make it obvious that dirtying file pages is only
    done in the (writable, shared) case. I think this clarifies the code, but
    I wouldn't mind dropping these two patches if there is no consensus about
    them.

    The last patch is where we actually avoid dirtying shared mappings during
    mlock. Note that as a side effect of this, we won't call page_mkwrite()
    for the mappings that define it, and won't be pre-allocating data blocks
    at the FS level if the mapped file was sparsely allocated. My
    understanding is that mlock does not need to provide such guarantee, as
    evidenced by the fact that it never did for the filesystems that don't
    define page_mkwrite() - including some common ones like ext3. However, I
    would like to gather feedback on this from filesystem people as a
    precaution. If this turns out to be a showstopper, maybe block
    preallocation can be added back on using a different interface.

    Large shared mlocks are getting significantly (>2x) faster in my tests, as
    the disk can be fully used for reading the file instead of having to share
    between this and writeback.

    This patch:

    Reorganize the code to remove the 'reuse' flag. No behavior changes.

    Signed-off-by: Michel Lespinasse
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Kosaki Motohiro
    Cc: Peter Zijlstra
    Cc: Nick Piggin
    Cc: Theodore Tso
    Cc: Michael Rubin
    Cc: Suleiman Souhlal
    Cc: Dave Chinner
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Temporary IO failures, eg. due to loss of both multipath paths, can
    permanently leave the PageError bit set on a page, resulting in msync or
    fsync returning -EIO over and over again, even if IO is now getting to the
    disk correctly.

    We already clear the AS_ENOSPC and AS_IO bits in mapping->flags in the
    filemap_fdatawait_range function. Also clearing the PageError bit on the
    page allows subsequent msync or fsync calls on this file to return without
    an error, if the subsequent IO succeeds.

    Unfortunately data written out in the msync or fsync call that returned
    -EIO can still get lost, because the page dirty bit appears to not get
    restored on IO error. However, the alternative could be potentially all
    of memory filling up with uncleanable dirty pages, hanging the system, so
    there is no nice choice here...

    Signed-off-by: Rik van Riel
    Acked-by: Valerie Aurora
    Acked-by: Jeff Layton
    Cc: Theodore Ts'o
    Acked-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • We'd like to be able to oom_score_adj a process up/down as it
    enters/leaves the foreground. Currently, it is not possible to oom_adj
    down without CAP_SYS_RESOURCE. This patch allows a task to decrease its
    oom_score_adj back to the value that a CAP_SYS_RESOURCE thread set it to
    or its inherited value at fork. Assuming the thread that has forked it
    has oom_score_adj of 0, each process could decrease it back from 0 upon
    activation unless a CAP_SYS_RESOURCE thread elevated it to something
    higher.

    Alternative considered:

    * a setuid binary
    * a daemon with CAP_SYS_RESOURCE

    Since you don't wan't all processes to be able to reduce their oom_adj, a
    setuid or daemon implementation would be complex. The alternatives also
    have much higher overhead.

    This patch updated from original patch based on feedback from David
    Rientjes.

    Signed-off-by: Mandeep Singh Baines
    Acked-by: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mandeep Singh Baines
     
  • Four architectures (arm, mips, sparc, x86) use __vmalloc_area() for
    module_init(). Much of the code is duplicated and can be generalized in a
    globally accessible function, __vmalloc_node_range().

    __vmalloc_node() now calls into __vmalloc_node_range() with a range of
    [VMALLOC_START, VMALLOC_END) for functionally equivalent behavior.

    Each architecture may then use __vmalloc_node_range() directly to remove
    the duplication of code.

    Signed-off-by: David Rientjes
    Cc: Christoph Lameter
    Cc: Russell King
    Cc: Ralf Baechle
    Cc: "David S. Miller"
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • pcpu_get_vm_areas() only uses GFP_KERNEL allocations, so remove the gfp_t
    formal and use the mask internally.

    Signed-off-by: David Rientjes
    Cc: Christoph Lameter
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • get_vm_area_node() is unused in the kernel and can thus be removed.

    Signed-off-by: David Rientjes
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • With compaction being used instead of lumpy reclaim, the name lumpy_mode
    and associated variables is a bit misleading. Rename lumpy_mode to
    reclaim_mode which is a better fit. There is no functional change.

    Signed-off-by: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Acked-by: Johannes Weiner
    Cc: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • try_to_compact_pages() is initially called to only migrate pages
    asychronously and kswapd always compacts asynchronously. Both are being
    optimistic so it is important to complete the work as quickly as possible
    to minimise stalls.

    This patch alters the scanner when asynchronous to only consider
    MIGRATE_MOVABLE pageblocks as migration candidates. This reduces stalls
    when allocating huge pages while not impairing allocation success rates as
    a full scan will be performed if necessary after direct reclaim.

    Signed-off-by: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Acked-by: Johannes Weiner
    Cc: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • With the introduction of the boolean sync parameter, the API looks a
    little inconsistent as offlining is still an int. Convert offlining to a
    bool for the sake of being tidy.

    Signed-off-by: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Acked-by: Johannes Weiner
    Cc: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • …ompaction in the faster path

    Migration synchronously waits for writeback if the initial passes fails.
    Callers of memory compaction do not necessarily want this behaviour if the
    caller is latency sensitive or expects that synchronous migration is not
    going to have a significantly better success rate.

    This patch adds a sync parameter to migrate_pages() allowing the caller to
    indicate if wait_on_page_writeback() is allowed within migration or not.
    For reclaim/compaction, try_to_compact_pages() is first called
    asynchronously, direct reclaim runs and then try_to_compact_pages() is
    called synchronously as there is a greater expectation that it'll succeed.

    [akpm@linux-foundation.org: build/merge fix]
    Signed-off-by: Mel Gorman <mel@csn.ul.ie>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Cc: Rik van Riel <riel@redhat.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Andy Whitcroft <apw@shadowen.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Mel Gorman
     
  • Lumpy reclaim is disruptive. It reclaims a large number of pages and
    ignores the age of the pages it reclaims. This can incur significant
    stalls and potentially increase the number of major faults.

    Compaction has reached the point where it is considered reasonably stable
    (meaning it has passed a lot of testing) and is a potential candidate for
    displacing lumpy reclaim. This patch introduces an alternative to lumpy
    reclaim whe compaction is available called reclaim/compaction. The basic
    operation is very simple - instead of selecting a contiguous range of
    pages to reclaim, a number of order-0 pages are reclaimed and then
    compaction is later by either kswapd (compact_zone_order()) or direct
    compaction (__alloc_pages_direct_compact()).

    [akpm@linux-foundation.org: fix build]
    [akpm@linux-foundation.org: use conventional task_struct naming]
    Signed-off-by: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Acked-by: Johannes Weiner
    Cc: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman