14 Jan, 2011

40 commits

  • This should work for both hugetlbfs and transparent hugepages.

    [akpm@linux-foundation.org: bring forward PageTransCompound() addition for bisectability]
    Signed-off-by: Andrea Arcangeli
    Cc: Avi Kivity
    Cc: Marcelo Tosatti
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Move the copy/clear_huge_page functions to common code to share between
    hugetlb.c and huge_memory.c.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Paging logic that splits the page before it is unmapped and added to swap
    to ensure backwards compatibility with the legacy swap code. Eventually
    swap should natively pageout the hugepages to increase performance and
    decrease seeking and fragmentation of swap space. swapoff can just skip
    over huge pmd as they cannot be part of swap yet. In add_to_swap be
    careful to split the page only if we got a valid swap entry so we don't
    split hugepages with a full swap.

    In theory we could split pages before isolating them during the lru scan,
    but for khugepaged to be safe, I'm relying on either mmap_sem write mode,
    or PG_lock taken, so split_huge_page has to run either with mmap_sem
    read/write mode or PG_lock taken. Calling it from isolate_lru_page would
    make locking more complicated, in addition to that split_huge_page would
    deadlock if called by __isolate_lru_page because it has to take the lru
    lock to add the tail pages.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Mel Gorman
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • split_huge_page_pmd compat code. Each one of those would need to be
    expanded to hundred of lines of complex code without a fully reliable
    split_huge_page_pmd design.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • This increase the size of the mm struct a bit but it is needed to
    preallocate one pte for each hugepage so that split_huge_page will not
    require a fail path. Guarantee of success is a fundamental property of
    split_huge_page to avoid decrasing swapping reliability and to avoid
    adding -ENOMEM fail paths that would otherwise force the hugepage-unaware
    VM code to learn rolling back in the middle of its pte mangling operations
    (if something we need it to learn handling pmd_trans_huge natively rather
    being capable of rollback). When split_huge_page runs a pte is needed to
    succeed the split, to map the newly splitted regular pages with a regular
    pte. This way all existing VM code remains backwards compatible by just
    adding a split_huge_page* one liner. The memory waste of those
    preallocated ptes is negligible and so it is worth it.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • split_huge_page must transform a compound page to a regular page and needs
    ClearPageCompound.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Reviewed-by: Christoph Lameter
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Add mmu notifier helpers to handle pmd huge operations.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • pte alloc routines must wait for split_huge_page if the pmd is not present
    and not null (i.e. pmd_trans_splitting). The additional branches are
    optimized away at compile time by pmd_trans_splitting if the config option
    is off. However we must pass the vma down in order to know the anon_vma
    lock to wait for.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Force gup_fast to take the slow path and block if the pmd is splitting,
    not only if it's none.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Add needed pmd mangling functions with symmetry with their pte
    counterparts. pmdp_splitting_flush() is the only new addition on the pmd_
    methods and it's needed to serialize the VM against split_huge_page. It
    simply atomically sets the splitting bit in a similar way
    pmdp_clear_flush_young atomically clears the accessed bit.
    pmdp_splitting_flush() also has to flush the tlb to make it effective
    against gup_fast, but it wouldn't really require to flush the tlb too.
    Just the tlb flush is the simplest operation we can invoke to serialize
    pmdp_splitting_flush() against gup_fast.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Some are needed to build but not actually used on archs not supporting
    transparent hugepages. Others like pmdp_clear_flush are used by x86 too.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • These returns 0 at compile time when the config option is disabled, to
    allow gcc to eliminate the transparent hugepage function calls at compile
    time without additional #ifdefs (only the export of those functions have
    to be visible to gcc but they won't be required at link time and
    huge_memory.o can be not built at all).

    _PAGE_BIT_UNUSED1 is never used for pmd, only on pte.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Add config option.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Warn destroy_compound_page that __split_huge_page_refcount is heavily
    dependent on its internal behavior.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • huge_memory.c needs it too when it fallbacks in copying hugepages into
    regular fragmented pages if hugepage allocation fails during COW.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • No paravirt version of set_pmd_at/pmd_update/pmd_update_defer.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Paravirt ops pmd_update/pmd_update_defer/pmd_set_at. Not all might be
    necessary (vmware needs pmd_update, Xen needs set_pmd_at, nobody needs
    pmd_update_defer), but this is to keep full simmetry with pte paravirt
    ops, which looks cleaner and simpler from a common code POV.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Used by paravirt and not paravirt set_pmd_at.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Clear compound mapping for anonymous compound pages like it already
    happens for regular anonymous pages. But crash if mapping is set for any
    tail page, also the PageAnon check is meaningless for tail pages. This
    check only makes sense for the head page, for tail page it can only hide
    bugs and we definitely don't want to hide bugs.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Futex code is smarter than most other gup_fast O_DIRECT code and knows
    about the compound internals. However now doing a put_page(head_page)
    will not release the pin on the tail page taken by gup-fast, leading to
    all sort of refcounting bugchecks. Getting a stable head_page is a little
    tricky.

    page_head = page is there because if this is not a tail page it's also the
    page_head. Only in case this is a tail page, compound_head is called,
    otherwise it's guaranteed unnecessary. And if it's a tail page
    compound_head has to run atomically inside irq disabled section
    __get_user_pages_fast before returning. Otherwise ->first_page won't be a
    stable pointer.

    Disableing irq before __get_user_page_fast and releasing irq after running
    compound_head is needed because if __get_user_page_fast returns == 1, it
    means the huge pmd is established and cannot go away from under us.
    pmdp_splitting_flush_notify in __split_huge_page_splitting will have to
    wait for local_irq_enable before the IPI delivery can return. This means
    __split_huge_page_refcount can't be running from under us, and in turn
    when we run compound_head(page) we're not reading a dangling pointer from
    tailpage->first_page. Then after we get to stable head page, we are
    always safe to call compound_lock and after taking the compound lock on
    head page we can finally re-check if the page returned by gup-fast is
    still a tail page. in which case we're set and we didn't need to split
    the hugepage in order to take a futex on it.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Mel Gorman
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • After releasing the compound_lock split_huge_page can still run and release the
    page before put_page_testzero runs.

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Alter compound get_page/put_page to keep references on subpages too, in
    order to allow __split_huge_page_refcount to split an hugepage even while
    subpages have been pinned by one of the get_user_pages() variants.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Add a new compound_lock() needed to serialize put_page against
    __split_huge_page_refcount().

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Define MADV_HUGEPAGE.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Acked-by: Arnd Bergmann
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Documentation/vm/transhuge.txt

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • page_count shows the count of the head page, but the actual check is done
    on the tail page, so show what is really being checked.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • When a swapcache page is replaced by a ksm page, it's best to free that
    swap immediately.

    Reported-by: Andrea Arcangeli
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • I think determine_dirtyable_memory() is a rather costly function since it
    need many atomic reads for gathering zone/global page state. But when we
    use vm_dirty_bytes && dirty_background_bytes, we don't need that costly
    calculation.

    This patch eliminates such unnecessary overhead.

    NOTE : newly added if condition might add overhead in normal path.
    But it should be _really_ small because anyway we need the
    access both vm_dirty_bytes and dirty_background_bytes so it is
    likely to hit the cache.

    [akpm@linux-foundation.org: fix used-uninitialised warning]
    Signed-off-by: Minchan Kim
    Cc: Wu Fengguang
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • When numa_zonelist_order parameter is set to "node" or "zone" on the
    command line it's still showing as "default" in sysctl. That's because
    early_param parsing function changes only user_zonelist_order variable.
    Fix this by copying user-provided string to numa_zonelist_order if it was
    successfully parsed.

    Signed-off-by: Volodymyr G Lukiianyk
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Volodymyr G. Lukiianyk
     
  • When kswapd is woken up for a high-order allocation, it takes account of
    the highest usable zone by the caller (the classzone idx). During
    allocation, this index is used to select the lowmem_reserve[] that should
    be applied to the watermark calculation in zone_watermark_ok().

    When balancing a node, kswapd considers the highest unbalanced zone to be
    the classzone index. This will always be at least be the callers
    classzone_idx and can be higher. However, sleeping_prematurely() always
    considers the lowest zone (e.g. ZONE_DMA) to be the classzone index.
    This means that sleeping_prematurely() can consider a zone to be balanced
    that is unusable by the allocation request that originally woke kswapd.
    This patch changes sleeping_prematurely() to use a classzone_idx matching
    the value it used in balance_pgdat().

    Signed-off-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Reviewed-by: Eric B Munson
    Cc: KAMEZAWA Hiroyuki
    Cc: Simon Kirby
    Cc: KOSAKI Motohiro
    Cc: Shaohua Li
    Cc: Dave Hansen
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • After DEF_PRIORITY, balance_pgdat() considers all_unreclaimable zones to
    be balanced but sleeping_prematurely does not. This can force kswapd to
    stay awake longer than it should. This patch fixes it.

    Signed-off-by: Mel Gorman
    Reviewed-by: Eric B Munson
    Cc: Minchan Kim
    Cc: KAMEZAWA Hiroyuki
    Cc: Simon Kirby
    Cc: KOSAKI Motohiro
    Cc: Shaohua Li
    Cc: Dave Hansen
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • When kswapd wakes up, it reads its order and classzone from pgdat and
    calls balance_pgdat. While its awake, it potentially reclaimes at a high
    order and a low classzone index. This might have been a once-off that was
    not required by subsequent callers. However, because the pgdat values
    were not reset, they remain artifically high while balance_pgdat() is
    running and potentially kswapd enters a second unnecessary reclaim cycle.
    Reset the pgdat order and classzone index after reading.

    Signed-off-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Eric B Munson
    Cc: Simon Kirby
    Cc: KOSAKI Motohiro
    Cc: Shaohua Li
    Cc: Dave Hansen
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Before kswapd goes to sleep, it uses sleeping_prematurely() to check if
    there was a race pushing a zone below its watermark. If the race
    happened, it stays awake. However, balance_pgdat() can decide to reclaim
    at order-0 if it decides that high-order reclaim is not working as
    expected. This information is not passed back to sleeping_prematurely().
    The impact is that kswapd remains awake reclaiming pages long after it
    should have gone to sleep. This patch passes the adjusted order to
    sleeping_prematurely and uses the same logic as balance_pgdat to decide if
    it's ok to go to sleep.

    Signed-off-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Eric B Munson
    Cc: Simon Kirby
    Cc: KOSAKI Motohiro
    Cc: Shaohua Li
    Cc: Dave Hansen
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • When reclaiming for high-orders, kswapd is responsible for balancing a
    node but it should not reclaim excessively. It avoids excessive reclaim
    by considering if any zone in a node is balanced then the node is
    balanced. In the cases where there are imbalanced zone sizes (e.g.
    ZONE_DMA with both ZONE_DMA32 and ZONE_NORMAL), kswapd can go to sleep
    prematurely as just one small zone was balanced.

    This alters the sleep logic of kswapd slightly. It counts the number of
    pages that make up the balanced zones. If the total number of balanced
    pages is more than a quarter of the zone, kswapd will go back to sleep.
    This should keep a node balanced without reclaiming an excessive number of
    pages.

    Signed-off-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Eric B Munson
    Cc: Simon Kirby
    Cc: KOSAKI Motohiro
    Cc: Shaohua Li
    Cc: Dave Hansen
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Simon Kirby reported the following problem

    We're seeing cases on a number of servers where cache never fully
    grows to use all available memory. Sometimes we see servers with 4 GB
    of memory that never seem to have less than 1.5 GB free, even with a
    constantly-active VM. In some cases, these servers also swap out while
    this happens, even though they are constantly reading the working set
    into memory. We have been seeing this happening for a long time; I
    don't think it's anything recent, and it still happens on 2.6.36.

    After some debugging work by Simon, Dave Hansen and others, the prevaling
    theory became that kswapd is reclaiming order-3 pages requested by SLUB
    too aggressive about it.

    There are two apparent problems here. On the target machine, there is a
    small Normal zone in comparison to DMA32. As kswapd tries to balance all
    zones, it would continually try reclaiming for Normal even though DMA32
    was balanced enough for callers. The second problem is that
    sleeping_prematurely() does not use the same logic as balance_pgdat() when
    deciding whether to sleep or not. This keeps kswapd artifically awake.

    A number of tests were run and the figures from previous postings will
    look very different for a few reasons. One, the old figures were forcing
    my network card to use GFP_ATOMIC in attempt to replicate Simon's problem.
    Second, I previous specified slub_min_order=3 again in an attempt to
    reproduce Simon's problem. In this posting, I'm depending on Simon to say
    whether his problem is fixed or not and these figures are to show the
    impact to the ordinary cases. Finally, the "vmscan" figures are taken
    from /proc/vmstat instead of the tracepoints. There is less information
    but recording is less disruptive.

    The first test of relevance was postmark with a process running in the
    background reading a large amount of anonymous memory in blocks. The
    objective was to vaguely simulate what was happening on Simon's machine
    and it's memory intensive enough to have kswapd awake.

    POSTMARK
    traceonly kanyzone
    Transactions per second: 156.00 ( 0.00%) 153.00 (-1.96%)
    Data megabytes read per second: 21.51 ( 0.00%) 21.52 ( 0.05%)
    Data megabytes written per second: 29.28 ( 0.00%) 29.11 (-0.58%)
    Files created alone per second: 250.00 ( 0.00%) 416.00 (39.90%)
    Files create/transact per second: 79.00 ( 0.00%) 76.00 (-3.95%)
    Files deleted alone per second: 520.00 ( 0.00%) 420.00 (-23.81%)
    Files delete/transact per second: 79.00 ( 0.00%) 76.00 (-3.95%)

    MMTests Statistics: duration
    User/Sys Time Running Test (seconds) 16.58 17.4
    Total Elapsed Time (seconds) 218.48 222.47

    VMstat Reclaim Statistics: vmscan
    Direct reclaims 0 4
    Direct reclaim pages scanned 0 203
    Direct reclaim pages reclaimed 0 184
    Kswapd pages scanned 326631 322018
    Kswapd pages reclaimed 312632 309784
    Kswapd low wmark quickly 1 4
    Kswapd high wmark quickly 122 475
    Kswapd skip congestion_wait 1 0
    Pages activated 700040 705317
    Pages deactivated 212113 203922
    Pages written 9875 6363

    Total pages scanned 326631 322221
    Total pages reclaimed 312632 309968
    %age total pages scanned/reclaimed 95.71% 96.20%
    %age total pages scanned/written 3.02% 1.97%

    proc vmstat: Faults
    Major Faults 300 254
    Minor Faults 645183 660284
    Page ins 493588 486704
    Page outs 4960088 4986704
    Swap ins 1230 661
    Swap outs 9869 6355

    Performance is mildly affected because kswapd is no longer doing as much
    work and the background memory consumer process is getting in the way.
    Note that kswapd scanned and reclaimed fewer pages as it's less aggressive
    and overall fewer pages were scanned and reclaimed. Swap in/out is
    particularly reduced again reflecting kswapd throwing out fewer pages.

    The slight performance impact is unfortunate here but it looks like a
    direct result of kswapd being less aggressive. As the bug report is about
    too many pages being freed by kswapd, it may have to be accepted for now.

    The second test is a streaming IO benchmark that was previously used by
    Johannes to show regressions in page reclaim.

    MICRO
    traceonly kanyzone
    User/Sys Time Running Test (seconds) 29.29 28.87
    Total Elapsed Time (seconds) 492.18 488.79

    VMstat Reclaim Statistics: vmscan
    Direct reclaims 2128 1460
    Direct reclaim pages scanned 2284822 1496067
    Direct reclaim pages reclaimed 148919 110937
    Kswapd pages scanned 15450014 16202876
    Kswapd pages reclaimed 8503697 8537897
    Kswapd low wmark quickly 3100 3397
    Kswapd high wmark quickly 1860 7243
    Kswapd skip congestion_wait 708 801
    Pages activated 9635 9573
    Pages deactivated 1432 1271
    Pages written 223 1130

    Total pages scanned 17734836 17698943
    Total pages reclaimed 8652616 8648834
    %age total pages scanned/reclaimed 48.79% 48.87%
    %age total pages scanned/written 0.00% 0.01%

    proc vmstat: Faults
    Major Faults 165 221
    Minor Faults 9655785 9656506
    Page ins 3880 7228
    Page outs 37692940 37480076
    Swap ins 0 69
    Swap outs 19 15

    Again fewer pages are scanned and reclaimed as expected and this time the
    test completed faster. Note that kswapd is hitting its watermarks faster
    (low and high wmark quickly) which I expect is due to kswapd reclaiming
    fewer pages.

    I also ran fs-mark, iozone and sysbench but there is nothing interesting
    to report in the figures. Performance is not significantly changed and
    the reclaim statistics look reasonable.

    Tgis patch:

    When the allocator enters its slow path, kswapd is woken up to balance the
    node. It continues working until all zones within the node are balanced.
    For order-0 allocations, this makes perfect sense but for higher orders it
    can have unintended side-effects. If the zone sizes are imbalanced,
    kswapd may reclaim heavily within a smaller zone discarding an excessive
    number of pages. The user-visible behaviour is that kswapd is awake and
    reclaiming even though plenty of pages are free from a suitable zone.

    This patch alters the "balance" logic for high-order reclaim allowing
    kswapd to stop if any suitable zone becomes balanced to reduce the number
    of pages it reclaims from other zones. kswapd still tries to ensure that
    order-0 watermarks for all zones are met before sleeping.

    Signed-off-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Eric B Munson
    Cc: Simon Kirby
    Cc: KOSAKI Motohiro
    Cc: Shaohua Li
    Cc: Dave Hansen
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Running the annotated branch profiler on a box doing average work
    (firefox, evolution, xchat, distcc farm), the likely() used in
    grab_cache_page_write_begin() was incorrect most of the time:

    correct incorrect % Function File Line
    ------- --------- - -------- ---- ----
    1924262 71332401 97 grab_cache_page_write_begin filemap.c 2206

    Adding a trace_printk() and running the function tracer limited to
    just this function I can see:

    gconfd-2-2696 [000] 4467.268935: grab_cache_page_write_begin: page= (null) mapping=ffff8800676a9460 index=7
    gconfd-2-2696 [000] 4467.268946: grab_cache_page_write_begin
    Acked-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Rostedt
     
  • page_mapping() has a unlikely that the mapping has PAGE_MAPPING_ANON set.
    But running the annotated branch profiler on a normal desktop system doing
    vairous tasks (xchat, evolution, firefox, distcc), it is not really that
    unlikely that the mapping here will have the PAGE_MAPPING_ANON flag set:

    correct incorrect % Function File Line
    ------- --------- - -------- ---- ----
    35935762 1270265395 97 page_mapping mm.h 659
    1306198001 143659 0 page_mapping mm.h 657
    203131478 121586 0 page_mapping mm.h 657
    5415491 1116 0 page_mapping mm.h 657
    74899487 1116 0 page_mapping mm.h 657
    203132845 224 0 page_mapping mm.h 659
    5415464 27 0 page_mapping mm.h 659
    13552 0 0 page_mapping mm.h 657
    13552 0 0 page_mapping mm.h 659
    242630 0 0 page_mapping mm.h 657
    242630 0 0 page_mapping mm.h 659
    74899487 0 0 page_mapping mm.h 659

    The page_mapping() is a static inline, which is why it shows up multiple
    times.

    The unlikely in page_mapping() was correct a total of 1909540379 times and
    incorrect 1270533123 times, with a 39% being incorrect. With this much of
    an error, it's best to simply remove the unlikely and have the compiler
    and branch prediction figure this out.

    Signed-off-by: Steven Rostedt
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Rik van Riel
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Rostedt
     
  • The mapping_unevictable() has a likely() around the mapping parameter.
    This mapping parameter comes from page_mapping() which has an unlikely()
    that the page will be set as PAGE_MAPPING_ANON, and if so, it will return
    NULL. One would think that this unlikely() means that the mapping
    returned by page_mapping() would not be NULL, but where page_mapping() is
    used just above mapping_unevictable(), that unlikely() is incorrect most
    of the time. This means that the "likely(mapping)" in
    mapping_unevictable() is incorrect most of the time.

    Running the annotated branch profiler on my main box which runs firefox,
    evolution, xchat and is part of my distcc farm, I had this:

    correct incorrect % Function File Line
    ------- --------- - -------- ---- ----
    12872836 1269443893 98 mapping_unevictable pagemap.h 51
    35935762 1270265395 97 page_mapping mm.h 659
    1306198001 143659 0 page_mapping mm.h 657
    203131478 121586 0 page_mapping mm.h 657
    5415491 1116 0 page_mapping mm.h 657
    74899487 1116 0 page_mapping mm.h 657
    203132845 224 0 page_mapping mm.h 659
    5415464 27 0 page_mapping mm.h 659
    13552 0 0 page_mapping mm.h 657
    13552 0 0 page_mapping mm.h 659
    242630 0 0 page_mapping mm.h 657
    242630 0 0 page_mapping mm.h 659
    74899487 0 0 page_mapping mm.h 659

    The page_mapping() is a static inline, which is why it shows up multiple
    times. The mapping_unevictable() is also a static inline but seems to be
    used only once in my setup.

    The unlikely in page_mapping() was correct a total of 1909540379 times and
    incorrect 1270533123 times, with a 39% being incorrect. Perhaps this is
    enough to remove the unlikely from page_mapping() as well.

    Signed-off-by: Steven Rostedt
    Reviewed-by: KOSAKI Motohiro
    Acked-by: Nick Piggin
    Acked-by: Rik van Riel
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Rostedt
     
  • IS_ERR() already implies unlikely(), so it can be omitted here.

    Signed-off-by: Tobias Klauser
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tobias Klauser
     
  • Today, tasklist_lock in migrate_pages doesn't protect anything.
    rcu_read_lock() provide enough protection from pid hash walk.

    Signed-off-by: KOSAKI Motohiro
    Reported-by: Peter Zijlstra
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro