20 May, 2016

1 commit

  • The page allocator iterates through a zonelist for zones that match the
    addressing limitations and nodemask of the caller but many allocations
    will not be restricted. Despite this, there is always functional call
    overhead which builds up.

    This patch inlines the optimistic basic case and only calls the iterator
    function for the complex case. A hindrance was the fact that
    cpuset_current_mems_allowed is used in the fastpath as the allowed
    nodemask even though all nodes are allowed on most systems. The patch
    handles this by only considering cpuset_current_mems_allowed if a cpuset
    exists. As well as being faster in the fast-path, this removes some
    junk in the slowpath.

    The performance difference on a page allocator microbenchmark is;

    4.6.0-rc2 4.6.0-rc2
    statinline-v1r20 optiter-v1r20
    Min alloc-odr0-1 412.00 ( 0.00%) 382.00 ( 7.28%)
    Min alloc-odr0-2 301.00 ( 0.00%) 282.00 ( 6.31%)
    Min alloc-odr0-4 247.00 ( 0.00%) 233.00 ( 5.67%)
    Min alloc-odr0-8 215.00 ( 0.00%) 203.00 ( 5.58%)
    Min alloc-odr0-16 199.00 ( 0.00%) 188.00 ( 5.53%)
    Min alloc-odr0-32 191.00 ( 0.00%) 182.00 ( 4.71%)
    Min alloc-odr0-64 187.00 ( 0.00%) 177.00 ( 5.35%)
    Min alloc-odr0-128 185.00 ( 0.00%) 175.00 ( 5.41%)
    Min alloc-odr0-256 193.00 ( 0.00%) 184.00 ( 4.66%)
    Min alloc-odr0-512 207.00 ( 0.00%) 197.00 ( 4.83%)
    Min alloc-odr0-1024 213.00 ( 0.00%) 203.00 ( 4.69%)
    Min alloc-odr0-2048 220.00 ( 0.00%) 209.00 ( 5.00%)
    Min alloc-odr0-4096 226.00 ( 0.00%) 214.00 ( 5.31%)
    Min alloc-odr0-8192 229.00 ( 0.00%) 218.00 ( 4.80%)
    Min alloc-odr0-16384 229.00 ( 0.00%) 219.00 ( 4.37%)

    perf indicated that next_zones_zonelist disappeared in the profile and
    __next_zones_zonelist did not appear. This is expected as the
    micro-benchmark would hit the inlined fast-path every time.

    Signed-off-by: Mel Gorman
    Cc: Vlastimil Babka
    Cc: Jesper Dangaard Brouer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

15 Jan, 2016

1 commit


12 Feb, 2015

1 commit

  • next_zones_zonelist() returns a zoneref pointer, as well as a zone pointer
    via extra parameter. Since the latter can be trivially obtained by
    dereferencing the former, the overhead of the extra parameter is
    unjustified.

    This patch thus removes the zone parameter from next_zones_zonelist().
    Both callers happen to be in the same header file, so it's simple to add
    the zoneref dereference inline. We save some bytes of code size.

    add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-105 (-105)
    function old new delta
    nr_free_zone_pages 129 115 -14
    __alloc_pages_nodemask 2300 2285 -15
    get_page_from_freelist 2652 2576 -76

    add/remove: 0/0 grow/shrink: 1/0 up/down: 10/0 (10)
    function old new delta
    try_to_compact_pages 569 579 +10

    Signed-off-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Zhang Yanfei
    Cc: Minchan Kim
    Cc: David Rientjes
    Cc: Rik van Riel
    Cc: "Aneesh Kumar K.V"
    Cc: "Kirill A. Shutemov"
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

09 Oct, 2013

2 commits

  • Change the per page last fault tracking to use cpu,pid instead of
    nid,pid. This will allow us to try and lookup the alternate task more
    easily. Note that even though it is the cpu that is store in the page
    flags that the mpol_misplaced decision is still based on the node.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Link: http://lkml.kernel.org/r/1381141781-10992-43-git-send-email-mgorman@suse.de
    [ Fixed build failure on 32-bit systems. ]
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Ideally it would be possible to distinguish between NUMA hinting faults that
    are private to a task and those that are shared. If treated identically
    there is a risk that shared pages bounce between nodes depending on
    the order they are referenced by tasks. Ultimately what is desirable is
    that task private pages remain local to the task while shared pages are
    interleaved between sharing tasks running on different nodes to give good
    average performance. This is further complicated by THP as even
    applications that partition their data may not be partitioning on a huge
    page boundary.

    To start with, this patch assumes that multi-threaded or multi-process
    applications partition their data and that in general the private accesses
    are more important for cpu->memory locality in the general case. Also,
    no new infrastructure is required to treat private pages properly but
    interleaving for shared pages requires additional infrastructure.

    To detect private accesses the pid of the last accessing task is required
    but the storage requirements are a high. This patch borrows heavily from
    Ingo Molnar's patch "numa, mm, sched: Implement last-CPU+PID hash tracking"
    to encode some bits from the last accessing task in the page flags as
    well as the node information. Collisions will occur but it is better than
    just depending on the node information. Node information is then used to
    determine if a page needs to migrate. The PID information is used to detect
    private/shared accesses. The preferred NUMA node is selected based on where
    the maximum number of approximately private faults were measured. Shared
    faults are not taken into consideration for a few reasons.

    First, if there are many tasks sharing the page then they'll all move
    towards the same node. The node will be compute overloaded and then
    scheduled away later only to bounce back again. Alternatively the shared
    tasks would just bounce around nodes because the fault information is
    effectively noise. Either way accounting for shared faults the same as
    private faults can result in lower performance overall.

    The second reason is based on a hypothetical workload that has a small
    number of very important, heavily accessed private pages but a large shared
    array. The shared array would dominate the number of faults and be selected
    as a preferred node even though it's the wrong decision.

    The third reason is that multiple threads in a process will race each
    other to fault the shared page making the fault information unreliable.

    Signed-off-by: Mel Gorman
    [ Fix complication error when !NUMA_BALANCING. ]
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-30-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Mel Gorman
     

24 Feb, 2013

2 commits

  • The function names page_xchg_last_nid(), page_last_nid() and
    reset_page_last_nid() were judged to be inconsistent so rename them to a
    struct_field_op style pattern. As it looked jarring to have
    reset_page_mapcount() and page_nid_reset_last() beside each other in
    memmap_init_zone(), this patch also renames reset_page_mapcount() to
    page_mapcount_reset(). There are others like init_page_count() but as
    it is used throughout the arch code a rename would likely cause more
    conflicts than it is worth.

    [akpm@linux-foundation.org: fix zcache]
    Signed-off-by: Mel Gorman
    Suggested-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Andrew Morton pointed out that page_xchg_last_nid() and
    reset_page_last_nid() were "getting nuttily large" and asked that it be
    investigated.

    reset_page_last_nid() is on the page free path and it would be
    unfortunate to make that path more expensive than it needs to be. Due
    to the internal use of page_xchg_last_nid() it is already too expensive
    but fortunately, it should also be impossible for the page->flags to be
    updated in parallel when we call reset_page_last_nid(). Instead of
    unlining the function, it uses a simplier implementation that assumes no
    parallel updates and should now be sufficiently short for inlining.

    page_xchg_last_nid() is called in paths that are already quite expensive
    (splitting huge page, fault handling, migration) and it is reasonable to
    uninline. There was not really a good place to place the function but
    mm/mmzone.c was the closest fit IMO.

    This patch saved 128 bytes of text in the vmlinux file for the kernel
    configuration I used for testing automatic NUMA balancing.

    Signed-off-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

17 Nov, 2012

1 commit

  • When MEMCG is configured on (even when it's disabled by boot option),
    when adding or removing a page to/from its lru list, the zone pointer
    used for stats updates is nowadays taken from the struct lruvec. (On
    many configurations, calculating zone from page is slower.)

    But we have no code to update all the lruvecs (per zone, per memcg) when
    a memory node is hotadded. Here's an extract from the oops which
    results when running numactl to bind a program to a newly onlined node:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000f60
    IP: __mod_zone_page_state+0x9/0x60
    Pid: 1219, comm: numactl Not tainted 3.6.0-rc5+ #180 Bochs Bochs
    Process numactl (pid: 1219, threadinfo ffff880039abc000, task ffff8800383c4ce0)
    Call Trace:
    __pagevec_lru_add_fn+0xdf/0x140
    pagevec_lru_move_fn+0xb1/0x100
    __pagevec_lru_add+0x1c/0x30
    lru_add_drain_cpu+0xa3/0x130
    lru_add_drain+0x2f/0x40
    ...

    The natural solution might be to use a memcg callback whenever memory is
    hotadded; but that solution has not been scoped out, and it happens that
    we do have an easy location at which to update lruvec->zone. The lruvec
    pointer is discovered either by mem_cgroup_zone_lruvec() or by
    mem_cgroup_page_lruvec(), and both of those do know the right zone.

    So check and set lruvec->zone in those; and remove the inadequate
    attempt to set lruvec->zone from lruvec_init(), which is called before
    NODE_DATA(node) has been allocated in such cases.

    Ah, there was one exceptionr. For no particularly good reason,
    mem_cgroup_force_empty_list() has its own code for deciding lruvec.
    Change it to use the standard mem_cgroup_zone_lruvec() and
    mem_cgroup_get_lru_size() too. In fact it was already safe against such
    an oops (the lru lists in danger could only be empty), but we're better
    proofed against future changes this way.

    I've marked this for stable (3.6) since we introduced the problem in 3.5
    (now closed to stable); but I have no idea if this is the only fix
    needed to get memory hotadd working with memcg in 3.6, and received no
    answer when I enquired twice before.

    Reported-by: Tang Chen
    Signed-off-by: Hugh Dickins
    Acked-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Konstantin Khlebnikov
    Cc: Wen Congyang
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

01 Aug, 2012

1 commit

  • Sanity:

    CONFIG_CGROUP_MEM_RES_CTLR -> CONFIG_MEMCG
    CONFIG_CGROUP_MEM_RES_CTLR_SWAP -> CONFIG_MEMCG_SWAP
    CONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED -> CONFIG_MEMCG_SWAP_ENABLED
    CONFIG_CGROUP_MEM_RES_CTLR_KMEM -> CONFIG_MEMCG_KMEM

    [mhocko@suse.cz: fix missed bits]
    Cc: Glauber Costa
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Tejun Heo
    Cc: Aneesh Kumar K.V
    Cc: David Rientjes
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

30 May, 2012

1 commit

  • This is the first stage of struct mem_cgroup_zone removal. Further
    patches replace struct mem_cgroup_zone with a pointer to struct lruvec.

    If CONFIG_CGROUP_MEM_RES_CTLR=n lruvec_zone() is just container_of().

    Signed-off-by: Konstantin Khlebnikov
    Cc: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

31 Oct, 2011

1 commit


14 Jan, 2011

1 commit

  • Commit aa45484 ("calculate a better estimate of NR_FREE_PAGES when memory
    is low") noted that watermarks were based on the vmstat NR_FREE_PAGES. To
    avoid synchronization overhead, these counters are maintained on a per-cpu
    basis and drained both periodically and when a threshold is above a
    threshold. On large CPU systems, the difference between the estimate and
    real value of NR_FREE_PAGES can be very high. The system can get into a
    case where pages are allocated far below the min watermark potentially
    causing livelock issues. The commit solved the problem by taking a better
    reading of NR_FREE_PAGES when memory was low.

    Unfortately, as reported by Shaohua Li this accurate reading can consume a
    large amount of CPU time on systems with many sockets due to cache line
    bouncing. This patch takes a different approach. For large machines
    where counter drift might be unsafe and while kswapd is awake, the per-cpu
    thresholds for the target pgdat are reduced to limit the level of drift to
    what should be a safe level. This incurs a performance penalty in heavy
    memory pressure by a factor that depends on the workload and the machine
    but the machine should function correctly without accidentally exhausting
    all memory on a node. There is an additional cost when kswapd wakes and
    sleeps but the event is not expected to be frequent - in Shaohua's test
    case, there was one recorded sleep and wake event at least.

    To ensure that kswapd wakes up, a safe version of zone_watermark_ok() is
    introduced that takes a more accurate reading of NR_FREE_PAGES when called
    from wakeup_kswapd, when deciding whether it is really safe to go back to
    sleep in sleeping_prematurely() and when deciding if a zone is really
    balanced or not in balance_pgdat(). We are still using an expensive
    function but limiting how often it is called.

    When the test case is reproduced, the time spent in the watermark
    functions is reduced. The following report is on the percentage of time
    spent cumulatively spent in the functions zone_nr_free_pages(),
    zone_watermark_ok(), __zone_watermark_ok(), zone_watermark_ok_safe(),
    zone_page_state_snapshot(), zone_page_state().

    vanilla 11.6615%
    disable-threshold 0.2584%

    David said:

    : We had to pull aa454840 "mm: page allocator: calculate a better estimate
    : of NR_FREE_PAGES when memory is low and kswapd is awake" from 2.6.36
    : internally because tests showed that it would cause the machine to stall
    : as the result of heavy kswapd activity. I merged it back with this fix as
    : it is pending in the -mm tree and it solves the issue we were seeing, so I
    : definitely think this should be pushed to -stable (and I would seriously
    : consider it for 2.6.37 inclusion even at this late date).

    Signed-off-by: Mel Gorman
    Reported-by: Shaohua Li
    Reviewed-by: Christoph Lameter
    Tested-by: Nicolas Bareil
    Cc: David Rientjes
    Cc: Kyle McMartin
    Cc: [2.6.37.1, 2.6.36.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

10 Sep, 2010

1 commit

  • …low and kswapd is awake

    Ordinarily watermark checks are based on the vmstat NR_FREE_PAGES as it is
    cheaper than scanning a number of lists. To avoid synchronization
    overhead, counter deltas are maintained on a per-cpu basis and drained
    both periodically and when the delta is above a threshold. On large CPU
    systems, the difference between the estimated and real value of
    NR_FREE_PAGES can be very high. If NR_FREE_PAGES is much higher than
    number of real free page in buddy, the VM can allocate pages below min
    watermark, at worst reducing the real number of pages to zero. Even if
    the OOM killer kills some victim for freeing memory, it may not free
    memory if the exit path requires a new page resulting in livelock.

    This patch introduces a zone_page_state_snapshot() function (courtesy of
    Christoph) that takes a slightly more accurate view of an arbitrary vmstat
    counter. It is used to read NR_FREE_PAGES while kswapd is awake to avoid
    the watermark being accidentally broken. The estimate is not perfect and
    may result in cache line bounces but is expected to be lighter than the
    IPI calls necessary to continually drain the per-cpu counters while kswapd
    is awake.

    Signed-off-by: Christoph Lameter <cl@linux.com>
    Signed-off-by: Mel Gorman <mel@csn.ul.ie>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Christoph Lameter
     

18 May, 2009

1 commit

  • pfn_valid() is meant to be able to tell if a given PFN has valid memmap
    associated with it or not. In FLATMEM, it is expected that holes always
    have valid memmap as long as there is valid PFNs either side of the hole.
    In SPARSEMEM, it is assumed that a valid section has a memmap for the
    entire section.

    However, ARM and maybe other embedded architectures in the future free
    memmap backing holes to save memory on the assumption the memmap is never
    used. The page_zone linkages are then broken even though pfn_valid()
    returns true. A walker of the full memmap must then do this additional
    check to ensure the memmap they are looking at is sane by making sure the
    zone and PFN linkages are still valid. This is expensive, but walkers of
    the full memmap are extremely rare.

    This was caught before for FLATMEM and hacked around but it hits again for
    SPARSEMEM because the page_zone linkages can look ok where the PFN linkages
    are totally screwed. This looks like a hatchet job but the reality is that
    any clean solution would end up consumning all the memory saved by punching
    these unexpected holes in the memmap. For example, we tried marking the
    memmap within the section invalid but the section size exceeds the size of
    the hole in most cases so pfn_valid() starts returning false where valid
    memmap exists. Shrinking the size of the section would increase memory
    consumption offsetting the gains.

    This patch identifies when an architecture is punching unexpected holes
    in the memmap that the memory model cannot automatically detect and sets
    ARCH_HAS_HOLES_MEMORYMODEL. At the moment, this is restricted to EP93xx
    which is the model sub-architecture this has been reported on but may expand
    later. When set, walkers of the full memmap must call memmap_valid_within()
    for each PFN and passing in what it expects the page and zone to be for
    that PFN. If it finds the linkages to be broken, it assumes the memmap is
    invalid for that PFN.

    Signed-off-by: Mel Gorman
    Signed-off-by: Russell King

    Mel Gorman
     

14 Sep, 2008

1 commit

  • The iterator for_each_zone_zonelist() uses a struct zoneref *z cursor when
    scanning zonelists to keep track of where in the zonelist it is. The
    zoneref that is returned corresponds to the the next zone that is to be
    scanned, not the current one. It was intended to be treated as an opaque
    list.

    When the page allocator is scanning a zonelist, it marks elements in the
    zonelist corresponding to zones that are temporarily full. As the
    zonelist is being updated, it uses the cursor here;

    if (NUMA_BUILD)
    zlc_mark_zone_full(zonelist, z);

    This is intended to prevent rescanning in the near future but the zoneref
    cursor does not correspond to the zone that has been found to be full.
    This is an easy misunderstanding to make so this patch corrects the
    problem by changing zoneref cursor to be the current zone being scanned
    instead of the next one.

    Signed-off-by: Mel Gorman
    Cc: Andy Whitcroft
    Cc: KAMEZAWA Hiroyuki
    Cc: [2.6.26.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

28 Apr, 2008

1 commit

  • The MPOL_BIND policy creates a zonelist that is used for allocations
    controlled by that mempolicy. As the per-node zonelist is already being
    filtered based on a zone id, this patch adds a version of __alloc_pages() that
    takes a nodemask for further filtering. This eliminates the need for
    MPOL_BIND to create a custom zonelist.

    A positive benefit of this is that allocations using MPOL_BIND now use the
    local node's distance-ordered zonelist instead of a custom node-id-ordered
    zonelist. I.e., pages will be allocated from the closest allowed node with
    available memory.

    [Lee.Schermerhorn@hp.com: Mempolicy: update stale documentation and comments]
    [Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask]
    [Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask rework]
    Signed-off-by: Mel Gorman
    Acked-by: Christoph Lameter
    Signed-off-by: Lee Schermerhorn
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

08 Dec, 2006

1 commit


11 Jul, 2006

1 commit


01 Jul, 2006

1 commit


28 Mar, 2006

1 commit

  • Helper functions for for_each_online_pgdat/for_each_zone look too big to be
    inlined. Speed of these helper macro itself is not very important. (inner
    loops are tend to do more work than this)

    This patch make helper function to be out-of-lined.

    inline out-of-line
    .text 005c0680 005bf6a0

    005c0680 - 005bf6a0 = FE0 = 4Kbytes.

    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki