23 Mar, 2012

3 commits

  • Pull cleancache changes from Konrad Rzeszutek Wilk:
    "This has some patches for the cleancache API that should have been
    submitted a _long_ time ago. They are basically cleanups:

    - rename of flush to invalidate

    - moving reporting of statistics into debugfs

    - use __read_mostly as necessary.

    Oh, and also the MAINTAINERS file change. The files (except the
    MAINTAINERS file) have been in #linux-next for months now. The late
    addition of MAINTAINERS file is a brain-fart on my side - didn't
    realize I needed that just until I was typing this up - and I based
    that patch on v3.3 - so the tree is on top of v3.3."

    * tag 'stable/for-linus-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/mm:
    MAINTAINERS: Adding cleancache API to the list.
    mm: cleancache: Use __read_mostly as appropiate.
    mm: cleancache: report statistics via debugfs instead of sysfs.
    mm: zcache/tmem/cleancache: s/flush/invalidate/
    mm: cleancache: s/flush/invalidate/

    Linus Torvalds
     
  • Pull MCE changes from Ingo Molnar.

    * 'x86-mce-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/mce: Fix return value of mce_chrdev_read() when erst is disabled
    x86/mce: Convert static array of pointers to per-cpu variables
    x86/mce: Replace hard coded hex constants with symbolic defines
    x86/mce: Recognise machine check bank signature for data path error
    x86/mce: Handle "action required" errors
    x86/mce: Add mechanism to safely save information in MCE handler
    x86/mce: Create helper function to save addr/misc when needed
    HWPOISON: Add code to handle "action required" errors.
    HWPOISON: Clean up memory_failure() vs. __memory_failure()

    Linus Torvalds
     
  • Merge first batch of patches from Andrew Morton:
    "A few misc things and all the MM queue"

    * emailed from Andrew Morton : (92 commits)
    memcg: avoid THP split in task migration
    thp: add HPAGE_PMD_* definitions for !CONFIG_TRANSPARENT_HUGEPAGE
    memcg: clean up existing move charge code
    mm/memcontrol.c: remove unnecessary 'break' in mem_cgroup_read()
    mm/memcontrol.c: remove redundant BUG_ON() in mem_cgroup_usage_unregister_event()
    mm/memcontrol.c: s/stealed/stolen/
    memcg: fix performance of mem_cgroup_begin_update_page_stat()
    memcg: remove PCG_FILE_MAPPED
    memcg: use new logic for page stat accounting
    memcg: remove PCG_MOVE_LOCK flag from page_cgroup
    memcg: simplify move_account() check
    memcg: remove EXPORT_SYMBOL(mem_cgroup_update_page_stat)
    memcg: kill dead prev_priority stubs
    memcg: remove PCG_CACHE page_cgroup flag
    memcg: let css_get_next() rely upon rcu_read_lock()
    cgroup: revert ss_id_lock to spinlock
    idr: make idr_get_next() good for rcu_read_lock()
    memcg: remove unnecessary thp check in page stat accounting
    memcg: remove redundant returns
    memcg: enum lru_list lru
    ...

    Linus Torvalds
     

22 Mar, 2012

37 commits

  • Currently we can't do task migration among memory cgroups without THP
    split, which means processes heavily using THP experience large overhead
    in task migration. This patch introduces the code for moving charge of
    THP and makes THP more valuable.

    Signed-off-by: Naoya Horiguchi
    Acked-by: Hillf Danton
    Cc: Andrea Arcangeli
    Acked-by: KAMEZAWA Hiroyuki
    Cc: David Rientjes
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • These macros will be used in a later patch, where all usages are expected
    to be optimized away without #ifdef CONFIG_TRANSPARENT_HUGEPAGE. But to
    detect unexpected usages, we convert the existing BUG() to BUILD_BUG().

    [akpm@linux-foundation.org: fix build in mm/pgtable-generic.c]
    Signed-off-by: Naoya Horiguchi
    Acked-by: Hillf Danton
    Reviewed-by: Andrea Arcangeli
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: David Rientjes
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • - Replace lengthy function name is_target_pte_for_mc() with a shorter
    one in order to avoid ugly line breaks.

    - explicitly use MC_TARGET_* instead of simply using integers.

    Signed-off-by: Naoya Horiguchi
    Cc: Andrea Arcangeli
    Cc: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Hillf Danton
    Cc: David Rientjes
    Acked-by: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Signed-off-by: Jie Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Liu
     
  • In the following code:

    if (type == _MEM)
    thresholds = &memcg->thresholds;
    else if (type == _MEMSWAP)
    thresholds = &memcg->memsw_thresholds;
    else
    BUG();

    BUG_ON(!thresholds);

    The BUG_ON() seems redundant.

    Signed-off-by: Anton Vorontsov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anton Vorontsov
     
  • A grammatical fix.

    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • mem_cgroup_begin_update_page_stat() should be very fast because it's
    called very frequently. Now, it needs to look up page_cgroup and its
    memcg....this is slow.

    This patch adds a global variable to check "any memcg is moving or not".
    With this, the caller doesn't need to visit page_cgroup and memcg.

    Here is a test result. A test program makes page faults onto a file,
    MAP_SHARED and makes each page's page_mapcount(page) > 1, and free the
    range by madvise() and page fault again. This program causes 26214400
    times of page fault onto a file(size was 1G.) and shows shows the cost of
    mem_cgroup_begin_update_page_stat().

    Before this patch for mem_cgroup_begin_update_page_stat()

    [kamezawa@bluextal test]$ time ./mmap 1G

    real 0m21.765s
    user 0m5.999s
    sys 0m15.434s

    27.46% mmap mmap [.] reader
    21.15% mmap [kernel.kallsyms] [k] page_fault
    9.17% mmap [kernel.kallsyms] [k] filemap_fault
    2.96% mmap [kernel.kallsyms] [k] __do_fault
    2.83% mmap [kernel.kallsyms] [k] __mem_cgroup_begin_update_page_stat

    After this patch

    [root@bluextal test]# time ./mmap 1G

    real 0m21.373s
    user 0m6.113s
    sys 0m15.016s

    In usual path, calls to __mem_cgroup_begin_update_page_stat() goes away.

    Note: we may be able to remove this optimization in future if
    we can get pointer to memcg directly from struct page.

    [akpm@linux-foundation.org: don't return a void]
    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Greg Thelen
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: KOSAKI Motohiro
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • With the new lock scheme for updating memcg's page stat, we don't need a
    flag PCG_FILE_MAPPED which was duplicated information of page_mapped().

    [hughd@google.com: cosmetic fix]
    [hughd@google.com: add comment to MEM_CGROUP_CHARGE_TYPE_MAPPED case in __mem_cgroup_uncharge_common()]
    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Greg Thelen
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: KOSAKI Motohiro
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Now, page-stat-per-memcg is recorded into per page_cgroup flag by
    duplicating page's status into the flag. The reason is that memcg has a
    feature to move a page from a group to another group and we have race
    between "move" and "page stat accounting",

    Under current logic, assume CPU-A and CPU-B. CPU-A does "move" and CPU-B
    does "page stat accounting".

    When CPU-A goes 1st,

    CPU-A CPU-B
    update "struct page" info.
    move_lock_mem_cgroup(memcg)
    see pc->flags
    copy page stat to new group
    overwrite pc->mem_cgroup.
    move_unlock_mem_cgroup(memcg)
    move_lock_mem_cgroup(mem)
    set pc->flags
    update page stat accounting
    move_unlock_mem_cgroup(mem)

    stat accounting is guarded by move_lock_mem_cgroup() and "move" logic
    (CPU-A) doesn't see changes in "struct page" information.

    But it's costly to have the same information both in 'struct page' and
    'struct page_cgroup'. And, there is a potential problem.

    For example, assume we have PG_dirty accounting in memcg.
    PG_..is a flag for struct page.
    PCG_ is a flag for struct page_cgroup.
    (This is just an example. The same problem can be found in any
    kind of page stat accounting.)

    CPU-A CPU-B
    TestSet PG_dirty
    (delay) TestClear PG_dirty
    if (TestClear(PCG_dirty))
    memcg->nr_dirty--
    if (TestSet(PCG_dirty))
    memcg->nr_dirty++

    Here, memcg->nr_dirty = +1, this is wrong. This race was reported by Greg
    Thelen . Now, only FILE_MAPPED is supported but
    fortunately, it's serialized by page table lock and this is not real bug,
    _now_,

    If this potential problem is caused by having duplicated information in
    struct page and struct page_cgroup, we may be able to fix this by using
    original 'struct page' information. But we'll have a problem in "move
    account"

    Assume we use only PG_dirty.

    CPU-A CPU-B
    TestSet PG_dirty
    (delay) move_lock_mem_cgroup()
    if (PageDirty(page))
    new_memcg->nr_dirty++
    pc->mem_cgroup = new_memcg;
    move_unlock_mem_cgroup()
    move_lock_mem_cgroup()
    memcg = pc->mem_cgroup
    new_memcg->nr_dirty++

    accounting information may be double-counted. This was original reason to
    have PCG_xxx flags but it seems PCG_xxx has another problem.

    I think we need a bigger lock as

    move_lock_mem_cgroup(page)
    TestSetPageDirty(page)
    update page stats (without any checks)
    move_unlock_mem_cgroup(page)

    This fixes both of problems and we don't have to duplicate page flag into
    page_cgroup. Please note: move_lock_mem_cgroup() is held only when there
    are possibility of "account move" under the system. So, in most path,
    status update will go without atomic locks.

    This patch introduces mem_cgroup_begin_update_page_stat() and
    mem_cgroup_end_update_page_stat() both should be called at modifying
    'struct page' information if memcg takes care of it. as

    mem_cgroup_begin_update_page_stat()
    modify page information
    mem_cgroup_update_page_stat()
    => never check any 'struct page' info, just update counters.
    mem_cgroup_end_update_page_stat().

    This patch is slow because we need to call begin_update_page_stat()/
    end_update_page_stat() regardless of accounted will be changed or not. A
    following patch adds an easy optimization and reduces the cost.

    [akpm@linux-foundation.org: s/lock/locked/]
    [hughd@google.com: fix deadlock by avoiding stat lock when anon]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Greg Thelen
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: KOSAKI Motohiro
    Cc: Ying Han
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • PCG_MOVE_LOCK is used for bit spinlock to avoid race between overwriting
    pc->mem_cgroup and page statistics accounting per memcg. This lock helps
    to avoid the race but the race is very rare because moving tasks between
    cgroup is not a usual job. So, it seems using 1bit per page is too
    costly.

    This patch changes this lock as per-memcg spinlock and removes
    PCG_MOVE_LOCK.

    If smaller lock is required, we'll be able to add some hashes but I'd like
    to start from this.

    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Greg Thelen
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: KOSAKI Motohiro
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • In memcg, for avoiding take-lock-irq-off at accessing page_cgroup, a
    logic, flag + rcu_read_lock(), is used. This works as following

    CPU-A CPU-B
    rcu_read_lock()
    set flag
    if(flag is set)
    take heavy lock
    do job.
    synchronize_rcu() rcu_read_unlock()
    take heavy lock.

    In recent discussion, it's argued that using per-cpu value for this flag
    just complicates the code because 'set flag' is very rare.

    This patch changes 'flag' implementation from percpu to atomic_t. This
    will be much simpler.

    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Greg Thelen
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: KOSAKI Motohiro
    Cc: Ying Han
    Cc: "Paul E. McKenney"
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • As described in the log, I guess EXPORT was for preparing dirty
    accounting. But _now_, we don't need to export this. Remove this for
    now.

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Greg Thelen
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: KOSAKI Motohiro
    Cc: Ying Han
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • We record 'the page is cache' with the PCG_CACHE bit in page_cgroup.
    Here, "CACHE" means anonymous user pages (and SwapCache). This doesn't
    include shmem.

    Considering callers, at charge/uncharge, the caller should know what the
    page is and we don't need to record it by using one bit per page.

    This patch removes PCG_CACHE bit and make callers of
    mem_cgroup_charge_statistics() to specify what the page is.

    About page migration: Mapping of the used page is not touched during migra
    tion (see page_remove_rmap) so we can rely on it and push the correct
    charge type down to __mem_cgroup_uncharge_common from end_migration for
    unused page. The force flag was misleading was abused for skipping the
    needless page_mapped() / PageCgroupMigration() check, as we know the
    unused page is no longer mapped and cleared the migration flag just a few
    lines up. But doing the checks is no biggie and it's not worth adding
    another flag just to skip them.

    [akpm@linux-foundation.org: checkpatch fixes]
    [hughd@google.com: fix PageAnon uncharging]
    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: Hugh Dickins
    Cc: Ying Han
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Commit e94c8a9cbce1 ("memcg: make mem_cgroup_split_huge_fixup() more
    efficient") removed move_lock_page_cgroup(). So we do not have to check
    PageTransHuge in mem_cgroup_update_page_stat() and fallback into the
    locked accounting because both move_account() and thp split are done
    with compound_lock so they cannot race.

    The race between update vs. move is protected by mem_cgroup_stealed.

    PageTransHuge pages shouldn't appear in this code path currently because
    we are tracking only file pages at the moment but later we are planning
    to track also other pages (e.g. mlocked ones).

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Johannes Weiner
    Cc: Andrea Arcangeli
    Reviewed-by: Acked-by: Michal Hocko
    Cc: David Rientjes
    Acked-by: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Remove redundant returns from ends of functions, and one blank line.

    Signed-off-by: Hugh Dickins
    Reviewed-by: KOSAKI Motohiro
    Acked-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Mostly we use "enum lru_list lru": change those few "l"s to "lru"s.

    Signed-off-by: Hugh Dickins
    Reviewed-by: KOSAKI Motohiro
    Acked-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • I never understood why we need a MEM_CGROUP_ZSTAT(mz, idx) macro to
    obscure the LRU counts. For easier searching? So call it lru_size
    rather than bare count (lru_length sounds better, but would be wrong,
    since each huge page raises lru_size hugely).

    Signed-off-by: Hugh Dickins
    Acked-by: Kirill A. Shutemov
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Replace mem and mem_cont stragglers in memcontrol.c by memcg.

    Signed-off-by: Hugh Dickins
    Reviewed-by: KOSAKI Motohiro
    Acked-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • When swapon() was not passed the SWAP_FLAG_DISCARD option, sys_swapon()
    will still perform a discard operation. This can cause problems if
    discard is slow or buggy.

    Reverse the order of the check so that a discard operation is performed
    only if the sys_swapon() caller is attempting to enable discard.

    Signed-off-by: Shaohua Li
    Reported-by: Holger Kiehl
    Tested-by: Holger Kiehl
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • Reset the reclaim mode in shrink_active_list() to RECLAIM_MODE_SINGLE |
    RECLAIM_MODE_ASYNC. (sync/async sign is used only in shrink_page_list
    and does not affect shrink_active_list)

    Currenly shrink_active_list() sometimes works in lumpy-reclaim mode, if
    RECLAIM_MODE_LUMPYRECLAIM is left over from an earlier
    shrink_inactive_list(). Meanwhile, in age_active_anon()
    sc->reclaim_mode is totally zero. So the current behavior is too
    complex and confusing, and this looks like bug.

    In general, shrink_active_list() populates the inactive list for the
    next shrink_inactive_list(). Lumpy shring_inactive_list() isolates
    pages around the chosen one from both the active and inactive lists.
    So, there is no reason for lumpy isolation in shrink_active_list().

    See also: https://lkml.org/lkml/2012/3/15/583

    Signed-off-by: Konstantin Khlebnikov
    Proposed-by: Hugh Dickins
    Acked-by: Johannes Weiner
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • The comment above __insert_vm_struct seems to suggest that this function
    is also going to link the VMA with the anon_vma, but this is not true.
    This function only links the VMA to the mm->mm_rb tree and the mm->mmap
    linked list.

    [akpm@linux-foundation.org: improve comment layout and text]
    Signed-off-by: Kautuk Consul
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kautuk Consul
     
  • find_zone_movable_pfns_for_nodes() does not use its argument.

    Signed-off-by: Kautuk Consul
    Cc: David Rientjes
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kautuk Consul
     
  • add_from_early_node_map() is unused.

    Signed-off-by: Kautuk Consul
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kautuk Consul
     
  • When calling shmget() with SHM_HUGETLB, shmget aligns the request size to
    PAGE_SIZE, but this is not sufficient.

    Modify hugetlb_file_setup() to align requests to the huge page size, and
    to accept an address argument so that all alignment checks can be
    performed in hugetlb_file_setup(), rather than in its callers. Change
    newseg() and mmap_pgoff() to match the new prototype and eliminate a now
    redundant alignment check.

    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Steven Truelove
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Truelove
     
  • There's no difference between sync_mm_rss() and __sync_task_rss_stat(),
    so fold the latter into the former.

    Signed-off-by: David Rientjes
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • sync_mm_rss() can only be used for current to avoid race conditions in
    iterating and clearing its per-task counters. Remove the task argument
    for it and its helper function, __sync_task_rss_stat(), to avoid thinking
    it can be used safely for anything other than current.

    Signed-off-by: David Rientjes
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • hugetlbfs_{get,put}_quota() are badly named. They don't interact with the
    general quota handling code, and they don't much resemble its behaviour.
    Rather than being about maintaining limits on on-disk block usage by
    particular users, they are instead about maintaining limits on in-memory
    page usage (including anonymous MAP_PRIVATE copied-on-write pages)
    associated with a particular hugetlbfs filesystem instance.

    Worse, they work by having callbacks to the hugetlbfs filesystem code from
    the low-level page handling code, in particular from free_huge_page().
    This is a layering violation of itself, but more importantly, if the
    kernel does a get_user_pages() on hugepages (which can happen from KVM
    amongst others), then the free_huge_page() can be delayed until after the
    associated inode has already been freed. If an unmount occurs at the
    wrong time, even the hugetlbfs superblock where the "quota" limits are
    stored may have been freed.

    Andrew Barry proposed a patch to fix this by having hugepages, instead of
    storing a pointer to their address_space and reaching the superblock from
    there, had the hugepages store pointers directly to the superblock,
    bumping the reference count as appropriate to avoid it being freed.
    Andrew Morton rejected that version, however, on the grounds that it made
    the existing layering violation worse.

    This is a reworked version of Andrew's patch, which removes the extra, and
    some of the existing, layering violation. It works by introducing the
    concept of a hugepage "subpool" at the lower hugepage mm layer - that is a
    finite logical pool of hugepages to allocate from. hugetlbfs now creates
    a subpool for each filesystem instance with a page limit set, and a
    pointer to the subpool gets added to each allocated hugepage, instead of
    the address_space pointer used now. The subpool has its own lifetime and
    is only freed once all pages in it _and_ all other references to it (i.e.
    superblocks) are gone.

    subpools are optional - a NULL subpool pointer is taken by the code to
    mean that no subpool limits are in effect.

    Previous discussion of this bug found in: "Fix refcounting in hugetlbfs
    quota handling.". See: https://lkml.org/lkml/2011/8/11/28 or
    http://marc.info/?l=linux-mm&m=126928970510627&w=1

    v2: Fixed a bug spotted by Hillf Danton, and removed the extra parameter to
    alloc_huge_page() - since it already takes the vma, it is not necessary.

    Signed-off-by: Andrew Barry
    Signed-off-by: David Gibson
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Hillf Danton
    Cc: Paul Mackerras
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Gibson
     
  • There are multiple places which perform the same check. Add a new
    find_mergeable_vma() to handle this.

    Signed-off-by: Bob Liu
    Acked-by: Hugh Dickins
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bob Liu
     
  • Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when
    changing cpuset's mems") wins a super prize for the largest number of
    memory barriers entered into fast paths for one commit.

    [get|put]_mems_allowed is incredibly heavy with pairs of full memory
    barriers inserted into a number of hot paths. This was detected while
    investigating at large page allocator slowdown introduced some time
    after 2.6.32. The largest portion of this overhead was shown by
    oprofile to be at an mfence introduced by this commit into the page
    allocator hot path.

    For extra style points, the commit introduced the use of yield() in an
    implementation of what looks like a spinning mutex.

    This patch replaces the full memory barriers on both read and write
    sides with a sequence counter with just read barriers on the fast path
    side. This is much cheaper on some architectures, including x86. The
    main bulk of the patch is the retry logic if the nodemask changes in a
    manner that can cause a false failure.

    While updating the nodemask, a check is made to see if a false failure
    is a risk. If it is, the sequence number gets bumped and parallel
    allocators will briefly stall while the nodemask update takes place.

    In a page fault test microbenchmark, oprofile samples from
    __alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The
    actual results were

    3.3.0-rc3 3.3.0-rc3
    rc3-vanilla nobarrier-v2r1
    Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%)
    Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%)
    Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%)
    Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%)
    Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%)
    Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%)
    Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%)
    Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%)
    Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%)
    Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%)
    Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%)
    Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%)
    Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%)
    Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%)
    Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%)
    MMTests Statistics: duration
    Sys Time Running Test (seconds) 135.68 132.17
    User+Sys Time Running Test (seconds) 164.2 160.13
    Total Elapsed Time (seconds) 123.46 120.87

    The overall improvement is small but the System CPU time is much
    improved and roughly in correlation to what oprofile reported (these
    performance figures are without profiling so skew is expected). The
    actual number of page faults is noticeably improved.

    For benchmarks like kernel builds, the overall benefit is marginal but
    the system CPU time is slightly reduced.

    To test the actual bug the commit fixed I opened two terminals. The
    first ran within a cpuset and continually ran a small program that
    faulted 100M of anonymous data. In a second window, the nodemask of the
    cpuset was continually randomised in a loop.

    Without the commit, the program would fail every so often (usually
    within 10 seconds) and obviously with the commit everything worked fine.
    With this patch applied, it also worked fine so the fix should be
    functionally equivalent.

    Signed-off-by: Mel Gorman
    Cc: Miao Xie
    Cc: David Rientjes
    Cc: Peter Zijlstra
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The oom killer typically displays the allocation order at the time of oom
    as a part of its diangostic messages (for global, cpuset, and mempolicy
    ooms).

    The memory controller may also pass the charge order to the oom killer so
    it can emit the same information. This is useful in determining how large
    the memory allocation is that triggered the oom killer.

    Signed-off-by: David Rientjes
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • s/noticable/noticeable/

    Signed-off-by: Copot Alexandru
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Copot Alexandru
     
  • When i_mmap_lock changed to a mutex the locking order in memory failure
    was changed to take the sleeping lock first. But the big fat mm lock
    ordering comment (BFMLO) wasn't updated. Do this here.

    Pointed out by Andrew.

    Signed-off-by: Andi Kleen
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • When starting a memory hog task, a desktop box w/o swap is found to go
    unresponsive for a long time. It's solely caused by lots of congestion
    waits in throttle_vm_writeout():

    gnome-system-mo-4201 553.073384: congestion_wait: throttle_vm_writeout+0x70/0x7f shrink_mem_cgroup_zone+0x48f/0x4a1
    gnome-system-mo-4201 553.073386: writeback_congestion_wait: usec_timeout=100000 usec_delayed=100000
    gtali-4237 553.080377: congestion_wait: throttle_vm_writeout+0x70/0x7f shrink_mem_cgroup_zone+0x48f/0x4a1
    gtali-4237 553.080378: writeback_congestion_wait: usec_timeout=100000 usec_delayed=100000
    Xorg-3483 553.103375: congestion_wait: throttle_vm_writeout+0x70/0x7f shrink_mem_cgroup_zone+0x48f/0x4a1
    Xorg-3483 553.103377: writeback_congestion_wait: usec_timeout=100000 usec_delayed=100000

    The root cause is, the dirty threshold is knocked down a lot by the memory
    hog task. Fixed by using global_dirty_limit which decreases gradually on
    such events and can guarantee we stay above (the also decreasing) nr_dirty
    in the progress of following down to the new dirty threshold.

    Signed-off-by: Fengguang Wu
    Cc: Johannes Weiner
    Cc: Jan Kara
    Cc: Greg Thelen
    Cc: Ying Han
    Cc: KAMEZAWA Hiroyuki
    Reviewed-by: Rik van Riel
    Cc: Mel Gorman
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     
  • There is not much point in skipping zones during allocation based on the
    dirty usage which they'll never contribute to. And we'd like to avoid
    page reclaim waits when writing to ramfs/sysfs etc.

    Signed-off-by: Fengguang Wu
    Acked-by: Johannes Weiner
    Cc: Jan Kara
    Cc: Greg Thelen
    Cc: Ying Han
    Cc: KAMEZAWA Hiroyuki
    Cc: Rik van Riel
    Cc: Mel Gorman
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     
  • While testing AMS (Active Memory Sharing) / CMO (Cooperative Memory
    Overcommit) on powerpc, we tripped the following:

    kernel BUG at mm/bootmem.c:483!
    cpu 0x0: Vector: 700 (Program Check) at [c000000000c03940]
    pc: c000000000a62bd8: .alloc_bootmem_core+0x90/0x39c
    lr: c000000000a64bcc: .sparse_early_usemaps_alloc_node+0x84/0x29c
    sp: c000000000c03bc0
    msr: 8000000000021032
    current = 0xc000000000b0cce0
    paca = 0xc000000001d80000
    pid = 0, comm = swapper
    kernel BUG at mm/bootmem.c:483!
    enter ? for help
    [c000000000c03c80] c000000000a64bcc
    .sparse_early_usemaps_alloc_node+0x84/0x29c
    [c000000000c03d50] c000000000a64f10 .sparse_init+0x12c/0x28c
    [c000000000c03e20] c000000000a474f4 .setup_arch+0x20c/0x294
    [c000000000c03ee0] c000000000a4079c .start_kernel+0xb4/0x460
    [c000000000c03f90] c000000000009670 .start_here_common+0x1c/0x2c

    This is

    BUG_ON(limit && goal + size > limit);

    and after some debugging, it seems that

    goal = 0x7ffff000000
    limit = 0x80000000000

    and sparse_early_usemaps_alloc_node ->
    sparse_early_usemaps_alloc_pgdat_section calls

    return alloc_bootmem_section(usemap_size() * count, section_nr);

    This is on a system with 8TB available via the AMS pool, and as a quirk
    of AMS in firmware, all of that memory shows up in node 0. So, we end
    up with an allocation that will fail the goal/limit constraints.

    In theory, we could "fall-back" to alloc_bootmem_node() in
    sparse_early_usemaps_alloc_node(), but since we actually have HOTREMOVE
    defined, we'll BUG_ON() instead. A simple solution appears to be to
    unconditionally remove the limit condition in alloc_bootmem_section,
    meaning allocations are allowed to cross section boundaries (necessary
    for systems of this size).

    Johannes Weiner pointed out that if alloc_bootmem_section() no longer
    guarantees section-locality, we need check_usemap_section_nr() to print
    possible cross-dependencies between node descriptors and the usemaps
    allocated through it. That makes the two loops in
    sparse_early_usemaps_alloc_node() identical, so re-factor the code a
    bit.

    [akpm@linux-foundation.org: code simplification]
    Signed-off-by: Nishanth Aravamudan
    Cc: Dave Hansen
    Cc: Anton Blanchard
    Cc: Paul Mackerras
    Cc: Ben Herrenschmidt
    Cc: Robert Jennings
    Acked-by: Johannes Weiner
    Acked-by: Mel Gorman
    Cc: [3.3.1]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nishanth Aravamudan
     
  • This cpu hotplug hook was accidentally removed in commit 00a62ce91e55
    ("mm: fix Committed_AS underflow on large NR_CPUS environment")

    The visible effect of this accident: some pages are borrowed in per-cpu
    page-vectors. Truncate can deal with it, but these pages cannot be
    reused while this cpu is offline. So this is like a temporary memory
    leak.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Dave Hansen
    Cc: KOSAKI Motohiro
    Cc: Eric B Munson
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Cc: KAMEZAWA Hiroyuki
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Migration functions perform the rcu_read_unlock too early. As a result
    the task pointed to may change from under us. This can result in an oops,
    as reported by Dave Hansen in https://lkml.org/lkml/2012/2/23/302.

    The following patch extend the period of the rcu_read_lock until after the
    permissions checks are done. We also take a refcount so that the task
    reference is stable when calling security check functions and performing
    cpuset node validation (which takes a mutex).

    The refcount is dropped before actual page migration occurs so there is no
    change to the refcounts held during page migration.

    Also move the determination of the mm of the task struct to immediately
    before the do_migrate*() calls so that it is clear that we switch from
    handling the task during permission checks to the mm for the actual
    migration. Since the determination is only done once and we then no
    longer use the task_struct we can be sure that we operate on a specific
    address space that will not change from under us.

    [akpm@linux-foundation.org: checkpatch fixes]
    Signed-off-by: Christoph Lameter
    Cc: "Eric W. Biederman"
    Reported-by: Dave Hansen
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: KOSAKI Motohiro
    Cc: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter