28 Jun, 2011

7 commits

  • Commit d149e3b25d7c ("memcg: add the soft_limit reclaim in global direct
    reclaim") adds a softlimit hook to shrink_zones(). By this, soft limit
    is called as

    try_to_free_pages()
    do_try_to_free_pages()
    shrink_zones()
    mem_cgroup_soft_limit_reclaim()

    Then, direct reclaim is memcg softlimit hint aware, now.

    But, the memory cgroup's "limit" path can call softlimit shrinker.

    try_to_free_mem_cgroup_pages()
    do_try_to_free_pages()
    shrink_zones()
    mem_cgroup_soft_limit_reclaim()

    This will cause a global reclaim when a memcg hits limit.

    This is bug. soft_limit_reclaim() should be called when
    scanning_global_lru(sc) == true.

    And the commit adds a variable "total_scanned" for counting softlimit
    scanned pages....it's not "total". This patch removes the variable and
    update sc->nr_scanned instead of it. This will affect shrink_slab()'s
    scan condition but, global LRU is scanned by softlimit and I think this
    change makes sense.

    TODO: avoid too much scanning of a zone when softlimit did enough work.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Ying Han
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Under heavy memory and filesystem load, users observe the assertion
    mapping->nrpages == 0 in end_writeback() trigger. This can be caused by
    page reclaim reclaiming the last page from a mapping in the following
    race:

    CPU0 CPU1
    ...
    shrink_page_list()
    __remove_mapping()
    __delete_from_page_cache()
    radix_tree_delete()
    evict_inode()
    truncate_inode_pages()
    truncate_inode_pages_range()
    pagevec_lookup() - finds nothing
    end_writeback()
    mapping->nrpages != 0 -> BUG
    page->mapping = NULL
    mapping->nrpages--

    Fix the problem by doing a reliable check of mapping->nrpages under
    mapping->tree_lock in end_writeback().

    Analyzed by Jay , lost in LKML, and dug out
    by Miklos Szeredi .

    Cc: Jay
    Cc: Miklos Szeredi
    Signed-off-by: Jan Kara
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • We cannot take a mutex while holding a spinlock, so flip the order and
    fix the locking documentation.

    Signed-off-by: Peter Zijlstra
    Acked-by: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Although it is used (by i915) on nothing but tmpfs, read_cache_page_gfp()
    is unsuited to tmpfs, because it inserts a page into pagecache before
    calling the filesystem's ->readpage: tmpfs may have pages in swapcache
    which only it knows how to locate and switch to filecache.

    At present tmpfs provides a ->readpage method, and copes with this by
    copying pages; but soon we can simplify it by removing its ->readpage.
    Provide shmem_read_mapping_page_gfp() now, ready for that transition,

    Export shmem_read_mapping_page_gfp() and add it to list in shmem_fs.h,
    with shmem_read_mapping_page() inline for the common mapping_gfp case.

    (shmem_read_mapping_page_gfp or shmem_read_cache_page_gfp? Generally the
    read_mapping_page functions use the mapping's ->readpage, and the
    read_cache_page functions use the supplied filler, so I think
    read_cache_page_gfp was slightly misnamed.)

    Signed-off-by: Hugh Dickins
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • 2.6.35's new truncate convention gave tmpfs the opportunity to control
    its file truncation, no longer enforced from outside by vmtruncate().
    We shall want to build upon that, to handle pagecache and swap together.

    Slightly redefine the ->truncate_range interface: let it now be called
    between the unmap_mapping_range()s, with the filesystem responsible for
    doing the truncate_inode_pages_range() from it - just as the filesystem
    is nowadays responsible for doing that from its ->setattr.

    Let's rename shmem_notify_change() to shmem_setattr(). Instead of
    calling the generic truncate_setsize(), bring that code in so we can
    call shmem_truncate_range() - which will later be updated to perform its
    own variant of truncate_inode_pages_range().

    Remove the punch_hole unmap_mapping_range() from shmem_truncate_range():
    now that the COW's unmap_mapping_range() comes after ->truncate_range,
    there is no need to call it a third time.

    Export shmem_truncate_range() and add it to the list in shmem_fs.h, so
    that i915_gem_object_truncate() can call it explicitly in future; get
    this patch in first, then update drm/i915 once this is available (until
    then, i915 will just be doing the truncate_inode_pages() twice).

    Though introduced five years ago, no other filesystem is implementing
    ->truncate_range, and its only other user is madvise(,,MADV_REMOVE): we
    expect to convert it to fallocate(,FALLOC_FL_PUNCH_HOLE,,) shortly,
    whereupon ->truncate_range can be removed from inode_operations -
    shmem_truncate_range() will help i915 across that transition too.

    Signed-off-by: Hugh Dickins
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Before adding any more global entry points into shmem.c, gather such
    prototypes into shmem_fs.h. Remove mm's own declarations from swap.h,
    but for now leave the ones in mm.h: because shmem_file_setup() and
    shmem_zero_setup() are called from various places, and we should not
    force other subsystems to update immediately.

    Signed-off-by: Hugh Dickins
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • You would expect to find vmtruncate_range() next to vmtruncate() in
    mm/truncate.c: move it there.

    Signed-off-by: Hugh Dickins
    Acked-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

23 Jun, 2011

2 commits

  • Commit 959ecc48fc75 ("mm/memory_hotplug.c: fix building of node hotplug
    zonelist") does not protect the build_all_zonelists() call with
    zonelists_mutex as needed. This can lead to races in constructing
    zonelist ordering if a concurrent build is underway. Protecting this
    with lock_memory_hotplug() is insufficient since zonelists can be
    rebuild though sysfs as well.

    Signed-off-by: David Rientjes
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • The error handling in mem_online_node() is incorrect: hotadd_new_pgdat()
    returns NULL if the new pgdat could not have been allocated and a pointer
    to it otherwise.

    mem_online_node() should fail if hotadd_new_pgdat() fails, not the
    inverse. This fixes an issue when memoryless nodes are not onlined and
    their sysfs interface is not registered when their first cpu is brought
    up.

    The bug was introduced by commit cf23422b9d76 ("cpu/mem hotplug: enable
    CPUs online before local memory online") iow v2.6.35.

    Signed-off-by: David Rientjes
    Reviewed-by: KOSAKI Motohiro
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    David Rientjes
     

18 Jun, 2011

3 commits

  • Hugh Dickins points out that lockdep (correctly) spots a potential
    deadlock on the anon_vma lock, because we now do a GFP_KERNEL allocation
    of anon_vma_chain while doing anon_vma_clone(). The problem is that
    page reclaim will want to take the anon_vma lock of any anonymous pages
    that it will try to reclaim.

    So re-organize the code in anon_vma_clone() slightly: first do just a
    GFP_NOWAIT allocation, which will usually work fine. But if that fails,
    let's just drop the lock and re-do the allocation, now with GFP_KERNEL.

    End result: not only do we avoid the locking problem, this also ends up
    getting better concurrency in case the allocation does need to block.
    Tim Chen reports that with all these anon_vma locking tweaks, we're now
    almost back up to the spinlock performance.

    Reported-and-tested-by: Hugh Dickins
    Tested-by: Tim Chen
    Cc: Peter Zijlstra
    Cc: Andi Kleen
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • This matches the anon_vma_clone() case, and uses the same lock helper
    functions. Because of the need to potentially release the anon_vma's,
    it's a bit more complex, though.

    We traverse the 'vma->anon_vma_chain' in two phases: the first loop gets
    the anon_vma lock (with the helper function that only takes the lock
    once for the whole loop), and removes any entries that don't need any
    more processing.

    The second phase just traverses the remaining list entries (without
    holding the anon_vma lock), and does any actual freeing of the
    anon_vma's that is required.

    Signed-off-by: Peter Zijlstra
    Tested-by: Hugh Dickins
    Tested-by: Tim Chen
    Cc: Andi Kleen
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • In anon_vma_clone() we traverse the vma->anon_vma_chain of the source
    vma, locking the anon_vma for each entry.

    But they are all going to have the same root entry, which means that
    we're locking and unlocking the same lock over and over again. Which is
    expensive in locked operations, but can get _really_ expensive when that
    root entry sees any kind of lock contention.

    In fact, Tim Chen reports a big performance regression due to this: when
    we switched to use a mutex instead of a spinlock, the contention case
    gets much worse.

    So to alleviate this all, this commit creates a small helper function
    (lock_anon_vma_root()) that can be used to take the lock just once
    rather than taking and releasing it over and over again.

    We still have the same "take the lock and release" it behavior in the
    exit path (in unlink_anon_vmas()), but that one is a bit harder to fix
    since we're actually freeing the anon_vma entries as we go, and that
    will touch the lock too.

    Reported-and-tested-by: Tim Chen
    Tested-by: Hugh Dickins
    Cc: Peter Zijlstra
    Cc: Andi Kleen
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

17 Jun, 2011

1 commit

  • swapcache will reach the below code path in migrate_page_move_mapping,
    and swapcache is accounted as NR_FILE_PAGES but it's not accounted as
    NR_SHMEM.

    Hugh pointed out we must use PageSwapCache instead of comparing
    mapping to &swapper_space, to avoid build failure with CONFIG_SWAP=n.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Hugh Dickins
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

16 Jun, 2011

21 commits

  • We have some users of this function that date back to before the vma
    list was doubly linked, and just are silly. These days, you can find
    the previous vma by just following the vma->vm_prev pointer.

    In some cases you don't need any find_vma() lookup at all, and in other
    cases you're better off with the regular "find_vma()" that uses the vma
    cache front-end lookup.

    Some "find_vma_prev()" users are still valid, though. For example, in
    the case of a stack that grows up, it can be the case that we don't find
    any 'vma' at all (because we're looking up an address that is past the
    last vma), and that the stack that we want to grow is the 'prev' vma.

    But that kind of special case aside, we generally should prefer to use
    'find_vma()'.

    Noticed due to a totally unrelated POWER memory corruption bug that just
    happened to hit in 'find_vma_prev()' and made me go "Hmm - why are we
    using that function here?".

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Andrea Righi reported a case where an exiting task can race against
    ksmd::scan_get_next_rmap_item (http://lkml.org/lkml/2011/6/1/742) easily
    triggering a NULL pointer dereference in ksmd.

    ksm_scan.mm_slot == &ksm_mm_head with only one registered mm

    CPU 1 (__ksm_exit) CPU 2 (scan_get_next_rmap_item)
    list_empty() is false
    lock slot == &ksm_mm_head
    list_del(slot->mm_list)
    (list now empty)
    unlock
    lock
    slot = list_entry(slot->mm_list.next)
    (list is empty, so slot is still ksm_mm_head)
    unlock
    slot->mm == NULL ... Oops

    Close this race by revalidating that the new slot is not simply the list
    head again.

    Andrea's test case:

    #include
    #include
    #include
    #include

    #define BUFSIZE getpagesize()

    int main(int argc, char **argv)
    {
    void *ptr;

    if (posix_memalign(&ptr, getpagesize(), BUFSIZE) < 0) {
    perror("posix_memalign");
    exit(1);
    }
    if (madvise(ptr, BUFSIZE, MADV_MERGEABLE) < 0) {
    perror("madvise");
    exit(1);
    }
    *(char *)NULL = 0;

    return 0;
    }

    Reported-by: Andrea Righi
    Tested-by: Andrea Righi
    Cc: Andrea Arcangeli
    Signed-off-by: Hugh Dickins
    Signed-off-by: Chris Wright
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Asynchronous compaction is used when promoting to huge pages. This is all
    very nice but if there are a number of processes in compacting memory, a
    large number of pages can be isolated. An "asynchronous" process can
    stall for long periods of time as a result with a user reporting that
    firefox can stall for 10s of seconds. This patch aborts asynchronous
    compaction if too many pages are isolated as it's better to fail a
    hugepage promotion than stall a process.

    [minchan.kim@gmail.com: return COMPACT_PARTIAL for abort]
    Reported-and-tested-by: Ury Stankevich
    Signed-off-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Reviewed-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • It is unsafe to run page_count during the physical pfn scan because
    compound_head could trip on a dangling pointer when reading
    page->first_page if the compound page is being freed by another CPU.

    [mgorman@suse.de: split out patch]
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Mel Gorman
    Reviewed-by: Michal Hocko
    Reviewed-by: Minchan Kim

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Compaction works with two scanners, a migration and a free scanner. When
    the scanners crossover, migration within the zone is complete. The
    location of the scanner is recorded on each cycle to avoid excesive
    scanning.

    When a zone is small and mostly reserved, it's very easy for the migration
    scanner to be close to the end of the zone. Then the following situation
    can occurs

    o migration scanner isolates some pages near the end of the zone
    o free scanner starts at the end of the zone but finds that the
    migration scanner is already there
    o free scanner gets reinitialised for the next cycle as
    cc->migrate_pfn + pageblock_nr_pages
    moving the free scanner into the next zone
    o migration scanner moves into the next zone

    When this happens, NR_ISOLATED accounting goes haywire because some of the
    accounting happens against the wrong zone. One zones counter remains
    positive while the other goes negative even though the overall global
    count is accurate. This was reported on X86-32 with !SMP because !SMP
    allows the negative counters to be visible. The fact that it is the bug
    should theoritically be possible there.

    Signed-off-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Reviewed-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • fragmentation_index() returns -1000 when the allocation might succeed
    This doesn't match the comment and code in compaction_suitable(). I
    thought compaction_suitable should return COMPACT_PARTIAL in -1000
    case, because in this case allocation could succeed depending on
    watermarks.

    The impact of this is that compaction starts and compact_finished() is
    called which rechecks the watermarks and the free lists. It should have
    the same result in that compaction should not start but is more expensive.

    Acked-by: Mel Gorman
    Signed-off-by: Shaohua Li
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • Pages isolated for migration are accounted with the vmstat counters
    NR_ISOLATE_[ANON|FILE]. Callers of migrate_pages() are expected to
    increment these counters when pages are isolated from the LRU. Once the
    pages have been migrated, they are put back on the LRU or freed and the
    isolated count is decremented.

    Memory failure is not properly accounting for pages it isolates causing
    the NR_ISOLATED counters to be negative. On SMP builds, this goes
    unnoticed as negative counters are treated as 0 due to expected per-cpu
    drift. On UP builds, the counter is treated by too_many_isolated() as a
    large value causing processes to enter D state during page reclaim or
    compaction. This patch accounts for pages isolated by memory failure
    correctly.

    [mel@csn.ul.ie: rewrote changelog]
    Reviewed-by: Andrea Arcangeli
    Signed-off-by: Minchan Kim
    Cc: Andi Kleen
    Acked-by: Mel Gorman
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Based on Michal Hocko's comment.

    We are not draining per cpu cached charges during soft limit reclaim
    because background reclaim doesn't care about charges. It tries to free
    some memory and charges will not give any.

    Cached charges might influence only selection of the biggest soft limit
    offender but as the call is done only after the selection has been already
    done it makes no change.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Reviewed-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • For performance, memory cgroup caches some "charge" from res_counter into
    per cpu cache. This works well but because it's cache, it needs to be
    flushed in some cases. Typical cases are

    1. when someone hit limit.

    2. when rmdir() is called and need to charges to be 0.

    But "1" has problem.

    Recently, with large SMP machines, we see many kworker runs because of
    flushing memcg's cache. Bad things in implementation are that even if a
    cpu contains a cache for memcg not related to a memcg which hits limit,
    drain code is called.

    This patch does
    A) check percpu cache contains a useful data or not.
    B) check other asynchronous percpu draining doesn't run.
    C) don't call local cpu callback.

    (*)This patch avoid changing the calling condition with hard-limit.

    When I run "cat 1Gfile > /dev/null" under 300M limit memcg,

    [Before]
    13767 kamezawa 20 0 98.6m 424 416 D 10.0 0.0 0:00.61 cat
    58 root 20 0 0 0 0 S 0.6 0.0 0:00.09 kworker/2:1
    60 root 20 0 0 0 0 S 0.6 0.0 0:00.08 kworker/4:1
    4 root 20 0 0 0 0 S 0.3 0.0 0:00.02 kworker/0:0
    57 root 20 0 0 0 0 S 0.3 0.0 0:00.05 kworker/1:1
    61 root 20 0 0 0 0 S 0.3 0.0 0:00.05 kworker/5:1
    62 root 20 0 0 0 0 S 0.3 0.0 0:00.05 kworker/6:1
    63 root 20 0 0 0 0 S 0.3 0.0 0:00.05 kworker/7:1

    [After]
    2676 root 20 0 98.6m 416 416 D 9.3 0.0 0:00.87 cat
    2626 kamezawa 20 0 15192 1312 920 R 0.3 0.0 0:00.28 top
    1 root 20 0 19384 1496 1204 S 0.0 0.0 0:00.66 init
    2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd
    3 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/0
    4 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0

    [akpm@linux-foundation.org: make percpu_charge_mutex static, tweak comments]
    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Daisuke Nishimura
    Reviewed-by: Michal Hocko
    Tested-by: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Hierarchical reclaim doesn't swap out if memsw and resource limits are
    thye same (memsw_is_minimum == true) because we would hit mem+swap limit
    anyway (during hard limit reclaim).

    If it comes to the soft limit we shouldn't consider memsw_is_minimum at
    all because it doesn't make much sense. Either the soft limit is bellow
    the hard limit and then we cannot hit mem+swap limit or the direct reclaim
    takes a precedence.

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Michal Hocko
    Acked-by: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Commit 21a3c9646873 ("memcg: allocate memory cgroup structures in local
    nodes") makes page_cgroup allocation as NUMA aware. But that caused a
    problem https://bugzilla.kernel.org/show_bug.cgi?id=36192.

    The problem was getting a NID from invalid struct pages, which was not
    initialized because it was out-of-node, out of [node_start_pfn,
    node_end_pfn)

    Now, with sparsemem, page_cgroup_init scans pfn from 0 to max_pfn. But
    this may scan a pfn which is not on any node and can access memmap which
    is not initialized.

    This makes page_cgroup_init() for SPARSEMEM node aware and remove a code
    to get nid from page->flags. (Then, we'll use valid NID always.)

    [akpm@linux-foundation.org: try to fix up comments]
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Commit 406eb0c9ba76 ("memcg: add memory.numastat api for numa
    statistics") adds memory.numa_stat file for memory cgroup. But the file
    permissions are wrong.

    [kamezawa@bluextal linux-2.6]$ ls -l /cgroup/memory/A/memory.numa_stat
    ---------- 1 root root 0 Jun 9 18:36 /cgroup/memory/A/memory.numa_stat

    This patch fixes the permission as

    [root@bluextal kamezawa]# ls -l /cgroup/memory/A/memory.numa_stat
    -r--r--r-- 1 root root 0 Jun 10 16:49 /cgroup/memory/A/memory.numa_stat

    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • When 1GB hugepages are allocated on a system, free(1) reports less
    available memory than what really is installed in the box. Also, if the
    total size of hugepages allocated on a system is over half of the total
    memory size, CommitLimit becomes a negative number.

    The problem is that gigantic hugepages (order > MAX_ORDER) can only be
    allocated at boot with bootmem, thus its frames are not accounted to
    'totalram_pages'. However, they are accounted to hugetlb_total_pages()

    What happens to turn CommitLimit into a negative number is this
    calculation, in fs/proc/meminfo.c:

    allowed = ((totalram_pages - hugetlb_total_pages())
    * sysctl_overcommit_ratio / 100) + total_swap_pages;

    A similar calculation occurs in __vm_enough_memory() in mm/mmap.c.

    Also, every vm statistic which depends on 'totalram_pages' will render
    confusing values, as if system were 'missing' some part of its memory.

    Impact of this bug:

    When gigantic hugepages are allocated and sysctl_overcommit_memory ==
    OVERCOMMIT_NEVER. In a such situation, __vm_enough_memory() goes through
    the mentioned 'allowed' calculation and might end up mistakenly returning
    -ENOMEM, thus forcing the system to start reclaiming pages earlier than it
    would be ususal, and this could cause detrimental impact to overall
    system's performance, depending on the workload.

    Besides the aforementioned scenario, I can only think of this causing
    annoyances with memory reports from /proc/meminfo and free(1).

    [akpm@linux-foundation.org: standardize comment layout]
    Reported-by: Russ Anderson
    Signed-off-by: Rafael Aquini
    Acked-by: Russ Anderson
    Cc: Andrea Arcangeli
    Cc: Christoph Lameter
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael Aquini
     
  • During memory hotplug we refresh zonelists when we online a page in a new
    zone. It means that the node's zonelist is not initialized until pages
    are onlined. So for example, "nid" passed by MEM_GOING_ONLINE notifier
    will point to NODE_DATA(nid) which has no zone fallback list. Moreover,
    if we hot-add cpu-only nodes, alloc_pages() will do no fallback.

    This patch makes a zonelist when a new pgdata is available.

    Note: in production, at fujitsu, memory should be onlined before cpu
    and our server didn't have any memory-less nodes and had no problems.

    But recent changes in MEM_GOING_ONLINE+page_cgroup
    will access not initialized zonelist of node.
    Anyway, there are memory-less node and we need some care.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Commit 56de7263fcf3 ("mm: compaction: direct compact when a high-order
    allocation fails") introduced a check for cc->order == -1 in
    compact_finished. We should continue compacting in that case because
    the request came from userspace and there is no particular order to
    compact for. Similar check has been added by 82478fb7 (mm: compaction:
    prevent division-by-zero during user-requested compaction) for
    compaction_suitable.

    The check is, however, done after zone_watermark_ok which uses order as a
    right hand argument for shifts. Not only watermark check is pointless if
    we can break out without it but it also uses 1 << -1 which is not well
    defined (at least from C standard). Let's move the -1 check above
    zone_watermark_ok.

    [minchan.kim@gmail.com> - caught compaction_suitable]
    Signed-off-by: Michal Hocko
    Cc: Mel Gorman
    Reviewed-by: Minchan Kim
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Running a ktest.pl test, I hit the following bug on x86_32:

    ------------[ cut here ]------------
    WARNING: at arch/x86/mm/highmem_32.c:81 __kunmap_atomic+0x64/0xc1()
    Hardware name:
    Modules linked in:
    Pid: 93, comm: sh Not tainted 2.6.39-test+ #1
    Call Trace:
    [] warn_slowpath_common+0x7c/0x91
    [] ? __kunmap_atomic+0x64/0xc1
    [] ? __kunmap_atomic+0x64/0xc1^M
    [] warn_slowpath_null+0x22/0x24
    [] __kunmap_atomic+0x64/0xc1
    [] unmap_vmas+0x43a/0x4e0
    [] exit_mmap+0x91/0xd2
    [] mmput+0x43/0xad
    [] exit_mm+0x111/0x119
    [] do_exit+0x1ff/0x5fa
    [] ? set_current_blocked+0x3c/0x40
    [] ? sigprocmask+0x7e/0x8e
    [] do_group_exit+0x65/0x88
    [] sys_exit_group+0x18/0x1c
    [] sysenter_do_call+0x12/0x38
    ---[ end trace 8055f74ea3c0eb62 ]---

    Running a ktest.pl git bisect, found the culprit: commit e303297e6c3a
    ("mm: extended batches for generic mmu_gather")

    But although this was the commit triggering the bug, it was not the one
    originally responsible for the bug. That was commit d16dfc550f53 ("mm:
    mmu_gather rework").

    The code in zap_pte_range() has something that looks like the following:

    pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
    do {
    [...]
    } while (pte++, addr += PAGE_SIZE, addr != end);
    pte_unmap_unlock(pte - 1, ptl);

    The pte starts off pointing at the first element in the page table
    directory that was returned by the pte_offset_map_lock(). When it's done
    with the page, pte will be pointing to anything between the next entry and
    the first entry of the next page inclusive. By doing a pte - 1, this puts
    the pte back onto the original page, which is all that pte_unmap_unlock()
    needs.

    In most archs (64 bit), this is not an issue as the pte is ignored in the
    pte_unmap_unlock(). But on 32 bit archs, where things may be kmapped, it
    is essential that the pte passed to pte_unmap_unlock() resides on the same
    page that was given by pte_offest_map_lock().

    The problem came in d16dfc55 ("mm: mmu_gather rework") where it introduced
    a "break;" from the while loop. This alone did not seem to easily trigger
    the bug. But the modifications made by e303297e6 caused that "break;" to
    be hit on the first iteration, before the pte++.

    The pte not being incremented will now cause pte_unmap_unlock(pte - 1) to
    be pointing to the previous page. This will cause the wrong page to be
    unmapped, and also trigger the warning above.

    The simple solution is to just save the pointer given by
    pte_offset_map_lock() and use it in the unlock.

    Signed-off-by: Steven Rostedt
    Cc: Peter Zijlstra
    Cc: KAMEZAWA Hiroyuki
    Acked-by: Hugh Dickins
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Rostedt
     
  • While testing for memcg aware swap token, I observed a swap token was
    often grabbed an intermittent running process (eg init, auditd) and they
    never release a token.

    Why?

    Some processes (eg init, auditd, audispd) wake up when a process exiting.
    And swap token can be get first page-in process when a process exiting
    makes no swap token owner. Thus such above intermittent running process
    often get a token.

    And currently, swap token priority is only decreased at page fault path.
    Then, if the process sleep immediately after to grab swap token, the swap
    token priority never be decreased. That's obviously undesirable.

    This patch implement very poor (and lightweight) priority aging. It only
    be affect to the above corner case and doesn't change swap tendency
    workload performance (eg multi process qsbench load)

    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: Rik van Riel
    Reviewed-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • This is useful for observing swap token activity.

    example output:

    zsh-1845 [000] 598.962716: update_swap_token_priority:
    mm=ffff88015eaf7700 old_prio=1 new_prio=0
    memtoy-1830 [001] 602.033900: update_swap_token_priority:
    mm=ffff880037a45880 old_prio=947 new_prio=949
    memtoy-1830 [000] 602.041509: update_swap_token_priority:
    mm=ffff880037a45880 old_prio=949 new_prio=951
    memtoy-1830 [000] 602.051959: update_swap_token_priority:
    mm=ffff880037a45880 old_prio=951 new_prio=953
    memtoy-1830 [000] 602.052188: update_swap_token_priority:
    mm=ffff880037a45880 old_prio=953 new_prio=955
    memtoy-1830 [001] 602.427184: put_swap_token:
    token_mm=ffff880037a45880
    zsh-1789 [000] 602.427281: replace_swap_token:
    old_token_mm= (null) old_prio=0 new_token_mm=ffff88015eaf7018
    new_prio=2
    zsh-1789 [001] 602.433456: update_swap_token_priority:
    mm=ffff88015eaf7018 old_prio=2 new_prio=4
    zsh-1789 [000] 602.437613: update_swap_token_priority:
    mm=ffff88015eaf7018 old_prio=4 new_prio=6
    zsh-1789 [000] 602.443924: update_swap_token_priority:
    mm=ffff88015eaf7018 old_prio=6 new_prio=8
    zsh-1789 [000] 602.451873: update_swap_token_priority:
    mm=ffff88015eaf7018 old_prio=8 new_prio=10
    zsh-1789 [001] 602.462639: update_swap_token_priority:
    mm=ffff88015eaf7018 old_prio=10 new_prio=12

    Signed-off-by: KOSAKI Motohiro
    Acked-by: Rik van Riel
    Reviewed-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Currently, memcg reclaim can disable swap token even if the swap token mm
    doesn't belong in its memory cgroup. It's slightly risky. If an admin
    creates very small mem-cgroup and silly guy runs contentious heavy memory
    pressure workload, every tasks are going to lose swap token and then
    system may become unresponsive. That's bad.

    This patch adds 'memcg' parameter into disable_swap_token(). and if the
    parameter doesn't match swap token, VM doesn't disable it.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Fix new kernel-doc warnings in mm/memory.c:

    Warning(mm/memory.c:1327): No description found for parameter 'tlb'
    Warning(mm/memory.c:1327): Excess function parameter 'tlbp' description in 'unmap_vmas'

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • Johannes noticed the vmstat update is already taken care of by
    khugepaged_alloc_hugepage() internally. The only places that are required
    to update the vmstat are the callers of alloc_hugepage (callers of
    khugepaged_alloc_hugepage aren't).

    Signed-off-by: Andrea Arcangeli
    Reported-by: Johannes Weiner
    Acked-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

14 Jun, 2011

1 commit


08 Jun, 2011

1 commit


06 Jun, 2011

1 commit

  • Al Viro observes that in the hugetlb case, handle_mm_fault() may return
    a value of the kind ENOSPC when its caller is expecting a value of the
    kind VM_FAULT_SIGBUS: fix alloc_huge_page()'s failure returns.

    Signed-off-by: Hugh Dickins
    Acked-by: Al Viro
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

04 Jun, 2011

3 commits

  • Caching "we have already removed suid/caps" was overenthusiastic as merged.
    On network filesystems we might have had suid/caps set on another client,
    silently picked by this client on revalidate, all of that *without* clearing
    the S_NOSEC flag.

    AFAICS, the only reasonably sane way to deal with that is
    * new superblock flag; unless set, S_NOSEC is not going to be set.
    * local block filesystems set it in their ->mount() (more accurately,
    mount_bdev() does, so does btrfs ->mount(), users of mount_bdev() other than
    local block ones clear it)
    * if any network filesystem (or a cluster one) wants to use S_NOSEC,
    it'll need to set MS_NOSEC in sb->s_flags *AND* take care to clear S_NOSEC when
    inode attribute changes are picked from other clients.

    It's not an earth-shattering hole (anybody that can set suid on another client
    will almost certainly be able to write to the file before doing that anyway),
    but it's a bug that needs fixing.

    Signed-off-by: Al Viro

    Al Viro
     
  • Currently, when using CONFIG_DEBUG_SLAB, we put in kfree() or
    kmem_cache_free() as the last user of free objects, which is not
    very useful, so change it to the caller of those functions instead.

    Acked-by: David Rientjes
    Acked-by: Christoph Lameter
    Signed-off-by: Suleiman Souhlal
    Signed-off-by: Pekka Enberg

    Suleiman Souhlal
     
  • On an architecture without CMPXCHG_LOCAL but with DEBUG_VM enabled,
    the VM_BUG_ON() in __pcpu_double_call_return_bool() will cause an early
    panic during boot unless we always align cpu_slab properly.

    In principle we could remove the alignment-testing VM_BUG_ON() for
    architectures that don't have CMPXCHG_LOCAL, but leaving it in means
    that new code will tend not to break x86 even if it is introduced
    on another platform, and it's low cost to require alignment.

    Acked-by: David Rientjes
    Acked-by: Christoph Lameter
    Signed-off-by: Chris Metcalf
    Signed-off-by: Pekka Enberg

    Chris Metcalf