26 Apr, 2012

1 commit

  • Commit 3268c63 ("mm: fix move/migrate_pages() race on task struct") has
    added an odd construct where 'mm' is checked for being NULL, and if it is,
    it would get dereferenced anyways by mput()ing it.

    This would lead to the following NULL ptr deref and BUG() when calling
    migrate_pages() with a pid that has no mm struct:

    [25904.193704] BUG: unable to handle kernel NULL pointer dereference at 0000000000000050
    [25904.194235] IP: [] mmput+0x27/0xf0
    [25904.194235] PGD 773e6067 PUD 77da0067 PMD 0
    [25904.194235] Oops: 0002 [#1] PREEMPT SMP
    [25904.194235] CPU 2
    [25904.194235] Pid: 31608, comm: trinity Tainted: G W 3.4.0-rc2-next-20120412-sasha #69
    [25904.194235] RIP: 0010:[] [] mmput+0x27/0xf0
    [25904.194235] RSP: 0018:ffff880077d49e08 EFLAGS: 00010202
    [25904.194235] RAX: 0000000000000286 RBX: 0000000000000000 RCX: 0000000000000000
    [25904.194235] RDX: ffff880075ef8000 RSI: 000000000000023d RDI: 0000000000000286
    [25904.194235] RBP: ffff880077d49e18 R08: 0000000000000001 R09: 0000000000000001
    [25904.194235] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
    [25904.194235] R13: 00000000ffffffea R14: ffff880034287740 R15: ffff8800218d3010
    [25904.194235] FS: 00007fc8b244c700(0000) GS:ffff880029800000(0000) knlGS:0000000000000000
    [25904.194235] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [25904.194235] CR2: 0000000000000050 CR3: 00000000767c6000 CR4: 00000000000406e0
    [25904.194235] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [25904.194235] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    [25904.194235] Process trinity (pid: 31608, threadinfo ffff880077d48000, task ffff880075ef8000)
    [25904.194235] Stack:
    [25904.194235] ffff8800342876c0 0000000000000000 ffff880077d49f78 ffffffff811b8020
    [25904.194235] ffffffff811b7d91 ffff880075ef8000 ffff88002256d200 0000000000000000
    [25904.194235] 00000000000003ff 0000000000000000 0000000000000000 0000000000000000
    [25904.194235] Call Trace:
    [25904.194235] [] sys_migrate_pages+0x340/0x3a0
    [25904.194235] [] ? sys_migrate_pages+0xb1/0x3a0
    [25904.194235] [] system_call_fastpath+0x16/0x1b
    [25904.194235] Code: c9 c3 66 90 55 31 d2 48 89 e5 be 3d 02 00 00 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 48 89 fb 48 c7 c7 cf 0e e1 82 e8 69 18 03 00 ff 4b 50 0f 94 c0 84 c0 0f 84 aa 00 00 00 48 89 df e8 72 f1
    [25904.194235] RIP [] mmput+0x27/0xf0
    [25904.194235] RSP
    [25904.194235] CR2: 0000000000000050
    [25904.348999] ---[ end trace a307b3ed40206b4b ]---

    Signed-off-by: Sasha Levin
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: KOSAKI Motohiro
    Cc: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     

22 Mar, 2012

3 commits

  • Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when
    changing cpuset's mems") wins a super prize for the largest number of
    memory barriers entered into fast paths for one commit.

    [get|put]_mems_allowed is incredibly heavy with pairs of full memory
    barriers inserted into a number of hot paths. This was detected while
    investigating at large page allocator slowdown introduced some time
    after 2.6.32. The largest portion of this overhead was shown by
    oprofile to be at an mfence introduced by this commit into the page
    allocator hot path.

    For extra style points, the commit introduced the use of yield() in an
    implementation of what looks like a spinning mutex.

    This patch replaces the full memory barriers on both read and write
    sides with a sequence counter with just read barriers on the fast path
    side. This is much cheaper on some architectures, including x86. The
    main bulk of the patch is the retry logic if the nodemask changes in a
    manner that can cause a false failure.

    While updating the nodemask, a check is made to see if a false failure
    is a risk. If it is, the sequence number gets bumped and parallel
    allocators will briefly stall while the nodemask update takes place.

    In a page fault test microbenchmark, oprofile samples from
    __alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The
    actual results were

    3.3.0-rc3 3.3.0-rc3
    rc3-vanilla nobarrier-v2r1
    Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%)
    Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%)
    Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%)
    Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%)
    Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%)
    Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%)
    Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%)
    Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%)
    Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%)
    Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%)
    Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%)
    Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%)
    Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%)
    Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%)
    Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%)
    MMTests Statistics: duration
    Sys Time Running Test (seconds) 135.68 132.17
    User+Sys Time Running Test (seconds) 164.2 160.13
    Total Elapsed Time (seconds) 123.46 120.87

    The overall improvement is small but the System CPU time is much
    improved and roughly in correlation to what oprofile reported (these
    performance figures are without profiling so skew is expected). The
    actual number of page faults is noticeably improved.

    For benchmarks like kernel builds, the overall benefit is marginal but
    the system CPU time is slightly reduced.

    To test the actual bug the commit fixed I opened two terminals. The
    first ran within a cpuset and continually ran a small program that
    faulted 100M of anonymous data. In a second window, the nodemask of the
    cpuset was continually randomised in a loop.

    Without the commit, the program would fail every so often (usually
    within 10 seconds) and obviously with the commit everything worked fine.
    With this patch applied, it also worked fine so the fix should be
    functionally equivalent.

    Signed-off-by: Mel Gorman
    Cc: Miao Xie
    Cc: David Rientjes
    Cc: Peter Zijlstra
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Migration functions perform the rcu_read_unlock too early. As a result
    the task pointed to may change from under us. This can result in an oops,
    as reported by Dave Hansen in https://lkml.org/lkml/2012/2/23/302.

    The following patch extend the period of the rcu_read_lock until after the
    permissions checks are done. We also take a refcount so that the task
    reference is stable when calling security check functions and performing
    cpuset node validation (which takes a mutex).

    The refcount is dropped before actual page migration occurs so there is no
    change to the refcounts held during page migration.

    Also move the determination of the mm of the task struct to immediately
    before the do_migrate*() calls so that it is clear that we switch from
    handling the task during permission checks to the mm for the actual
    migration. Since the determination is only done once and we then no
    longer use the task_struct we can be sure that we operate on a specific
    address space that will not change from under us.

    [akpm@linux-foundation.org: checkpatch fixes]
    Signed-off-by: Christoph Lameter
    Cc: "Eric W. Biederman"
    Reported-by: Dave Hansen
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: KOSAKI Motohiro
    Cc: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • In some cases it may happen that pmd_none_or_clear_bad() is called with
    the mmap_sem hold in read mode. In those cases the huge page faults can
    allocate hugepmds under pmd_none_or_clear_bad() and that can trigger a
    false positive from pmd_bad() that will not like to see a pmd
    materializing as trans huge.

    It's not khugepaged causing the problem, khugepaged holds the mmap_sem
    in write mode (and all those sites must hold the mmap_sem in read mode
    to prevent pagetables to go away from under them, during code review it
    seems vm86 mode on 32bit kernels requires that too unless it's
    restricted to 1 thread per process or UP builds). The race is only with
    the huge pagefaults that can convert a pmd_none() into a
    pmd_trans_huge().

    Effectively all these pmd_none_or_clear_bad() sites running with
    mmap_sem in read mode are somewhat speculative with the page faults, and
    the result is always undefined when they run simultaneously. This is
    probably why it wasn't common to run into this. For example if the
    madvise(MADV_DONTNEED) runs zap_page_range() shortly before the page
    fault, the hugepage will not be zapped, if the page fault runs first it
    will be zapped.

    Altering pmd_bad() not to error out if it finds hugepmds won't be enough
    to fix this, because zap_pmd_range would then proceed to call
    zap_pte_range (which would be incorrect if the pmd become a
    pmd_trans_huge()).

    The simplest way to fix this is to read the pmd in the local stack
    (regardless of what we read, no need of actual CPU barriers, only
    compiler barrier needed), and be sure it is not changing under the code
    that computes its value. Even if the real pmd is changing under the
    value we hold on the stack, we don't care. If we actually end up in
    zap_pte_range it means the pmd was not none already and it was not huge,
    and it can't become huge from under us (khugepaged locking explained
    above).

    All we need is to enforce that there is no way anymore that in a code
    path like below, pmd_trans_huge can be false, but pmd_none_or_clear_bad
    can run into a hugepmd. The overhead of a barrier() is just a compiler
    tweak and should not be measurable (I only added it for THP builds). I
    don't exclude different compiler versions may have prevented the race
    too by caching the value of *pmd on the stack (that hasn't been
    verified, but it wouldn't be impossible considering
    pmd_none_or_clear_bad, pmd_bad, pmd_trans_huge, pmd_none are all inlines
    and there's no external function called in between pmd_trans_huge and
    pmd_none_or_clear_bad).

    if (pmd_trans_huge(*pmd)) {
    if (next-addr != HPAGE_PMD_SIZE) {
    VM_BUG_ON(!rwsem_is_locked(&tlb->mm->mmap_sem));
    split_huge_page_pmd(vma->vm_mm, pmd);
    } else if (zap_huge_pmd(tlb, vma, pmd, addr))
    continue;
    /* fall through */
    }
    if (pmd_none_or_clear_bad(pmd))

    Because this race condition could be exercised without special
    privileges this was reported in CVE-2012-1179.

    The race was identified and fully explained by Ulrich who debugged it.
    I'm quoting his accurate explanation below, for reference.

    ====== start quote =======
    mapcount 0 page_mapcount 1
    kernel BUG at mm/huge_memory.c:1384!

    At some point prior to the panic, a "bad pmd ..." message similar to the
    following is logged on the console:

    mm/memory.c:145: bad pmd ffff8800376e1f98(80000000314000e7).

    The "bad pmd ..." message is logged by pmd_clear_bad() before it clears
    the page's PMD table entry.

    143 void pmd_clear_bad(pmd_t *pmd)
    144 {
    -> 145 pmd_ERROR(*pmd);
    146 pmd_clear(pmd);
    147 }

    After the PMD table entry has been cleared, there is an inconsistency
    between the actual number of PMD table entries that are mapping the page
    and the page's map count (_mapcount field in struct page). When the page
    is subsequently reclaimed, __split_huge_page() detects this inconsistency.

    1381 if (mapcount != page_mapcount(page))
    1382 printk(KERN_ERR "mapcount %d page_mapcount %d\n",
    1383 mapcount, page_mapcount(page));
    -> 1384 BUG_ON(mapcount != page_mapcount(page));

    The root cause of the problem is a race of two threads in a multithreaded
    process. Thread B incurs a page fault on a virtual address that has never
    been accessed (PMD entry is zero) while Thread A is executing an madvise()
    system call on a virtual address within the same 2 MB (huge page) range.

    virtual address space
    .---------------------.
    | |
    | |
    .-|---------------------|
    | | |
    | | |< |/////////////////////| > A(range)
    page | |/////////////////////|-'
    | | |
    | | |
    '-|---------------------|
    | |
    | |
    '---------------------'

    - Thread A is executing an madvise(..., MADV_DONTNEED) system call
    on the virtual address range "A(range)" shown in the picture.

    sys_madvise
    // Acquire the semaphore in shared mode.
    down_read(¤t->mm->mmap_sem)
    ...
    madvise_vma
    switch (behavior)
    case MADV_DONTNEED:
    madvise_dontneed
    zap_page_range
    unmap_vmas
    unmap_page_range
    zap_pud_range
    zap_pmd_range
    //
    // Assume that this huge page has never been accessed.
    // I.e. content of the PMD entry is zero (not mapped).
    //
    if (pmd_trans_huge(*pmd)) {
    // We don't get here due to the above assumption.
    }
    //
    // Assume that Thread B incurred a page fault and
    .---------> // sneaks in here as shown below.
    | //
    | if (pmd_none_or_clear_bad(pmd))
    | {
    | if (unlikely(pmd_bad(*pmd)))
    | pmd_clear_bad
    | {
    | pmd_ERROR
    | // Log "bad pmd ..." message here.
    | pmd_clear
    | // Clear the page's PMD entry.
    | // Thread B incremented the map count
    | // in page_add_new_anon_rmap(), but
    | // now the page is no longer mapped
    | // by a PMD entry (-> inconsistency).
    | }
    | }
    |
    v
    - Thread B is handling a page fault on virtual address "B(fault)" shown
    in the picture.

    ...
    do_page_fault
    __do_page_fault
    // Acquire the semaphore in shared mode.
    down_read_trylock(&mm->mmap_sem)
    ...
    handle_mm_fault
    if (pmd_none(*pmd) && transparent_hugepage_enabled(vma))
    // We get here due to the above assumption (PMD entry is zero).
    do_huge_pmd_anonymous_page
    alloc_hugepage_vma
    // Allocate a new transparent huge page here.
    ...
    __do_huge_pmd_anonymous_page
    ...
    spin_lock(&mm->page_table_lock)
    ...
    page_add_new_anon_rmap
    // Here we increment the page's map count (starts at -1).
    atomic_set(&page->_mapcount, 0)
    set_pmd_at
    // Here we set the page's PMD entry which will be cleared
    // when Thread A calls pmd_clear_bad().
    ...
    spin_unlock(&mm->page_table_lock)

    The mmap_sem does not prevent the race because both threads are acquiring
    it in shared mode (down_read). Thread B holds the page_table_lock while
    the page's map count and PMD table entry are updated. However, Thread A
    does not synchronize on that lock.

    ====== end quote =======

    [akpm@linux-foundation.org: checkpatch fixes]
    Reported-by: Ulrich Obergfell
    Signed-off-by: Andrea Arcangeli
    Acked-by: Johannes Weiner
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Dave Jones
    Acked-by: Larry Woodman
    Acked-by: Rik van Riel
    Cc: [2.6.38+]
    Cc: Mark Salter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

07 Mar, 2012

1 commit

  • Several users of "find_vma_prev()" were not in fact interested in the
    previous vma if there was no primary vma to be found either. And in
    those cases, we're much better off just using the regular "find_vma()",
    and then "prev" can be looked up by just checking vma->vm_prev.

    The find_vma_prev() semantics are fairly subtle (see Mikulas' recent
    commit 83cd904d271b: "mm: fix find_vma_prev"), and the whole "return
    prev by reference" means that it generates worse code too.

    Thus this "let's avoid using this inconvenient and clearly too subtle
    interface when we don't really have to" patch.

    Cc: Mikulas Patocka
    Cc: KOSAKI Motohiro
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

13 Jan, 2012

1 commit

  • This patch adds a lightweight sync migrate operation MIGRATE_SYNC_LIGHT
    mode that avoids writing back pages to backing storage. Async compaction
    maps to MIGRATE_ASYNC while sync compaction maps to MIGRATE_SYNC_LIGHT.
    For other migrate_pages users such as memory hotplug, MIGRATE_SYNC is
    used.

    This avoids sync compaction stalling for an excessive length of time,
    particularly when copying files to a USB stick where there might be a
    large number of dirty pages backed by a filesystem that does not support
    ->writepages.

    [aarcange@redhat.com: This patch is heavily based on Andrea's work]
    [akpm@linux-foundation.org: fix fs/nfs/write.c build]
    [akpm@linux-foundation.org: fix fs/btrfs/disk-io.c build]
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Minchan Kim
    Cc: Dave Jones
    Cc: Jan Kara
    Cc: Andy Isaacson
    Cc: Nai Xia
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

11 Jan, 2012

1 commit


30 Dec, 2011

1 commit

  • commit 8aacc9f550 ("mm/mempolicy.c: fix pgoff in mbind vma merge") is the
    slightly incorrect fix.

    Why? Think following case.

    1. map 4 pages of a file at offset 0

    [0123]

    2. map 2 pages just after the first mapping of the same file but with
    page offset 2

    [0123][23]

    3. mbind() 2 pages from the first mapping at offset 2.
    mbind_range() should treat new vma is,

    [0123][23]
    |23|
    mbind vma

    but it does

    [0123][23]
    |01|
    mbind vma

    Oops. then, it makes wrong vma merge and splitting ([01][0123] or similar).

    This patch fixes it.

    [testcase]
    test result - before the patch

    case4: 126: test failed. expect '2,4', actual '2,2,2'
    case5: passed
    case6: passed
    case7: passed
    case8: passed
    case_n: 246: test failed. expect '4,2', actual '1,4'

    ------------[ cut here ]------------
    kernel BUG at mm/filemap.c:135!
    invalid opcode: 0000 [#4] SMP DEBUG_PAGEALLOC

    (snip long bug on messages)

    test result - after the patch

    case4: passed
    case5: passed
    case6: passed
    case7: passed
    case8: passed
    case_n: passed

    source: mbind_vma_test.c
    ============================================================
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    static unsigned long pagesize;
    void* mmap_addr;
    struct bitmask *nmask;
    char buf[1024];
    FILE *file;
    char retbuf[10240] = "";
    int mapped_fd;

    char *rubysrc = "ruby -e '\
    pid = %d; \
    vstart = 0x%llx; \
    vend = 0x%llx; \
    s = `pmap -q #{pid}`; \
    rary = []; \
    s.each_line {|line|; \
    ary=line.split(\" \"); \
    addr = ary[0].to_i(16); \
    if(vstart < vend) then \
    rary.push(ary[1].to_i()/4); \
    end; \
    }; \
    print rary.join(\",\"); \
    '";

    void init(void)
    {
    void* addr;
    char buf[128];

    nmask = numa_allocate_nodemask();
    numa_bitmask_setbit(nmask, 0);

    pagesize = getpagesize();

    sprintf(buf, "%s", "mbind_vma_XXXXXX");
    mapped_fd = mkstemp(buf);
    if (mapped_fd == -1)
    perror("mkstemp "), exit(1);
    unlink(buf);

    if (lseek(mapped_fd, pagesize*8, SEEK_SET) < 0)
    perror("lseek "), exit(1);
    if (write(mapped_fd, "\0", 1) < 0)
    perror("write "), exit(1);

    addr = mmap(NULL, pagesize*8, PROT_NONE,
    MAP_SHARED, mapped_fd, 0);
    if (addr == MAP_FAILED)
    perror("mmap "), exit(1);

    if (mprotect(addr+pagesize, pagesize*6, PROT_READ|PROT_WRITE) < 0)
    perror("mprotect "), exit(1);

    mmap_addr = addr + pagesize;

    /* make page populate */
    memset(mmap_addr, 0, pagesize*6);
    }

    void fin(void)
    {
    void* addr = mmap_addr - pagesize;
    munmap(addr, pagesize*8);

    memset(buf, 0, sizeof(buf));
    memset(retbuf, 0, sizeof(retbuf));
    }

    void mem_bind(int index, int len)
    {
    int err;

    err = mbind(mmap_addr+pagesize*index, pagesize*len,
    MPOL_BIND, nmask->maskp, nmask->size, 0);
    if (err)
    perror("mbind "), exit(err);
    }

    void mem_interleave(int index, int len)
    {
    int err;

    err = mbind(mmap_addr+pagesize*index, pagesize*len,
    MPOL_INTERLEAVE, nmask->maskp, nmask->size, 0);
    if (err)
    perror("mbind "), exit(err);
    }

    void mem_unbind(int index, int len)
    {
    int err;

    err = mbind(mmap_addr+pagesize*index, pagesize*len,
    MPOL_DEFAULT, NULL, 0, 0);
    if (err)
    perror("mbind "), exit(err);
    }

    void Assert(char *expected, char *value, char *name, int line)
    {
    if (strcmp(expected, value) == 0) {
    fprintf(stderr, "%s: passed\n", name);
    return;
    }
    else {
    fprintf(stderr, "%s: %d: test failed. expect '%s', actual '%s'\n",
    name, line,
    expected, value);
    // exit(1);
    }
    }

    /*
    AAAA
    PPPPPPNNNNNN
    might become
    PPNNNNNNNNNN
    case 4 below
    */
    void case4(void)
    {
    init();
    sprintf(buf, rubysrc, getpid(), mmap_addr, mmap_addr+pagesize*6);

    mem_bind(0, 4);
    mem_unbind(2, 2);

    file = popen(buf, "r");
    fread(retbuf, sizeof(retbuf), 1, file);
    Assert("2,4", retbuf, "case4", __LINE__);

    fin();
    }

    /*
    AAAA
    PPPPPPNNNNNN
    might become
    PPPPPPPPPPNN
    case 5 below
    */
    void case5(void)
    {
    init();
    sprintf(buf, rubysrc, getpid(), mmap_addr, mmap_addr+pagesize*6);

    mem_bind(0, 2);
    mem_bind(2, 2);

    file = popen(buf, "r");
    fread(retbuf, sizeof(retbuf), 1, file);
    Assert("4,2", retbuf, "case5", __LINE__);

    fin();
    }

    /*
    AAAA
    PPPPNNNNXXXX
    might become
    PPPPPPPPPPPP 6
    */
    void case6(void)
    {
    init();
    sprintf(buf, rubysrc, getpid(), mmap_addr, mmap_addr+pagesize*6);

    mem_bind(0, 2);
    mem_bind(4, 2);
    mem_bind(2, 2);

    file = popen(buf, "r");
    fread(retbuf, sizeof(retbuf), 1, file);
    Assert("6", retbuf, "case6", __LINE__);

    fin();
    }

    /*
    AAAA
    PPPPNNNNXXXX
    might become
    PPPPPPPPXXXX 7
    */
    void case7(void)
    {
    init();
    sprintf(buf, rubysrc, getpid(), mmap_addr, mmap_addr+pagesize*6);

    mem_bind(0, 2);
    mem_interleave(4, 2);
    mem_bind(2, 2);

    file = popen(buf, "r");
    fread(retbuf, sizeof(retbuf), 1, file);
    Assert("4,2", retbuf, "case7", __LINE__);

    fin();
    }

    /*
    AAAA
    PPPPNNNNXXXX
    might become
    PPPPNNNNNNNN 8
    */
    void case8(void)
    {
    init();
    sprintf(buf, rubysrc, getpid(), mmap_addr, mmap_addr+pagesize*6);

    mem_bind(0, 2);
    mem_interleave(4, 2);
    mem_interleave(2, 2);

    file = popen(buf, "r");
    fread(retbuf, sizeof(retbuf), 1, file);
    Assert("2,4", retbuf, "case8", __LINE__);

    fin();
    }

    void case_n(void)
    {
    init();
    sprintf(buf, rubysrc, getpid(), mmap_addr, mmap_addr+pagesize*6);

    /* make redundunt mappings [0][1234][34][7] */
    mmap(mmap_addr + pagesize*4, pagesize*2, PROT_READ|PROT_WRITE,
    MAP_FIXED|MAP_SHARED, mapped_fd, pagesize*3);

    /* Expect to do nothing. */
    mem_unbind(2, 2);

    file = popen(buf, "r");
    fread(retbuf, sizeof(retbuf), 1, file);
    Assert("4,2", retbuf, "case_n", __LINE__);

    fin();
    }

    int main(int argc, char** argv)
    {
    case4();
    case5();
    case6();
    case7();
    case8();
    case_n();

    return 0;
    }
    =============================================================

    Signed-off-by: KOSAKI Motohiro
    Acked-by: Johannes Weiner
    Cc: Minchan Kim
    Cc: Caspar Zhang
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Lee Schermerhorn
    Cc: [3.1.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

07 Nov, 2011

1 commit

  • * 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
    Revert "tracing: Include module.h in define_trace.h"
    irq: don't put module.h into irq.h for tracking irqgen modules.
    bluetooth: macroize two small inlines to avoid module.h
    ip_vs.h: fix implicit use of module_get/module_put from module.h
    nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
    include: replace linux/module.h with "struct module" wherever possible
    include: convert various register fcns to macros to avoid include chaining
    crypto.h: remove unused crypto_tfm_alg_modname() inline
    uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
    pm_runtime.h: explicitly requires notifier.h
    linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
    miscdevice.h: fix up implicit use of lists and types
    stop_machine.h: fix implicit use of smp.h for smp_processor_id
    of: fix implicit use of errno.h in include/linux/of.h
    of_platform.h: delete needless include
    acpi: remove module.h include from platform/aclinux.h
    miscdevice.h: delete unnecessary inclusion of module.h
    device_cgroup.h: delete needless include
    net: sch_generic remove redundant use of
    net: inet_timewait_sock doesnt need
    ...

    Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in
    - drivers/media/dvb/frontends/dibx000_common.c
    - drivers/media/video/{mt9m111.c,ov6650.c}
    - drivers/mfd/ab3550-core.c
    - include/linux/dmaengine.h

    Linus Torvalds
     

01 Nov, 2011

1 commit

  • Quiet the spares noise:

    warning: symbol 'default_policy' was not declared. Should it be static?

    Signed-off-by: H Hartley Sweeten
    Cc: KOSAKI Motohiro
    Cc: Stephen Wilson
    Cc: Andrea Arcangeli
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    H Hartley Sweeten
     

31 Oct, 2011

1 commit


15 Sep, 2011

2 commits

  • When compiling mm/mempolicy.c with struct user copy checks the following
    warning is shown:

    In file included from arch/x86/include/asm/uaccess.h:572,
    from include/linux/uaccess.h:5,
    from include/linux/highmem.h:7,
    from include/linux/pagemap.h:10,
    from include/linux/mempolicy.h:70,
    from mm/mempolicy.c:68:
    In function `copy_from_user',
    inlined from `compat_sys_get_mempolicy' at mm/mempolicy.c:1415:
    arch/x86/include/asm/uaccess_64.h:64: warning: call to `copy_from_user_overflow' declared with attribute warning: copy_from_user() buffer size is not provably correct
    LD mm/built-in.o

    Fix this by passing correct buffer size value.

    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • commit 9d8cebd4bcd7 ("mm: fix mbind vma merge problem") didn't really
    fix the mbind vma merge problem due to wrong pgoff value passing to
    vma_merge(), which made vma_merge() always return NULL.

    Before the patch applied, we are getting a result like:

    addr = 0x7fa58f00c000
    [snip]
    7fa58f00c000-7fa58f00d000 rw-p 00000000 00:00 0
    7fa58f00d000-7fa58f00e000 rw-p 00000000 00:00 0
    7fa58f00e000-7fa58f00f000 rw-p 00000000 00:00 0

    here 7fa58f00c000->7fa58f00f000 we get 3 VMAs which are expected to be
    merged described as described in commit 9d8cebd.

    Re-testing the patched kernel with the reproducer provided in commit
    9d8cebd, we get the correct result:

    addr = 0x7ffa5aaa2000
    [snip]
    7ffa5aaa2000-7ffa5aaa6000 rw-p 00000000 00:00 0
    7fffd556f000-7fffd5584000 rw-p 00000000 00:00 0 [stack]

    Signed-off-by: Caspar Zhang
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Lee Schermerhorn
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Caspar Zhang
     

27 Jul, 2011

1 commit

  • [ This patch has already been accepted as commit 0ac0c0d0f837 but later
    reverted (commit 35926ff5fba8) because it itroduced arch specific
    __node_random which was defined only for x86 code so it broke other
    archs. This is a followup without any arch specific code. Other than
    that there are no functional changes.]

    Some workloads that create a large number of small files tend to assign
    too many pages to node 0 (multi-node systems). Part of the reason is
    that the rotor (in cpuset_mem_spread_node()) used to assign nodes starts
    at node 0 for newly created tasks.

    This patch changes the rotor to be initialized to a random node number
    of the cpuset.

    [akpm@linux-foundation.org: fix layout]
    [Lee.Schermerhorn@hp.com: Define stub numa_random() for !NUMA configuration]
    [mhocko@suse.cz: Make it arch independent]
    [akpm@linux-foundation.org: fix CONFIG_NUMA=y, MAX_NUMNODES>1 build]
    Signed-off-by: Jack Steiner
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Michal Hocko
    Reviewed-by: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Paul Menage
    Cc: Jack Steiner
    Cc: Robin Holt
    Cc: David Rientjes
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Jack Steiner
    Cc: KOSAKI Motohiro
    Cc: Lee Schermerhorn
    Cc: Michal Hocko
    Cc: Paul Menage
    Cc: Pekka Enberg
    Cc: Robin Holt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

25 May, 2011

6 commits

  • Moving show_numa_map() from mempolicy.c to task_mmu.c solves several
    issues.

    - Having the show() operation "miles away" from the corresponding
    seq_file iteration operations is a maintenance burden.

    - The need to export ad hoc info like struct proc_maps_private is
    eliminated.

    - The implementation of show_numa_map() can be improved in a simple
    manner by cooperating with the other seq_file operations (start,
    stop, etc) -- something that would be messy to do without this
    change.

    Signed-off-by: Stephen Wilson
    Reviewed-by: KOSAKI Motohiro
    Cc: Hugh Dickins
    Cc: David Rientjes
    Cc: Lee Schermerhorn
    Cc: Alexey Dobriyan
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Wilson
     
  • This function has been superseded by gather_hugetbl_stats() and is no
    longer needed.

    Signed-off-by: Stephen Wilson
    Reviewed-by: KOSAKI Motohiro
    Cc: Hugh Dickins
    Cc: David Rientjes
    Cc: Lee Schermerhorn
    Cc: Alexey Dobriyan
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Wilson
     
  • Improve the prototype of gather_stats() to take a struct numa_maps as
    argument instead of a generic void *. Update all callers to make the
    required type explicit.

    Since gather_stats() is not needed before its definition and is scheduled
    to be moved out of mempolicy.c the declaration is removed as well.

    Signed-off-by: Stephen Wilson
    Reviewed-by: KOSAKI Motohiro
    Cc: Hugh Dickins
    Cc: David Rientjes
    Cc: Lee Schermerhorn
    Cc: Alexey Dobriyan
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Wilson
     
  • Mapping statistics in a NUMA environment is now computed using the generic
    walk_page_range() logic. Remove the old/equivalent functionality.

    Signed-off-by: Stephen Wilson
    Reviewed-by: KOSAKI Motohiro
    Cc: Hugh Dickins
    Cc: David Rientjes
    Cc: Lee Schermerhorn
    Cc: Alexey Dobriyan
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Wilson
     
  • Converting show_numa_map() to use the generic routine decouples the
    function from mempolicy.c, allowing it to be moved out of the mm subsystem
    and into fs/proc.

    Also, include KSM pages in /proc/pid/numa_maps statistics. The pagewalk
    logic implemented by check_pte_range() failed to account for such pages as
    they were not applicable to the page migration case.

    Signed-off-by: Stephen Wilson
    Reviewed-by: KOSAKI Motohiro
    Cc: Hugh Dickins
    Cc: David Rientjes
    Cc: Lee Schermerhorn
    Cc: Alexey Dobriyan
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Wilson
     
  • In commit 48fce3429d ("mempolicies: unexport get_vma_policy()")
    get_vma_policy() was marked static as all clients were local to
    mempolicy.c.

    However, the decision to generate /proc/pid/numa_maps in the numa memory
    policy code and outside the procfs subsystem introduces an artificial
    interdependency between the two systems. Exporting get_vma_policy() once
    again is the first step to clean up this interdependency.

    Signed-off-by: Stephen Wilson
    Reviewed-by: KOSAKI Motohiro
    Cc: Hugh Dickins
    Cc: David Rientjes
    Cc: Lee Schermerhorn
    Cc: Alexey Dobriyan
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Wilson
     

23 Mar, 2011

1 commit


19 Mar, 2011

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (47 commits)
    doc: CONFIG_UNEVICTABLE_LRU doesn't exist anymore
    Update cpuset info & webiste for cgroups
    dcdbas: force SMI to happen when expected
    arch/arm/Kconfig: remove one to many l's in the word.
    asm-generic/user.h: Fix spelling in comment
    drm: fix printk typo 'sracth'
    Remove one to many n's in a word
    Documentation/filesystems/romfs.txt: fixing link to genromfs
    drivers:scsi Change printk typo initate -> initiate
    serial, pch uart: Remove duplicate inclusion of linux/pci.h header
    fs/eventpoll.c: fix spelling
    mm: Fix out-of-date comments which refers non-existent functions
    drm: Fix printk typo 'failled'
    coh901318.c: Change initate to initiate.
    mbox-db5500.c Change initate to initiate.
    edac: correct i82975x error-info reported
    edac: correct i82975x mci initialisation
    edac: correct commented info
    fs: update comments to point correct document
    target: remove duplicate include of target/target_core_device.h from drivers/target/target_core_hba.c
    ...

    Trivial conflict in fs/eventpoll.c (spelling vs addition)

    Linus Torvalds
     

05 Mar, 2011

2 commits

  • Pass down the correct node for a transparent hugepage allocation. Most
    callers continue to use the current node, however the hugepaged daemon
    now uses the previous node of the first to be collapsed page instead.
    This ensures that khugepaged does not mess up local memory for an
    existing process which uses local policy.

    The choice of node is somewhat primitive currently: it just uses the
    node of the first page in the pmd range. An alternative would be to
    look at multiple pages and use the most popular node. I used the
    simplest variant for now which should work well enough for the case of
    all pages being on the same node.

    [akpm@linux-foundation.org: coding-style fixes]
    Acked-by: Andrea Arcangeli
    Signed-off-by: Andi Kleen
    Reviewed-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • Currently alloc_pages_vma() always uses the local node as policy node for
    the LOCAL policy. Pass this node down as an argument instead.

    No behaviour change from this patch, but will be needed for followons.

    Acked-by: Andrea Arcangeli
    Signed-off-by: Andi Kleen
    Reviewed-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     

01 Mar, 2011

1 commit


26 Feb, 2011

1 commit

  • The THP code didn't pass the correct interleaving shift to the memory
    policy code. Fix this here by adjusting for the order.

    Signed-off-by: Andi Kleen
    Reviewed-by: Christoph Lameter
    Acked-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     

14 Jan, 2011

5 commits

  • It's mostly a matter of replacing alloc_pages with alloc_pages_vma after
    introducing alloc_pages_vma. khugepaged needs special handling as the
    allocation has to happen inside collapse_huge_page where the vma is known
    and an error has to be returned to the outer loop to sleep
    alloc_sleep_millisecs in case of failure. But it retains the more
    efficient logic of handling allocation failures in khugepaged in case of
    CONFIG_NUMA=n.

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • split_huge_page_pmd compat code. Each one of those would need to be
    expanded to hundred of lines of complex code without a fully reliable
    split_huge_page_pmd design.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Today, tasklist_lock in migrate_pages doesn't protect anything.
    rcu_read_lock() provide enough protection from pid hash walk.

    Signed-off-by: KOSAKI Motohiro
    Reported-by: Peter Zijlstra
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • With the introduction of the boolean sync parameter, the API looks a
    little inconsistent as offlining is still an int. Convert offlining to a
    bool for the sake of being tidy.

    Signed-off-by: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Acked-by: Johannes Weiner
    Cc: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • …ompaction in the faster path

    Migration synchronously waits for writeback if the initial passes fails.
    Callers of memory compaction do not necessarily want this behaviour if the
    caller is latency sensitive or expects that synchronous migration is not
    going to have a significantly better success rate.

    This patch adds a sync parameter to migrate_pages() allowing the caller to
    indicate if wait_on_page_writeback() is allowed within migration or not.
    For reclaim/compaction, try_to_compact_pages() is first called
    asynchronously, direct reclaim runs and then try_to_compact_pages() is
    called synchronously as there is a greater expectation that it'll succeed.

    [akpm@linux-foundation.org: build/merge fix]
    Signed-off-by: Mel Gorman <mel@csn.ul.ie>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Cc: Rik van Riel <riel@redhat.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Andy Whitcroft <apw@shadowen.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Mel Gorman
     

03 Dec, 2010

1 commit


29 Oct, 2010

1 commit

  • When a node contains only HighMem memory, slab_node(MPOL_BIND)
    dereferences a NULL pointer.

    [ This code seems to go back all the way to commit 19770b32609b: "mm:
    filter based on a nodemask as well as a gfp_mask". Which was back in
    April 2008, and it got merged into 2.6.26. - Linus ]

    Signed-off-by: Eric Dumazet
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: Andrew Morton
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     

27 Oct, 2010

2 commits

  • Function check_range may return ERR_PTR(...). Check for it.

    Signed-off-by: Vasiliy Kulikov
    Acked-by: David Rientjes
    Reviewed-by: Christoph Lameter
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vasiliy Kulikov
     
  • Presently update_nr_listpages() doesn't have a role. That's because lists
    passed is always empty just after calling migrate_pages. The
    migrate_pages cleans up page list which have failed to migrate before
    returning by aaa994b3.

    [PATCH] page migration: handle freeing of pages in migrate_pages()

    Do not leave pages on the lists passed to migrate_pages(). Seems that we will
    not need any postprocessing of pages. This will simplify the handling of
    pages by the callers of migrate_pages().

    At that time, we thought we don't need any postprocessing of pages. But
    the situation is changed. The compaction need to know the number of
    failed to migrate for COMPACTPAGEFAILED stat

    This patch makes new rule for caller of migrate_pages to call
    putback_lru_pages. So caller need to clean up the lists so it has a
    chance to postprocess the pages. [suggested by Christoph Lameter]

    Signed-off-by: Minchan Kim
    Cc: Hugh Dickins
    Cc: Andi Kleen
    Reviewed-by: Mel Gorman
    Reviewed-by: Wu Fengguang
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

10 Aug, 2010

2 commits

  • migrate_pages() is using >500 bytes stack. Reduce it.

    mm/mempolicy.c: In function 'sys_migrate_pages':
    mm/mempolicy.c:1344: warning: the frame size of 528 bytes is larger than 512 bytes

    [akpm@linux-foundation.org: don't play with a might-be-NULL pointer]
    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • The oom killer presently kills current whenever there is no more memory
    free or reclaimable on its mempolicy's nodes. There is no guarantee that
    current is a memory-hogging task or that killing it will free any
    substantial amount of memory, however.

    In such situations, it is better to scan the tasklist for nodes that are
    allowed to allocate on current's set of nodes and kill the task with the
    highest badness() score. This ensures that the most memory-hogging task,
    or the one configured by the user with /proc/pid/oom_adj, is always
    selected in such scenarios.

    Signed-off-by: David Rientjes
    Reviewed-by: KOSAKI Motohiro
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

30 Jun, 2010

1 commit

  • My patch to "Factor out duplicate put/frees in mpol_shared_policy_init()
    to a common return path"; and Dan Carpenter's fix thereto both left a
    dangling reference to the incoming tmpfs superblock mempolicy structure.
    A similar leak was introduced earlier when the nodemask was moved offstack
    to the scratch area despite the note in the comment block regarding the
    incoming ref.

    Move the remaining 'put of the incoming "mpol" to the common exit path to
    drop the reference.

    Signed-off-by: Lee Schermerhorn
    Acked-by: Dan Carpenter
    Cc: KOSAKI Motohiro
    Cc: David Rientjes
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     

26 May, 2010

1 commit


25 May, 2010

1 commit

  • Use mm->task_size instead of TASK_SIZE to ensure that the entire user
    address space is migrated. mm->task_size is independent of the calling
    task context. TASK SIZE may be dependant on the address space size of the
    calling process. Usage of TASK_SIZE can lead to partial address space
    migration if the calling process was 32 bit and the migrating process was
    64 bit.

    Here is the test script used on 64 system with a 32 bit echo process:

    mount -t cgroup none /cgroup -o cpuset
    cd /cgroup

    mkdir 0
    echo 1 > 0/cpuset.cpus
    echo 0 > 0/cpuset.mems
    echo 1 > 0/cpuset.memory_migrate

    mkdir 1
    echo 1 > 1/cpuset.cpus
    echo 1 > 1/cpuset.mems
    echo 1 > 1/cpuset.memory_migrate

    echo $$ > 0/tasks
    64_bit_process &
    pid=$!

    echo $pid > 1/tasks # This does not migrate all process pages without
    # this patch. If 64 bit echo is used or this patch is
    # applied, then the full address space of $pid is
    # migrated.

    To check memory migration, I watched:
    grep MemUsed /sys/devices/system/node/node*/meminfo

    Signed-off-by: Greg Thelen
    Acked-by: Christoph Lameter
    Reviewed-by: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Thelen