07 Jan, 2012

1 commit

  • * 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
    x86: Fix atomic64_xxx_cx8() functions
    x86: Fix and improve cmpxchg_double{,_local}()
    x86_64, asm: Optimise fls(), ffs() and fls64()
    x86, bitops: Move fls64.h inside __KERNEL__
    x86: Fix and improve percpu_cmpxchg{8,16}b_double()
    x86: Report cpb and eff_freq_ro flags correctly
    x86/i386: Use less assembly in strlen(), speed things up a bit
    x86: Use the same node_distance for 32 and 64-bit
    x86: Fix rflags in FAKE_STACK_FRAME
    x86: Clean up and extend do_int3()
    x86: Call do_notify_resume() with interrupts enabled
    x86/div64: Add a micro-optimization shortcut if base is power of two
    x86-64: Cleanup some assembly entry points
    x86-64: Slightly shorten line system call entry and exit paths
    x86-64: Reduce amount of redundant code generated for invalidate_interruptNN
    x86-64: Slightly shorten int_ret_from_sys_call
    x86, efi: Convert efi_phys_get_time() args to physical addresses
    x86: Default to vsyscall=emulate
    x86-64: Set siginfo and context on vsyscall emulation faults
    x86: consolidate xchg and xadd macros
    ...

    Linus Torvalds
     

06 Jan, 2012

1 commit

  • * 'core-memblock-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits)
    memblock: Reimplement memblock allocation using reverse free area iterator
    memblock: Kill early_node_map[]
    score: Use HAVE_MEMBLOCK_NODE_MAP
    s390: Use HAVE_MEMBLOCK_NODE_MAP
    mips: Use HAVE_MEMBLOCK_NODE_MAP
    ia64: Use HAVE_MEMBLOCK_NODE_MAP
    SuperH: Use HAVE_MEMBLOCK_NODE_MAP
    sparc: Use HAVE_MEMBLOCK_NODE_MAP
    powerpc: Use HAVE_MEMBLOCK_NODE_MAP
    memblock: Implement memblock_add_node()
    memblock: s/memblock_analyze()/memblock_allow_resize()/ and update users
    memblock: Track total size of regions automatically
    powerpc: Cleanup memblock usage
    memblock: Reimplement memblock_enforce_memory_limit() using __memblock_remove()
    memblock: Make memblock functions handle overflowing range @size
    memblock: Reimplement __memblock_remove() using memblock_isolate_range()
    memblock: Separate out memblock_isolate_range() from memblock_set_node()
    memblock: Kill memblock_init()
    memblock: Kill sentinel entries at the end of static region arrays
    memblock: Add __memblock_dump_all()
    ...

    Linus Torvalds
     

04 Jan, 2012

1 commit

  • Just like the per-CPU ones they had several
    problems/shortcomings:

    Only the first memory operand was mentioned in the asm()
    operands, and the 2x64-bit version didn't have a memory clobber
    while the 2x32-bit one did. The former allowed the compiler to
    not recognize the need to re-load the data in case it had it
    cached in some register, while the latter was overly
    destructive.

    The types of the local copies of the old and new values were
    incorrect (the types of the pointed-to variables should be used
    here, to make sure the respective old/new variable types are
    compatible).

    The __dummy/__junk variables were pointless, given that local
    copies of the inputs already existed (and can hence be used for
    discarded outputs).

    The 32-bit variant of cmpxchg_double_local() referenced
    cmpxchg16b_local().

    At once also:

    - change the return value type to what it really is: 'bool'
    - unify 32- and 64-bit variants
    - abstract out the common part of the 'normal' and 'local' variants

    Signed-off-by: Jan Beulich
    Cc: Christoph Lameter
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Link: http://lkml.kernel.org/r/4F01F12A020000780006A19B@nat28.tlf.novell.com
    Signed-off-by: Ingo Molnar

    Jan Beulich
     

30 Dec, 2011

2 commits

  • If a huge page is enqueued under the protection of hugetlb_lock, then the
    operation is atomic and safe.

    Signed-off-by: Hillf Danton
    Reviewed-by: Michal Hocko
    Acked-by: KAMEZAWA Hiroyuki
    Cc: [2.6.37+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hillf Danton
     
  • commit 8aacc9f550 ("mm/mempolicy.c: fix pgoff in mbind vma merge") is the
    slightly incorrect fix.

    Why? Think following case.

    1. map 4 pages of a file at offset 0

    [0123]

    2. map 2 pages just after the first mapping of the same file but with
    page offset 2

    [0123][23]

    3. mbind() 2 pages from the first mapping at offset 2.
    mbind_range() should treat new vma is,

    [0123][23]
    |23|
    mbind vma

    but it does

    [0123][23]
    |01|
    mbind vma

    Oops. then, it makes wrong vma merge and splitting ([01][0123] or similar).

    This patch fixes it.

    [testcase]
    test result - before the patch

    case4: 126: test failed. expect '2,4', actual '2,2,2'
    case5: passed
    case6: passed
    case7: passed
    case8: passed
    case_n: 246: test failed. expect '4,2', actual '1,4'

    ------------[ cut here ]------------
    kernel BUG at mm/filemap.c:135!
    invalid opcode: 0000 [#4] SMP DEBUG_PAGEALLOC

    (snip long bug on messages)

    test result - after the patch

    case4: passed
    case5: passed
    case6: passed
    case7: passed
    case8: passed
    case_n: passed

    source: mbind_vma_test.c
    ============================================================
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    static unsigned long pagesize;
    void* mmap_addr;
    struct bitmask *nmask;
    char buf[1024];
    FILE *file;
    char retbuf[10240] = "";
    int mapped_fd;

    char *rubysrc = "ruby -e '\
    pid = %d; \
    vstart = 0x%llx; \
    vend = 0x%llx; \
    s = `pmap -q #{pid}`; \
    rary = []; \
    s.each_line {|line|; \
    ary=line.split(\" \"); \
    addr = ary[0].to_i(16); \
    if(vstart < vend) then \
    rary.push(ary[1].to_i()/4); \
    end; \
    }; \
    print rary.join(\",\"); \
    '";

    void init(void)
    {
    void* addr;
    char buf[128];

    nmask = numa_allocate_nodemask();
    numa_bitmask_setbit(nmask, 0);

    pagesize = getpagesize();

    sprintf(buf, "%s", "mbind_vma_XXXXXX");
    mapped_fd = mkstemp(buf);
    if (mapped_fd == -1)
    perror("mkstemp "), exit(1);
    unlink(buf);

    if (lseek(mapped_fd, pagesize*8, SEEK_SET) < 0)
    perror("lseek "), exit(1);
    if (write(mapped_fd, "\0", 1) < 0)
    perror("write "), exit(1);

    addr = mmap(NULL, pagesize*8, PROT_NONE,
    MAP_SHARED, mapped_fd, 0);
    if (addr == MAP_FAILED)
    perror("mmap "), exit(1);

    if (mprotect(addr+pagesize, pagesize*6, PROT_READ|PROT_WRITE) < 0)
    perror("mprotect "), exit(1);

    mmap_addr = addr + pagesize;

    /* make page populate */
    memset(mmap_addr, 0, pagesize*6);
    }

    void fin(void)
    {
    void* addr = mmap_addr - pagesize;
    munmap(addr, pagesize*8);

    memset(buf, 0, sizeof(buf));
    memset(retbuf, 0, sizeof(retbuf));
    }

    void mem_bind(int index, int len)
    {
    int err;

    err = mbind(mmap_addr+pagesize*index, pagesize*len,
    MPOL_BIND, nmask->maskp, nmask->size, 0);
    if (err)
    perror("mbind "), exit(err);
    }

    void mem_interleave(int index, int len)
    {
    int err;

    err = mbind(mmap_addr+pagesize*index, pagesize*len,
    MPOL_INTERLEAVE, nmask->maskp, nmask->size, 0);
    if (err)
    perror("mbind "), exit(err);
    }

    void mem_unbind(int index, int len)
    {
    int err;

    err = mbind(mmap_addr+pagesize*index, pagesize*len,
    MPOL_DEFAULT, NULL, 0, 0);
    if (err)
    perror("mbind "), exit(err);
    }

    void Assert(char *expected, char *value, char *name, int line)
    {
    if (strcmp(expected, value) == 0) {
    fprintf(stderr, "%s: passed\n", name);
    return;
    }
    else {
    fprintf(stderr, "%s: %d: test failed. expect '%s', actual '%s'\n",
    name, line,
    expected, value);
    // exit(1);
    }
    }

    /*
    AAAA
    PPPPPPNNNNNN
    might become
    PPNNNNNNNNNN
    case 4 below
    */
    void case4(void)
    {
    init();
    sprintf(buf, rubysrc, getpid(), mmap_addr, mmap_addr+pagesize*6);

    mem_bind(0, 4);
    mem_unbind(2, 2);

    file = popen(buf, "r");
    fread(retbuf, sizeof(retbuf), 1, file);
    Assert("2,4", retbuf, "case4", __LINE__);

    fin();
    }

    /*
    AAAA
    PPPPPPNNNNNN
    might become
    PPPPPPPPPPNN
    case 5 below
    */
    void case5(void)
    {
    init();
    sprintf(buf, rubysrc, getpid(), mmap_addr, mmap_addr+pagesize*6);

    mem_bind(0, 2);
    mem_bind(2, 2);

    file = popen(buf, "r");
    fread(retbuf, sizeof(retbuf), 1, file);
    Assert("4,2", retbuf, "case5", __LINE__);

    fin();
    }

    /*
    AAAA
    PPPPNNNNXXXX
    might become
    PPPPPPPPPPPP 6
    */
    void case6(void)
    {
    init();
    sprintf(buf, rubysrc, getpid(), mmap_addr, mmap_addr+pagesize*6);

    mem_bind(0, 2);
    mem_bind(4, 2);
    mem_bind(2, 2);

    file = popen(buf, "r");
    fread(retbuf, sizeof(retbuf), 1, file);
    Assert("6", retbuf, "case6", __LINE__);

    fin();
    }

    /*
    AAAA
    PPPPNNNNXXXX
    might become
    PPPPPPPPXXXX 7
    */
    void case7(void)
    {
    init();
    sprintf(buf, rubysrc, getpid(), mmap_addr, mmap_addr+pagesize*6);

    mem_bind(0, 2);
    mem_interleave(4, 2);
    mem_bind(2, 2);

    file = popen(buf, "r");
    fread(retbuf, sizeof(retbuf), 1, file);
    Assert("4,2", retbuf, "case7", __LINE__);

    fin();
    }

    /*
    AAAA
    PPPPNNNNXXXX
    might become
    PPPPNNNNNNNN 8
    */
    void case8(void)
    {
    init();
    sprintf(buf, rubysrc, getpid(), mmap_addr, mmap_addr+pagesize*6);

    mem_bind(0, 2);
    mem_interleave(4, 2);
    mem_interleave(2, 2);

    file = popen(buf, "r");
    fread(retbuf, sizeof(retbuf), 1, file);
    Assert("2,4", retbuf, "case8", __LINE__);

    fin();
    }

    void case_n(void)
    {
    init();
    sprintf(buf, rubysrc, getpid(), mmap_addr, mmap_addr+pagesize*6);

    /* make redundunt mappings [0][1234][34][7] */
    mmap(mmap_addr + pagesize*4, pagesize*2, PROT_READ|PROT_WRITE,
    MAP_FIXED|MAP_SHARED, mapped_fd, pagesize*3);

    /* Expect to do nothing. */
    mem_unbind(2, 2);

    file = popen(buf, "r");
    fread(retbuf, sizeof(retbuf), 1, file);
    Assert("4,2", retbuf, "case_n", __LINE__);

    fin();
    }

    int main(int argc, char** argv)
    {
    case4();
    case5();
    case6();
    case7();
    case8();
    case_n();

    return 0;
    }
    =============================================================

    Signed-off-by: KOSAKI Motohiro
    Acked-by: Johannes Weiner
    Cc: Minchan Kim
    Cc: Caspar Zhang
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Lee Schermerhorn
    Cc: [3.1.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

22 Dec, 2011

1 commit


21 Dec, 2011

3 commits

  • Static storage is not required for the struct vmap_area in
    __get_vm_area_node.

    Removing "static" to store this variable on the stack instead.

    Signed-off-by: Kautuk Consul
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kautuk Consul
     
  • An integer overflow will happen on 64bit archs if task's sum of rss,
    swapents and nr_ptes exceeds (2^31)/1000 value. This was introduced by
    commit

    f755a04 oom: use pte pages in OOM score

    where the oom score computation was divided into several steps and it's no
    longer computed as one expression in unsigned long(rss, swapents, nr_pte
    are unsigned long), where the result value assigned to points(int) is in
    range(1..1000). So there could be an int overflow while computing

    176 points *= 1000;

    and points may have negative value. Meaning the oom score for a mem hog task
    will be one.

    196 if (points
    Acked-by: KOSAKI Motohiro
    Acked-by: Oleg Nesterov
    Acked-by: David Rientjes
    Cc: [2.6.36+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Frantisek Hrbata
     
  • If the request is to create non-root group and we fail to meet it, we
    should leave the root unchanged.

    Signed-off-by: Hillf Danton
    Acked-by: Hugh Dickins
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: Balbir Singh
    Cc: David Rientjes
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hillf Danton
     

20 Dec, 2011

1 commit


16 Dec, 2011

1 commit

  • per_cpu_ptr_to_phys() incorrectly rounds up its result for non-kmalloc
    case to the page boundary, which is bogus for any non-page-aligned
    address.

    This affects the only in-tree user of this function - sysfs handler
    for per-cpu 'crash_notes' physical address. The trouble is that the
    crash_notes per-cpu variable is not page-aligned:

    crash_notes = 0xc08e8ed4
    PER-CPU OFFSET VALUES:
    CPU 0: 3711f000
    CPU 1: 37129000
    CPU 2: 37133000
    CPU 3: 3713d000

    So, the per-cpu addresses are:
    crash_notes on CPU 0: f7a07ed4 => phys 36b57ed4
    crash_notes on CPU 1: f7a11ed4 => phys 36b4ded4
    crash_notes on CPU 2: f7a1bed4 => phys 36b43ed4
    crash_notes on CPU 3: f7a25ed4 => phys 36b39ed4

    However, /sys/devices/system/cpu/cpu*/crash_notes says:
    /sys/devices/system/cpu/cpu0/crash_notes: 36b57000
    /sys/devices/system/cpu/cpu1/crash_notes: 36b4d000
    /sys/devices/system/cpu/cpu2/crash_notes: 36b43000
    /sys/devices/system/cpu/cpu3/crash_notes: 36b39000

    As you can see, all values are rounded down to a page
    boundary. Consequently, this is where kexec sets up the NOTE segments,
    and thus where the secondary kernel is looking for them. However, when
    the first kernel crashes, it saves the notes to the unaligned
    addresses, where they are not found.

    Fix it by adding offset_in_page() to the translated page address.

    -tj: Combined Eugene's and Petr's commit messages.

    Signed-off-by: Eugene Surovegin
    Signed-off-by: Tejun Heo
    Reported-by: Petr Tesarik
    Cc: stable@kernel.org

    Eugene Surovegin
     

14 Dec, 2011

1 commit


09 Dec, 2011

21 commits

  • Commit f5252e00 ("mm: avoid null pointer access in vm_struct via
    /proc/vmallocinfo") adds newly allocated vm_structs to the vmlist after
    it is fully initialised. Unfortunately, it did not check that
    __vmalloc_area_node() successfully populated the area. In the event of
    allocation failure, the vmalloc area is freed but the pointer to freed
    memory is inserted into the vmlist leading to a a crash later in
    get_vmalloc_info().

    This patch adds a check for ____vmalloc_area_node() failure within
    __vmalloc_node_range. It does not use "goto fail" as in the previous
    error path as a warning was already displayed by __vmalloc_area_node()
    before it called vfree in its failure path.

    Credit goes to Luciano Chavez for doing all the real work of identifying
    exactly where the problem was.

    Signed-off-by: Mel Gorman
    Reported-by: Luciano Chavez
    Tested-by: Luciano Chavez
    Reviewed-by: Rik van Riel
    Acked-by: David Rientjes
    Cc: [3.1.x+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • setup_zone_migrate_reserve() expects that zone->start_pfn starts at
    pageblock_nr_pages aligned pfn otherwise we could access beyond an
    existing memblock resulting in the following panic if
    CONFIG_HOLES_IN_ZONE is not configured and we do not check pfn_valid:

    IP: [] setup_zone_migrate_reserve+0xcd/0x180
    *pdpt = 0000000000000000 *pde = f000ff53f000ff53
    Oops: 0000 [#1] SMP
    Pid: 1, comm: swapper Not tainted 3.0.7-0.7-pae #1 VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform
    EIP: 0060:[] EFLAGS: 00010006 CPU: 0
    EIP is at setup_zone_migrate_reserve+0xcd/0x180
    EAX: 000c0000 EBX: f5801fc0 ECX: 000c0000 EDX: 00000000
    ESI: 000c01fe EDI: 000c01fe EBP: 00140000 ESP: f2475f58
    DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
    Process swapper (pid: 1, ti=f2474000 task=f2472cd0 task.ti=f2474000)
    Call Trace:
    [] __setup_per_zone_wmarks+0xec/0x160
    [] setup_per_zone_wmarks+0xf/0x20
    [] init_per_zone_wmark_min+0x27/0x86
    [] do_one_initcall+0x2b/0x160
    [] kernel_init+0xbe/0x157
    [] kernel_thread_helper+0x6/0xd
    Code: a5 39 f5 89 f7 0f 46 fd 39 cf 76 40 8b 03 f6 c4 08 74 32 eb 91 90 89 c8 c1 e8 0e 0f be 80 80 2f 86 c0 8b 14 85 60 2f 86 c0 89 c8 82 b4 12 00 00 c1 e0 05 03 82 ac 12 00 00 8b 00 f6 c4 08 0f
    EIP: [] setup_zone_migrate_reserve+0xcd/0x180 SS:ESP 0068:f2475f58
    CR2: 00000000000012b4

    We crashed in pageblock_is_reserved() when accessing pfn 0xc0000 because
    highstart_pfn = 0x36ffe.

    The issue was introduced in 3.0-rc1 by 6d3163ce ("mm: check if any page
    in a pageblock is reserved before marking it MIGRATE_RESERVE").

    Make sure that start_pfn is always aligned to pageblock_nr_pages to
    ensure that pfn_valid s always called at the start of each pageblock.
    Architectures with holes in pageblocks will be correctly handled by
    pfn_valid_within in pageblock_is_reserved.

    Signed-off-by: Michal Hocko
    Signed-off-by: Mel Gorman
    Tested-by: Dang Bo
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Cc: Arve Hjnnevg
    Cc: KOSAKI Motohiro
    Cc: John Stultz
    Cc: Dave Hansen
    Cc: [3.0+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Avoid unlocking and unlocked page if we failed to lock it.

    Signed-off-by: Hillf Danton
    Cc: Naoya Horiguchi
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hillf Danton
     
  • Commit 70b50f94f1644 ("mm: thp: tail page refcounting fix") keeps all
    page_tail->_count zero at all times. But the current kernel does not
    set page_tail->_count to zero if a 1GB page is utilized. So when an
    IOMMU 1GB page is used by KVM, it wil result in a kernel oops because a
    tail page's _count does not equal zero.

    kernel BUG at include/linux/mm.h:386!
    invalid opcode: 0000 [#1] SMP
    Call Trace:
    gup_pud_range+0xb8/0x19d
    get_user_pages_fast+0xcb/0x192
    ? trace_hardirqs_off+0xd/0xf
    hva_to_pfn+0x119/0x2f2
    gfn_to_pfn_memslot+0x2c/0x2e
    kvm_iommu_map_pages+0xfd/0x1c1
    kvm_iommu_map_memslots+0x7c/0xbd
    kvm_iommu_map_guest+0xaa/0xbf
    kvm_vm_ioctl_assigned_device+0x2ef/0xa47
    kvm_vm_ioctl+0x36c/0x3a2
    do_vfs_ioctl+0x49e/0x4e4
    sys_ioctl+0x5a/0x7c
    system_call_fastpath+0x16/0x1b
    RIP gup_huge_pud+0xf2/0x159

    Signed-off-by: Youquan Song
    Reviewed-by: Andrea Arcangeli
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Youquan Song
     
  • khugepaged can sometimes cause suspend to fail, requiring that the user
    retry the suspend operation.

    Use wait_event_freezable_timeout() instead of
    schedule_timeout_interruptible() to avoid missing freezer wakeups. A
    try_to_freeze() would have been needed in the khugepaged_alloc_hugepage
    tight loop too in case of the allocation failing repeatedly, and
    wait_event_freezable_timeout will provide it too.

    khugepaged would still freeze just fine by trying again the next minute
    but it's better if it freezes immediately.

    Reported-by: Jiri Slaby
    Signed-off-by: Andrea Arcangeli
    Tested-by: Jiri Slaby
    Cc: Tejun Heo
    Cc: Oleg Nesterov
    Cc: "Srivatsa S. Bhat"
    Cc: "Rafael J. Wysocki"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Use atomic-long operations instead of looping around cmpxchg().

    [akpm@linux-foundation.org: massage atomic.h inclusions]
    Signed-off-by: Konstantin Khlebnikov
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • A shrinker function can return -1, means that it cannot do anything
    without a risk of deadlock. For example prune_super() does this if it
    cannot grab a superblock refrence, even if nr_to_scan=0. Currently we
    interpret this -1 as a ULONG_MAX size shrinker and evaluate `total_scan'
    according to this. So the next time around this shrinker can cause
    really big pressure. Let's skip such shrinkers instead.

    Also make total_scan signed, otherwise the check (total_scan < 0) below
    never works.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Now that all early memory information is in memblock when enabled, we
    can implement reverse free area iterator and use it to implement NUMA
    aware allocator which is then wrapped for simpler variants instead of
    the confusing and inefficient mending of information in separate NUMA
    aware allocator.

    Implement for_each_free_mem_range_reverse(), use it to reimplement
    memblock_find_in_range_node() which in turn is used by all allocators.

    The visible allocator interface is inconsistent and can probably use
    some cleanup too.

    Signed-off-by: Tejun Heo
    Cc: Benjamin Herrenschmidt
    Cc: Yinghai Lu

    Tejun Heo
     
  • Now all ARCH_POPULATES_NODE_MAP archs select HAVE_MEBLOCK_NODE_MAP -
    there's no user of early_node_map[] left. Kill early_node_map[] and
    replace ARCH_POPULATES_NODE_MAP with HAVE_MEMBLOCK_NODE_MAP. Also,
    relocate for_each_mem_pfn_range() and helper from mm.h to memblock.h
    as page_alloc.c would no longer host an alternative implementation.

    This change is ultimately one to one mapping and shouldn't cause any
    observable difference; however, after the recent changes, there are
    some functions which now would fit memblock.c better than page_alloc.c
    and dependency on HAVE_MEMBLOCK_NODE_MAP instead of HAVE_MEMBLOCK
    doesn't make much sense on some of them. Further cleanups for
    functions inside HAVE_MEMBLOCK_NODE_MAP in mm.h would be nice.

    -v2: Fix compile bug introduced by mis-spelling
    CONFIG_HAVE_MEMBLOCK_NODE_MAP to CONFIG_MEMBLOCK_HAVE_NODE_MAP in
    mmzone.h. Reported by Stephen Rothwell.

    Signed-off-by: Tejun Heo
    Cc: Stephen Rothwell
    Cc: Benjamin Herrenschmidt
    Cc: Yinghai Lu
    Cc: Tony Luck
    Cc: Ralf Baechle
    Cc: Martin Schwidefsky
    Cc: Chen Liqin
    Cc: Paul Mundt
    Cc: "David S. Miller"
    Cc: "H. Peter Anvin"

    Tejun Heo
     
  • Implement memblock_add_node() which can add a new memblock memory
    region with specific node ID.

    Signed-off-by: Tejun Heo
    Cc: Benjamin Herrenschmidt
    Cc: Yinghai Lu

    Tejun Heo
     
  • The only function of memblock_analyze() is now allowing resize of
    memblock region arrays. Rename it to memblock_allow_resize() and
    update its users.

    * The following users remain the same other than renaming.

    arm/mm/init.c::arm_memblock_init()
    microblaze/kernel/prom.c::early_init_devtree()
    powerpc/kernel/prom.c::early_init_devtree()
    openrisc/kernel/prom.c::early_init_devtree()
    sh/mm/init.c::paging_init()
    sparc/mm/init_64.c::paging_init()
    unicore32/mm/init.c::uc32_memblock_init()

    * In the following users, analyze was used to update total size which
    is no longer necessary.

    powerpc/kernel/machine_kexec.c::reserve_crashkernel()
    powerpc/kernel/prom.c::early_init_devtree()
    powerpc/mm/init_32.c::MMU_init()
    powerpc/mm/tlb_nohash.c::__early_init_mmu()
    powerpc/platforms/ps3/mm.c::ps3_mm_add_memory()
    powerpc/platforms/embedded6xx/wii.c::wii_memory_fixups()
    sh/kernel/machine_kexec.c::reserve_crashkernel()

    * x86/kernel/e820.c::memblock_x86_fill() was directly setting
    memblock_can_resize before populating memblock and calling analyze
    afterwards. Call memblock_allow_resize() before start populating.

    memblock_can_resize is now static inside memblock.c.

    Signed-off-by: Tejun Heo
    Cc: Benjamin Herrenschmidt
    Cc: Yinghai Lu
    Cc: Russell King
    Cc: Michal Simek
    Cc: Paul Mundt
    Cc: "David S. Miller"
    Cc: Guan Xuetao
    Cc: "H. Peter Anvin"

    Tejun Heo
     
  • Total size of memory regions was calculated by memblock_analyze()
    requiring explicitly calling the function between operations which can
    change memory regions and possible users of total size, which is
    cumbersome and fragile.

    This patch makes each memblock_type track total size automatically
    with minor modifications to memblock manipulation functions and remove
    requirements on calling memblock_analyze(). [__]memblock_dump_all()
    now also dumps the total size of reserved regions.

    Signed-off-by: Tejun Heo
    Cc: Benjamin Herrenschmidt
    Cc: Yinghai Lu

    Tejun Heo
     
  • With recent updates, the basic memblock operations are robust enough
    that there's no reason for memblock_enfore_memory_limit() to directly
    manipulate memblock region arrays. Reimplement it using
    __memblock_remove().

    Signed-off-by: Tejun Heo
    Cc: Benjamin Herrenschmidt
    Cc: Yinghai Lu

    Tejun Heo
     
  • Allow memblock users to specify range where @base + @size overflows
    and automatically cap it at maximum. This makes the interface more
    robust and specifying till-the-end-of-memory easier.

    Signed-off-by: Tejun Heo
    Cc: Benjamin Herrenschmidt
    Cc: Yinghai Lu

    Tejun Heo
     
  • __memblock_remove()'s open coded region manipulation can be trivially
    replaced with memblock_islate_range(). This increases code sharing
    and eases improving region tracking.

    This pulls memblock_isolate_range() out of HAVE_MEMBLOCK_NODE_MAP.
    Make it use memblock_get_region_node() instead of assuming rgn->nid is
    available.

    -v2: Fixed build failure on !HAVE_MEMBLOCK_NODE_MAP caused by direct
    rgn->nid access.

    Signed-off-by: Tejun Heo
    Cc: Benjamin Herrenschmidt
    Cc: Yinghai Lu

    Tejun Heo
     
  • memblock_set_node() operates in three steps - break regions crossing
    boundaries, set nid and merge back regions. This patch separates the
    first part into a separate function - memblock_isolate_range(), which
    breaks regions crossing range boundaries and returns range index range
    for regions properly contained in the specified memory range.

    This doesn't introduce any behavior change and will be used to further
    unify region handling.

    Signed-off-by: Tejun Heo
    Cc: Benjamin Herrenschmidt
    Cc: Yinghai Lu

    Tejun Heo
     
  • memblock_init() initializes arrays for regions and memblock itself;
    however, all these can be done with struct initializers and
    memblock_init() can be removed. This patch kills memblock_init() and
    initializes memblock with struct initializer.

    The only difference is that the first dummy entries don't have .nid
    set to MAX_NUMNODES initially. This doesn't cause any behavior
    difference.

    Signed-off-by: Tejun Heo
    Cc: Benjamin Herrenschmidt
    Cc: Yinghai Lu
    Cc: Russell King
    Cc: Michal Simek
    Cc: Paul Mundt
    Cc: "David S. Miller"
    Cc: Guan Xuetao
    Cc: "H. Peter Anvin"

    Tejun Heo
     
  • memblock no longer depends on having one more entry at the end during
    addition making the sentinel entries at the end of region arrays not
    too useful. Remove the sentinels. This eases further updates.

    Signed-off-by: Tejun Heo
    Cc: Benjamin Herrenschmidt
    Cc: Yinghai Lu

    Tejun Heo
     
  • Add __memblock_dump_all() which dumps memblock configuration whether
    memblock_debug is enabled or not.

    Signed-off-by: Tejun Heo
    Cc: Benjamin Herrenschmidt
    Cc: Yinghai Lu

    Tejun Heo
     
  • Make memblock_double_array(), __memblock_alloc_base() and
    memblock_alloc_nid() use memblock_reserve() instead of calling
    memblock_add_region() with reserved array directly. This eases
    debugging and updates to memblock_add_region().

    Signed-off-by: Tejun Heo
    Cc: Benjamin Herrenschmidt
    Cc: Yinghai Lu

    Tejun Heo
     
  • memblock_{add|remove|free|reserve}() return either 0 or -errno but had
    long as return type. Chage it to int. Also, drop 'extern' from all
    prototypes in memblock.h - they are unnecessary and used
    inconsistently (especially if mm.h is included in the picture).

    Signed-off-by: Tejun Heo
    Cc: Benjamin Herrenschmidt
    Cc: Yinghai Lu

    Tejun Heo
     

08 Dec, 2011

3 commits

  • Some trace shows lots of bdi_dirty=0 lines where it's actually some
    small value if w/o the accounting errors in the per-cpu bdi stats.

    In this case the max pause time should really be set to the smallest
    (non-zero) value to avoid IO queue underrun and improve throughput.

    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • On a system with 1 local mount and 1 NFS mount, if the NFS server
    becomes not responding when dd to the NFS mount, the NFS dirty pages may
    exceed the global dirty limit and _every_ task involving writing will be
    blocked. The whole system appears unresponsive.

    The workaround is to permit through the bdi's that only has a small
    number of dirty pages. The number chosen (bdi_stat_error pages) is not
    enough to enable the local disk to run in optimal throughput, however is
    enough to make the system responsive on a broken NFS mount. The user can
    then kill the dirtiers on the NFS mount and increase the global dirty
    limit to bring up the local disk's throughput.

    It risks allowing dirty pages to grow much larger than the global dirty
    limit when there are 1000+ mounts, however that's very unlikely to happen,
    especially in low memory profiles.

    Signed-off-by: Wu Fengguang

    Wu Fengguang
     
  • We do "floating proportions" to let active devices to grow its target
    share of dirty pages and stalled/inactive devices to decrease its target
    share over time.

    It works well except in the case of "an inactive disk suddenly goes
    busy", where the initial target share may be too small. To mitigate
    this, bdi_position_ratio() has the below line to raise a small
    bdi_thresh when it's safe to do so, so that the disk be feed with enough
    dirty pages for efficient IO and in turn fast rampup of bdi_thresh:

    bdi_thresh = max(bdi_thresh, (limit - dirty) / 8);

    balance_dirty_pages() normally does negative feedback control which
    adjusts ratelimit to balance the bdi dirty pages around the target.
    In some extreme cases when that is not enough, it will have to block
    the tasks completely until the bdi dirty pages drop below bdi_thresh.

    Acked-by: Jan Kara
    Acked-by: Peter Zijlstra
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     

05 Dec, 2011

1 commit

  • Commit 30765b92 ("slab, lockdep: Annotate the locks before using
    them") moves the init_lock_keys() call from after g_cpucache_up =
    FULL, to before it. And overlooks the fact that init_node_lock_keys()
    tests for it and ignores everything !FULL.

    Introduce a LATE stage and change the lockdep test to be
    Cc: Pekka Enberg
    Cc: stable@kernel.org
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

02 Dec, 2011

1 commit

  • Currently write(2) to a file is not interruptible by any signal.
    Sometimes this is desirable, e.g. when you want to quickly kill a
    process hogging your disk. Also, with commit 499d05ecf990 ("mm: Make
    task in balance_dirty_pages() killable"), it's necessary to abort the
    current write accordingly to avoid it quickly dirtying lots more pages
    at unthrottled rate.

    This patch makes write interruptible by SIGKILL. We do not allow write
    to be interruptible by any other signal because that has larger
    potential of screwing some badly written applications.

    Reported-by: Kazuya Mio
    Tested-by: Kazuya Mio
    Acked-by: Matthew Wilcox
    Signed-off-by: Jan Kara
    Signed-off-by: Wu Fengguang

    Jan Kara
     

30 Nov, 2011

1 commit


29 Nov, 2011

1 commit