03 Oct, 2008

3 commits

  • When we initialise a compound page we initialise the page flags and head
    page pointer for all base pages spanned by that page. When we initialise
    a gigantic page (a page of order greater than or equal to MAX_ORDER) we
    have to initialise more than MAX_ORDER_NR_PAGES pages. Currently we
    assume that all elements of the mem_map in this page are contigious in
    memory. However this is only guarenteed out to MAX_ORDER_NR_PAGES pages,
    and with SPARSEMEM enabled they will not be contigious. This leads us to
    walk off the end of the first section and scribble on everything which
    follows, BAD.

    When we reach a MAX_ORDER_NR_PAGES boundary we much locate the next
    section of the mem_map. As gigantic pages can only be maximally aligned
    we know this will occur at exact multiple of MAX_ORDER_NR_PAGES pages from
    the start of the page.

    This is a bug fix for the gigantic page support in hugetlbfs.

    Credit to Mel Gorman for spotting the issue.

    Signed-off-by: Andy Whitcroft
    Cc: Mel Gorman
    Cc: Jon Tollefson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     
  • The previous patch db203d53d474aa068984e409d807628f5841da1b ("mm:
    tiny-shmem fix lock ordering: mmap_sem vs i_mutex") to fix the lock
    ordering in tiny-shmem breaks shared anonymous and IPC memory on NOMMU
    architectures because it was using the expanding truncate to signal ramfs
    to allocate a physically contiguous RAM backing the inode (otherwise it is
    unusable for "memory mapping" it to userspace).

    However do_truncate is what caused the lock ordering error, due to it
    taking i_mutex. In this case, we can actually just call ramfs directly to
    allocate memory for the mapping, rather than go via truncate.

    Acked-by: David Howells
    Acked-by: Hugh Dickins
    Signed-off-by: Nick Piggin
    Cc: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • __test_page_isolated_in_pageblock() in mm/page_isolation.c has a comment
    saying that the caller must hold zone->lock. But the only caller of that
    function, test_pages_isolated(), does not hold zone->lock and the lock is
    also not acquired anywhere before. This patch adds the missing zone->lock
    to test_pages_isolated().

    We reproducibly run into BUG_ON(!PageBuddy(page)) in __offline_isolated_pages()
    during memory hotplug stress test, see trace below. This patch fixes that
    problem, it would be good if we could have it in 2.6.27.

    kernel BUG at /home/autobuild/BUILD/linux-2.6.26-20080909/mm/page_alloc.c:4561!
    illegal operation: 0001 [#1] PREEMPT SMP
    Modules linked in: dm_multipath sunrpc bonding qeth_l3 dm_mod qeth ccwgroup vmur
    CPU: 1 Not tainted 2.6.26-29.x.20080909-s390default #1
    Process memory_loop_all (pid: 10025, task: 2f444028, ksp: 2b10dd28)
    Krnl PSW : 040c0000 801727ea (__offline_isolated_pages+0x18e/0x1c4)
    R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:0 CC:0 PM:0
    Krnl GPRS: 00000000 7e27fc00 00000000 7e27fc00
    00000000 00000400 00014000 7e27fc01
    00606f00 7e27fc00 00013fe0 2b10dd28
    00000005 80172662 801727b2 2b10dd28
    Krnl Code: 801727de: 5810900c l %r1,12(%r9)
    801727e2: a7f4ffb3 brc 15,80172748
    801727e6: a7f40001 brc 15,801727e8
    >801727ea: a7f4ffbc brc 15,80172762
    801727ee: a7f40001 brc 15,801727f0
    801727f2: a7f4ffaf brc 15,80172750
    801727f6: 0707 bcr 0,%r7
    801727f8: 0017 unknown
    Call Trace:
    ([] __offline_isolated_pages+0x116/0x1c4)
    [] offline_isolated_pages_cb+0x22/0x34
    [] walk_memory_resource+0xcc/0x11c
    [] offline_pages+0x36a/0x498
    [] remove_memory+0x36/0x44
    [] memory_block_change_state+0x112/0x150
    [] store_mem_state+0x90/0xe4
    [] sysdev_store+0x34/0x40
    [] sysfs_write_file+0xd0/0x178
    [] vfs_write+0x74/0x118
    [] sys_write+0x46/0x7c
    [] sysc_do_restart+0x12/0x16
    [] 0x77f3e8ca

    Signed-off-by: Gerald Schaefer
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gerald Schaefer
     

29 Sep, 2008

1 commit

  • There's a race between mm->owner assignment and swapoff, more easily
    seen when task slab poisoning is turned on. The condition occurs when
    try_to_unuse() runs in parallel with an exiting task. A similar race
    can occur with callers of get_task_mm(), such as /proc//
    or ptrace or page migration.

    CPU0 CPU1
    try_to_unuse
    looks at mm = task0->mm
    increments mm->mm_users
    task 0 exits
    mm->owner needs to be updated, but no
    new owner is found (mm_users > 1, but
    no other task has task->mm = task0->mm)
    mm_update_next_owner() leaves
    mmput(mm) decrements mm->mm_users
    task0 freed
    dereferencing mm->owner fails

    The fix is to notify the subsystem via mm_owner_changed callback(),
    if no new owner is found, by specifying the new task as NULL.

    Jiri Slaby:
    mm->owner was set to NULL prior to calling cgroup_mm_owner_callbacks(), but
    must be set after that, so as not to pass NULL as old owner causing oops.

    Daisuke Nishimura:
    mm_update_next_owner() may set mm->owner to NULL, but mem_cgroup_from_task()
    and its callers need to take account of this situation to avoid oops.

    Hugh Dickins:
    Lockdep warning and hang below exec_mmap() when testing these patches.
    exit_mm() up_reads mmap_sem before calling mm_update_next_owner(),
    so exec_mmap() now needs to do the same. And with that repositioning,
    there's now no point in mm_need_new_owner() allowing for NULL mm.

    Reported-by: Hugh Dickins
    Signed-off-by: Balbir Singh
    Signed-off-by: Jiri Slaby
    Signed-off-by: Daisuke Nishimura
    Signed-off-by: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     

23 Sep, 2008

2 commits

  • Current memory cgroup(both in mainline and -mm) doesn't account swap
    caches as memory(swap cache support is dropped temporarily now).

    So try_to_free_mem_cgroup_pages doesn't reflect the count of pages that
    have been moved to swap cache.

    But this makes mem_cgroup_shrink_usage fail easily if most of the pages
    are anon/shmem, and then shmem_getpage returns -ENOMEM and the process
    will be killed.

    This patch adds res_counter_check_under_limit to avoid these cases.

    BTW, even if swap cache support is enabled again, if a process is moved to
    another cgroup, which has been just made, between precharge and
    shrink_usage in shmem_getpage, shrink_usage may fail just because there is
    no pages to reclaim.

    So this change would make sense anyway.

    Signed-off-by: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Pavel Emelyanov
    Cc: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • tiny-shmem calls do_truncate in shmem_file_setup. do_truncate takes
    i_mutex, and shmem_file_setup is called with mmap_sem held. However
    i_mutex nests outside mmap_sem.

    Copy the code in shmem.c to avoid this problem.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Nick Piggin
    Reported-and-tested-by: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Matt Mackall
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

15 Sep, 2008

1 commit


14 Sep, 2008

1 commit

  • The iterator for_each_zone_zonelist() uses a struct zoneref *z cursor when
    scanning zonelists to keep track of where in the zonelist it is. The
    zoneref that is returned corresponds to the the next zone that is to be
    scanned, not the current one. It was intended to be treated as an opaque
    list.

    When the page allocator is scanning a zonelist, it marks elements in the
    zonelist corresponding to zones that are temporarily full. As the
    zonelist is being updated, it uses the cursor here;

    if (NUMA_BUILD)
    zlc_mark_zone_full(zonelist, z);

    This is intended to prevent rescanning in the near future but the zoneref
    cursor does not correspond to the zone that has been found to be full.
    This is an easy misunderstanding to make so this patch corrects the
    problem by changing zoneref cursor to be the current zone being scanned
    instead of the next one.

    Signed-off-by: Mel Gorman
    Cc: Andy Whitcroft
    Cc: KAMEZAWA Hiroyuki
    Cc: [2.6.26.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

04 Sep, 2008

1 commit

  • Anonymous mappings should ignore offset but shared anonymous mapping
    forgot to clear it and makes the following legit test program trigger
    SIGBUS.

    #include
    #include
    #include

    #define PAGE_SIZE 4096

    int main(void)
    {
    char *p;
    int i;

    p = mmap(NULL, 2 * PAGE_SIZE, PROT_READ|PROT_WRITE,
    MAP_SHARED|MAP_ANONYMOUS, -1, PAGE_SIZE);
    if (p == MAP_FAILED) {
    perror("mmap");
    return 1;
    }

    for (i = 0; i < 2; i++) {
    printf("page %d\n", i);
    p[i * 4096] = i;
    }
    return 0;
    }

    Fix it.

    Signed-off-by: Tejun Heo
    Acked-by: Hugh Dickins
    Acked-by: KOSAKI Motohiro
    Signed-off-by: Linus Torvalds

    Tejun Heo
     

03 Sep, 2008

4 commits

  • Quicklists store pages for each CPU as caches. (Each CPU can cache
    node_free_pages/16 pages)

    It is used for page table cache. exit() will increase the cache size,
    while fork() consumes it.

    So for example if an apache-style application runs (one parent and many
    child model), one CPU process will fork() while another CPU will process
    the middleware work and exit().

    At that time, the CPU on which the parent runs doesn't have page table
    cache at all. Others (on which children runs) have maximum caches.

    QList_max = (#ofCPUs - 1) x Free / 16
    => QList_max / (Free + QList_max) = (#ofCPUs - 1) / (16 + #ofCPUs - 1)

    So, How much quicklist memory is used in the maximum case?

    This is proposional to # of CPUs because the limit of per cpu quicklist
    cache doesn't see the number of cpus.

    Above calculation mean

    Number of CPUs per node 2 4 8 16
    ============================== ====================
    QList_max / (Free + QList_max) 5.8% 16% 30% 48%

    Wow! Quicklist can spend about 50% memory at worst case.

    My demonstration program is here
    --------------------------------------------------------------------------------
    #define _GNU_SOURCE

    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #define BUFFSIZE 512

    int max_cpu(void) /* get max number of logical cpus from /proc/cpuinfo */
    {
    FILE *fd;
    char *ret, buffer[BUFFSIZE];
    int cpu = 1;

    fd = fopen("/proc/cpuinfo", "r");
    if (fd == NULL) {
    perror("fopen(/proc/cpuinfo)");
    exit(EXIT_FAILURE);
    }
    while (1) {
    ret = fgets(buffer, BUFFSIZE, fd);
    if (ret == NULL)
    break;
    if (!strncmp(buffer, "processor", 9))
    cpu = atoi(strchr(buffer, ':') + 2);
    }
    fclose(fd);
    return cpu;
    }

    void cpu_bind(int cpu) /* bind current process to one cpu */
    {
    cpu_set_t mask;
    int ret;

    CPU_ZERO(&mask);
    CPU_SET(cpu, &mask);
    ret = sched_setaffinity(0, sizeof(mask), &mask);
    if (ret == -1) {
    perror("sched_setaffinity()");
    exit(EXIT_FAILURE);
    }
    sched_yield(); /* not necessary */
    }

    #define MMAP_SIZE (10 * 1024 * 1024) /* 10 MB */
    #define FORK_INTERVAL 1 /* 1 second */

    main(int argc, char *argv[])
    {
    int cpu_max, nextcpu;
    long pagesize;
    pid_t pid;

    /* set max number of logical cpu */
    if (argc > 1)
    cpu_max = atoi(argv[1]) - 1;
    else
    cpu_max = max_cpu();

    /* get the page size */
    pagesize = sysconf(_SC_PAGESIZE);
    if (pagesize == -1) {
    perror("sysconf(_SC_PAGESIZE)");
    exit(EXIT_FAILURE);
    }

    /* prepare parent process */
    cpu_bind(0);
    nextcpu = cpu_max;

    loop:

    /* select destination cpu for child process by round-robin rule */
    if (++nextcpu > cpu_max)
    nextcpu = 1;

    pid = fork();

    if (pid == 0) { /* child action */

    char *p;
    int i;

    /* consume page tables */
    p = mmap(0, MMAP_SIZE, PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
    i = MMAP_SIZE / pagesize;
    while (i-- > 0) {
    *p = 1;
    p += pagesize;
    }

    /* move to other cpu */
    cpu_bind(nextcpu);
    /*
    printf("a child moved to cpu%d after mmap().\n", nextcpu);
    fflush(stdout);
    */

    /* back page tables to pgtable_quicklist */
    exit(0);

    } else if (pid > 0) { /* parent action */

    sleep(FORK_INTERVAL);
    waitpid(pid, NULL, WNOHANG);

    }

    goto loop;
    }
    ----------------------------------------

    When above program which does task migration runs, my 8GB box spends
    800MB of memory for quicklist. This is not memory leak but doesn't seem
    good.

    % cat /proc/meminfo

    MemTotal: 7701568 kB
    MemFree: 4724672 kB
    (snip)
    Quicklists: 844800 kB

    because

    - My machine spec is
    number of numa node: 2
    number of cpus: 8 (4CPU x2 node)
    total mem: 8GB (4GB x2 node)
    free mem: about 5GB

    - Then, 4.7GB x 16% ~= 880MB.
    So, Quicklist can use 800MB.

    So, if following spec machine run that program

    CPUs: 64 (8cpu x 8node)
    Mem: 1TB (128GB x8node)

    Then, quicklist can waste 300GB (= 1TB x 30%). It is too large.

    So, I don't like cache policies which is proportional to # of cpus.

    My patch changes the number of caches
    from:
    per-cpu-cache-amount = memory_on_node / 16
    to
    per-cpu-cache-amount = memory_on_node / 16 / number_of_cpus_on_node.

    Signed-off-by: KOSAKI Motohiro
    Cc: Keiichiro Tokunaga
    Acked-by: Christoph Lameter
    Tested-by: David Miller
    Acked-by: Mike Travis
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • WARNING: vmlinux.o(.data+0x1f5c0): Section mismatch in reference from the variable contig_page_data to the variable .init.data:bootmem_node_data
    The variable contig_page_data references
    the variable __initdata bootmem_node_data
    If the reference is valid then annotate the
    variable with __init* (see linux/init.h) or name the variable:
    *driver, *_template, *_timer, *_sht, *_ops, *_probe, *_probe_one, *_console,

    Signed-off-by: Marcin Slusarz
    Cc: Johannes Weiner
    Cc: Sean MacLennan
    Cc: Sam Ravnborg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Marcin Slusarz
     
  • Dio write returns EIO when try_to_release_page fails because bh is
    still referenced.

    The patch

    commit 3f31fddfa26b7594b44ff2b34f9a04ba409e0f91
    Author: Mingming Cao
    Date: Fri Jul 25 01:46:22 2008 -0700

    jbd: fix race between free buffer and commit transaction

    was merged into 2.6.27-rc1, but I noticed that this patch is not enough
    to fix the race.

    I did fsstress test heavily to 2.6.27-rc1, and found that dio write still
    sometimes got EIO through this test.

    The patch above fixed race between freeing buffer(dio) and committing
    transaction(jbd) but I discovered that there is another race, freeing
    buffer(dio) and ext3/4_ordered_writepage.

    : background_writeout()
    ->write_cache_pages()
    ->ext3_ordered_writepage()
    walk_page_buffers() -> take a bh ref
    block_write_full_page() -> unlock_page
    : try_to_release_page fails)
    walk_page_buffers() ->release a bh ref

    ext3_ordered_writepage holds bh ref and does unlock_page remaining
    taking a bh ref, so this causes the race and failure of
    try_to_release_page.

    To fix this race, I used the approach of falling back to buffered
    writes if try_to_release_page() fails on a page.

    [akpm@linux-foundation.org: cleanups]
    Signed-off-by: Hisashi Hifumi
    Cc: Chris Mason
    Cc: Jan Kara
    Cc: Mingming Cao
    Cc: Zach Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hisashi Hifumi
     
  • I have gotten to the root cause of the hugetlb badness I reported back on
    August 15th. My system has the following memory topology (note the
    overlapping node):

    Node 0 Memory: 0x8000000-0x44000000
    Node 1 Memory: 0x0-0x8000000 0x44000000-0x80000000

    setup_zone_migrate_reserve() scans the address range 0x0-0x8000000 looking
    for a pageblock to move onto the MIGRATE_RESERVE list. Finding no
    candidates, it happily continues the scan into 0x8000000-0x44000000. When
    a pageblock is found, the pages are moved to the MIGRATE_RESERVE list on
    the wrong zone. Oops.

    setup_zone_migrate_reserve() should skip pageblocks in overlapping nodes.

    Signed-off-by: Adam Litke
    Acked-by: Mel Gorman
    Cc: Dave Hansen
    Cc: Nishanth Aravamudan
    Cc: Andy Whitcroft
    Cc: [2.6.25.x, 2.6.26.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adam Litke
     

02 Sep, 2008

1 commit


29 Aug, 2008

1 commit

  • * master.kernel.org:/home/rmk/linux-2.6-arm:
    [ARM] 5226/1: remove unmatched comment end.
    [ARM] Skip memory holes in FLATMEM when reading /proc/pagetypeinfo
    [ARM] use bcd2bin/bin2bcd
    [ARM] use the new byteorder headers
    [ARM] OMAP: Fix 2430 SMC91x ethernet IRQ
    [ARM] OMAP: Add and update OMAP default configuration files
    [ARM] OMAP: Change mailing list for OMAP in MAINTAINERS
    [ARM] S3C2443: Fix the S3C2443 clock register definitions
    [ARM] JIVE: Fix the spi bus numbering
    [ARM] S3C24XX: pwm.c: stop debugging output
    [ARM] S3C24XX: Fix sparse warnings in pwm.c
    [ARM] S3C24XX: Fix spare errors in pwm-clock driver
    [ARM] S3C24XX: Fix sparse warnings in arch/arm/plat-s3c24xx/gpiolib.c
    [ARM] S3C24XX: Fix nor-simtec driver sparse errors
    [ARM] 5225/1: zaurus: Register I2C controller for audio codecs
    [ARM] orion5x: update defconfig to v2.6.27-rc4
    [ARM] Orion: register UART1 on QNAP TS-209 and TS-409
    [ARM] Orion: activate lm75 driver on DNS-323
    [ARM] Orion: fix MAC detection on QNAP TS-209 and TS-409
    [ARM] Orion: Fix boot crash on Kurobox Pro

    Linus Torvalds
     

28 Aug, 2008

2 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:
    slub: Disable NUMA remote node defragmentation by default

    Linus Torvalds
     
  • Ordinarily, memory holes in flatmem still have a valid memmap and is safe
    to use. However, an architecture (ARM) frees up the memmap backing memory
    holes on the assumption it is never used. /proc/pagetypeinfo reads the
    whole range of pages in a zone believing that the memmap is valid and that
    pfn_valid will return false if it is not. On ARM, freeing the memmap breaks
    the page->zone linkages even though pfn_valid() returns true and the kernel
    can oops shortly afterwards due to accessing a bogus struct zone *.

    This patch lets architectures say when FLATMEM can have holes in the
    memmap. Rather than an expensive check for valid memory, /proc/pagetypeinfo
    will confirm that the page linkages are still valid by checking page->zone
    is still the expected zone. The lookup of page_zone is safe as there is a
    limited range of memory that is accessed when calling page_zone. Even if
    page_zone happens to return the correct zone, the impact is that the counters
    in /proc/pagetypeinfo are slightly off but fragmentation monitoring is
    unlikely to be relevant on an embedded system.

    Reported-by: H Hartley Sweeten
    Signed-off-by: Mel Gorman
    Tested-by: H Hartley Sweeten
    Signed-off-by: Russell King

    Mel Gorman
     

21 Aug, 2008

8 commits

  • XIP can call into get_xip_mem concurrently with the same file,offset with
    create=1. This usually maps down to get_block, which expects the page
    lock to prevent such a situation. This causes ext2 to explode for one
    reason or another.

    Serialise those calls for the moment. For common usages today, I suspect
    get_xip_mem rarely is called to create new blocks. In future as XIP
    technologies evolve we might need to look at which operations require
    scalability, and rework the locking to suit.

    Signed-off-by: Nick Piggin
    Cc: Jared Hulbert
    Acked-by: Carsten Otte
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • XIP has a race between sparse pages being inserted into page tables, and
    sparse pages being zapped when its time to put a non-sparse page in.

    What can happen is that a process can be left with a dangling sparse page
    in a MAP_SHARED mapping, while the rest of the world sees the non-sparse
    version. Ie. data corruption.

    Guard these operations with a seqlock, making fault-in-sparse-pages the
    slowpath, and try-to-unmap-sparse-pages the fastpath.

    Signed-off-by: Nick Piggin
    Cc: Jared Hulbert
    Acked-by: Carsten Otte
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • There is a race with dirty page accounting where a page may not properly
    be accounted for.

    clear_page_dirty_for_io() calls page_mkclean; then TestClearPageDirty.

    page_mkclean walks the rmaps for that page, and for each one it cleans and
    write protects the pte if it was dirty. It uses page_check_address to
    find the pte. That function has a shortcut to avoid the ptl if the pte is
    not present. Unfortunately, the pte can be switched to not-present then
    back to present by other code while holding the page table lock -- this
    should not be a signal for page_mkclean to ignore that pte, because it may
    be dirty.

    For example, powerpc64's set_pte_at will clear a previously present pte
    before setting it to the desired value. There may also be other code in
    core mm or in arch which do similar things.

    The consequence of the bug is loss of data integrity due to msync, and
    loss of dirty page accounting accuracy. XIP's __xip_unmap could easily
    also be unreliable (depending on the exact XIP locking scheme), which can
    lead to data corruption.

    Fix this by having an option to always take ptl to check the pte in
    page_check_address.

    It's possible to retain this optimization for page_referenced and
    try_to_unmap.

    Signed-off-by: Nick Piggin
    Cc: Jared Hulbert
    Cc: Carsten Otte
    Cc: Hugh Dickins
    Acked-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Absolute alignment requirements may never be applied to node-relative
    offsets. Andreas Herrmann spotted this flaw when a bootmem allocation on
    an unaligned node was itself not aligned because the combination of an
    unaligned node with an aligned offset into that node is not garuanteed to
    be aligned itself.

    This patch introduces two helper functions that align a node-relative
    index or offset with respect to the node's starting address so that the
    absolute PFN or virtual address that results from combining the two
    satisfies the requested alignment.

    Then all the broken ALIGN()s in alloc_bootmem_core() are replaced by these
    helpers.

    Signed-off-by: Johannes Weiner
    Reported-by: Andreas Herrmann
    Debugged-by: Andreas Herrmann
    Reviewed-by: Andreas Herrmann
    Tested-by: Andreas Herrmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • mminit_loglevel is now used from mminit_verify_zonelist
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Marcin Slusarz
     
  • Adjust m show_swap_cache_info() to show "Free swap" as a
    signed long: the signed format is preferable, because during swapoff
    nr_swap_pages can legitimately go negative, so makes more sense thus
    (it used to be shown redundantly, once as signed and once as unsigned).

    Signed-off-by: Hugh Dickins
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Add a comment to s390's page_test_dirty/page_clear_dirty/page_set_dirty
    dance in page_remove_rmap(): I was wrong to think the PageSwapCache test
    could be avoided, and would like a comment in there to remind me. And
    mention s390, to help us remember that this block is not really common.

    Also move down the "It would be tidy to reset PageAnon" comment: it does
    not belong to s390's block, and it would be unwise to reset PageAnon
    before we're done with testing it.

    Signed-off-by: Hugh Dickins
    Acked-by: Martin Schwidefsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Switch remote node defragmentation off by default. The current settings can
    cause excessive node local allocations with hackbench:

    SLAB:

    % cat /proc/meminfo
    MemTotal: 7701760 kB
    MemFree: 5940096 kB
    Slab: 123840 kB

    SLUB:

    % cat /proc/meminfo
    MemTotal: 7701376 kB
    MemFree: 4740928 kB
    Slab: 1591680 kB

    [Note: this feature is not related to slab defragmentation.]

    You can find the original discussion here:

    http://lkml.org/lkml/2008/8/4/308

    Reported-by: KOSAKI Motohiro
    Tested-by: KOSAKI Motohiro
    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     

16 Aug, 2008

1 commit


15 Aug, 2008

1 commit

  • This is the minimal sequence that jams the allocator:

    void *p, *q, *r;
    p = alloc_bootmem(PAGE_SIZE);
    q = alloc_bootmem(64);
    free_bootmem(p, PAGE_SIZE);
    p = alloc_bootmem(PAGE_SIZE);
    r = alloc_bootmem(64);

    after this sequence (assuming that the allocator was empty or page-aligned
    before), pointer "q" will be equal to pointer "r".

    What's hapenning inside the allocator:
    p = alloc_bootmem(PAGE_SIZE);
    in allocator: last_end_off == PAGE_SIZE, bitmap contains bits 10000...
    q = alloc_bootmem(64);
    in allocator: last_end_off == PAGE_SIZE + 64, bitmap contains 11000...
    free_bootmem(p, PAGE_SIZE);
    in allocator: last_end_off == PAGE_SIZE + 64, bitmap contains 01000...
    p = alloc_bootmem(PAGE_SIZE);
    in allocator: last_end_off == PAGE_SIZE, bitmap contains 11000...
    r = alloc_bootmem(64);

    and now:

    it finds bit "2", as a place where to allocate (sidx)

    it hits the condition

    if (bdata->last_end_off && PFN_DOWN(bdata->last_end_off) + 1 == sidx))
    start_off = ALIGN(bdata->last_end_off, align);

    -you can see that the condition is true, so it assigns start_off =
    ALIGN(bdata->last_end_off, align); (that is PAGE_SIZE) and allocates
    over already allocated block.

    With the patch it tries to continue at the end of previous allocation only
    if the previous allocation ended in the middle of the page.

    Signed-off-by: Mikulas Patocka
    Acked-by: Johannes Weiner
    Cc: David Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mikulas Patocka
     

14 Aug, 2008

1 commit

  • Fix the setting of PF_SUPERPRIV by __capable() as it could corrupt the flags
    the target process if that is not the current process and it is trying to
    change its own flags in a different way at the same time.

    __capable() is using neither atomic ops nor locking to protect t->flags. This
    patch removes __capable() and introduces has_capability() that doesn't set
    PF_SUPERPRIV on the process being queried.

    This patch further splits security_ptrace() in two:

    (1) security_ptrace_may_access(). This passes judgement on whether one
    process may access another only (PTRACE_MODE_ATTACH for ptrace() and
    PTRACE_MODE_READ for /proc), and takes a pointer to the child process.
    current is the parent.

    (2) security_ptrace_traceme(). This passes judgement on PTRACE_TRACEME only,
    and takes only a pointer to the parent process. current is the child.

    In Smack and commoncap, this uses has_capability() to determine whether
    the parent will be permitted to use PTRACE_ATTACH if normal checks fail.
    This does not set PF_SUPERPRIV.

    Two of the instances of __capable() actually only act on current, and so have
    been changed to calls to capable().

    Of the places that were using __capable():

    (1) The OOM killer calls __capable() thrice when weighing the killability of a
    process. All of these now use has_capability().

    (2) cap_ptrace() and smack_ptrace() were using __capable() to check to see
    whether the parent was allowed to trace any process. As mentioned above,
    these have been split. For PTRACE_ATTACH and /proc, capable() is now
    used, and for PTRACE_TRACEME, has_capability() is used.

    (3) cap_safe_nice() only ever saw current, so now uses capable().

    (4) smack_setprocattr() rejected accesses to tasks other than current just
    after calling __capable(), so the order of these two tests have been
    switched and capable() is used instead.

    (5) In smack_file_send_sigiotask(), we need to allow privileged processes to
    receive SIGIO on files they're manipulating.

    (6) In smack_task_wait(), we let a process wait for a privileged process,
    whether or not the process doing the waiting is privileged.

    I've tested this with the LTP SELinux and syscalls testscripts.

    Signed-off-by: David Howells
    Acked-by: Serge Hallyn
    Acked-by: Casey Schaufler
    Acked-by: Andrew G. Morgan
    Acked-by: Al Viro
    Signed-off-by: James Morris

    David Howells
     

13 Aug, 2008

7 commits

  • Signed-off-by: Huang Weiyi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Weiyi
     
  • Signed-off-by: MinChan Kim
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    MinChan Kim
     
  • [Andrew this should replace the previous version which did not check
    the returns from the region prepare for errors. This has been tested by
    us and Gerald and it looks good.

    Bah, while reviewing the locking based on your previous email I spotted
    that we need to check the return from the vma_needs_reservation call for
    allocation errors. Here is an updated patch to correct this. This passes
    testing here.]

    Signed-off-by: Andy Whitcroft
    Tested-by: Gerald Schaefer
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     
  • In the normal case, hugetlbfs reserves hugepages at map time so that the
    pages exist for future faults. A struct file_region is used to track when
    reservations have been consumed and where. These file_regions are
    allocated as necessary with kmalloc() which can sleep with the
    mm->page_table_lock held. This is wrong and triggers may-sleep warning
    when PREEMPT is enabled.

    Updates to the underlying file_region are done in two phases. The first
    phase prepares the region for the change, allocating any necessary memory,
    without actually making the change. The second phase actually commits the
    change. This patch makes use of this by checking the reservations before
    the page_table_lock is taken; triggering any necessary allocations. This
    may then be safely repeated within the locks without any allocations being
    required.

    Credit to Mel Gorman for diagnosing this failure and initial versions of
    the patch.

    Signed-off-by: Andy Whitcroft
    Tested-by: Gerald Schaefer
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     
  • Got an oops in mem_cgroup_shrink_usage() when testing loop over tmpfs:
    yes, of course, loop0 has no mm: other entry points check but this didn't.

    Signed-off-by: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Acked-by: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • .. since a failed allocation is being (initially) handled gracefully, and
    panic()-ed upon failure explicitly in the function if retries with smaller
    sizes failed.

    Signed-off-by: Jan Beulich
    Signed-off-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Beulich
     
  • The s390 software large page emulation implements shared page tables by
    using page->index of the first tail page from a compound large page to
    store page table information. This is set up in arch_prepare_hugepage(),
    which is called from alloc_fresh_huge_page_node().

    A similar call to arch_prepare_hugepage() is missing for surplus large
    pages that are allocated in alloc_buddy_huge_page(), which breaks the
    software emulation mode for (surplus) large pages on s390. This patch
    adds the missing call to arch_prepare_hugepage(). It will have no effect
    on other architectures where arch_prepare_hugepage() is a nop.

    Also, use the correct order in the error path in alloc_fresh_huge_page_node().

    Acked-by: Martin Schwidefsky
    Signed-off-by: Gerald Schaefer
    Acked-by: Nick Piggin
    Acked-by: Adam Litke
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gerald Schaefer
     

12 Aug, 2008

4 commits

  • * git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus:
    fix spinlock recursion in hvc_console
    stop_machine: remove unused variable
    modules: extend initcall_debug functionality to the module loader
    export virtio_rng.h
    lguest: use get_user_pages_fast() instead of get_user_pages()
    mm: Make generic weak get_user_pages_fast and EXPORT_GPL it
    lguest: don't set MAC address for guest unless specified

    Linus Torvalds
     
  • Out of line get_user_pages_fast fallback implementation, make it a weak
    symbol, get rid of CONFIG_HAVE_GET_USER_PAGES_FAST.

    Export the symbol to modules so lguest can use it.

    Signed-off-by: Nick Piggin
    Signed-off-by: Rusty Russell

    Rusty Russell
     
  • …/git/tip/linux-2.6-tip

    * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    lockdep: fix debug_lock_alloc
    lockdep: increase MAX_LOCKDEP_KEYS
    generic-ipi: fix stack and rcu interaction bug in smp_call_function_mask()
    lockdep: fix overflow in the hlock shrinkage code
    lockdep: rename map_[acquire|release]() => lock_map_[acquire|release]()
    lockdep: handle chains involving classes defined in modules
    mm: fix mm_take_all_locks() locking order
    lockdep: annotate mm_take_all_locks()
    lockdep: spin_lock_nest_lock()
    lockdep: lock protection locks
    lockdep: map_acquire
    lockdep: shrink held_lock structure
    lockdep: re-annotate scheduler runqueues
    lockdep: lock_set_subclass - reset a held lock's subclass
    lockdep: change scheduler annotation
    debug_locks: set oops_in_progress if we will log messages.
    lockdep: fix combinatorial explosion in lock subgraph traversal

    Linus Torvalds
     
  • Ingo Molnar
     

11 Aug, 2008

1 commit

  • Lockdep spotted:

    =======================================================
    [ INFO: possible circular locking dependency detected ]
    2.6.27-rc1 #270
    -------------------------------------------------------
    qemu-kvm/2033 is trying to acquire lock:
    (&inode->i_data.i_mmap_lock){----}, at: [] mm_take_all_locks+0xc2/0xea

    but task is already holding lock:
    (&anon_vma->lock){----}, at: [] mm_take_all_locks+0x70/0xea

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #1 (&anon_vma->lock){----}:
    [] __lock_acquire+0x11be/0x14d2
    [] lock_acquire+0x5e/0x7a
    [] _spin_lock+0x3b/0x47
    [] vma_adjust+0x200/0x444
    [] split_vma+0x12f/0x146
    [] mprotect_fixup+0x13c/0x536
    [] sys_mprotect+0x1a9/0x21e
    [] system_call_fastpath+0x16/0x1b
    [] 0xffffffffffffffff

    -> #0 (&inode->i_data.i_mmap_lock){----}:
    [] __lock_acquire+0xedb/0x14d2
    [] lock_release_non_nested+0x1c2/0x219
    [] lock_release+0x127/0x14a
    [] _spin_unlock+0x1e/0x50
    [] mm_drop_all_locks+0x7f/0xb0
    [] do_mmu_notifier_register+0xe2/0x112
    [] mmu_notifier_register+0xe/0x10
    [] kvm_dev_ioctl+0x11e/0x287 [kvm]
    [] vfs_ioctl+0x2a/0x78
    [] do_vfs_ioctl+0x257/0x274
    [] sys_ioctl+0x55/0x78
    [] system_call_fastpath+0x16/0x1b
    [] 0xffffffffffffffff

    other info that might help us debug this:

    5 locks held by qemu-kvm/2033:
    #0: (&mm->mmap_sem){----}, at: [] do_mmu_notifier_register+0x55/0x112
    #1: (mm_all_locks_mutex){--..}, at: [] mm_take_all_locks+0x34/0xea
    #2: (&anon_vma->lock){----}, at: [] mm_take_all_locks+0x70/0xea
    #3: (&anon_vma->lock){----}, at: [] mm_take_all_locks+0x70/0xea
    #4: (&anon_vma->lock){----}, at: [] mm_take_all_locks+0x70/0xea

    stack backtrace:
    Pid: 2033, comm: qemu-kvm Not tainted 2.6.27-rc1 #270

    Call Trace:
    [] print_circular_bug_tail+0xb8/0xc3
    [] __lock_acquire+0xedb/0x14d2
    [] ? add_lock_to_list+0x7e/0xad
    [] ? mm_take_all_locks+0x70/0xea
    [] ? mm_take_all_locks+0x70/0xea
    [] lock_release_non_nested+0x1c2/0x219
    [] ? mm_take_all_locks+0xc2/0xea
    [] ? mm_take_all_locks+0xc2/0xea
    [] ? trace_hardirqs_on_caller+0x4d/0x115
    [] ? mm_drop_all_locks+0x7f/0xb0
    [] lock_release+0x127/0x14a
    [] _spin_unlock+0x1e/0x50
    [] mm_drop_all_locks+0x7f/0xb0
    [] do_mmu_notifier_register+0xe2/0x112
    [] mmu_notifier_register+0xe/0x10
    [] kvm_dev_ioctl+0x11e/0x287 [kvm]
    [] ? file_has_perm+0x83/0x8e
    [] vfs_ioctl+0x2a/0x78
    [] do_vfs_ioctl+0x257/0x274
    [] sys_ioctl+0x55/0x78
    [] system_call_fastpath+0x16/0x1b

    Which the locking hierarchy in mm/rmap.c confirms as valid.

    Fix this by first taking all the mapping->i_mmap_lock instances and then
    take all anon_vma->lock instances.

    Signed-off-by: Peter Zijlstra
    Acked-by: Hugh Dickins
    Signed-off-by: Ingo Molnar

    Peter Zijlstra