13 Apr, 2007

1 commit


23 Mar, 2007

2 commits

  • Make the SYSV SHM nattch counter work correctly by forcing multiple VMAs to
    be produced to represent MAP_SHARED segments, even if they overlap exactly.

    Using this test program:

    http://people.redhat.com/~dhowells/doshm.c

    Run as:

    doshm sysv

    I can see nattch going from one before the patch:

    # /doshm sysv
    Command: sysv
    shmid: 65536
    memory: 0xc3700000
    c0b00000-c0b04000 rw-p 00000000 00:00 0
    c0bb0000-c0bba788 r-xs 00000000 00:0b 14582157 /lib/ld-uClibc-0.9.28.so
    c3180000-c31dede4 r-xs 00000000 00:0b 14582179 /lib/libuClibc-0.9.28.so
    c3520000-c352278c rw-p 00000000 00:0b 13763417 /doshm
    c3584000-c35865e8 r-xs 00000000 00:0b 13763417 /doshm
    c3588000-c358aa00 rw-p 00008000 00:0b 14582157 /lib/ld-uClibc-0.9.28.so
    c3590000-c359b6c0 rw-p 00000000 00:00 0
    c3620000-c3640000 rwxp 00000000 00:00 0
    c3700000-c37fa000 rw-S 00000000 00:06 1411 /SYSV00000000 (deleted)
    c3700000-c37fa000 rw-S 00000000 00:06 1411 /SYSV00000000 (deleted)
    nattch 1

    To two after the patch:

    # /doshm sysv
    Command: sysv
    shmid: 0
    memory: 0xc3700000
    c0bb0000-c0bba788 r-xs 00000000 00:0b 14582157 /lib/ld-uClibc-0.9.28.so
    c3180000-c31dede4 r-xs 00000000 00:0b 14582179 /lib/libuClibc-0.9.28.so
    c3320000-c3340000 rwxp 00000000 00:00 0
    c3530000-c35325e8 r-xs 00000000 00:0b 13763417 /doshm
    c3534000-c353678c rw-p 00000000 00:0b 13763417 /doshm
    c3538000-c353aa00 rw-p 00008000 00:0b 14582157 /lib/ld-uClibc-0.9.28.so
    c3590000-c359b6c0 rw-p 00000000 00:00 0
    c35a4000-c35a8000 rw-p 00000000 00:00 0
    c3700000-c37fa000 rw-S 00000000 00:06 1369 /SYSV00000000 (deleted)
    c3700000-c37fa000 rw-S 00000000 00:06 1369 /SYSV00000000 (deleted)
    nattch 2

    That's +1 to nattch for each shmat() made.

    Signed-off-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     
  • Supply a get_unmapped_area() to fix NOMMU SYSV SHM support.

    Signed-off-by: David Howells
    Acked-by: Adam Litke
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     

09 Dec, 2006

1 commit


08 Dec, 2006

1 commit


06 Dec, 2006

1 commit

  • I was playing with blackfin when i hit a neat bug ... doing an open() on a
    directory and then passing that fd to mmap() would cause the kernel to hang

    after poking into the code a bit more, i found that
    mm/nommu.c:validate_mmap_request() checks the length and if it is 0, just
    returns the address ... this is in stark contrast to mmu's
    mm/mmap.c:do_mmap_pgoff() where it returns -EINVAL for 0 length requests ...
    i then noticed that some other parts of the logic is out of date between the
    two funcs, so perhaps that's the easy fix ?

    Signed-off-by: Greg Ungerer
    Signed-off-by: Linus Torvalds

    Mike Frysinger
     

04 Oct, 2006

1 commit


01 Oct, 2006

1 commit


27 Sep, 2006

8 commits

  • Make futexes work under NOMMU conditions.

    This can be tested by running this in one shell:

    #define SYSERROR(X, Y) \
    do { if ((long)(X) == -1L) { perror(Y); exit(1); }} while(0)

    int main()
    {
    int shmid, tmp, *f, n;

    shmid = shmget(23, 4, IPC_CREAT|0666);
    SYSERROR(shmid, "shmget");

    f = shmat(shmid, NULL, 0);
    SYSERROR(f, "shmat");

    n = *f;
    printf("WAIT: %p{%x}\n", f, n);
    tmp = futex(f, FUTEX_WAIT, n, NULL, NULL, 0);
    SYSERROR(tmp, "futex");
    printf("WAITED: %d\n", tmp);

    tmp = shmdt(f);
    SYSERROR(tmp, "shmdt");

    exit(0);
    }

    And then this in the other shell:

    #define SYSERROR(X, Y) \
    do { if ((long)(X) == -1L) { perror(Y); exit(1); }} while(0)

    int main()
    {
    int shmid, tmp, *f;

    shmid = shmget(23, 4, IPC_CREAT|0666);
    SYSERROR(shmid, "shmget");

    f = shmat(shmid, NULL, 0);
    SYSERROR(f, "shmat");

    (*f)++;
    printf("WAKE: %p{%x}\n", f, *f);
    tmp = futex(f, FUTEX_WAKE, 1, NULL, NULL, 0);
    SYSERROR(tmp, "futex");
    printf("WOKE: %d\n", tmp);

    tmp = shmdt(f);
    SYSERROR(tmp, "shmdt");

    exit(0);
    }

    The first program will set up a SYSV IPC SHM segment and wait on a futex in it
    for the number at the start to change. The program will increment that number
    and wake the first program up. This leads to output of the form:

    SHELL 1 SHELL 2
    ======================= =======================
    # /dowait
    WAIT: 0xc32ac000{0}
    # /dowake
    WAKE: 0xc32ac000{1}
    WAITED: 0 WOKE: 1

    Signed-off-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     
  • Make mremap() partially work for NOMMU kernels. It may resize a VMA provided
    that it doesn't exceed the size of the slab object in which the storage is
    allocated that the VMA refers to. Shareable VMAs may not be resized.

    Moving VMAs (as permitted by MREMAP_MAYMOVE) is not currently supported.

    This patch also makes use of the fact that the VMA list is now ordered to cut
    it short when possible.

    Signed-off-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     
  • Order the per-mm_struct VMA list by address so that searching it can be cut
    short when the appropriate address has been exceeded.

    Signed-off-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     
  • Permit ptrace to modify a section that's non-shared but is marked
    unwritable, such as is obtained by mapping the text segment of an ELF-FDPIC
    executable binary with into a binary that's being ptraced[*].

    [*] Under NOMMU conditions ptrace causes read-only MAP_PRIVATE mmaps to become
    totally private copies because if a private mapping was actually shared
    then the debugging setting breakpoints in it would potentially crash
    other processes.

    This is done by using the VM_MAYWRITE flag rather than the VM_WRITE flag
    when deciding whether to permit a write.

    Without this patch a debugger can't set breakpoints in the mapped text
    sections of executables that are mapped read-only private, even if the
    mmap() syscall has taken a private copy because PT_PTRACED is set.

    In addition, VM_MAYREAD is used instead of VM_READ for similar reasons.

    Signed-off-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     
  • Check the VMA protections in get_user_pages() against what's being asked.

    This checks to see that we don't accidentally write on a non-writable VMA or
    permit an I/O mapping VMA to be accessed (which may lack page structs).

    Signed-off-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     
  • In NOMMU arch, if run "cat /proc/self/mem", data from physical address 0
    are read. This behavior is different from MMU arch. In IA32, message
    "cat: /proc/self/mem: Input/output error" is reported.

    This issue is rootcaused by not validate the start address in NOMMU
    function get_user_pages(). Following patch solves this issue.

    Signed-off-by: Sonic Zhang
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sonic Zhang
     
  • Use find_vma() in the NOMMU version of access_process_vm() rather than
    reimplementing it.

    Signed-off-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     
  • Check that access_process_vm() is accessing a valid mapping in the target
    process.

    This limits ptrace() accesses and accesses through /proc//maps to only
    those regions actually mapped by a program.

    Signed-off-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     

26 Sep, 2006

1 commit

  • Remove the atomic counter for slab_reclaim_pages and replace the counter
    and NR_SLAB with two ZVC counter that account for unreclaimable and
    reclaimable slab pages: NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE.

    Change the check in vmscan.c to refer to to NR_SLAB_RECLAIMABLE. The
    intend seems to be to check for slab pages that could be freed.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

15 Jul, 2006

1 commit


01 Jul, 2006

1 commit

  • Currently a single atomic variable is used to establish the size of the page
    cache in the whole machine. The zoned VM counters have the same method of
    implementation as the nr_pagecache code but also allow the determination of
    the pagecache size per zone.

    Remove the special implementation for nr_pagecache and make it a zoned counter
    named NR_FILE_PAGES.

    Updates of the page cache counters are always performed with interrupts off.
    We can therefore use the __ variant here.

    Signed-off-by: Christoph Lameter
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

11 Apr, 2006

1 commit

  • This patch is an enhancement of OVERCOMMIT_GUESS algorithm in
    __vm_enough_memory() in mm/nommu.c.

    When the OVERCOMMIT_GUESS algorithm calculates the number of free pages,
    the algorithm subtracts the number of reserved pages from the result
    nr_free_pages().

    Signed-off-by: Hideo Aoki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hideo AOKI
     

22 Mar, 2006

1 commit

  • Now that compound page handling is properly fixed in the VM, move nommu
    over to using compound pages rather than rolling their own refcounting.

    nommu vm page refcounting is broken anyway, but there is no need to have
    divergent code in the core VM now, nor when it gets fixed.

    Signed-off-by: Nick Piggin
    Cc: David Howells

    (Needs testing, please).
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

01 Mar, 2006

1 commit


21 Feb, 2006

1 commit


07 Jan, 2006

1 commit

  • The attached patch makes the SYSV IPC shared memory facilities use the new
    ramfs facilities on a no-MMU kernel.

    The following changes are made:

    (1) There are now shmem_mmap() and shmem_get_unmapped_area() functions to
    allow the IPC SHM facilities to commune with the tiny-shmem and shmem
    code.

    (2) ramfs files now need resizing using do_truncate() rather than by modifying
    the inode size directly (see shmem_file_setup()). This causes ramfs to
    attempt to bind a block of pages of sufficient size to the inode.

    (3) CONFIG_SYSVIPC is no longer contingent on CONFIG_MMU.

    Signed-off-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     

29 Nov, 2005

1 commit

  • This replaces the (in my opinion horrible) VM_UNMAPPED logic with very
    explicit support for a "remapped page range" aka VM_PFNMAP. It allows a
    VM area to contain an arbitrary range of page table entries that the VM
    never touches, and never considers to be normal pages.

    Any user of "remap_pfn_range()" automatically gets this new
    functionality, and doesn't even have to mark the pages reserved or
    indeed mark them any other way. It just works. As a side effect, doing
    mmap() on /dev/mem works for arbitrary ranges.

    Sparc update from David in the next commit.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

07 Nov, 2005

1 commit


30 Oct, 2005

3 commits

  • Final step in pushing down common core's page_table_lock. follow_page no
    longer wants caller to hold page_table_lock, uses pte_offset_map_lock itself;
    and so no page_table_lock is taken in get_user_pages itself.

    But get_user_pages (and get_futex_key) do then need follow_page to pin the
    page for them: take Daniel's suggestion of bitflags to follow_page.

    Need one for WRITE, another for TOUCH (it was the accessed flag before:
    vanished along with check_user_page_readable, but surely get_numa_maps is
    wrong to mark every page it finds as accessed), another for GET.

    And another, ANON to dispose of untouched_anonymous_page: it seems silly for
    that to descend a second time, let follow_page observe if there was no page
    table and return ZERO_PAGE if so. Fix minor bug in that: check VM_LOCKED -
    make_pages_present ought to make readonly anonymous present.

    Give get_numa_maps a cond_resched while we're there.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • update_mem_hiwater has attracted various criticisms, in particular from those
    concerned with mm scalability. Originally it was called whenever rss or
    total_vm got raised. Then many of those callsites were replaced by a timer
    tick call from account_system_time. Now Frank van Maarseveen reports that to
    be found inadequate. How about this? Works for Frank.

    Replace update_mem_hiwater, a poor combination of two unrelated ops, by macros
    update_hiwater_rss and update_hiwater_vm. Don't attempt to keep
    mm->hiwater_rss up to date at timer tick, nor every time we raise rss (usually
    by 1): those are hot paths. Do the opposite, update only when about to lower
    rss (usually by many), or just before final accounting in do_exit. Handle
    mm->hiwater_vm in the same way, though it's much less of an issue. Demand
    that whoever collects these hiwater statistics do the work of taking the
    maximum with rss or total_vm.

    And there has been no collector of these hiwater statistics in the tree. The
    new convention needs an example, so match Frank's usage by adding a VmPeak
    line above VmSize to /proc//status, and also a VmHWM line above VmRSS
    (High-Water-Mark or High-Water-Memory).

    There was a particular anomaly during mremap move, that hiwater_vm might be
    captured too high. A fleeting such anomaly remains, but it's quickly
    corrected now, whereas before it would stick.

    What locking? None: if the app is racy then these statistics will be racy,
    it's not worth any overhead to make them exact. But whenever it suits,
    hiwater_vm is updated under exclusive mmap_sem, and hiwater_rss under
    page_table_lock (for now) or with preemption disabled (later on): without
    going to any trouble, minimize the time between reading current values and
    updating, to minimize those occasions when a racing thread bumps a count up
    and back down in between.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • I was lazy when we added anon_rss, and chose to change as few places as
    possible. So currently each anonymous page has to be counted twice, in rss
    and in anon_rss. Which won't be so good if those are atomic counts in some
    configurations.

    Change that around: keep file_rss and anon_rss separately, and add them
    together (with get_mm_rss macro) when the total is needed - reading two
    atomics is much cheaper than updating two atomics. And update anon_rss
    upfront, typically in memory.c, not tucked away in page_add_anon_rmap.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

09 Oct, 2005

1 commit

  • - added typedef unsigned int __nocast gfp_t;

    - replaced __nocast uses for gfp flags with gfp_t - it gives exactly
    the same warnings as far as sparse is concerned, doesn't change
    generated code (from gcc point of view we replaced unsigned int with
    typedef) and documents what's going on far better.

    Signed-off-by: Al Viro
    Signed-off-by: Linus Torvalds

    Al Viro
     

12 Sep, 2005

1 commit

  • Move call to get_mm_counter() in update_mem_hiwater() to be
    inside the check for tsk->mm being null. Otherwise you can be
    following a null pointer here. This patch submitted by
    Javier Herrero .

    Modify the end check for munmap regions to allow for the
    legacy behavior of 0 being valid. Pretty much all current
    uClinux system libc malloc's pass in 0 as the end point.
    A hard check will fail on these, so change the check so
    that if it is non-zero it must be valid otherwise it fails.
    A passed in value will always succeed (as it used too).

    Also export a few more mm system functions - to be consistent
    with the VM code exports.

    Signed-off-by: Greg Ungerer
    Signed-off-by: Linus Torvalds

    Greg Ungerer
     

05 Aug, 2005

1 commit

  • We have found what seems to be a small bug in __vm_enough_memory() when
    sysctl_overcommit_memory is set to OVERCOMMIT_NEVER.

    When this bug occurs the systems fails to boot, with /sbin/init whining
    about fork() returning ENOMEM.

    We hunted down the problem to this:

    The deferred update mecanism used in vm_acct_memory(), on a SMP system,
    allows the vm_committed_space counter to have a negative value.

    This should not be a problem since this counter is known to be inaccurate.

    But in __vm_enough_memory() this counter is compared to the `allowed'
    variable, which is an unsigned long. This comparison is broken since it
    will consider the negative values of vm_committed_space to be huge positive
    values, resulting in a memory allocation failure.

    Signed-off-by:
    Signed-off-by:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Simon Derr
     

22 Jun, 2005

1 commit

  • Ingo recently introduced a great speedup for allocating new mmaps using the
    free_area_cache pointer which boosts the specweb SSL benchmark by 4-5% and
    causes huge performance increases in thread creation.

    The downside of this patch is that it does lead to fragmentation in the
    mmap-ed areas (visible via /proc/self/maps), such that some applications
    that work fine under 2.4 kernels quickly run out of memory on any 2.6
    kernel.

    The problem is twofold:

    1) the free_area_cache is used to continue a search for memory where
    the last search ended. Before the change new areas were always
    searched from the base address on.

    So now new small areas are cluttering holes of all sizes
    throughout the whole mmap-able region whereas before small holes
    tended to close holes near the base leaving holes far from the base
    large and available for larger requests.

    2) the free_area_cache also is set to the location of the last
    munmap-ed area so in scenarios where we allocate e.g. five regions of
    1K each, then free regions 4 2 3 in this order the next request for 1K
    will be placed in the position of the old region 3, whereas before we
    appended it to the still active region 1, placing it at the location
    of the old region 2. Before we had 1 free region of 2K, now we only
    get two free regions of 1K -> fragmentation.

    The patch addresses thes issues by introducing yet another cache descriptor
    cached_hole_size that contains the largest known hole size below the
    current free_area_cache. If a new request comes in the size is compared
    against the cached_hole_size and if the request can be filled with a hole
    below free_area_cache the search is started from the base instead.

    The results look promising: Whereas 2.6.12-rc4 fragments quickly and my
    (earlier posted) leakme.c test program terminates after 50000+ iterations
    with 96 distinct and fragmented maps in /proc/self/maps it performs nicely
    (as expected) with thread creation, Ingo's test_str02 with 20000 threads
    requires 0.7s system time.

    Taking out Ingo's patch (un-patch available per request) by basically
    deleting all mentions of free_area_cache from the kernel and starting the
    search for new memory always at the respective bases we observe: leakme
    terminates successfully with 11 distinctive hardly fragmented areas in
    /proc/self/maps but thread creating is gringdingly slow: 30+s(!) system
    time for Ingo's test_str02 with 20000 threads.

    Now - drumroll ;-) the appended patch works fine with leakme: it ends with
    only 7 distinct areas in /proc/self/maps and also thread creation seems
    sufficiently fast with 0.71s for 20000 threads.

    Signed-off-by: Wolfgang Wander
    Credit-to: "Richard Purdie"
    Signed-off-by: Ken Chen
    Acked-by: Ingo Molnar (partly)
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wolfgang Wander
     

17 May, 2005

1 commit

  • Linus changed the second argument of __vmalloc from int to unsigned int
    breaking the compilation for CONFIG_MMU=n configurations (since he only
    changed vmalloc.c but not nommu.c).

    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     

17 Apr, 2005

1 commit

  • Initial git repository build. I'm not bothering with the full history,
    even though we have it. We can create a separate "historical" git
    archive of that later if we want to, and in the meantime it's about
    3.2GB when imported into git - space that would just make the early
    git days unnecessarily complicated, when we don't have a lot of good
    infrastructure for it.

    Let it rip!

    Linus Torvalds