31 Dec, 2009

1 commit

  • Move sys_mmap_pgoff() from mm/util.c to mm/mmap.c and mm/nommu.c,
    where we'd expect to find such code: especially now that it contains
    the MAP_HUGETLB handling. Revert mm/util.c to how it was in 2.6.32.

    This patch just ignores MAP_HUGETLB in the nommu case, as in 2.6.32,
    whereas 2.6.33-rc2 reported -ENOSYS. Perhaps validate_mmap_request()
    should reject it with -EINVAL? Add that later if necessary.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

16 Dec, 2009

2 commits

  • Modify the generic mmap() code to keep the cache attribute in
    vma->vm_page_prot regardless if writenotify is enabled or not. Without
    this patch the cache configuration selected by f_op->mmap() is overwritten
    if writenotify is enabled, making it impossible to keep the vma uncached.

    Needed by drivers such as drivers/video/sh_mobile_lcdcfb.c which uses
    deferred io together with uncached memory.

    Signed-off-by: Magnus Damm
    Cc: Nick Piggin
    Cc: Hugh Dickins
    Cc: Paul Mundt
    Cc: Jaya Kumar
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Magnus Damm
     
  • On ia64, the following test program exit abnormally, because glibc thread
    library called abort().

    ========================================================
    (gdb) bt
    #0 0xa000000000010620 in __kernel_syscall_via_break ()
    #1 0x20000000003208e0 in raise () from /lib/libc.so.6.1
    #2 0x2000000000324090 in abort () from /lib/libc.so.6.1
    #3 0x200000000027c3e0 in __deallocate_stack () from /lib/libpthread.so.0
    #4 0x200000000027f7c0 in start_thread () from /lib/libpthread.so.0
    #5 0x200000000047ef60 in __clone2 () from /lib/libc.so.6.1
    ========================================================

    The fact is, glibc call munmap() when thread exitng time for freeing
    stack, and it assume munlock() never fail. However, munmap() often make
    vma splitting and it with many mapcount make -ENOMEM.

    Oh well, that's crazy, because stack unmapping never increase mapcount.
    The maxcount exceeding is only temporary. internal temporary exceeding
    shouldn't make ENOMEM.

    This patch does it.

    test_max_mapcount.c
    ==================================================================
    #include
    #include
    #include
    #include
    #include
    #include

    #define THREAD_NUM 30000
    #define MAL_SIZE (8*1024*1024)

    void *wait_thread(void *args)
    {
    void *addr;

    addr = malloc(MAL_SIZE);
    sleep(10);

    return NULL;
    }

    void *wait_thread2(void *args)
    {
    sleep(60);

    return NULL;
    }

    int main(int argc, char *argv[])
    {
    int i;
    pthread_t thread[THREAD_NUM], th;
    int ret, count = 0;
    pthread_attr_t attr;

    ret = pthread_attr_init(&attr);
    if(ret) {
    perror("pthread_attr_init");
    }

    ret = pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED);
    if(ret) {
    perror("pthread_attr_setdetachstate");
    }

    for (i = 0; i < THREAD_NUM; i++) {
    ret = pthread_create(&th, &attr, wait_thread, NULL);
    if(ret) {
    fprintf(stderr, "[%d] ", count);
    perror("pthread_create");
    } else {
    printf("[%d] create OK.\n", count);
    }
    count++;

    ret = pthread_create(&thread[i], &attr, wait_thread2, NULL);
    if(ret) {
    fprintf(stderr, "[%d] ", count);
    perror("pthread_create");
    } else {
    printf("[%d] create OK.\n", count);
    }
    count++;
    }

    sleep(3600);
    return 0;
    }
    ==================================================================

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

11 Dec, 2009

3 commits


25 Oct, 2009

1 commit


28 Sep, 2009

1 commit


22 Sep, 2009

8 commits

  • Add a flag for mmap that will be used to request a huge page region that
    will look like anonymous memory to userspace. This is accomplished by
    using a file on the internal vfsmount. MAP_HUGETLB is a modifier of
    MAP_ANONYMOUS and so must be specified with it. The region will behave
    the same as a MAP_ANONYMOUS region using small pages.

    [akpm@linux-foundation.org: fix arch definitions of MAP_HUGETLB]
    Signed-off-by: Eric B Munson
    Acked-by: David Rientjes
    Cc: Mel Gorman
    Cc: Adam Litke
    Cc: David Gibson
    Cc: Lee Schermerhorn
    Cc: Nick Piggin
    Cc: Hugh Dickins
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric B Munson
     
  • shmem_zero_setup() does not change vm_start, pgoff or vm_flags, only some
    drivers change them (such as /driver/video/bfin-t350mcqb-fb.c).

    Move these codes to a more proper place to save cycles for shared
    anonymous mapping.

    Signed-off-by: Huang Shijie
    Reviewed-by: Minchan Kim
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Shijie
     
  • We noticed very erratic behavior [throughput] with the AIM7 shared
    workload running on recent distro [SLES11] and mainline kernels on an
    8-socket, 32-core, 256GB x86_64 platform. On the SLES11 kernel
    [2.6.27.19+] with Barcelona processors, as we increased the load [10s of
    thousands of tasks], the throughput would vary between two "plateaus"--one
    at ~65K jobs per minute and one at ~130K jpm. The simple patch below
    causes the results to smooth out at the ~130k plateau.

    But wait, there's more:

    We do not see this behavior on smaller platforms--e.g., 4 socket/8 core.
    This could be the result of the larger number of cpus on the larger
    platform--a scalability issue--or it could be the result of the larger
    number of interconnect "hops" between some nodes in this platform and how
    the tasks for a given load end up distributed over the nodes' cpus and
    memories--a stochastic NUMA effect.

    The variability in the results are less pronounced [on the same platform]
    with Shanghai processors and with mainline kernels. With 31-rc6 on
    Shanghai processors and 288 file systems on 288 fibre attached storage
    volumes, the curves [jpm vs load] are both quite flat with the patched
    kernel consistently producing ~3.9% better throughput [~80K jpm vs ~77K
    jpm] than the unpatched kernel.

    Profiling indicated that the "slow" runs were incurring high[er]
    contention on an anon_vma lock in vma_adjust(), apparently called from the
    sbrk() system call.

    The patch:

    A comment in mm/mmap.c:vma_adjust() suggests that we don't really need the
    anon_vma lock when we're only adjusting the end of a vma, as is the case
    for brk(). The comment questions whether it's worth while to optimize for
    this case. Apparently, on the newer, larger x86_64 platforms, with
    interesting NUMA topologies, it is worth while--especially considering
    that the patch [if correct!] is quite simple.

    We can detect this condition--no overlap with next vma--by noting a NULL
    "importer". The anon_vma pointer will also be NULL in this case, so
    simply avoid loading vma->anon_vma to avoid the lock.

    However, we DO need to take the anon_vma lock when we're inserting a vma
    ['insert' non-NULL] even when we have no overlap [NULL "importer"], so we
    need to check for 'insert', as well. And Hugh points out that we should
    also take it when adjusting vm_start (so that rmap.c can rely upon
    vma_address() while it holds the anon_vma lock).

    akpm: Zhang Yanmin reprts a 150% throughput improvement with aim7, so it
    might be -stable material even though thiss isn't a regression: "this
    issue is not clear on dual socket Nehalem machine (2*4*2 cpu), but is
    severe on large machine (4*8*2 cpu)"

    [hugh.dickins@tiscali.co.uk: test vma start too]
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Hugh Dickins
    Cc: Nick Piggin
    Cc: Eric Whitney
    Tested-by: "Zhang, Yanmin"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • If (flags & MAP_LOCKED) is true, it means vm_flags has already contained
    the bit VM_LOCKED which is set by calc_vm_flag_bits().

    So there is no need to reset it again, just remove it.

    Signed-off-by: Huang Shijie
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Shijie
     
  • A few cleanups, given the munlock fix: the comment on ksm_test_exit() no
    longer applies, and it can be made private to ksm.c; there's no more
    reference to mmu_gather or tlb.h, and mmap.c doesn't need ksm.h.

    Signed-off-by: Hugh Dickins
    Acked-by: Izik Eidus
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • KSM originally stood for Kernel Shared Memory: but the kernel has long
    supported shared memory, and VM_SHARED and VM_MAYSHARE vmas, and KSM is
    something else. So we switched to saying "merge" instead of "share".

    But Chris Wright points out that this is confusing where mmap.c merges
    adjacent vmas: most especially in the name VM_MERGEABLE_FLAGS, used by
    is_mergeable_vma() to let vmas be merged despite flags being different.

    Call it VMA_MERGE_DESPITE_FLAGS? Perhaps, but at present it consists
    only of VM_CAN_NONLINEAR: so for now it's clearer on all sides to use
    that directly, with a comment on it in is_mergeable_vma().

    Signed-off-by: Hugh Dickins
    Acked-by: Izik Eidus
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Rawhide users have reported hang at startup when cryptsetup is run: the
    same problem can be simply reproduced by running a program int main() {
    mlockall(MCL_CURRENT | MCL_FUTURE); return 0; }

    The problem is that exit_mmap() applies munlock_vma_pages_all() to
    clean up VM_LOCKED areas, and its current implementation (stupidly)
    tries to fault in absent pages, for example where PROT_NONE prevented
    them being faulted in when mlocking. Whereas the "ksm: fix oom
    deadlock" patch, knowing there's a race by which KSM might try to fault
    in pages after exit_mmap() had finally zapped the range, backs out of
    such faults doing nothing when its ksm_test_exit() notices mm_users 0.

    So revert that part of "ksm: fix oom deadlock" which moved the
    ksm_exit() call from before exit_mmap() to the middle of exit_mmap();
    and remove those ksm_test_exit() checks from the page fault paths, so
    allowing the munlocking to proceed without interference.

    ksm_exit, if there are rmap_items still chained on this mm slot, takes
    mmap_sem write side: so preventing KSM from working on an mm while
    exit_mmap runs. And KSM will bail out as soon as it notices that
    mm_users is already zero, thanks to its internal ksm_test_exit checks.
    So that when a task is killed by OOM killer or the user, KSM will not
    indefinitely prevent it from running exit_mmap to release its memory.

    This does break a part of what "ksm: fix oom deadlock" was trying to
    achieve. When unmerging KSM (echo 2 >/sys/kernel/mm/ksm), and even
    when ksmd itself has to cancel a KSM page, it is possible that the
    first OOM-kill victim would be the KSM process being faulted: then its
    memory won't be freed until a second victim has been selected (freeing
    memory for the unmerging fault to complete).

    But the OOM killer is already liable to kill a second victim once the
    intended victim's p->mm goes to NULL: so there's not much point in
    rejecting this KSM patch before fixing that OOM behaviour. It is very
    much more important to allow KSM users to boot up, than to haggle over
    an unlikely and poorly supported OOM case.

    We also intend to fix munlocking to not fault pages: at which point
    this patch _could_ be reverted; though that would be controversial, so
    we hope to find a better solution.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Justin M. Forbes
    Acked-for-now-by: Hugh Dickins
    Cc: Izik Eidus
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • There's a now-obvious deadlock in KSM's out-of-memory handling:
    imagine ksmd or KSM_RUN_UNMERGE handling, holding ksm_thread_mutex,
    trying to allocate a page to break KSM in an mm which becomes the
    OOM victim (quite likely in the unmerge case): it's killed and goes
    to exit, and hangs there waiting to acquire ksm_thread_mutex.

    Clearly we must not require ksm_thread_mutex in __ksm_exit, simple
    though that made everything else: perhaps use mmap_sem somehow?
    And part of the answer lies in the comments on unmerge_ksm_pages:
    __ksm_exit should also leave all the rmap_item removal to ksmd.

    But there's a fundamental problem, that KSM relies upon mmap_sem to
    guarantee the consistency of the mm it's dealing with, yet exit_mmap
    tears down an mm without taking mmap_sem. And bumping mm_users won't
    help at all, that just ensures that the pages the OOM killer assumes
    are on their way to being freed will not be freed.

    The best answer seems to be, to move the ksm_exit callout from just
    before exit_mmap, to the middle of exit_mmap: after the mm's pages
    have been freed (if the mmu_gather is flushed), but before its page
    tables and vma structures have been freed; and down_write,up_write
    mmap_sem there to serialize with KSM's own reliance on mmap_sem.

    But KSM then needs to be careful, whenever it downs mmap_sem, to
    check that the mm is not already exiting: there's a danger of using
    find_vma on a layout that's being torn apart, or writing into page
    tables which have been freed for reuse; and even do_anonymous_page
    and __do_fault need to check they're not being called by break_ksm
    to reinstate a pte after zap_pte_range has zapped that page table.

    Though it might be clearer to add an exiting flag, set while holding
    mmap_sem in __ksm_exit, that wouldn't cover the issue of reinstating
    a zapped pte. All we need is to check whether mm_users is 0 - but
    must remember that ksmd may detect that before __ksm_exit is reached.
    So, ksm_test_exit(mm) added to comment such checks on mm->mm_users.

    __ksm_exit now has to leave clearing up the rmap_items to ksmd,
    that needs ksm_thread_mutex; but shift the exiting mm just after the
    ksm_scan cursor so that it will soon be dealt with. __ksm_enter raise
    mm_count to hold the mm_struct, ksmd's exit processing (exactly like
    its processing when it finds all VM_MERGEABLEs unmapped) mmdrop it,
    similar procedure for KSM_RUN_UNMERGE (which has stopped ksmd).

    But also give __ksm_exit a fast path: when there's no complication
    (no rmap_items attached to mm and it's not at the ksm_scan cursor),
    it can safely do all the exiting work itself. This is not just an
    optimization: when ksmd is not running, the raised mm_count would
    otherwise leak mm_structs.

    Signed-off-by: Hugh Dickins
    Acked-by: Izik Eidus
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

21 Sep, 2009

1 commit

  • Bye-bye Performance Counters, welcome Performance Events!

    In the past few months the perfcounters subsystem has grown out its
    initial role of counting hardware events, and has become (and is
    becoming) a much broader generic event enumeration, reporting, logging,
    monitoring, analysis facility.

    Naming its core object 'perf_counter' and naming the subsystem
    'perfcounters' has become more and more of a misnomer. With pending
    code like hw-breakpoints support the 'counter' name is less and
    less appropriate.

    All in one, we've decided to rename the subsystem to 'performance
    events' and to propagate this rename through all fields, variables
    and API names. (in an ABI compatible fashion)

    The word 'event' is also a bit shorter than 'counter' - which makes
    it slightly more convenient to write/handle as well.

    Thanks goes to Stephane Eranian who first observed this misnomer and
    suggested a rename.

    User-space tooling and ABI compatibility is not affected - this patch
    should be function-invariant. (Also, defconfigs were not touched to
    keep the size down.)

    This patch has been generated via the following script:

    FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')

    sed -i \
    -e 's/PERF_EVENT_/PERF_RECORD_/g' \
    -e 's/PERF_COUNTER/PERF_EVENT/g' \
    -e 's/perf_counter/perf_event/g' \
    -e 's/nb_counters/nb_events/g' \
    -e 's/swcounter/swevent/g' \
    -e 's/tpcounter_event/tp_event/g' \
    $FILES

    for N in $(find . -name perf_counter.[ch]); do
    M=$(echo $N | sed 's/perf_counter/perf_event/g')
    mv $N $M
    done

    FILES=$(find . -name perf_event.*)

    sed -i \
    -e 's/COUNTER_MASK/REG_MASK/g' \
    -e 's/COUNTER/EVENT/g' \
    -e 's/\/event_id/g' \
    -e 's/counter/event/g' \
    -e 's/Counter/Event/g' \
    $FILES

    ... to keep it as correct as possible. This script can also be
    used by anyone who has pending perfcounters patches - it converts
    a Linux kernel tree over to the new naming. We tried to time this
    change to the point in time where the amount of pending patches
    is the smallest: the end of the merge window.

    Namespace clashes were fixed up in a preparatory patch - and some
    stylistic fallout will be fixed up in a subsequent patch.

    ( NOTE: 'counters' are still the proper terminology when we deal
    with hardware registers - and these sed scripts are a bit
    over-eager in renaming them. I've undone some of that, but
    in case there's something left where 'counter' would be
    better than 'event' we can undo that on an individual basis
    instead of touching an otherwise nicely automated patch. )

    Suggested-by: Stephane Eranian
    Acked-by: Peter Zijlstra
    Acked-by: Paul Mackerras
    Reviewed-by: Arjan van de Ven
    Cc: Mike Galbraith
    Cc: Arnaldo Carvalho de Melo
    Cc: Frederic Weisbecker
    Cc: Steven Rostedt
    Cc: Benjamin Herrenschmidt
    Cc: David Howells
    Cc: Kyle McMartin
    Cc: Martin Schwidefsky
    Cc: "David S. Miller"
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc:
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

19 Sep, 2009

1 commit


17 Aug, 2009

1 commit

  • Currently SELinux enforcement of controls on the ability to map low memory
    is determined by the mmap_min_addr tunable. This patch causes SELinux to
    ignore the tunable and instead use a seperate Kconfig option specific to how
    much space the LSM should protect.

    The tunable will now only control the need for CAP_SYS_RAWIO and SELinux
    permissions will always protect the amount of low memory designated by
    CONFIG_LSM_MMAP_MIN_ADDR.

    This allows users who need to disable the mmap_min_addr controls (usual reason
    being they run WINE as a non-root user) to do so and still have SELinux
    controls preventing confined domains (like a web server) from being able to
    map some area of low memory.

    Signed-off-by: Eric Paris
    Signed-off-by: James Morris

    Eric Paris
     

12 Jun, 2009

1 commit

  • …el/git/tip/linux-2.6-tip

    * 'perfcounters-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (574 commits)
    perf_counter: Turn off by default
    perf_counter: Add counter->id to the throttle event
    perf_counter: Better align code
    perf_counter: Rename L2 to LL cache
    perf_counter: Standardize event names
    perf_counter: Rename enums
    perf_counter tools: Clean up u64 usage
    perf_counter: Rename perf_counter_limit sysctl
    perf_counter: More paranoia settings
    perf_counter: powerpc: Implement generalized cache events for POWER processors
    perf_counters: powerpc: Add support for POWER7 processors
    perf_counter: Accurate period data
    perf_counter: Introduce struct for sample data
    perf_counter tools: Normalize data using per sample period data
    perf_counter: Annotate exit ctx recursion
    perf_counter tools: Propagate signals properly
    perf_counter tools: Small frequency related fixes
    perf_counter: More aggressive frequency adjustment
    perf_counter/x86: Fix the model number of Intel Core2 processors
    perf_counter, x86: Correct some event and umask values for Intel processors
    ...

    Linus Torvalds
     

05 Jun, 2009

1 commit


04 Jun, 2009

2 commits

  • In name of keeping it simple, only track mmap events. Userspace
    will have to remove old overlapping maps when it encounters them.

    Signed-off-by: Peter Zijlstra
    Cc: Mike Galbraith
    Cc: Paul Mackerras
    Cc: Arnaldo Carvalho de Melo
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • This patch removes the dependency of mmap_min_addr on CONFIG_SECURITY.
    It also sets a default mmap_min_addr of 4096.

    mmapping of addresses below 4096 will only be possible for processes
    with CAP_SYS_RAWIO.

    Signed-off-by: Christoph Lameter
    Acked-by: Eric Paris
    Looks-ok-by: Linus Torvalds
    Signed-off-by: James Morris

    Christoph Lameter
     

18 May, 2009

1 commit


03 May, 2009

1 commit

  • The Committed_AS field can underflow in certain situations:

    > # while true; do cat /proc/meminfo | grep _AS; sleep 1; done | uniq -c
    > 1 Committed_AS: 18446744073709323392 kB
    > 11 Committed_AS: 18446744073709455488 kB
    > 6 Committed_AS: 35136 kB
    > 5 Committed_AS: 18446744073709454400 kB
    > 7 Committed_AS: 35904 kB
    > 3 Committed_AS: 18446744073709453248 kB
    > 2 Committed_AS: 34752 kB
    > 9 Committed_AS: 18446744073709453248 kB
    > 8 Committed_AS: 34752 kB
    > 3 Committed_AS: 18446744073709320960 kB
    > 7 Committed_AS: 18446744073709454080 kB
    > 3 Committed_AS: 18446744073709320960 kB
    > 5 Committed_AS: 18446744073709454080 kB
    > 6 Committed_AS: 18446744073709320960 kB

    Because NR_CPUS can be greater than 1000 and meminfo_proc_show() does
    not check for underflow.

    But NR_CPUS proportional isn't good calculation. In general,
    possibility of lock contention is proportional to the number of online
    cpus, not theorical maximum cpus (NR_CPUS).

    The current kernel has generic percpu-counter stuff. using it is right
    way. it makes code simplify and percpu_counter_read_positive() don't
    make underflow issue.

    Reported-by: Dave Hansen
    Signed-off-by: KOSAKI Motohiro
    Cc: Eric B Munson
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Cc: [All kernel versions]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

29 Apr, 2009

1 commit


17 Apr, 2009

1 commit

  • Tetsuo Handa reports seeing the WARN_ON(current->mm == NULL) in
    security_vm_enough_memory(), when do_execve() is touching the
    target mm's stack, to set up its args and environment.

    Yes, a UMH_NO_WAIT or UMH_WAIT_PROC call_usermodehelper() spawns
    an mm-less kernel thread to do the exec. And in any case, that
    vm_enough_memory check when growing stack ought to be done on the
    target mm, not on the execer's mm (though apart from the warning,
    it only makes a slight tweak to OVERCOMMIT_NEVER behaviour).

    Reported-by: Tetsuo Handa
    Signed-off-by: Hugh Dickins
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

06 Apr, 2009

1 commit

  • Currently the profiling information returns userspace IPs but no way
    to correlate them to userspace code. Userspace could look into
    /proc/$pid/maps but that might not be current or even present anymore
    at the time of analyzing the IPs.

    Therefore provide means to track the mmap information and provide it
    in the output stream.

    XXX: only covers mmap()/munmap(), mremap() and mprotect() are missing.

    Signed-off-by: Peter Zijlstra
    Acked-by: Paul Mackerras
    Cc: Andrew Morton
    Orig-LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

03 Apr, 2009

1 commit

  • Fix a number of issues with the per-MM VMA patch:

    (1) Make mmap_pages_allocated an atomic_long_t, just in case this is used on
    a NOMMU system with more than 2G pages. Makes no difference on a 32-bit
    system.

    (2) Report vma->vm_pgoff * PAGE_SIZE as a 64-bit value, not a 32-bit value,
    lest it overflow.

    (3) Move the allocation of the vm_area_struct slab back for fork.c.

    (4) Use KMEM_CACHE() for both vm_area_struct and vm_region slabs.

    (5) Use BUG_ON() rather than if () BUG().

    (6) Make the default validate_nommu_regions() a static inline rather than a
    #define.

    (7) Make free_page_series()'s objection to pages with a refcount != 1 more
    informative.

    (8) Adjust the __put_nommu_region() banner comment to indicate that the
    semaphore must be held for writing.

    (9) Limit the number of warnings about munmaps of non-mmapped regions.

    Reported-by: Andrew Morton
    Signed-off-by: David Howells
    Cc: Greg Ungerer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     

24 Mar, 2009

1 commit


12 Feb, 2009

1 commit

  • Christophe Saout reported [in precursor to:
    http://marc.info/?l=linux-kernel&m=123209902707347&w=4]:

    > Note that I also some a different issue with CONFIG_UNEVICTABLE_LRU.
    > Seems like Xen tears down current->mm early on process termination, so
    > that __get_user_pages in exit_mmap causes nasty messages when the
    > process had any mlocked pages. (in fact, it somehow manages to get into
    > the swapping code and produces a null pointer dereference trying to get
    > a swap token)

    Jeremy explained:

    Yes. In the normal case under Xen, an in-use pagetable is "pinned",
    meaning that it is RO to the kernel, and all updates must go via hypercall
    (or writes are trapped and emulated, which is much the same thing). An
    unpinned pagetable is not currently in use by any process, and can be
    directly accessed as normal RW pages.

    As an optimisation at process exit time, we unpin the pagetable as early
    as possible (switching the process to init_mm), so that all the normal
    pagetable teardown can happen with direct memory accesses.

    This happens in exit_mmap() -> arch_exit_mmap(). The munlocking happens
    a few lines below. The obvious thing to do would be to move
    arch_exit_mmap() to below the munlock code, but I think we'd want to
    call it even if mm->mmap is NULL, just to be on the safe side.

    Thus, this patch:

    exit_mmap() needs to unlock any locked vmas before calling arch_exit_mmap,
    as the latter may switch the current mm to init_mm, which would cause the
    former to fail.

    Signed-off-by: Jeremy Fitzhardinge
    Signed-off-by: Lee Schermerhorn
    Cc: Christophe Saout
    Cc: Keir Fraser
    Cc: Christophe Saout
    Cc: Alex Williamson
    Cc: [2.6.28.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeremy Fitzhardinge
     

11 Feb, 2009

1 commit

  • When overcommit is disabled, the core VM accounts for pages used by anonymous
    shared, private mappings and special mappings. It keeps track of VMAs that
    should be accounted for with VM_ACCOUNT and VMAs that never had a reserve
    with VM_NORESERVE.

    Overcommit for hugetlbfs is much riskier than overcommit for base pages
    due to contiguity requirements. It avoids overcommiting on both shared and
    private mappings using reservation counters that are checked and updated
    during mmap(). This ensures (within limits) that hugepages exist in the
    future when faults occurs or it is too easy to applications to be SIGKILLed.

    As hugetlbfs makes its own reservations of a different unit to the base page
    size, VM_ACCOUNT should never be set. Even if the units were correct, we would
    double account for the usage in the core VM and hugetlbfs. VM_NORESERVE may
    be set because an application can request no reserves be made for hugetlbfs
    at the risk of getting killed later.

    With commit fc8744adc870a8d4366908221508bb113d8b72ee, VM_NORESERVE and
    VM_ACCOUNT are getting unconditionally set for hugetlbfs-backed mappings. This
    breaks the accounting for both the core VM and hugetlbfs, can trigger an
    OOM storm when hugepage pools are too small lockups and corrupted counters
    otherwise are used. This patch brings hugetlbfs more in line with how the
    core VM treats VM_NORESERVE but prevents VM_ACCOUNT being set.

    Signed-off-by: Mel Gorman
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

06 Feb, 2009

2 commits

  • Conflicts:
    fs/namei.c

    Manually merged per:

    diff --cc fs/namei.c
    index 734f2b5,bbc15c2..0000000
    --- a/fs/namei.c
    +++ b/fs/namei.c
    @@@ -860,9 -848,8 +849,10 @@@ static int __link_path_walk(const char
    nd->flags |= LOOKUP_CONTINUE;
    err = exec_permission_lite(inode);
    if (err == -EAGAIN)
    - err = vfs_permission(nd, MAY_EXEC);
    + err = inode_permission(nd->path.dentry->d_inode,
    + MAY_EXEC);
    + if (!err)
    + err = ima_path_check(&nd->path, MAY_EXEC);
    if (err)
    break;

    @@@ -1525,14 -1506,9 +1509,14 @@@ int may_open(struct path *path, int acc
    flag &= ~O_TRUNC;
    }

    - error = vfs_permission(nd, acc_mode);
    + error = inode_permission(inode, acc_mode);
    if (error)
    return error;
    +
    - error = ima_path_check(&nd->path,
    ++ error = ima_path_check(path,
    + acc_mode & (MAY_READ | MAY_WRITE | MAY_EXEC));
    + if (error)
    + return error;
    /*
    * An append-only file must be opened in append mode for writing.
    */

    Signed-off-by: James Morris

    James Morris
     
  • This patch replaces the generic integrity hooks, for which IMA registered
    itself, with IMA integrity hooks in the appropriate places directly
    in the fs directory.

    Signed-off-by: Mimi Zohar
    Acked-by: Serge Hallyn
    Signed-off-by: James Morris

    Mimi Zohar
     

01 Feb, 2009

1 commit

  • The mmap_region() code would temporarily set the VM_ACCOUNT flag for
    anonymous shared mappings just to inform shmem_zero_setup() that it
    should enable accounting for the resulting shm object. It would then
    clear the flag after calling ->mmap (for the /dev/zero case) or doing
    shmem_zero_setup() (for the MAP_ANON case).

    This just resulted in vma merge issues, but also made for just
    unnecessary confusion. Use the already-existing VM_NORESERVE flag for
    this instead, and let shmem_{zero|file}_setup() just figure it out from
    that.

    This also happens to make it obvious that the new DRI2 GEM layer uses a
    non-reserving backing store for its object allocation - which is quite
    possibly not intentional. But since I didn't want to change semantics
    in this patch, I left it alone, and just updated the caller to use the
    new flag semantics.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

31 Jan, 2009

1 commit

  • Commit de33c8db5910cda599899dd431cc30d7c1018cbf ("Fix OOPS in
    mmap_region() when merging adjacent VM_LOCKED file segments") unified
    the vma merging of anonymous and file maps to just one place, which
    simplified the code and fixed a use-after-free bug that could cause an
    oops.

    But by doing the merge opportunistically before even having called
    ->mmap() on the file method, it now compares two different 'vm_flags'
    values: the pre-mmap() value of the new not-yet-formed vma, and previous
    mappings of the same file around it.

    And in doing so, it refused to merge the common file case, which adds a
    marker to say "I can be made non-linear".

    This fixes it by just adding a set of flags that don't have to match,
    because we know they are ok to merge. Currently it's only that single
    VM_CAN_NONLINEAR flag, but at least conceptually there could be others
    in the future.

    Reported-and-acked-by: Hugh Dickins
    Cc: Lee Schermerhorn
    Cc: Nick Piggin
    Cc: Andrew Morton
    Cc: Greg KH
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

30 Jan, 2009

1 commit

  • As of commit ba470de43188cdbff795b5da43a1474523c6c2fb ("map: handle
    mlocked pages during map, remap, unmap") we now use the 'vma' variable
    at the end of mmap_region() to handle the page-in of newly mapped
    mlocked pages.

    However, if we merged adjacent vma's together, the vma we're using may
    be stale. We historically consciously avoided using it after the merge
    operation, but that got overlooked when redoing the locked page
    handling.

    This commit simplifies mmap_region() by doing any vma merges early,
    avoiding the issue entirely, and 'vma' will always be valid. As pointed
    out by Hugh Dickins, this depends on any drivers that change the page
    offset of flags to have set one of the VM_SPECIAL bits (so that they
    cannot trigger the early merge logic), but that's true in general.

    Reported-and-tested-by: Maksim Yevmenkin
    Cc: Lee Schermerhorn
    Cc: Nick Piggin
    Cc: Andrew Morton
    Cc: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

14 Jan, 2009

2 commits


08 Jan, 2009

1 commit

  • Make VMAs per mm_struct as for MMU-mode linux. This solves two problems:

    (1) In SYSV SHM where nattch for a segment does not reflect the number of
    shmat's (and forks) done.

    (2) In mmap() where the VMA's vm_mm is set to point to the parent mm by an
    exec'ing process when VM_EXECUTABLE is specified, regardless of the fact
    that a VMA might be shared and already have its vm_mm assigned to another
    process or a dead process.

    A new struct (vm_region) is introduced to track a mapped region and to remember
    the circumstances under which it may be shared and the vm_list_struct structure
    is discarded as it's no longer required.

    This patch makes the following additional changes:

    (1) Regions are now allocated with alloc_pages() rather than kmalloc() and
    with no recourse to __GFP_COMP, so the pages are not composite. Instead,
    each page has a reference on it held by the region. Anything else that is
    interested in such a page will have to get a reference on it to retain it.
    When the pages are released due to unmapping, each page is passed to
    put_page() and will be freed when the page usage count reaches zero.

    (2) Excess pages are trimmed after an allocation as the allocation must be
    made as a power-of-2 quantity of pages.

    (3) VMAs are added to the parent MM's R/B tree and mmap lists. As an MM may
    end up with overlapping VMAs within the tree, the VMA struct address is
    appended to the sort key.

    (4) Non-anonymous VMAs are now added to the backing inode's prio list.

    (5) Holes may be punched in anonymous VMAs with munmap(), releasing parts of
    the backing region. The VMA and region structs will be split if
    necessary.

    (6) sys_shmdt() only releases one attachment to a SYSV IPC shared memory
    segment instead of all the attachments at that addresss. Multiple
    shmat()'s return the same address under NOMMU-mode instead of different
    virtual addresses as under MMU-mode.

    (7) Core dumping for ELF-FDPIC requires fewer exceptions for NOMMU-mode.

    (8) /proc/maps is now the global list of mapped regions, and may list bits
    that aren't actually mapped anywhere.

    (9) /proc/meminfo gains a line (tagged "MmapCopy") that indicates the amount
    of RAM currently allocated by mmap to hold mappable regions that can't be
    mapped directly. These are copies of the backing device or file if not
    anonymous.

    These changes make NOMMU mode more similar to MMU mode. The downside is that
    NOMMU mode requires some extra memory to track things over NOMMU without this
    patch (VMAs are no longer shared, and there are now region structs).

    Signed-off-by: David Howells
    Tested-by: Mike Frysinger
    Acked-by: Paul Mundt

    David Howells