05 Dec, 2013

1 commit

  • commit 2afc745f3e3079ab16c826be4860da2529054dd2 upstream.

    This patch fixes the problem that get_unmapped_area() can return illegal
    address and result in failing mmap(2) etc.

    In case that the address higher than PAGE_SIZE is set to
    /proc/sys/vm/mmap_min_addr, the address lower than mmap_min_addr can be
    returned by get_unmapped_area(), even if you do not pass any virtual
    address hint (i.e. the second argument).

    This is because the current get_unmapped_area() code does not take into
    account mmap_min_addr.

    This leads to two actual problems as follows:

    1. mmap(2) can fail with EPERM on the process without CAP_SYS_RAWIO,
    although any illegal parameter is not passed.

    2. The bottom-up search path after the top-down search might not work in
    arch_get_unmapped_area_topdown().

    Note: The first and third chunk of my patch, which changes "len" check,
    are for more precise check using mmap_min_addr, and not for solving the
    above problem.

    [How to reproduce]

    --- test.c -------------------------------------------------
    #include
    #include
    #include
    #include

    int main(int argc, char *argv[])
    {
    void *ret = NULL, *last_map;
    size_t pagesize = sysconf(_SC_PAGESIZE);

    do {
    last_map = ret;
    ret = mmap(0, pagesize, PROT_NONE,
    MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
    // printf("ret=%p\n", ret);
    } while (ret != MAP_FAILED);

    if (errno != ENOMEM) {
    printf("ERR: unexpected errno: %d (last map=%p)\n",
    errno, last_map);
    }

    return 0;
    }
    ---------------------------------------------------------------

    $ gcc -m32 -o test test.c
    $ sudo sysctl -w vm.mmap_min_addr=65536
    vm.mmap_min_addr = 65536
    $ ./test (run as non-priviledge user)
    ERR: unexpected errno: 1 (last map=0x10000)

    Signed-off-by: Akira Takeuchi
    Signed-off-by: Kiyoshi Owada
    Reviewed-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Akira Takeuchi
     

20 Aug, 2013

1 commit

  • commit 2b047252d087be7f2ba088b4933cd904f92e6fce upstream.

    Ben Tebulin reported:

    "Since v3.7.2 on two independent machines a very specific Git
    repository fails in 9/10 cases on git-fsck due to an SHA1/memory
    failures. This only occurs on a very specific repository and can be
    reproduced stably on two independent laptops. Git mailing list ran
    out of ideas and for me this looks like some very exotic kernel issue"

    and bisected the failure to the backport of commit 53a59fc67f97 ("mm:
    limit mmu_gather batching to fix soft lockups on !CONFIG_PREEMPT").

    That commit itself is not actually buggy, but what it does is to make it
    much more likely to hit the partial TLB invalidation case, since it
    introduces a new case in tlb_next_batch() that previously only ever
    happened when running out of memory.

    The real bug is that the TLB gather virtual memory range setup is subtly
    buggered. It was introduced in commit 597e1c3580b7 ("mm/mmu_gather:
    enable tlb flush range in generic mmu_gather"), and the range handling
    was already fixed at least once in commit e6c495a96ce0 ("mm: fix the TLB
    range flushed when __tlb_remove_page() runs out of slots"), but that fix
    was not complete.

    The problem with the TLB gather virtual address range is that it isn't
    set up by the initial tlb_gather_mmu() initialization (which didn't get
    the TLB range information), but it is set up ad-hoc later by the
    functions that actually flush the TLB. And so any such case that forgot
    to update the TLB range entries would potentially miss TLB invalidates.

    Rather than try to figure out exactly which particular ad-hoc range
    setup was missing (I personally suspect it's the hugetlb case in
    zap_huge_pmd(), which didn't have the same logic as zap_pte_range()
    did), this patch just gets rid of the problem at the source: make the
    TLB range information available to tlb_gather_mmu(), and initialize it
    when initializing all the other tlb gather fields.

    This makes the patch larger, but conceptually much simpler. And the end
    result is much more understandable; even if you want to play games with
    partial ranges when invalidating the TLB contents in chunks, now the
    range information is always there, and anybody who doesn't want to
    bother with it won't introduce subtle bugs.

    Ben verified that this fixes his problem.

    Reported-bisected-and-tested-by: Ben Tebulin
    Build-testing-by: Stephen Rothwell
    Build-testing-by: Richard Weinberger
    Reviewed-by: Michal Hocko
    Acked-by: Peter Zijlstra
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Linus Torvalds
     

04 Aug, 2013

1 commit

  • commit 3964acd0dbec123aa0a621973a2a0580034b4788 upstream.

    vma_adjust() does vma_set_policy(vma, vma_policy(next)) and this
    is doubly wrong:

    1. This leaks vma->vm_policy if it is not NULL and not equal to
    next->vm_policy.

    This can happen if vma_merge() expands "area", not prev (case 8).

    2. This sets the wrong policy if vma_merge() joins prev and area,
    area is the vma the caller needs to update and it still has the
    old policy.

    Revert commit 1444f92c8498 ("mm: merging memory blocks resets
    mempolicy") which introduced these problems.

    Change mbind_range() to recheck mpol_equal() after vma_merge() to fix
    the problem that commit tried to address.

    Signed-off-by: Oleg Nesterov
    Acked-by: KOSAKI Motohiro
    Cc: Steven T Hampson
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Oleg Nesterov
     

10 May, 2013

1 commit

  • Dave reported an oops triggered by trinity:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
    IP: newseg+0x10d/0x390
    PGD cf8c1067 PUD cf8c2067 PMD 0
    Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    CPU: 2 PID: 7636 Comm: trinity-child2 Not tainted 3.9.0+#67
    ...
    Call Trace:
    ipcget+0x182/0x380
    SyS_shmget+0x5a/0x60
    tracesys+0xdd/0xe2

    This bug was introduced by commit af73e4d9506d ("hugetlbfs: fix mmap
    failure in unaligned size request").

    Reported-by: Dave Jones
    Cc:
    Signed-off-by: Li Zefan
    Reviewed-by: Naoya Horiguchi
    Acked-by: Rik van Riel
    Signed-off-by: Linus Torvalds

    Li Zefan
     

08 May, 2013

1 commit

  • The current kernel returns -EINVAL unless a given mmap length is
    "almost" hugepage aligned. This is because in sys_mmap_pgoff() the
    given length is passed to vm_mmap_pgoff() as it is without being aligned
    with hugepage boundary.

    This is a regression introduced in commit 40716e29243d ("hugetlbfs: fix
    alignment of huge page requests"), where alignment code is pushed into
    hugetlb_file_setup() and the variable len in caller side is not changed.

    To fix this, this patch partially reverts that commit, and adds
    alignment code in caller side. And it also introduces hstate_sizelog()
    in order to get proper hstate to specified hugepage size.

    Addresses https://bugzilla.kernel.org/show_bug.cgi?id=56881

    [akpm@linux-foundation.org: fix warning when CONFIG_HUGETLB_PAGE=n]
    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Johannes Weiner
    Reported-by:
    Cc: Steven Truelove
    Cc: Jianguo Wu
    Cc: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

30 Apr, 2013

7 commits

  • Fix a corner case for MAP_FIXED when requested mapping length is larger
    than rlimit for virtual memory. In such case any overlapping mappings
    are unmapped before we check for the limit and return ENOMEM.

    The check is moved before the loop that unmaps overlapping parts of
    existing mappings. When we are about to hit the limit (currently mapped
    pages + len > limit) we scan for overlapping pages and check again
    accounting for them.

    This fixes situation when userspace program expects that the previous
    mappings are preserved after the mmap() syscall has returned with error.
    (POSIX clearly states that successfull mapping shall replace any
    previous mappings.)

    This corner case was found and can be tested with LTP testcase:

    testcases/open_posix_testsuite/conformance/interfaces/mmap/24-2.c

    In this case the mmap, which is clearly over current limit, unmaps
    dynamic libraries and the testcase segfaults right after returning into
    userspace.

    I've also looked at the second instance of the unmapping loop in the
    do_brk(). The do_brk() is called from brk() syscall and from vm_brk().
    The brk() syscall checks for overlapping mappings and bails out when
    there are any (so it can't be triggered from the brk syscall). The
    vm_brk() is called only from binmft handlers so it shouldn't be
    triggered unless binmft handler created overlapping mappings.

    Signed-off-by: Cyril Hrubis
    Reviewed-by: Mel Gorman
    Reviewed-by: Wanpeng Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyril Hrubis
     
  • Alter the admin and user reserves of the previous patches in this series
    when memory is added or removed.

    If memory is added and the reserves have been eliminated or increased
    above the default max, then we'll trust the admin.

    If memory is removed and there isn't enough free memory, then we need to
    reset the reserves.

    Otherwise keep the reserve set by the admin.

    The reserve reset code is the same as the reserve initialization code.

    I tested hot addition and removal by triggering it via sysfs. The
    reserves shrunk when they were set high and memory was removed. They
    were reset higher when memory was added again.

    [akpm@linux-foundation.org: use register_hotmemory_notifier()]
    [akpm@linux-foundation.org: init_user_reserve() and init_admin_reserve can no longer be __meminit]
    [fengguang.wu@intel.com: make init_reserve_notifier() static]
    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Andrew Shewmaker
    Signed-off-by: Fengguang Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Shewmaker
     
  • Add an admin_reserve_kbytes knob to allow admins to change the hardcoded
    memory reserve to something other than 3%, which may be multiple
    gigabytes on large memory systems. Only about 8MB is necessary to
    enable recovery in the default mode, and only a few hundred MB are
    required even when overcommit is disabled.

    This affects OVERCOMMIT_GUESS and OVERCOMMIT_NEVER.

    admin_reserve_kbytes is initialized to min(3% free pages, 8MB)

    I arrived at 8MB by summing the RSS of sshd or login, bash, and top.

    Please see first patch in this series for full background, motivation,
    testing, and full changelog.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: make init_admin_reserve() static]
    Signed-off-by: Andrew Shewmaker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Shewmaker
     
  • Add user_reserve_kbytes knob.

    Limit the growth of the memory reserved for other user processes to
    min(3% current process size, user_reserve_pages). Only about 8MB is
    necessary to enable recovery in the default mode, and only a few hundred
    MB are required even when overcommit is disabled.

    user_reserve_pages defaults to min(3% free pages, 128MB)

    I arrived at 128MB by taking the max VSZ of sshd, login, bash, and top ...
    then adding the RSS of each.

    This only affects OVERCOMMIT_NEVER mode.

    Background

    1. user reserve

    __vm_enough_memory reserves a hardcoded 3% of the current process size for
    other applications when overcommit is disabled. This was done so that a
    user could recover if they launched a memory hogging process. Without the
    reserve, a user would easily run into a message such as:

    bash: fork: Cannot allocate memory

    2. admin reserve

    Additionally, a hardcoded 3% of free memory is reserved for root in both
    overcommit 'guess' and 'never' modes. This was intended to prevent a
    scenario where root-cant-log-in and perform recovery operations.

    Note that this reserve shrinks, and doesn't guarantee a useful reserve.

    Motivation

    The two hardcoded memory reserves should be updated to account for current
    memory sizes.

    Also, the admin reserve would be more useful if it didn't shrink too much.

    When the current code was originally written, 1GB was considered
    "enterprise". Now the 3% reserve can grow to multiple GB on large memory
    systems, and it only needs to be a few hundred MB at most to enable a user
    or admin to recover a system with an unwanted memory hogging process.

    I've found that reducing these reserves is especially beneficial for a
    specific type of application load:

    * single application system
    * one or few processes (e.g. one per core)
    * allocating all available memory
    * not initializing every page immediately
    * long running

    I've run scientific clusters with this sort of load. A long running job
    sometimes failed many hours (weeks of CPU time) into a calculation. They
    weren't initializing all of their memory immediately, and they weren't
    using calloc, so I put systems into overcommit 'never' mode. These
    clusters run diskless and have no swap.

    However, with the current reserves, a user wishing to allocate as much
    memory as possible to one process may be prevented from using, for
    example, almost 2GB out of 32GB.

    The effect is less, but still significant when a user starts a job with
    one process per core. I have repeatedly seen a set of processes
    requesting the same amount of memory fail because one of them could not
    allocate the amount of memory a user would expect to be able to allocate.
    For example, Message Passing Interfce (MPI) processes, one per core. And
    it is similar for other parallel programming frameworks.

    Changing this reserve code will make the overcommit never mode more useful
    by allowing applications to allocate nearly all of the available memory.

    Also, the new admin_reserve_kbytes will be safer than the current behavior
    since the hardcoded 3% of available memory reserve can shrink to something
    useless in the case where applications have grabbed all available memory.

    Risks

    * "bash: fork: Cannot allocate memory"

    The downside of the first patch-- which creates a tunable user reserve
    that is only used in overcommit 'never' mode--is that an admin can set
    it so low that a user may not be able to kill their process, even if
    they already have a shell prompt.

    Of course, a user can get in the same predicament with the current 3%
    reserve--they just have to launch processes until 3% becomes negligible.

    * root-cant-log-in problem

    The second patch, adding the tunable rootuser_reserve_pages, allows
    the admin to shoot themselves in the foot by setting it too small. They
    can easily get the system into a state where root-can't-log-in.

    However, the new admin_reserve_kbytes will be safer than the current
    behavior since the hardcoded 3% of available memory reserve can shrink
    to something useless in the case where applications have grabbed all
    available memory.

    Alternatives

    * Memory cgroups provide a more flexible way to limit application memory.

    Not everyone wants to set up cgroups or deal with their overhead.

    * We could create a fourth overcommit mode which provides smaller reserves.

    The size of useful reserves may be drastically different depending
    on the whether the system is embedded or enterprise.

    * Force users to initialize all of their memory or use calloc.

    Some users don't want/expect the system to overcommit when they malloc.
    Overcommit 'never' mode is for this scenario, and it should work well.

    The new user and admin reserve tunables are simple to use, with low
    overhead compared to cgroups. The patches preserve current behavior where
    3% of memory is less than 128MB, except that the admin reserve doesn't
    shrink to an unusable size under pressure. The code allows admins to tune
    for embedded and enterprise usage.

    FAQ

    * How is the root-cant-login problem addressed?
    What happens if admin_reserve_pages is set to 0?

    Root is free to shoot themselves in the foot by setting
    admin_reserve_kbytes too low.

    On x86_64, the minimum useful reserve is:
    8MB for overcommit 'guess'
    128MB for overcommit 'never'

    admin_reserve_pages defaults to min(3% free memory, 8MB)

    So, anyone switching to 'never' mode needs to adjust
    admin_reserve_pages.

    * How do you calculate a minimum useful reserve?

    A user or the admin needs enough memory to login and perform
    recovery operations, which includes, at a minimum:

    sshd or login + bash (or some other shell) + top (or ps, kill, etc.)

    For overcommit 'guess', we can sum resident set sizes (RSS)
    because we only need enough memory to handle what the recovery
    programs will typically use. On x86_64 this is about 8MB.

    For overcommit 'never', we can take the max of their virtual sizes (VSZ)
    and add the sum of their RSS. We use VSZ instead of RSS because mode
    forces us to ensure we can fulfill all of the requested memory allocations--
    even if the programs only use a fraction of what they ask for.
    On x86_64 this is about 128MB.

    When swap is enabled, reserves are useful even when they are as
    small as 10MB, regardless of overcommit mode.

    When both swap and overcommit are disabled, then the admin should
    tune the reserves higher to be absolutley safe. Over 230MB each
    was safest in my testing.

    * What happens if user_reserve_pages is set to 0?

    Note, this only affects overcomitt 'never' mode.

    Then a user will be able to allocate all available memory minus
    admin_reserve_kbytes.

    However, they will easily see a message such as:

    "bash: fork: Cannot allocate memory"

    And they won't be able to recover/kill their application.
    The admin should be able to recover the system if
    admin_reserve_kbytes is set appropriately.

    * What's the difference between overcommit 'guess' and 'never'?

    "Guess" allows an allocation if there are enough free + reclaimable
    pages. It has a hardcoded 3% of free pages reserved for root.

    "Never" allows an allocation if there is enough swap + a configurable
    percentage (default is 50) of physical RAM. It has a hardcoded 3% of
    free pages reserved for root, like "Guess" mode. It also has a
    hardcoded 3% of the current process size reserved for additional
    applications.

    * Why is overcommit 'guess' not suitable even when an app eventually
    writes to every page? It takes free pages, file pages, available
    swap pages, reclaimable slab pages into consideration. In other words,
    these are all pages available, then why isn't overcommit suitable?

    Because it only looks at the present state of the system. It
    does not take into account the memory that other applications have
    malloced, but haven't initialized yet. It overcommits the system.

    Test Summary

    There was little change in behavior in the default overcommit 'guess'
    mode with swap enabled before and after the patch. This was expected.

    Systems run most predictably (i.e. no oom kills) in overcommit 'never'
    mode with swap enabled. This also allowed the most memory to be allocated
    to a user application.

    Overcommit 'guess' mode without swap is a bad idea. It is easy to
    crash the system. None of the other tested combinations crashed.
    This matches my experience on the Roadrunner supercomputer.

    Without the tunable user reserve, a system in overcommit 'never' mode
    and without swap does not allow the admin to recover, although the
    admin can.

    With the new tunable reserves, a system in overcommit 'never' mode
    and without swap can be configured to:

    1. maximize user-allocatable memory, running close to the edge of
    recoverability

    2. maximize recoverability, sacrificing allocatable memory to
    ensure that a user cannot take down a system

    Test Description

    Fedora 18 VM - 4 x86_64 cores, 5725MB RAM, 4GB Swap

    System is booted into multiuser console mode, with unnecessary services
    turned off. Caches were dropped before each test.

    Hogs are user memtester processes that attempt to allocate all free memory
    as reported by /proc/meminfo

    In overcommit 'never' mode, memory_ratio=100

    Test Results

    3.9.0-rc1-mm1

    Overcommit | Swap | Hogs | MB Got/Wanted | OOMs | User Recovery | Admin Recovery
    ---------- ---- ---- ------------- ---- ------------- --------------
    guess yes 1 5432/5432 no yes yes
    guess yes 4 5444/5444 1 yes yes
    guess no 1 5302/5449 no yes yes
    guess no 4 - crash no no

    never yes 1 5460/5460 1 yes yes
    never yes 4 5460/5460 1 yes yes
    never no 1 5218/5432 no no yes
    never no 4 5203/5448 no no yes

    3.9.0-rc1-mm1-tunablereserves

    User and Admin Recovery show their respective reserves, if applicable.

    Overcommit | Swap | Hogs | MB Got/Wanted | OOMs | User Recovery | Admin Recovery
    ---------- ---- ---- ------------- ---- ------------- --------------
    guess yes 1 5419/5419 no - yes 8MB yes
    guess yes 4 5436/5436 1 - yes 8MB yes
    guess no 1 5440/5440 * - yes 8MB yes
    guess no 4 - crash - no 8MB no

    * process would successfully mlock, then the oom killer would pick it

    never yes 1 5446/5446 no 10MB yes 20MB yes
    never yes 4 5456/5456 no 10MB yes 20MB yes
    never no 1 5387/5429 no 128MB no 8MB barely
    never no 1 5323/5428 no 226MB barely 8MB barely
    never no 1 5323/5428 no 226MB barely 8MB barely

    never no 1 5359/5448 no 10MB no 10MB barely

    never no 1 5323/5428 no 0MB no 10MB barely
    never no 1 5332/5428 no 0MB no 50MB yes
    never no 1 5293/5429 no 0MB no 90MB yes

    never no 1 5001/5427 no 230MB yes 338MB yes
    never no 4* 4998/5424 no 230MB yes 338MB yes

    * more memtesters were launched, able to allocate approximately another 100MB

    Future Work

    - Test larger memory systems.

    - Test an embedded image.

    - Test other architectures.

    - Time malloc microbenchmarks.

    - Would it be useful to be able to set overcommit policy for
    each memory cgroup?

    - Some lines are slightly above 80 chars.
    Perhaps define a macro to convert between pages and kb?
    Other places in the kernel do this.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: make init_user_reserve() static]
    Signed-off-by: Andrew Shewmaker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Shewmaker
     
  • Using mbind to change the mempolicy to MPOL_BIND on several adjacent
    mmapped blocks may result in a reset of the mempolicy to MPOL_DEFAULT in
    vma_adjust.

    Test code. Correct result is three lines containing "OK".

    #include
    #include
    #include
    #include
    #include

    /* gcc mbind_test.c -lnuma -o mbind_test -Wall */
    #define MAXNODE 4096

    void allocate()
    {
    int ret;
    int len;
    int policy = -1;
    unsigned char *p;
    unsigned long mask[MAXNODE] = { 0 };
    unsigned long retmask[MAXNODE] = { 0 };

    len = getpagesize() * 0x2fc00;
    p = mmap(NULL, len, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS,
    -1, 0);
    if (p == MAP_FAILED)
    printf("mbind err: %d\n", errno);

    mask[0] = 1;
    ret = mbind(p, len, MPOL_BIND, mask, MAXNODE, 0);
    if (ret < 0)
    printf("mbind err: %d %d\n", ret, errno);
    ret = get_mempolicy(&policy, retmask, MAXNODE, p, MPOL_F_ADDR);
    if (ret < 0)
    printf("get_mempolicy err: %d %d\n", ret, errno);

    if (policy == MPOL_BIND)
    printf("OK\n");
    else
    printf("ERROR: policy is %d\n", policy);
    }

    int main()
    {
    allocate();
    allocate();
    allocate();
    return 0;
    }

    Signed-off-by: Steven T Hampson
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hampson, Steven T
     
  • On architectures where a pgd entry may be shared between user and kernel
    (e.g. ARM+LPAE), freeing page tables needs a ceiling other than 0.
    This patch introduces a generic USER_PGTABLES_CEILING that arch code can
    override. It is the responsibility of the arch code setting the ceiling
    to ensure the complete freeing of the page tables (usually in
    pgd_free()).

    [catalin.marinas@arm.com: commit log; shift_arg_pages(), asm-generic/pgtables.h changes]
    Signed-off-by: Hugh Dickins
    Signed-off-by: Catalin Marinas
    Cc: Russell King
    Cc: [3.3+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Remove the WARN_ON_ONCE(!mm) check as the comment suggested. Kernel
    code calls find_vma only when it is absolutely sure that the mm_struct
    arg to it is non-NULL.

    Signed-off-by: Zhang Yanfei
    Cc: k80c
    Cc: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang Yanfei
     

05 Apr, 2013

1 commit

  • find_vma() can be called by multiple threads with read lock
    held on mm->mmap_sem and any of them can update mm->mmap_cache.
    Prevent compiler from re-fetching mm->mmap_cache, because other
    readers could update it in the meantime:

    thread 1 thread 2
    |
    find_vma() | find_vma()
    struct vm_area_struct *vma = NULL; |
    vma = mm->mmap_cache; |
    if (!(vma && vma->vm_end > addr |
    && vma->vm_start mmap_cache = vma;
    return vma; |
    ^^ compiler may optimize this |
    local variable out and re-read |
    mm->mmap_cache |

    This issue can be reproduced with gcc-4.8.0-1 on s390x by running
    mallocstress testcase from LTP, which triggers:

    kernel BUG at mm/rmap.c:1088!
    Call Trace:
    ([] 0x3d100c57000)
    [] do_wp_page+0x2fc/0xa88
    [] handle_pte_fault+0x41a/0xac8
    [] handle_mm_fault+0x17a/0x268
    [] do_protection_exception+0x1e2/0x394
    [] pgm_check_handler+0x138/0x13c
    [] 0x3fffcf1f07a
    Last Breaking-Event-Address:
    [] page_add_new_anon_rmap+0xc2/0x168

    Thanks to Jakub Jelinek for his insight on gcc and helping to
    track this down.

    Signed-off-by: Jan Stancek
    Acked-by: David Rientjes
    Signed-off-by: Hugh Dickins
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Jan Stancek
     

29 Mar, 2013

1 commit

  • This reverts commit 186930500985 ("mm: introduce VM_POPULATE flag to
    better deal with racy userspace programs").

    VM_POPULATE only has any effect when userspace plays racy games with
    vmas by trying to unmap and remap memory regions that mmap or mlock are
    operating on.

    Also, the only effect of VM_POPULATE when userspace plays such games is
    that it avoids populating new memory regions that get remapped into the
    address range that was being operated on by the original mmap or mlock
    calls.

    Let's remove VM_POPULATE as there isn't any strong argument to mandate a
    new vm_flag.

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

28 Feb, 2013

1 commit

  • The stack vma is designed to grow automatically (marked with VM_GROWSUP
    or VM_GROWSDOWN depending on architecture) when an access is made beyond
    the existing boundary. However, particularly if you have not limited
    your stack at all ("ulimit -s unlimited"), this can cause the stack to
    grow even if the access was really just one past *another* segment.

    And that's wrong, especially since we first grow the segment, but then
    immediately later enforce the stack guard page on the last page of the
    segment. So _despite_ first growing the stack segment as a result of
    the access, the kernel will then make the access cause a SIGSEGV anyway!

    So do the same logic as the guard page check does, and consider an
    access to within one page of the next segment to be a bad access, rather
    than growing the stack to abut the next segment.

    Reported-and-tested-by: Heiko Carstens
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

27 Feb, 2013

1 commit

  • Pull vfs pile (part one) from Al Viro:
    "Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent
    locking violations, etc.

    The most visible changes here are death of FS_REVAL_DOT (replaced with
    "has ->d_weak_revalidate()") and a new helper getting from struct file
    to inode. Some bits of preparation to xattr method interface changes.

    Misc patches by various people sent this cycle *and* ocfs2 fixes from
    several cycles ago that should've been upstream right then.

    PS: the next vfs pile will be xattr stuff."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
    saner proc_get_inode() calling conventions
    proc: avoid extra pde_put() in proc_fill_super()
    fs: change return values from -EACCES to -EPERM
    fs/exec.c: make bprm_mm_init() static
    ocfs2/dlm: use GFP_ATOMIC inside a spin_lock
    ocfs2: fix possible use-after-free with AIO
    ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path
    get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero
    target: writev() on single-element vector is pointless
    export kernel_write(), convert open-coded instances
    fs: encode_fh: return FILEID_INVALID if invalid fid_type
    kill f_vfsmnt
    vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op
    nfsd: handle vfs_getattr errors in acl protocol
    switch vfs_getattr() to struct path
    default SET_PERSONALITY() in linux/elf.h
    ceph: prepopulate inodes only when request is aborted
    d_hash_and_lookup(): export, switch open-coded instances
    9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate()
    9p: split dropping the acls from v9fs_set_create_acl()
    ...

    Linus Torvalds
     

24 Feb, 2013

8 commits

  • The comment in commit 4fc3f1d66b1e ("mm/rmap, migration: Make
    rmap_walk_anon() and try_to_unmap_anon() more scalable") says:

    | Rename anon_vma_[un]lock() => anon_vma_[un]lock_write(),
    | to make it clearer that it's an exclusive write-lock in
    | that case - suggested by Rik van Riel.

    But that commit renames only anon_vma_lock()

    Signed-off-by: Konstantin Khlebnikov
    Cc: Ingo Molnar
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • swap_lock is heavily contended when I test swap to 3 fast SSD (even
    slightly slower than swap to 2 such SSD). The main contention comes
    from swap_info_get(). This patch tries to fix the gap with adding a new
    per-partition lock.

    Global data like nr_swapfiles, total_swap_pages, least_priority and
    swap_list are still protected by swap_lock.

    nr_swap_pages is an atomic now, it can be changed without swap_lock. In
    theory, it's possible get_swap_page() finds no swap pages but actually
    there are free swap pages. But sounds not a big problem.

    Accessing partition specific data (like scan_swap_map and so on) is only
    protected by swap_info_struct.lock.

    Changing swap_info_struct.flags need hold swap_lock and
    swap_info_struct.lock, because scan_scan_map() will check it. read the
    flags is ok with either the locks hold.

    If both swap_lock and swap_info_struct.lock must be hold, we always hold
    the former first to avoid deadlock.

    swap_entry_free() can change swap_list. To delete that code, we add a
    new highest_priority_index. Whenever get_swap_page() is called, we
    check it. If it's valid, we use it.

    It's a pity get_swap_page() still holds swap_lock(). But in practice,
    swap_lock() isn't heavily contended in my test with this patch (or I can
    say there are other much more heavier bottlenecks like TLB flush). And
    BTW, looks get_swap_page() doesn't really need the lock. We never free
    swap_info[] and we check SWAP_WRITEOK flag. The only risk without the
    lock is we could swapout to some low priority swap, but we can quickly
    recover after several rounds of swap, so sounds not a big deal to me.
    But I'd prefer to fix this if it's a real problem.

    "swap: make each swap partition have one address_space" improved the
    swapout speed from 1.7G/s to 2G/s. This patch further improves the
    speed to 2.3G/s, so around 15% improvement. It's a multi-process test,
    so TLB flush isn't the biggest bottleneck before the patches.

    [arnd@arndb.de: fix it for nommu]
    [hughd@google.com: add missing unlock]
    [minchan@kernel.org: get rid of lockdep whinge on sys_swapon]
    Signed-off-by: Shaohua Li
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Greg Kroah-Hartman
    Cc: Seth Jennings
    Cc: Konrad Rzeszutek Wilk
    Cc: Xiao Guangrong
    Cc: Dan Magenheimer
    Cc: Stephen Rothwell
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Hugh Dickins
    Signed-off-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • do_mmap_pgoff() rounds up the desired size to the next PAGE_SIZE
    multiple, however there was no equivalent code in mm_populate(), which
    caused issues.

    This could be fixed by introduced the same rounding in mm_populate(),
    however I think it's preferable to make do_mmap_pgoff() return populate
    as a size rather than as a boolean, so we don't have to duplicate the
    size rounding logic in mm_populate().

    Signed-off-by: Michel Lespinasse
    Acked-by: Rik van Riel
    Tested-by: Andy Lutomirski
    Cc: Greg Ungerer
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • The vm_populate() code populates user mappings without constantly
    holding the mmap_sem. This makes it susceptible to racy userspace
    programs: the user mappings may change while vm_populate() is running,
    and in this case vm_populate() may end up populating the new mapping
    instead of the old one.

    In order to reduce the possibility of userspace getting surprised by
    this behavior, this change introduces the VM_POPULATE vma flag which
    gets set on vmas we want vm_populate() to work on. This way
    vm_populate() may still end up populating the new mapping after such a
    race, but only if the new mapping is also one that the user has
    requested (using MAP_SHARED, MAP_LOCKED or mlock) to be populated.

    Signed-off-by: Michel Lespinasse
    Acked-by: Rik van Riel
    Tested-by: Andy Lutomirski
    Cc: Greg Ungerer
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • In find_extend_vma(), we don't need mlock_vma_pages_range() to verify
    the vma type - we know we're working with a stack. So, we can call
    directly into __mlock_vma_pages_range(), and remove the last
    make_pages_present() call site.

    Note that we don't use mm_populate() here, so we can't release the
    mmap_sem while allocating new stack pages. This is deemed acceptable,
    because the stack vmas grow by a bounded number of pages at a time, and
    these are anon pages so we don't have to read from disk to populate
    them.

    Signed-off-by: Michel Lespinasse
    Acked-by: Rik van Riel
    Tested-by: Andy Lutomirski
    Cc: Greg Ungerer
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • After the MAP_POPULATE handling has been moved to mmap_region() call
    sites, the only remaining use of the flags argument is to pass the
    MAP_NORESERVE flag. This can be just as easily handled by
    do_mmap_pgoff(), so do that and remove the mmap_region() flags
    parameter.

    [akpm@linux-foundation.org: remove double parens]
    Signed-off-by: Michel Lespinasse
    Acked-by: Rik van Riel
    Tested-by: Andy Lutomirski
    Cc: Greg Ungerer
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Signed-off-by: Michel Lespinasse
    Acked-by: Rik van Riel
    Tested-by: Andy Lutomirski
    Cc: Greg Ungerer
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • When creating new mappings using the MAP_POPULATE / MAP_LOCKED flags (or
    with MCL_FUTURE in effect), we want to populate the pages within the
    newly created vmas. This may take a while as we may have to read pages
    from disk, so ideally we want to do this outside of the write-locked
    mmap_sem region.

    This change introduces mm_populate(), which is used to defer populating
    such mappings until after the mmap_sem write lock has been released.
    This is implemented as a generalization of the former do_mlock_pages(),
    which accomplished the same task but was using during mlock() /
    mlockall().

    Signed-off-by: Michel Lespinasse
    Reported-by: Andy Lutomirski
    Acked-by: Rik van Riel
    Tested-by: Andy Lutomirski
    Cc: Greg Ungerer
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

23 Feb, 2013

1 commit


20 Feb, 2013

1 commit

  • Pull scheduler changes from Ingo Molnar:
    "Main changes:

    - scheduler side full-dynticks (user-space execution is undisturbed
    and receives no timer IRQs) preparation changes that convert the
    cputime accounting code to be full-dynticks ready, from Frederic
    Weisbecker.

    - Initial sched.h split-up changes, by Clark Williams

    - select_idle_sibling() performance improvement by Mike Galbraith:

    " 1 tbench pair (worst case) in a 10 core + SMT package:

    pre 15.22 MB/sec 1 procs
    post 252.01 MB/sec 1 procs "

    - sched_rr_get_interval() ABI fix/change. We think this detail is not
    used by apps (so it's not an ABI in practice), but lets keep it
    under observation.

    - misc RT scheduling cleanups, optimizations"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
    sched/rt: Add header to
    cputime: Remove irqsave from seqlock readers
    sched, powerpc: Fix sched.h split-up build failure
    cputime: Restore CPU_ACCOUNTING config defaults for PPC64
    sched/rt: Move rt specific bits into new header file
    sched/rt: Add a tuning knob to allow changing SCHED_RR timeslice
    sched: Move sched.h sysctl bits into separate header
    sched: Fix signedness bug in yield_to()
    sched: Fix select_idle_sibling() bouncing cow syndrome
    sched/rt: Further simplify pick_rt_task()
    sched/rt: Do not account zero delta_exec in update_curr_rt()
    cputime: Safely read cputime of full dynticks CPUs
    kvm: Prepare to add generic guest entry/exit callbacks
    cputime: Use accessors to read task cputime stats
    cputime: Allow dynamic switch between tick/virtual based cputime accounting
    cputime: Generic on-demand virtual cputime accounting
    cputime: Move default nsecs_to_cputime() to jiffies based cputime file
    cputime: Librarize per nsecs resolution cputime definitions
    cputime: Avoid multiplication overflow on utime scaling
    context_tracking: Export context state for generic vtime
    ...

    Fix up conflict in kernel/context_tracking.c due to comment additions.

    Linus Torvalds
     

08 Feb, 2013

1 commit

  • Move the sysctl-related bits from include/linux/sched.h into
    a new file: include/linux/sched/sysctl.h. Then update source
    files requiring access to those bits by including the new
    header file.

    Signed-off-by: Clark Williams
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Link: http://lkml.kernel.org/r/20130207094659.06dced96@riff.lan
    Signed-off-by: Ingo Molnar

    Clark Williams
     

05 Feb, 2013

1 commit

  • We use rwsem since commit 5a505085f043 ("mm/rmap: Convert the struct
    anon_vma::mutex to an rwsem"). And most of comments are converted to
    the new rwsem lock; while just 2 more missed from:

    $ git grep 'anon_vma->mutex'

    Signed-off-by: Yuanhan Liu
    Acked-by: Ingo Molnar
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yuanhan Liu
     

12 Jan, 2013

1 commit

  • Commit 5a505085f043 ("mm/rmap: Convert the struct anon_vma::mutex to an
    rwsem") turned anon_vma mutex to rwsem.

    However, the properly annotated nested locking in mm_take_all_locks()
    has been converted from

    mutex_lock_nest_lock(&anon_vma->root->mutex, &mm->mmap_sem);

    to

    down_write(&anon_vma->root->rwsem);

    which is incomplete, and causes the false positive report from lockdep
    below.

    Annotate the fact that mmap_sem is used as an outter lock to serialize
    taking of all the anon_vma rwsems at once no matter the order, using the
    down_write_nest_lock() primitive.

    This patch fixes this lockdep report:

    =============================================
    [ INFO: possible recursive locking detected ]
    3.8.0-rc2-00036-g5f73896 #171 Not tainted
    ---------------------------------------------
    qemu-kvm/2315 is trying to acquire lock:
    (&anon_vma->rwsem){+.+...}, at: mm_take_all_locks+0x149/0x1b0

    but task is already holding lock:
    (&anon_vma->rwsem){+.+...}, at: mm_take_all_locks+0x149/0x1b0

    other info that might help us debug this:
    Possible unsafe locking scenario:

    CPU0
    ----
    lock(&anon_vma->rwsem);
    lock(&anon_vma->rwsem);

    *** DEADLOCK ***

    May be due to missing lock nesting notation

    4 locks held by qemu-kvm/2315:
    #0: (&mm->mmap_sem){++++++}, at: do_mmu_notifier_register+0xfc/0x170
    #1: (mm_all_locks_mutex){+.+...}, at: mm_take_all_locks+0x36/0x1b0
    #2: (&mapping->i_mmap_mutex){+.+...}, at: mm_take_all_locks+0xc9/0x1b0
    #3: (&anon_vma->rwsem){+.+...}, at: mm_take_all_locks+0x149/0x1b0

    stack backtrace:
    Pid: 2315, comm: qemu-kvm Not tainted 3.8.0-rc2-00036-g5f73896 #171
    Call Trace:
    print_deadlock_bug+0xf2/0x100
    validate_chain+0x4f6/0x720
    __lock_acquire+0x359/0x580
    lock_acquire+0x121/0x190
    down_write+0x3f/0x70
    mm_take_all_locks+0x149/0x1b0
    do_mmu_notifier_register+0x68/0x170
    mmu_notifier_register+0xe/0x10
    kvm_create_vm+0x22b/0x330 [kvm]
    kvm_dev_ioctl+0xf8/0x1a0 [kvm]
    do_vfs_ioctl+0x9d/0x350
    sys_ioctl+0x91/0xb0
    system_call_fastpath+0x16/0x1b

    Signed-off-by: Jiri Kosina
    Cc: Rik van Riel
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Mel Gorman
    Tested-by: Sedat Dilek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Kosina
     

17 Dec, 2012

1 commit

  • Pull Automatic NUMA Balancing bare-bones from Mel Gorman:
    "There are three implementations for NUMA balancing, this tree
    (balancenuma), numacore which has been developed in tip/master and
    autonuma which is in aa.git.

    In almost all respects balancenuma is the dumbest of the three because
    its main impact is on the VM side with no attempt to be smart about
    scheduling. In the interest of getting the ball rolling, it would be
    desirable to see this much merged for 3.8 with the view to building
    scheduler smarts on top and adapting the VM where required for 3.9.

    The most recent set of comparisons available from different people are

    mel: https://lkml.org/lkml/2012/12/9/108
    mingo: https://lkml.org/lkml/2012/12/7/331
    tglx: https://lkml.org/lkml/2012/12/10/437
    srikar: https://lkml.org/lkml/2012/12/10/397

    The results are a mixed bag. In my own tests, balancenuma does
    reasonably well. It's dumb as rocks and does not regress against
    mainline. On the other hand, Ingo's tests shows that balancenuma is
    incapable of converging for this workloads driven by perf which is bad
    but is potentially explained by the lack of scheduler smarts. Thomas'
    results show balancenuma improves on mainline but falls far short of
    numacore or autonuma. Srikar's results indicate we all suffer on a
    large machine with imbalanced node sizes.

    My own testing showed that recent numacore results have improved
    dramatically, particularly in the last week but not universally.
    We've butted heads heavily on system CPU usage and high levels of
    migration even when it shows that overall performance is better.
    There are also cases where it regresses. Of interest is that for
    specjbb in some configurations it will regress for lower numbers of
    warehouses and show gains for higher numbers which is not reported by
    the tool by default and sometimes missed in treports. Recently I
    reported for numacore that the JVM was crashing with
    NullPointerExceptions but currently it's unclear what the source of
    this problem is. Initially I thought it was in how numacore batch
    handles PTEs but I'm no longer think this is the case. It's possible
    numacore is just able to trigger it due to higher rates of migration.

    These reports were quite late in the cycle so I/we would like to start
    with this tree as it contains much of the code we can agree on and has
    not changed significantly over the last 2-3 weeks."

    * tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma: (50 commits)
    mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable
    mm/rmap: Convert the struct anon_vma::mutex to an rwsem
    mm: migrate: Account a transhuge page properly when rate limiting
    mm: numa: Account for failed allocations and isolations as migration failures
    mm: numa: Add THP migration for the NUMA working set scanning fault case build fix
    mm: numa: Add THP migration for the NUMA working set scanning fault case.
    mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node
    mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG
    mm: sched: numa: Control enabling and disabling of NUMA balancing
    mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate
    mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely tasknode relationships
    mm: numa: migrate: Set last_nid on newly allocated page
    mm: numa: split_huge_page: Transfer last_nid on tail page
    mm: numa: Introduce last_nid to the page frame
    sched: numa: Slowly increase the scanning period as NUMA faults are handled
    mm: numa: Rate limit setting of pte_numa if node is saturated
    mm: numa: Rate limit the amount of memory that is migrated between nodes
    mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting
    mm: numa: Migrate pages handled during a pmd_numa hinting fault
    mm: numa: Migrate on reference policy
    ...

    Linus Torvalds
     

13 Dec, 2012

2 commits

  • expand_stack() runs with a shared mmap_sem lock. Because of this, there
    could be multiple concurrent stack expansions in the same mm, which may
    cause problems in the vma gap update code.

    I propose to solve this by taking the mm->page_table_lock around such vma
    expansions, in order to avoid the concurrency issue. We only have to
    worry about concurrent expand_stack() calls here, since we hold a shared
    mmap_sem lock and all vma modificaitons other than expand_stack() are done
    under an exclusive mmap_sem lock.

    I previously tried to achieve the same effect by making sure all growable
    vmas in a given mm would share the same anon_vma, which we already lock
    here. However this turned out to be difficult - all of the schemes I
    tried for refcounting the growable anon_vma and clearing turned out ugly.
    So, I'm now proposing only the minimal fix.

    The overhead of taking the page table lock during stack expansion is
    expected to be small: glibc doesn't use expandable stacks for the threads
    it creates, so having multiple growable stacks is actually uncommon and we
    don't expect the page table lock to get bounced between threads.

    Signed-off-by: Michel Lespinasse
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • During reviewing the source code, I found a comment which mention that
    after f_op->mmap(), vma's start address can be changed. I didn't verify
    that it is really possible, because there are so many f_op->mmap()
    implementation. But if there are some mmap() which change vma's start
    address, it is possible error situation, because we already prepare prev
    vma, rb_link and rb_parent and these are related to original address.

    So add WARN_ON_ONCE for finding that this situtation really happens.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

12 Dec, 2012

5 commits

  • Merge misc updates from Andrew Morton:
    "About half of most of MM. Going very early this time due to
    uncertainty over the coreautounifiednumasched things. I'll send the
    other half of most of MM tomorrow. The rest of MM awaits a slab merge
    from Pekka."

    * emailed patches from Andrew Morton: (71 commits)
    memory_hotplug: ensure every online node has NORMAL memory
    memory_hotplug: handle empty zone when online_movable/online_kernel
    mm, memory-hotplug: dynamic configure movable memory and portion memory
    drivers/base/node.c: cleanup node_state_attr[]
    bootmem: fix wrong call parameter for free_bootmem()
    avr32, kconfig: remove HAVE_ARCH_BOOTMEM
    mm: cma: remove watermark hacks
    mm: cma: skip watermarks check for already isolated blocks in split_free_page()
    mm, oom: fix race when specifying a thread as the oom origin
    mm, oom: change type of oom_score_adj to short
    mm: cleanup register_node()
    mm, mempolicy: remove duplicate code
    mm/vmscan.c: try_to_freeze() returns boolean
    mm: introduce putback_movable_pages()
    virtio_balloon: introduce migration primitives to balloon pages
    mm: introduce compaction and migration for ballooned pages
    mm: introduce a common interface for balloon pages mobility
    mm: redefine address_space.assoc_mapping
    mm: adjust address_space_operations.migratepage() return code
    arch/sparc/kernel/sys_sparc_64.c: s/COLOUR/COLOR/
    ...

    Linus Torvalds
     
  • Implement vm_unmapped_area() using the rb_subtree_gap and highest_vm_end
    information to look up for suitable virtual address space gaps.

    struct vm_unmapped_area_info is used to define the desired allocation
    request:
    - lowest or highest possible address matching the remaining constraints
    - desired gap length
    - low/high address limits that the gap must fit into
    - alignment mask and offset

    Also update the generic arch_get_unmapped_area[_topdown] functions to make
    use of vm_unmapped_area() instead of implementing a brute force search.

    [akpm@linux-foundation.org: checkpatch fixes]
    Signed-off-by: Michel Lespinasse
    Reviewed-by: Rik van Riel
    Cc: Hugh Dickins
    Cc: Russell King
    Cc: Ralf Baechle
    Cc: Paul Mundt
    Cc: "David S. Miller"
    Cc: Chris Metcalf
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • When CONFIG_DEBUG_VM_RB is enabled, check that rb_subtree_gap is correctly
    set for every vma and that mm->highest_vm_end is also correct.

    Also add an explicit 'bug' variable to track if browse_rb() detected any
    invalid condition.

    [akpm@linux-foundation.org: repair innovative coding-style inventions]
    Signed-off-by: Michel Lespinasse
    Reviewed-by: Rik van Riel
    Cc: Hugh Dickins
    Cc: Russell King
    Cc: Ralf Baechle
    Cc: Paul Mundt
    Cc: "David S. Miller"
    Cc: Chris Metcalf
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Define vma->rb_subtree_gap as the largest gap between any vma in the
    subtree rooted at that vma, and their predecessor. Or, for a recursive
    definition, vma->rb_subtree_gap is the max of:

    - vma->vm_start - vma->vm_prev->vm_end
    - rb_subtree_gap fields of the vmas pointed by vma->rb.rb_left and
    vma->rb.rb_right

    This will allow get_unmapped_area_* to find a free area of the right
    size in O(log(N)) time, instead of potentially having to do a linear
    walk across all the VMAs.

    Also define mm->highest_vm_end as the vm_end field of the highest vma,
    so that we can easily check if the following gap is suitable.

    This does have the potential to make unmapping VMAs more expensive,
    especially for processes with very large numbers of VMAs, where the VMA
    rbtree can grow quite deep.

    Signed-off-by: Michel Lespinasse
    Reviewed-by: Rik van Riel
    Cc: Hugh Dickins
    Cc: Russell King
    Cc: Ralf Baechle
    Cc: Paul Mundt
    Cc: "David S. Miller"
    Cc: Chris Metcalf
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • There was some desire in large applications using MAP_HUGETLB or
    SHM_HUGETLB to use 1GB huge pages on some mappings, and stay with 2MB on
    others. This is useful together with NUMA policy: use 2MB interleaving
    on some mappings, but 1GB on local mappings.

    This patch extends the IPC/SHM syscall interfaces slightly to allow
    specifying the page size.

    It borrows some upper bits in the existing flag arguments and allows
    encoding the log of the desired page size in addition to the *_HUGETLB
    flag. When 0 is specified the default size is used, this makes the
    change fully compatible.

    Extending the internal hugetlb code to handle this is straight forward.
    Instead of a single mount it just keeps an array of them and selects the
    right mount based on the specified page size. When no page size is
    specified it uses the mount of the default page size.

    The change is not visible in /proc/mounts because internal mounts don't
    appear there. It also has very little overhead: the additional mounts
    just consume a super block, but not more memory when not used.

    I also exported the new flags to the user headers (they were previously
    under __KERNEL__). Right now only symbols for x86 and some other
    architecture for 1GB and 2MB are defined. The interface should already
    work for all other architectures though. Only architectures that define
    multiple hugetlb sizes actually need it (that is currently x86, tile,
    powerpc). However tile and powerpc have user configurable hugetlb
    sizes, so it's not easy to add defines. A program on those
    architectures would need to query sysfs and use the appropiate log2.

    [akpm@linux-foundation.org: cleanups]
    [rientjes@google.com: fix build]
    [akpm@linux-foundation.org: checkpatch fixes]
    Signed-off-by: Andi Kleen
    Cc: Michael Kerrisk
    Acked-by: Rik van Riel
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hillf Danton
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     

11 Dec, 2012

2 commits

  • rmap_walk_anon() and try_to_unmap_anon() appears to be too
    careful about locking the anon vma: while it needs protection
    against anon vma list modifications, it does not need exclusive
    access to the list itself.

    Transforming this exclusive lock to a read-locked rwsem removes
    a global lock from the hot path of page-migration intense
    threaded workloads which can cause pathological performance like
    this:

    96.43% process 0 [kernel.kallsyms] [k] perf_trace_sched_switch
    |
    --- perf_trace_sched_switch
    __schedule
    schedule
    schedule_preempt_disabled
    __mutex_lock_common.isra.6
    __mutex_lock_slowpath
    mutex_lock
    |
    |--50.61%-- rmap_walk
    | move_to_new_page
    | migrate_pages
    | migrate_misplaced_page
    | __do_numa_page.isra.69
    | handle_pte_fault
    | handle_mm_fault
    | __do_page_fault
    | do_page_fault
    | page_fault
    | __memset_sse2
    | |
    | --100.00%-- worker_thread
    | |
    | --100.00%-- start_thread
    |
    --49.39%-- page_lock_anon_vma
    try_to_unmap_anon
    try_to_unmap
    migrate_pages
    migrate_misplaced_page
    __do_numa_page.isra.69
    handle_pte_fault
    handle_mm_fault
    __do_page_fault
    do_page_fault
    page_fault
    __memset_sse2
    |
    --100.00%-- worker_thread
    start_thread

    With this change applied the profile is now nicely flat
    and there's no anon-vma related scheduling/blocking.

    Rename anon_vma_[un]lock() => anon_vma_[un]lock_write(),
    to make it clearer that it's an exclusive write-lock in
    that case - suggested by Rik van Riel.

    Suggested-by: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Paul Turner
    Cc: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Hugh Dickins
    Signed-off-by: Ingo Molnar
    Signed-off-by: Mel Gorman

    Ingo Molnar
     
  • Convert the struct anon_vma::mutex to an rwsem, which will help
    in solving a page-migration scalability problem. (Addressed in
    a separate patch.)

    The conversion is simple and straightforward: in every case
    where we mutex_lock()ed we'll now down_write().

    Suggested-by: Linus Torvalds
    Reviewed-by: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Paul Turner
    Cc: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Hugh Dickins
    Signed-off-by: Ingo Molnar
    Signed-off-by: Mel Gorman

    Ingo Molnar
     

17 Nov, 2012

1 commit