03 Jul, 2015

1 commit

  • …scm/linux/kernel/git/paulg/linux

    Pull module_init replacement part two from Paul Gortmaker:
    "Replace module_init with appropriate alternate initcall in non
    modules.

    This series converts non-modular code that is using the module_init()
    call to hook itself into the system to instead use one of our
    alternate priority initcalls.

    Unlike the previous series that used device_initcall and hence was a
    runtime no-op, these commits change to one of the alternate initcalls,
    because (a) we have them and (b) it seems like the right thing to do.

    For example, it would seem logical to use arch_initcall for arch
    specific setup code and fs_initcall for filesystem setup code.

    This does mean however, that changes in the init ordering will be
    taking place, and so there is a small risk that some kind of implicit
    init ordering issue may lie uncovered. But I think it is still better
    to give these ones sensible priorities than to just assign them all to
    device_initcall in order to exactly preserve the old ordering.

    Thad said, we have already made similar changes in core kernel code in
    commit c96d6660dc65 ("kernel: audit/fix non-modular users of
    module_init in core code") without any regressions reported, so this
    type of change isn't without precedent. It has also got the same
    local testing and linux-next coverage as all the other pull requests
    that I'm sending for this merge window have got.

    Once again, there is an unused module_exit function removal that shows
    up as an outlier upon casual inspection of the diffstat"

    * tag 'module_init-alternate_initcall-v4.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux:
    x86: perf_event_intel_pt.c: use arch_initcall to hook in enabling
    x86: perf_event_intel_bts.c: use arch_initcall to hook in enabling
    mm/page_owner.c: use late_initcall to hook in enabling
    lib/list_sort: use late_initcall to hook in self tests
    arm: use subsys_initcall in non-modular pl320 IPC code
    powerpc: don't use module_init for non-modular core hugetlb code
    powerpc: use subsys_initcall for Freescale Local Bus
    x86: don't use module_init for non-modular core bootflag code
    netfilter: don't use module_init/exit in core IPV4 code
    fs/notify: don't use module_init for non-modular inotify_user code
    mm: replace module_init usages with subsys_initcall in nommu.c

    Linus Torvalds
     

25 Jun, 2015

1 commit

  • kenter/kleave/kdebug are wrapper macros to print functions flow and debug
    information. This set was written before pr_devel() was introduced, so it
    was controlled by "#if 0" construction. It is questionable if anyone is
    using them [1] now.

    This patch removes these macros, converts numerous printk(KERN_WARNING,
    ...) to use general pr_warn(...) and removes debug print line from
    validate_mmap_request() function.

    Signed-off-by: Leon Romanovsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Leon Romanovsky
     

17 Jun, 2015

1 commit

  • Compiling some arm/m68k configs with "# CONFIG_MMU is not set" reveals
    two more instances of module_init being used for code that can't
    possibly be modular, as CONFIG_MMU is either on or off.

    We replace them with subsys_initcall as per what was done in other
    mmu-enabled code.

    Note that direct use of __initcall is discouraged, vs. one of the
    priority categorized subgroups. As __initcall gets mapped onto
    device_initcall, our use of subsys_initcall (which makes sense for these
    files) will thus change this registration from level 6-device to level
    4-subsys (i.e. slightly earlier).

    One might think that core_initcall (l2) or postcore_initcall (l3) would
    be more appropriate for anything in mm/ but if we look at the actual init
    functions themselves, we see they are just sysctl setup stuff, and
    hence the choice of subsys_initcall (l4) seems reasonable. At the same
    time it minimizes the risk of changing the priority too drastically all
    at once. We can adjust further in the future.

    Also, a couple instances of missing ";" at EOL are fixed.

    Cc: Andrew Morton
    Cc: linux-mm@kvack.org
    Signed-off-by: Paul Gortmaker

    Paul Gortmaker
     

12 Apr, 2015

1 commit


13 Mar, 2015

1 commit

  • Several modules may need max_mapnr, so export, the related error with
    allmodconfig under c6x:

    MODPOST 3327 modules
    ERROR: "max_mapnr" [fs/pstore/ramoops.ko] undefined!
    ERROR: "max_mapnr" [drivers/media/v4l2-core/videobuf2-dma-contig.ko] undefined!

    Signed-off-by: Chen Gang
    Cc: Mark Salter
    Cc: Aurelien Jacquiot
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    gchen gchen
     

01 Mar, 2015

1 commit

  • Maxime reported the following memory leak regression due to commit
    dbc8358c7237 ("mm/nommu: use alloc_pages_exact() rather than its own
    implementation").

    On v3.19, I am facing a memory leak. Each time I run a command one page
    is lost. Here an example with busybox's free command:

    / # free
    total used free shared buffers cached
    Mem: 7928 1972 5956 0 0 492
    -/+ buffers/cache: 1480 6448
    / # free
    total used free shared buffers cached
    Mem: 7928 1976 5952 0 0 492
    -/+ buffers/cache: 1484 6444
    / # free
    total used free shared buffers cached
    Mem: 7928 1980 5948 0 0 492
    -/+ buffers/cache: 1488 6440
    / # free
    total used free shared buffers cached
    Mem: 7928 1984 5944 0 0 492
    -/+ buffers/cache: 1492 6436
    / # free
    total used free shared buffers cached
    Mem: 7928 1988 5940 0 0 492
    -/+ buffers/cache: 1496 6432

    At some point, the system fails to sastisfy 256KB allocations:

    free: page allocation failure: order:6, mode:0xd0
    CPU: 0 PID: 67 Comm: free Not tainted 3.19.0-05389-gacf2cf1-dirty #64
    Hardware name: STM32 (Device Tree Support)
    show_stack+0xb/0xc
    warn_alloc_failed+0x97/0xbc
    __alloc_pages_nodemask+0x295/0x35c
    __get_free_pages+0xb/0x24
    alloc_pages_exact+0x19/0x24
    do_mmap_pgoff+0x423/0x658
    vm_mmap_pgoff+0x3f/0x4e
    load_flat_file+0x20d/0x4f8
    load_flat_binary+0x3f/0x26c
    search_binary_handler+0x51/0xe4
    do_execveat_common+0x271/0x35c
    do_execve+0x19/0x1c
    ret_fast_syscall+0x1/0x4a
    Mem-info:
    Normal per-cpu:
    CPU 0: hi: 0, btch: 1 usd: 0
    active_anon:0 inactive_anon:0 isolated_anon:0
    active_file:0 inactive_file:0 isolated_file:0
    unevictable:123 dirty:0 writeback:0 unstable:0
    free:1515 slab_reclaimable:17 slab_unreclaimable:139
    mapped:0 shmem:0 pagetables:0 bounce:0
    free_cma:0
    Normal free:6060kB min:352kB low:440kB high:528kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:492kB isolated(anon):0ks
    lowmem_reserve[]: 0 0
    Normal: 23*4kB (U) 22*8kB (U) 24*16kB (U) 23*32kB (U) 23*64kB (U) 23*128kB (U) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 6060kB
    123 total pagecache pages
    2048 pages of RAM
    1538 free pages
    66 reserved pages
    109 slab pages
    -46 pages shared
    0 pages swap cached
    nommu: Allocation of length 221184 from process 67 (free) failed
    Normal per-cpu:
    CPU 0: hi: 0, btch: 1 usd: 0
    active_anon:0 inactive_anon:0 isolated_anon:0
    active_file:0 inactive_file:0 isolated_file:0
    unevictable:123 dirty:0 writeback:0 unstable:0
    free:1515 slab_reclaimable:17 slab_unreclaimable:139
    mapped:0 shmem:0 pagetables:0 bounce:0
    free_cma:0
    Normal free:6060kB min:352kB low:440kB high:528kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:492kB isolated(anon):0ks
    lowmem_reserve[]: 0 0
    Normal: 23*4kB (U) 22*8kB (U) 24*16kB (U) 23*32kB (U) 23*64kB (U) 23*128kB (U) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 6060kB
    123 total pagecache pages
    Unable to allocate RAM for process text/data, errno 12 SEGV

    This problem happens because we allocate ordered page through
    __get_free_pages() in do_mmap_private() in some cases and we try to free
    individual pages rather than ordered page in free_page_series(). In
    this case, freeing pages whose refcount is not 0 won't be freed to the
    page allocator so memory leak happens.

    To fix the problem, this patch changes __get_free_pages() to
    alloc_pages_exact() since alloc_pages_exact() returns
    physically-contiguous pages but each pages are refcounted.

    Fixes: dbc8358c7237 ("mm/nommu: use alloc_pages_exact() rather than its own implementation").
    Reported-by: Maxime Coquelin
    Tested-by: Maxime Coquelin
    Signed-off-by: Joonsoo Kim
    Cc: [3.19]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

13 Feb, 2015

1 commit

  • Pull backing device changes from Jens Axboe:
    "This contains a cleanup of how the backing device is handled, in
    preparation for a rework of the life time rules. In this part, the
    most important change is to split the unrelated nommu mmap flags from
    it, but also removing a backing_dev_info pointer from the
    address_space (and inode), and a cleanup of other various minor bits.

    Christoph did all the work here, I just fixed an oops with pages that
    have a swap backing. Arnd fixed a missing export, and Oleg killed the
    lustre backing_dev_info from staging. Last patch was from Al,
    unexporting parts that are now no longer needed outside"

    * 'for-3.20/bdi' of git://git.kernel.dk/linux-block:
    Make super_blocks and sb_lock static
    mtd: export new mtd_mmap_capabilities
    fs: make inode_to_bdi() handle NULL inode
    staging/lustre/llite: get rid of backing_dev_info
    fs: remove default_backing_dev_info
    fs: don't reassign dirty inodes to default_backing_dev_info
    nfs: don't call bdi_unregister
    ceph: remove call to bdi_unregister
    fs: remove mapping->backing_dev_info
    fs: export inode_to_bdi and use it in favor of mapping->backing_dev_info
    nilfs2: set up s_bdi like the generic mount_bdev code
    block_dev: get bdev inode bdi directly from the block device
    block_dev: only write bdev inode on close
    fs: introduce f_op->mmap_capabilities for nommu mmap support
    fs: kill BDI_CAP_SWAP_BACKED
    fs: deduplicate noop_backing_dev_info

    Linus Torvalds
     

12 Feb, 2015

3 commits

  • I noticed that "allowed" can easily overflow by falling below 0, because
    (total_vm / 32) can be larger than "allowed". The problem occurs in
    OVERCOMMIT_NONE mode.

    In this case, a huge allocation can success and overcommit the system
    (despite OVERCOMMIT_NONE mode). All subsequent allocations will fall
    (system-wide), so system become unusable.

    The problem was masked out by commit c9b1d0981fcc
    ("mm: limit growth of 3% hardcoded other user reserve"),
    but it's easy to reproduce it on older kernels:
    1) set overcommit_memory sysctl to 2
    2) mmap() large file multiple times (with VM_SHARED flag)
    3) try to malloc() large amount of memory

    It also can be reproduced on newer kernels, but miss-configured
    sysctl_user_reserve_kbytes is required.

    Fix this issue by switching to signed arithmetic here.

    Signed-off-by: Roman Gushchin
    Cc: Andrew Shewmaker
    Cc: Rik van Riel
    Cc: Konstantin Khlebnikov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Some callers (like KVM) may want to set the gup_flags like FOLL_HWPOSION
    to get a proper -EHWPOSION retval instead of -EFAULT to take a more
    appropriate action if get_user_pages runs into a memory failure.

    Signed-off-by: Andrea Arcangeli
    Reviewed-by: Kirill A. Shutemov
    Cc: Andres Lagar-Cavilla
    Cc: Peter Feiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • FAULT_FOLL_ALLOW_RETRY allows the page fault to drop the mmap_sem for
    reading to reduce the mmap_sem contention (for writing), like while
    waiting for I/O completion. The problem is that right now practically no
    get_user_pages call uses FAULT_FOLL_ALLOW_RETRY, so we're not leveraging
    that nifty feature.

    Andres fixed it for the KVM page fault. However get_user_pages_fast
    remains uncovered, and 99% of other get_user_pages aren't using it either
    (the only exception being FOLL_NOWAIT in KVM which is really nonblocking
    and in fact it doesn't even release the mmap_sem).

    So this patchsets extends the optimization Andres did in the KVM page
    fault to the whole kernel. It makes most important places (including
    gup_fast) to use FAULT_FOLL_ALLOW_RETRY to reduce the mmap_sem hold times
    during I/O.

    The only few places that remains uncovered are drivers like v4l and other
    exceptions that tends to work on their own memory and they're not working
    on random user memory (for example like O_DIRECT that uses gup_fast and is
    fully covered by this patch).

    A follow up patch should probably also add a printk_once warning to
    get_user_pages that should go obsolete and be phased out eventually. The
    "vmas" parameter of get_user_pages makes it fundamentally incompatible
    with FAULT_FOLL_ALLOW_RETRY (vmas array becomes meaningless the moment the
    mmap_sem is released).

    While this is just an optimization, this becomes an absolute requirement
    for the userfaultfd feature http://lwn.net/Articles/615086/ .

    The userfaultfd allows to block the page fault, and in order to do so I
    need to drop the mmap_sem first. So this patch also ensures that all
    memory where userfaultfd could be registered by KVM, the very first fault
    (no matter if it is a regular page fault, or a get_user_pages) always has
    FAULT_FOLL_ALLOW_RETRY set. Then the userfaultfd blocks and it is waken
    only when the pagetable is already mapped. The second fault attempt after
    the wakeup doesn't need FAULT_FOLL_ALLOW_RETRY, so it's ok to retry
    without it.

    This patch (of 5):

    We can leverage the VM_FAULT_RETRY functionality in the page fault paths
    better by using either get_user_pages_locked or get_user_pages_unlocked.

    The former allows conversion of get_user_pages invocations that will have
    to pass a "&locked" parameter to know if the mmap_sem was dropped during
    the call. Example from:

    down_read(&mm->mmap_sem);
    do_something()
    get_user_pages(tsk, mm, ..., pages, NULL);
    up_read(&mm->mmap_sem);

    to:

    int locked = 1;
    down_read(&mm->mmap_sem);
    do_something()
    get_user_pages_locked(tsk, mm, ..., pages, &locked);
    if (locked)
    up_read(&mm->mmap_sem);

    The latter is suitable only as a drop in replacement of the form:

    down_read(&mm->mmap_sem);
    get_user_pages(tsk, mm, ..., pages, NULL);
    up_read(&mm->mmap_sem);

    into:

    get_user_pages_unlocked(tsk, mm, ..., pages);

    Where tsk, mm, the intermediate "..." paramters and "pages" can be any
    value as before. Just the last parameter of get_user_pages (vmas) must be
    NULL for get_user_pages_locked|unlocked to be usable (the latter original
    form wouldn't have been safe anyway if vmas wasn't null, for the former we
    just make it explicit by dropping the parameter).

    If vmas is not NULL these two methods cannot be used.

    Signed-off-by: Andrea Arcangeli
    Reviewed-by: Andres Lagar-Cavilla
    Reviewed-by: Peter Feiner
    Reviewed-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

11 Feb, 2015

1 commit

  • remap_file_pages(2) was invented to be able efficiently map parts of
    huge file into limited 32-bit virtual address space such as in database
    workloads.

    Nonlinear mappings are pain to support and it seems there's no
    legitimate use-cases nowadays since 64-bit systems are widely available.

    Let's drop it and get rid of all these special-cased code.

    The patch replaces the syscall with emulation which creates new VMA on
    each remap_file_pages(), unless they it can be merged with an adjacent
    one.

    I didn't find *any* real code that uses remap_file_pages(2) to test
    emulation impact on. I've checked Debian code search and source of all
    packages in ALT Linux. No real users: libc wrappers, mentions in
    strace, gdb, valgrind and this kind of stuff.

    There are few basic tests in LTP for the syscall. They work just fine
    with emulation.

    To test performance impact, I've written small test case which
    demonstrate pretty much worst case scenario: map 4G shmfs file, write to
    begin of every page pgoff of the page, remap pages in reverse order,
    read every page.

    The test creates 1 million of VMAs if emulation is in use, so I had to
    set vm.max_map_count to 1100000 to avoid -ENOMEM.

    Before: 23.3 ( +- 4.31% ) seconds
    After: 43.9 ( +- 0.85% ) seconds
    Slowdown: 1.88x

    I believe we can live with that.

    Test case:

    #define _GNU_SOURCE
    #include
    #include
    #include
    #include

    #define MB (1024UL * 1024)
    #define SIZE (4096 * MB)

    int main(int argc, char **argv)
    {
    unsigned long *p;
    long i, pass;

    for (pass = 0; pass < 10; pass++) {
    p = mmap(NULL, SIZE, PROT_READ|PROT_WRITE,
    MAP_SHARED | MAP_ANONYMOUS, -1, 0);
    if (p == MAP_FAILED) {
    perror("mmap");
    return -1;
    }

    for (i = 0; i < SIZE / 4096; i++)
    p[i * 4096 / sizeof(*p)] = i;

    for (i = 0; i < SIZE / 4096; i++) {
    if (remap_file_pages(p + i * 4096 / sizeof(*p), 4096,
    0, (SIZE - 4096 * (i + 1)) >> 12, 0)) {
    perror("remap_file_pages");
    return -1;
    }
    }

    for (i = SIZE / 4096 - 1; i >= 0; i--)
    assert(p[i * 4096 / sizeof(*p)] == SIZE / 4096 - i - 1);

    munmap(p, SIZE);
    }

    return 0;
    }

    [akpm@linux-foundation.org: fix spello]
    [sasha.levin@oracle.com: initialize populate before usage]
    [sasha.levin@oracle.com: grab file ref to prevent race while mmaping]
    Signed-off-by: "Kirill A. Shutemov"
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Dave Jones
    Cc: Linus Torvalds
    Cc: Armin Rigo
    Signed-off-by: Sasha Levin
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

06 Feb, 2015

1 commit

  • The symbol 'high_memory' is provided on both MMU- and NOMMU-kernels, but
    only one of them is exported, which leads to module build errors in
    drivers that work fine built-in:

    ERROR: "high_memory" [drivers/net/virtio_net.ko] undefined!
    ERROR: "high_memory" [drivers/net/ppp/ppp_mppe.ko] undefined!
    ERROR: "high_memory" [drivers/mtd/nand/nand.ko] undefined!
    ERROR: "high_memory" [crypto/tcrypt.ko] undefined!
    ERROR: "high_memory" [crypto/cts.ko] undefined!

    This exports the symbol to get these to work on NOMMU as well.

    Signed-off-by: Arnd Bergmann
    Cc: Kirill A. Shutemov
    Acked-by: Greg Ungerer
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     

21 Jan, 2015

1 commit

  • Since "BDI: Provide backing device capability information [try #3]" the
    backing_dev_info structure also provides flags for the kind of mmap
    operation available in a nommu environment, which is entirely unrelated
    to it's original purpose.

    Introduce a new nommu-only file operation to provide this information to
    the nommu mmap code instead. Splitting this from the backing_dev_info
    structure allows to remove lots of backing_dev_info instance that aren't
    otherwise needed, and entirely gets rid of the concept of providing a
    backing_dev_info for a character device. It also removes the need for
    the mtd_inodefs filesystem.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Tejun Heo
    Acked-by: Brian Norris
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

14 Dec, 2014

3 commits

  • do_mmap_private() in nommu.c try to allocate physically contiguous pages
    with arbitrary size in some cases and we now have good abstract function
    to do exactly same thing, alloc_pages_exact(). So, change to use it.

    There is no functional change. This is the preparation step for support
    page owner feature accurately.

    Signed-off-by: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Dave Hansen
    Cc: Michal Nazarewicz
    Cc: Jungsoo Son
    Cc: Ingo Molnar
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Shrinking/truncate logic can call nommu_shrink_inode_mappings() to verify
    that any shared mappings of the inode in question aren't broken (dead
    zone). afaict the only user being ramfs to handle the size change
    attribute.

    Pretty much a no-brainer to share the lock.

    Signed-off-by: Davidlohr Bueso
    Acked-by: "Kirill A. Shutemov"
    Acked-by: Hugh Dickins
    Cc: Oleg Nesterov
    Acked-by: Peter Zijlstra (Intel)
    Cc: Rik van Riel
    Cc: Srikar Dronamraju
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Convert all open coded mutex_lock/unlock calls to the
    i_mmap_[lock/unlock]_write() helpers.

    Signed-off-by: Davidlohr Bueso
    Acked-by: Rik van Riel
    Acked-by: "Kirill A. Shutemov"
    Acked-by: Hugh Dickins
    Cc: Oleg Nesterov
    Acked-by: Peter Zijlstra (Intel)
    Cc: Srikar Dronamraju
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

08 Sep, 2014

1 commit

  • Percpu allocator now supports allocation mask. Add @gfp to
    percpu_counter_init() so that !GFP_KERNEL allocation masks can be used
    with percpu_counters too.

    We could have left percpu_counter_init() alone and added
    percpu_counter_init_gfp(); however, the number of users isn't that
    high and introducing _gfp variants to all percpu data structures would
    be quite ugly, so let's just do the conversion. This is the one with
    the most users. Other percpu data structures are a lot easier to
    convert.

    This patch doesn't make any functional difference.

    Signed-off-by: Tejun Heo
    Acked-by: Jan Kara
    Acked-by: "David S. Miller"
    Cc: x86@kernel.org
    Cc: Jens Axboe
    Cc: "Theodore Ts'o"
    Cc: Alexander Viro
    Cc: Andrew Morton

    Tejun Heo
     

09 Aug, 2014

1 commit

  • The core mm code will provide a default gate area based on
    FIXADDR_USER_START and FIXADDR_USER_END if
    !defined(__HAVE_ARCH_GATE_AREA) && defined(AT_SYSINFO_EHDR).

    This default is only useful for ia64. arm64, ppc, s390, sh, tile, 64-bit
    UML, and x86_32 have their own code just to disable it. arm, 32-bit UML,
    and x86_64 have gate areas, but they have their own implementations.

    This gets rid of the default and moves the code into ia64.

    This should save some code on architectures without a gate area: it's now
    possible to inline the gate_area functions in the default case.

    Signed-off-by: Andy Lutomirski
    Acked-by: Nathan Lynch
    Acked-by: H. Peter Anvin
    Acked-by: Benjamin Herrenschmidt [in principle]
    Acked-by: Richard Weinberger [for um]
    Acked-by: Will Deacon [for arm64]
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Chris Metcalf
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Nathan Lynch
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Lutomirski
     

24 Jun, 2014

1 commit

  • mm could be removed from current task struct, using previous vma->vm_mm

    It will crash on blackfin after updated to Linux 3.15. The commit "mm:
    per-thread vma caching" caused the crash. mm could be removed from
    current task struct before

    mmput()->
    exit_mmap()->
    delete_vma_from_mm()

    the detailed fault information:

    NULL pointer access
    Kernel OOPS in progress
    Deferred Exception context
    CURRENT PROCESS:
    COMM=modprobe PID=278 CPU=0
    invalid mm
    return address: [0x000531de]; contents of:
    0x000531b0: c727 acea 0c42 181d 0000 0000 0000 a0a8
    0x000531c0: b090 acaa 0c42 1806 0000 0000 0000 a0e8
    0x000531d0: b0d0 e801 0000 05b3 0010 e522 0046 [a090]
    0x000531e0: 6408 b090 0c00 17cc 3042 e3ff f37b 2fc8

    CPU: 0 PID: 278 Comm: modprobe Not tainted 3.15.0-ADI-2014R1-pre-00345-gea9f446 #25
    task: 0572b720 ti: 0569e000 task.ti: 0569e000
    Compiled for cpu family 0x27fe (Rev 0), but running on:0x0000 (Rev 0)
    ADSP-BF609-0.0 500(MHz CCLK) 125(MHz SCLK) (mpu off)
    Linux version 3.15.0-ADI-2014R1-pre-00345-gea9f446 (steven@steven-OptiPlex-390) (gcc version 4.3.5 (ADI-trunk/svn-5962) ) #25 Tue Jun 10 17:47:46 CST 2014

    SEQUENCER STATUS: Not tainted
    SEQSTAT: 00000027 IPEND: 8008 IMASK: ffff SYSCFG: 2806
    EXCAUSE : 0x27
    physical IVG3 asserted : { _trap + 0x0 }
    physical IVG15 asserted : { _evt_system_call + 0x0 }
    logical irq 6 mapped : { _bfin_coretmr_interrupt + 0x0 }
    logical irq 7 mapped : { _bfin_fault_routine + 0x0 }
    logical irq 11 mapped : { _l2_ecc_err + 0x0 }
    logical irq 13 mapped : { _bfin_fault_routine + 0x0 }
    logical irq 39 mapped : { _bfin_twi_interrupt_entry + 0x0 }
    logical irq 40 mapped : { _bfin_twi_interrupt_entry + 0x0 }
    RETE: /* Maybe null pointer? */
    RETN: /* kernel dynamic memory (maybe user-space) */
    RETX: /* Maybe fixed code section */
    RETS: { _exit_mmap + 0x28 }
    PC : { _delete_vma_from_mm + 0x92 }
    DCPLB_FAULT_ADDR: /* Maybe null pointer? */
    ICPLB_FAULT_ADDR: { _delete_vma_from_mm + 0x92 }
    PROCESSOR STATE:
    R0 : 00000004 R1 : 0569e000 R2 : 00bf3db4 R3 : 00000000
    R4 : 057f9800 R5 : 00000001 R6 : 0569ddd0 R7 : 0572b720
    P0 : 0572b854 P1 : 00000004 P2 : 00000000 P3 : 0569dda0
    P4 : 0572b720 P5 : 0566c368 FP : 0569fe5c SP : 0569fd74
    LB0: 057f523f LT0: 057f523e LC0: 00000000
    LB1: 0005317c LT1: 00053172 LC1: 00000002
    B0 : 00000000 L0 : 00000000 M0 : 0566f5bc I0 : 00000000
    B1 : 00000000 L1 : 00000000 M1 : 00000000 I1 : ffffffff
    B2 : 00000001 L2 : 00000000 M2 : 00000000 I2 : 00000000
    B3 : 00000000 L3 : 00000000 M3 : 00000000 I3 : 057f8000
    A0.w: 00000000 A0.x: 00000000 A1.w: 00000000 A1.x: 00000000
    USP : 056ffcf8 ASTAT: 02003024

    Hardware Trace:
    0 Target : { _trap_c + 0x0 }
    Source : { _exception_to_level5 + 0xa0 } JUMP.L
    1 Target : { _exception_to_level5 + 0x0 }
    Source : { _bfin_return_from_exception + 0x6 } RTX
    2 Target : { _bfin_return_from_exception + 0x0 }
    Source : { _ex_trap_c + 0x70 } JUMP.S
    3 Target : { _ex_trap_c + 0x0 }
    Source : { _trap + 0x2a } JUMP (P4)
    4 Target : { _trap + 0x0 }
    FAULT : { _delete_vma_from_mm + 0x92 } P0 = W[P2 + 2]
    Source : { _delete_vma_from_mm + 0x8e } P2 = [P4 + 0x18]
    5 Target : { _delete_vma_from_mm + 0x8e }
    Source : { _delete_vma_from_mm + 0x2a } IF CC JUMP pcrel
    6 Target : { _delete_vma_from_mm + 0x0 }
    Source : { _exit_mmap + 0x24 } JUMP.L
    7 Target : { _exit_mmap + 0x1c }
    Source : { _exit_mmap + 0x38 } IF !CC JUMP pcrel (BP)
    8 Target : { _exit_mmap + 0x34 }
    Source : { __cond_resched + 0x20 } RTS
    9 Target : { __cond_resched + 0x0 }
    Source : { _exit_mmap + 0x30 } JUMP.L
    10 Target : { _exit_mmap + 0x30 }
    Source : { _delete_vma + 0xb2 } RTS
    11 Target : { _delete_vma + 0xac }
    Source : { _kmem_cache_free + 0xba } RTS
    12 Target : { _kmem_cache_free + 0xa8 }
    Source : { _kmem_cache_free + 0x9e } IF !CC JUMP pcrel (BP)
    13 Target : { _kmem_cache_free + 0x92 }
    Source : { _kmem_cache_free + 0x5a } IF CC JUMP pcrel
    14 Target : { _kmem_cache_free + 0x34 }
    Source : { _kmem_cache_free + 0xe } IF CC JUMP pcrel (BP)
    15 Target : { _kmem_cache_free + 0x0 }
    Source : { _delete_vma + 0xa8 } JUMP.L
    Kernel Stack
    Stack info:
    SP: [0x0569ff24] /* kernel dynamic memory (maybe user-space) */
    Memory from 0x0569ff20 to 056a0000
    0569ff20: 00000001 [04e8da5a] 00008000 00000000 00000000 056a0000 04e8da5a 04e8da5a
    0569ff40: 04eb9eea ffa00dce 02003025 04ea09c5 057f523f 04ea09c4 057f523e 00000000
    0569ff60: 00000000 00000000 00000000 00000000 00000000 00000000 00000001 00000000
    0569ff80: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
    0569ffa0: 0566f5bc 057f8000 057f8000 00000001 04ec0170 056ffcf8 056ffd04 057f9800
    0569ffc0: 04d1d498 057f9800 057f8fe4 057f8ef0 00000001 057f928c 00000001 00000001
    0569ffe0: 057f9800 00000000 00000008 00000007 00000001 00000001 00000001
    Return addresses in stack:
    address : { _show_cpuinfo + 0x2d2 }
    Modules linked in:
    Kernel panic - not syncing: Kernel exception
    [ end Kernel panic - not syncing: Kernel exception

    Signed-off-by: Steven Miao
    Acked-by: Davidlohr Bueso
    Reviewed-by: Rik van Riel
    Acked-by: David Rientjes
    Cc: [3.15.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Miao
     

07 Jun, 2014

1 commit

  • printk is meant to be used with an associated log level. There are some
    instances of printk scattered around the mm code where the log level is
    missing. Add a log level and adhere to suggestions by
    scripts/checkpatch.pl by moving to the pr_* macros.

    Also add the typical pr_fmt definition so that print statements can be
    easily traced back to the modules where they occur, correlated one with
    another, etc. This will require the removal of some (now redundant)
    prefixes on a few print statements.

    Signed-off-by: Mitchel Humpherys
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mitchel Humpherys
     

08 Apr, 2014

4 commits

  • Signed-off-by: Choi Gi-yong
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Choi Gi-yong
     
  • To increase compiler portability there is which
    provides convenience macros for various gcc constructs. Eg: __weak for
    __attribute__((weak)). I've replaced all instances of gcc attributes with
    the right macro in the memory management (/mm) subsystem.

    [akpm@linux-foundation.org: while-we're-there consistency tweaks]
    Signed-off-by: Gideon Israel Dsouza
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gideon Israel Dsouza
     
  • This patch is a continuation of efforts trying to optimize find_vma(),
    avoiding potentially expensive rbtree walks to locate a vma upon faults.
    The original approach (https://lkml.org/lkml/2013/11/1/410), where the
    largest vma was also cached, ended up being too specific and random,
    thus further comparison with other approaches were needed. There are
    two things to consider when dealing with this, the cache hit rate and
    the latency of find_vma(). Improving the hit-rate does not necessarily
    translate in finding the vma any faster, as the overhead of any fancy
    caching schemes can be too high to consider.

    We currently cache the last used vma for the whole address space, which
    provides a nice optimization, reducing the total cycles in find_vma() by
    up to 250%, for workloads with good locality. On the other hand, this
    simple scheme is pretty much useless for workloads with poor locality.
    Analyzing ebizzy runs shows that, no matter how many threads are
    running, the mmap_cache hit rate is less than 2%, and in many situations
    below 1%.

    The proposed approach is to replace this scheme with a small per-thread
    cache, maximizing hit rates at a very low maintenance cost.
    Invalidations are performed by simply bumping up a 32-bit sequence
    number. The only expensive operation is in the rare case of a seq
    number overflow, where all caches that share the same address space are
    flushed. Upon a miss, the proposed replacement policy is based on the
    page number that contains the virtual address in question. Concretely,
    the following results are seen on an 80 core, 8 socket x86-64 box:

    1) System bootup: Most programs are single threaded, so the per-thread
    scheme does improve ~50% hit rate by just adding a few more slots to
    the cache.

    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline | 50.61% | 19.90 |
    | patched | 73.45% | 13.58 |
    +----------------+----------+------------------+

    2) Kernel build: This one is already pretty good with the current
    approach as we're dealing with good locality.

    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline | 75.28% | 11.03 |
    | patched | 88.09% | 9.31 |
    +----------------+----------+------------------+

    3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.

    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline | 70.66% | 17.14 |
    | patched | 91.15% | 12.57 |
    +----------------+----------+------------------+

    4) Ebizzy: There's a fair amount of variation from run to run, but this
    approach always shows nearly perfect hit rates, while baseline is just
    about non-existent. The amounts of cycles can fluctuate between
    anywhere from ~60 to ~116 for the baseline scheme, but this approach
    reduces it considerably. For instance, with 80 threads:

    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline | 1.06% | 91.54 |
    | patched | 99.97% | 14.18 |
    +----------------+----------+------------------+

    [akpm@linux-foundation.org: fix nommu build, per Davidlohr]
    [akpm@linux-foundation.org: document vmacache_valid() logic]
    [akpm@linux-foundation.org: attempt to untangle header files]
    [akpm@linux-foundation.org: add vmacache_find() BUG_ON]
    [hughd@google.com: add vmacache_valid_mm() (from Oleg)]
    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: adjust and enhance comments]
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Rik van Riel
    Acked-by: Linus Torvalds
    Reviewed-by: Michel Lespinasse
    Cc: Oleg Nesterov
    Tested-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • filemap_map_pages() is generic implementation of ->map_pages() for
    filesystems who uses page cache.

    It should be safe to use filemap_map_pages() for ->map_pages() if
    filesystem use filemap_fault() for ->fault().

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Linus Torvalds
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Andi Kleen
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Cc: Alexander Viro
    Cc: Dave Chinner
    Cc: Ning Qu
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

31 Mar, 2014

1 commit

  • As Trond pointed out, you can currently deadlock yourself by setting a
    file-private lock on a file that requires mandatory locking and then
    trying to do I/O on it.

    Avoid this problem by plumbing some knowledge of file-private locks into
    the mandatory locking code. In order to do this, we must pass down
    information about the struct file that's being used to
    locks_verify_locked.

    Reported-by: Trond Myklebust
    Signed-off-by: Jeff Layton
    Acked-by: J. Bruce Fields

    Jeff Layton
     

22 Jan, 2014

1 commit

  • Some applications that run on HPC clusters are designed around the
    availability of RAM and the overcommit ratio is fine tuned to get the
    maximum usage of memory without swapping. With growing memory, the
    1%-of-all-RAM grain provided by overcommit_ratio has become too coarse
    for these workload (on a 2TB machine it represents no less than 20GB).

    This patch adds the new overcommit_kbytes sysctl variable that allow a
    much finer grain.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix nommu build]
    Signed-off-by: Jerome Marchand
    Cc: Dave Hansen
    Cc: Alan Cox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jerome Marchand
     

13 Nov, 2013

2 commits

  • Merge first patch-bomb from Andrew Morton:
    "Quite a lot of other stuff is banked up awaiting further
    next->mainline merging, but this batch contains:

    - Lots of random misc patches
    - OCFS2
    - Most of MM
    - backlight updates
    - lib/ updates
    - printk updates
    - checkpatch updates
    - epoll tweaking
    - rtc updates
    - hfs
    - hfsplus
    - documentation
    - procfs
    - update gcov to gcc-4.7 format
    - IPC"

    * emailed patches from Andrew Morton : (269 commits)
    ipc, msg: fix message length check for negative values
    ipc/util.c: remove unnecessary work pending test
    devpts: plug the memory leak in kill_sb
    ./Makefile: export initial ramdisk compression config option
    init/Kconfig: add option to disable kernel compression
    drivers: w1: make w1_slave::flags long to avoid memory corruption
    drivers/w1/masters/ds1wm.cuse dev_get_platdata()
    drivers/memstick/core/ms_block.c: fix unreachable state in h_msb_read_page()
    drivers/memstick/core/mspro_block.c: fix attributes array allocation
    drivers/pps/clients/pps-gpio.c: remove redundant of_match_ptr
    kernel/panic.c: reduce 1 byte usage for print tainted buffer
    gcov: reuse kbasename helper
    kernel/gcov/fs.c: use pr_warn()
    kernel/module.c: use pr_foo()
    gcov: compile specific gcov implementation based on gcc version
    gcov: add support for gcc 4.7 gcov format
    gcov: move gcov structs definitions to a gcc version specific file
    kernel/taskstats.c: return -ENOMEM when alloc memory fails in add_del_listener()
    kernel/taskstats.c: add nla_nest_cancel() for failure processing between nla_nest_start() and nla_nest_end()
    kernel/sysctl_binary.c: use scnprintf() instead of snprintf()
    ...

    Linus Torvalds
     
  • The same calculation is currently done in three differents places.
    Factor that code so future changes has to be made at only one place.

    [akpm@linux-foundation.org: uninline vm_commit_limit()]
    Signed-off-by: Jerome Marchand
    Cc: Dave Hansen
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jerome Marchand
     

25 Oct, 2013

1 commit


11 Jul, 2013

1 commit

  • Since all architectures have been converted to use vm_unmapped_area(),
    there is no remaining use for the free_area_cache.

    Signed-off-by: Michel Lespinasse
    Acked-by: Rik van Riel
    Cc: "James E.J. Bottomley"
    Cc: "Luck, Tony"
    Cc: Benjamin Herrenschmidt
    Cc: David Howells
    Cc: Helge Deller
    Cc: Ivan Kokshaysky
    Cc: Matt Turner
    Cc: Paul Mackerras
    Cc: Richard Henderson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

04 Jul, 2013

2 commits

  • Now all references to num_physpages have been removed, so kill it.

    Signed-off-by: Jiang Liu
    Cc: Mel Gorman
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Jiang Liu
    Cc: Hugh Dickins
    Cc: David Rientjes
    Cc: Al Viro
    Cc: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • vwrite() checks for overflow. vread() should do the same thing.

    Since vwrite() checks the source buffer address, vread() should check
    the destination buffer address.

    Signed-off-by: Chen Gang
    Cc: Al Viro
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Gang
     

01 May, 2013

1 commit

  • Pull compat cleanup from Al Viro:
    "Mostly about syscall wrappers this time; there will be another pile
    with patches in the same general area from various people, but I'd
    rather push those after both that and vfs.git pile are in."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
    syscalls.h: slightly reduce the jungles of macros
    get rid of union semop in sys_semctl(2) arguments
    make do_mremap() static
    sparc: no need to sign-extend in sync_file_range() wrapper
    ppc compat wrappers for add_key(2) and request_key(2) are pointless
    x86: trim sys_ia32.h
    x86: sys32_kill and sys32_mprotect are pointless
    get rid of compat_sys_semctl() and friends in case of ARCH_WANT_OLD_COMPAT_IPC
    merge compat sys_ipc instances
    consolidate compat lookup_dcookie()
    convert vmsplice to COMPAT_SYSCALL_DEFINE
    switch getrusage() to COMPAT_SYSCALL_DEFINE
    switch epoll_pwait to COMPAT_SYSCALL_DEFINE
    convert sendfile{,64} to COMPAT_SYSCALL_DEFINE
    switch signalfd{,4}() to COMPAT_SYSCALL_DEFINE
    make SYSCALL_DEFINE-generated wrappers do asmlinkage_protect
    make HAVE_SYSCALL_WRAPPERS unconditional
    consolidate cond_syscall and SYSCALL_ALIAS declarations
    teach SYSCALL_DEFINE how to deal with long long/unsigned long long
    get rid of duplicate logics in __SC_....[1-6] definitions

    Linus Torvalds
     

30 Apr, 2013

3 commits

  • Add an admin_reserve_kbytes knob to allow admins to change the hardcoded
    memory reserve to something other than 3%, which may be multiple
    gigabytes on large memory systems. Only about 8MB is necessary to
    enable recovery in the default mode, and only a few hundred MB are
    required even when overcommit is disabled.

    This affects OVERCOMMIT_GUESS and OVERCOMMIT_NEVER.

    admin_reserve_kbytes is initialized to min(3% free pages, 8MB)

    I arrived at 8MB by summing the RSS of sshd or login, bash, and top.

    Please see first patch in this series for full background, motivation,
    testing, and full changelog.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: make init_admin_reserve() static]
    Signed-off-by: Andrew Shewmaker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Shewmaker
     
  • Add user_reserve_kbytes knob.

    Limit the growth of the memory reserved for other user processes to
    min(3% current process size, user_reserve_pages). Only about 8MB is
    necessary to enable recovery in the default mode, and only a few hundred
    MB are required even when overcommit is disabled.

    user_reserve_pages defaults to min(3% free pages, 128MB)

    I arrived at 128MB by taking the max VSZ of sshd, login, bash, and top ...
    then adding the RSS of each.

    This only affects OVERCOMMIT_NEVER mode.

    Background

    1. user reserve

    __vm_enough_memory reserves a hardcoded 3% of the current process size for
    other applications when overcommit is disabled. This was done so that a
    user could recover if they launched a memory hogging process. Without the
    reserve, a user would easily run into a message such as:

    bash: fork: Cannot allocate memory

    2. admin reserve

    Additionally, a hardcoded 3% of free memory is reserved for root in both
    overcommit 'guess' and 'never' modes. This was intended to prevent a
    scenario where root-cant-log-in and perform recovery operations.

    Note that this reserve shrinks, and doesn't guarantee a useful reserve.

    Motivation

    The two hardcoded memory reserves should be updated to account for current
    memory sizes.

    Also, the admin reserve would be more useful if it didn't shrink too much.

    When the current code was originally written, 1GB was considered
    "enterprise". Now the 3% reserve can grow to multiple GB on large memory
    systems, and it only needs to be a few hundred MB at most to enable a user
    or admin to recover a system with an unwanted memory hogging process.

    I've found that reducing these reserves is especially beneficial for a
    specific type of application load:

    * single application system
    * one or few processes (e.g. one per core)
    * allocating all available memory
    * not initializing every page immediately
    * long running

    I've run scientific clusters with this sort of load. A long running job
    sometimes failed many hours (weeks of CPU time) into a calculation. They
    weren't initializing all of their memory immediately, and they weren't
    using calloc, so I put systems into overcommit 'never' mode. These
    clusters run diskless and have no swap.

    However, with the current reserves, a user wishing to allocate as much
    memory as possible to one process may be prevented from using, for
    example, almost 2GB out of 32GB.

    The effect is less, but still significant when a user starts a job with
    one process per core. I have repeatedly seen a set of processes
    requesting the same amount of memory fail because one of them could not
    allocate the amount of memory a user would expect to be able to allocate.
    For example, Message Passing Interfce (MPI) processes, one per core. And
    it is similar for other parallel programming frameworks.

    Changing this reserve code will make the overcommit never mode more useful
    by allowing applications to allocate nearly all of the available memory.

    Also, the new admin_reserve_kbytes will be safer than the current behavior
    since the hardcoded 3% of available memory reserve can shrink to something
    useless in the case where applications have grabbed all available memory.

    Risks

    * "bash: fork: Cannot allocate memory"

    The downside of the first patch-- which creates a tunable user reserve
    that is only used in overcommit 'never' mode--is that an admin can set
    it so low that a user may not be able to kill their process, even if
    they already have a shell prompt.

    Of course, a user can get in the same predicament with the current 3%
    reserve--they just have to launch processes until 3% becomes negligible.

    * root-cant-log-in problem

    The second patch, adding the tunable rootuser_reserve_pages, allows
    the admin to shoot themselves in the foot by setting it too small. They
    can easily get the system into a state where root-can't-log-in.

    However, the new admin_reserve_kbytes will be safer than the current
    behavior since the hardcoded 3% of available memory reserve can shrink
    to something useless in the case where applications have grabbed all
    available memory.

    Alternatives

    * Memory cgroups provide a more flexible way to limit application memory.

    Not everyone wants to set up cgroups or deal with their overhead.

    * We could create a fourth overcommit mode which provides smaller reserves.

    The size of useful reserves may be drastically different depending
    on the whether the system is embedded or enterprise.

    * Force users to initialize all of their memory or use calloc.

    Some users don't want/expect the system to overcommit when they malloc.
    Overcommit 'never' mode is for this scenario, and it should work well.

    The new user and admin reserve tunables are simple to use, with low
    overhead compared to cgroups. The patches preserve current behavior where
    3% of memory is less than 128MB, except that the admin reserve doesn't
    shrink to an unusable size under pressure. The code allows admins to tune
    for embedded and enterprise usage.

    FAQ

    * How is the root-cant-login problem addressed?
    What happens if admin_reserve_pages is set to 0?

    Root is free to shoot themselves in the foot by setting
    admin_reserve_kbytes too low.

    On x86_64, the minimum useful reserve is:
    8MB for overcommit 'guess'
    128MB for overcommit 'never'

    admin_reserve_pages defaults to min(3% free memory, 8MB)

    So, anyone switching to 'never' mode needs to adjust
    admin_reserve_pages.

    * How do you calculate a minimum useful reserve?

    A user or the admin needs enough memory to login and perform
    recovery operations, which includes, at a minimum:

    sshd or login + bash (or some other shell) + top (or ps, kill, etc.)

    For overcommit 'guess', we can sum resident set sizes (RSS)
    because we only need enough memory to handle what the recovery
    programs will typically use. On x86_64 this is about 8MB.

    For overcommit 'never', we can take the max of their virtual sizes (VSZ)
    and add the sum of their RSS. We use VSZ instead of RSS because mode
    forces us to ensure we can fulfill all of the requested memory allocations--
    even if the programs only use a fraction of what they ask for.
    On x86_64 this is about 128MB.

    When swap is enabled, reserves are useful even when they are as
    small as 10MB, regardless of overcommit mode.

    When both swap and overcommit are disabled, then the admin should
    tune the reserves higher to be absolutley safe. Over 230MB each
    was safest in my testing.

    * What happens if user_reserve_pages is set to 0?

    Note, this only affects overcomitt 'never' mode.

    Then a user will be able to allocate all available memory minus
    admin_reserve_kbytes.

    However, they will easily see a message such as:

    "bash: fork: Cannot allocate memory"

    And they won't be able to recover/kill their application.
    The admin should be able to recover the system if
    admin_reserve_kbytes is set appropriately.

    * What's the difference between overcommit 'guess' and 'never'?

    "Guess" allows an allocation if there are enough free + reclaimable
    pages. It has a hardcoded 3% of free pages reserved for root.

    "Never" allows an allocation if there is enough swap + a configurable
    percentage (default is 50) of physical RAM. It has a hardcoded 3% of
    free pages reserved for root, like "Guess" mode. It also has a
    hardcoded 3% of the current process size reserved for additional
    applications.

    * Why is overcommit 'guess' not suitable even when an app eventually
    writes to every page? It takes free pages, file pages, available
    swap pages, reclaimable slab pages into consideration. In other words,
    these are all pages available, then why isn't overcommit suitable?

    Because it only looks at the present state of the system. It
    does not take into account the memory that other applications have
    malloced, but haven't initialized yet. It overcommits the system.

    Test Summary

    There was little change in behavior in the default overcommit 'guess'
    mode with swap enabled before and after the patch. This was expected.

    Systems run most predictably (i.e. no oom kills) in overcommit 'never'
    mode with swap enabled. This also allowed the most memory to be allocated
    to a user application.

    Overcommit 'guess' mode without swap is a bad idea. It is easy to
    crash the system. None of the other tested combinations crashed.
    This matches my experience on the Roadrunner supercomputer.

    Without the tunable user reserve, a system in overcommit 'never' mode
    and without swap does not allow the admin to recover, although the
    admin can.

    With the new tunable reserves, a system in overcommit 'never' mode
    and without swap can be configured to:

    1. maximize user-allocatable memory, running close to the edge of
    recoverability

    2. maximize recoverability, sacrificing allocatable memory to
    ensure that a user cannot take down a system

    Test Description

    Fedora 18 VM - 4 x86_64 cores, 5725MB RAM, 4GB Swap

    System is booted into multiuser console mode, with unnecessary services
    turned off. Caches were dropped before each test.

    Hogs are user memtester processes that attempt to allocate all free memory
    as reported by /proc/meminfo

    In overcommit 'never' mode, memory_ratio=100

    Test Results

    3.9.0-rc1-mm1

    Overcommit | Swap | Hogs | MB Got/Wanted | OOMs | User Recovery | Admin Recovery
    ---------- ---- ---- ------------- ---- ------------- --------------
    guess yes 1 5432/5432 no yes yes
    guess yes 4 5444/5444 1 yes yes
    guess no 1 5302/5449 no yes yes
    guess no 4 - crash no no

    never yes 1 5460/5460 1 yes yes
    never yes 4 5460/5460 1 yes yes
    never no 1 5218/5432 no no yes
    never no 4 5203/5448 no no yes

    3.9.0-rc1-mm1-tunablereserves

    User and Admin Recovery show their respective reserves, if applicable.

    Overcommit | Swap | Hogs | MB Got/Wanted | OOMs | User Recovery | Admin Recovery
    ---------- ---- ---- ------------- ---- ------------- --------------
    guess yes 1 5419/5419 no - yes 8MB yes
    guess yes 4 5436/5436 1 - yes 8MB yes
    guess no 1 5440/5440 * - yes 8MB yes
    guess no 4 - crash - no 8MB no

    * process would successfully mlock, then the oom killer would pick it

    never yes 1 5446/5446 no 10MB yes 20MB yes
    never yes 4 5456/5456 no 10MB yes 20MB yes
    never no 1 5387/5429 no 128MB no 8MB barely
    never no 1 5323/5428 no 226MB barely 8MB barely
    never no 1 5323/5428 no 226MB barely 8MB barely

    never no 1 5359/5448 no 10MB no 10MB barely

    never no 1 5323/5428 no 0MB no 10MB barely
    never no 1 5332/5428 no 0MB no 50MB yes
    never no 1 5293/5429 no 0MB no 90MB yes

    never no 1 5001/5427 no 230MB yes 338MB yes
    never no 4* 4998/5424 no 230MB yes 338MB yes

    * more memtesters were launched, able to allocate approximately another 100MB

    Future Work

    - Test larger memory systems.

    - Test an embedded image.

    - Test other architectures.

    - Time malloc microbenchmarks.

    - Would it be useful to be able to set overcommit policy for
    each memory cgroup?

    - Some lines are slightly above 80 chars.
    Perhaps define a macro to convert between pages and kb?
    Other places in the kernel do this.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: make init_user_reserve() static]
    Signed-off-by: Andrew Shewmaker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Shewmaker
     
  • Although our intention is to unexport internal structure entirely, but
    there is one exception for kexec. kexec dumps address of vmlist and
    makedumpfile uses this information.

    We are about to remove vmlist, then another way to retrieve information
    of vmalloc layer is needed for makedumpfile. For this purpose, we
    export vmap_area_list, instead of vmlist.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Joonsoo Kim
    Cc: Eric Biederman
    Cc: Dave Anderson
    Cc: Vivek Goyal
    Cc: Atsushi Kumagai
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Chris Metcalf
    Cc: Guan Xuetao
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

28 Apr, 2013

1 commit

  • I think we could just move the full vm_iomap_memory() function into
    util.h or similar, but I didn't get any reply from anybody actually
    using nommu even to this trivial patch, so I'm not going to touch it any
    more than required.

    Here's the fairly minimal stub to make the nommu case at least
    potentially work. It doesn't seem like anybody cares, though.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

05 Apr, 2013

1 commit

  • find_vma() can be called by multiple threads with read lock
    held on mm->mmap_sem and any of them can update mm->mmap_cache.
    Prevent compiler from re-fetching mm->mmap_cache, because other
    readers could update it in the meantime:

    thread 1 thread 2
    |
    find_vma() | find_vma()
    struct vm_area_struct *vma = NULL; |
    vma = mm->mmap_cache; |
    if (!(vma && vma->vm_end > addr |
    && vma->vm_start mmap_cache = vma;
    return vma; |
    ^^ compiler may optimize this |
    local variable out and re-read |
    mm->mmap_cache |

    This issue can be reproduced with gcc-4.8.0-1 on s390x by running
    mallocstress testcase from LTP, which triggers:

    kernel BUG at mm/rmap.c:1088!
    Call Trace:
    ([] 0x3d100c57000)
    [] do_wp_page+0x2fc/0xa88
    [] handle_pte_fault+0x41a/0xac8
    [] handle_mm_fault+0x17a/0x268
    [] do_protection_exception+0x1e2/0x394
    [] pgm_check_handler+0x138/0x13c
    [] 0x3fffcf1f07a
    Last Breaking-Event-Address:
    [] page_add_new_anon_rmap+0xc2/0x168

    Thanks to Jakub Jelinek for his insight on gcc and helping to
    track this down.

    Signed-off-by: Jan Stancek
    Acked-by: David Rientjes
    Signed-off-by: Hugh Dickins
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Jan Stancek
     

04 Mar, 2013

1 commit

  • The extern in sys_sparc_64.c was a rudiment of time when do_mremap()
    used to exist in MMU case (it doesn't anymore). As for !MMU one,
    nothing uses it outside of mm/nommu.c...

    Signed-off-by: Al Viro

    Al Viro
     

27 Feb, 2013

1 commit

  • Pull vfs pile (part one) from Al Viro:
    "Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent
    locking violations, etc.

    The most visible changes here are death of FS_REVAL_DOT (replaced with
    "has ->d_weak_revalidate()") and a new helper getting from struct file
    to inode. Some bits of preparation to xattr method interface changes.

    Misc patches by various people sent this cycle *and* ocfs2 fixes from
    several cycles ago that should've been upstream right then.

    PS: the next vfs pile will be xattr stuff."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
    saner proc_get_inode() calling conventions
    proc: avoid extra pde_put() in proc_fill_super()
    fs: change return values from -EACCES to -EPERM
    fs/exec.c: make bprm_mm_init() static
    ocfs2/dlm: use GFP_ATOMIC inside a spin_lock
    ocfs2: fix possible use-after-free with AIO
    ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path
    get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero
    target: writev() on single-element vector is pointless
    export kernel_write(), convert open-coded instances
    fs: encode_fh: return FILEID_INVALID if invalid fid_type
    kill f_vfsmnt
    vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op
    nfsd: handle vfs_getattr errors in acl protocol
    switch vfs_getattr() to struct path
    default SET_PERSONALITY() in linux/elf.h
    ceph: prepopulate inodes only when request is aborted
    d_hash_and_lookup(): export, switch open-coded instances
    9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate()
    9p: split dropping the acls from v9fs_set_create_acl()
    ...

    Linus Torvalds