13 Mar, 2010

4 commits

  • Add a generic implementation of the old mmap() syscall, which expects its
    argument in a memory block and switch all architectures over to use it.

    Signed-off-by: Christoph Hellwig
    Cc: Ralf Baechle
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mundt
    Cc: Jeff Dike
    Cc: Hirokazu Takata
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Reviewed-by: H. Peter Anvin
    Cc: Al Viro
    Cc: Arnd Bergmann
    Cc: Heiko Carstens
    Cc: Martin Schwidefsky
    Cc: "Luck, Tony"
    Cc: James Morris
    Cc: Andreas Schwab
    Acked-by: Jesper Nilsson
    Acked-by: Russell King
    Acked-by: Greg Ungerer
    Acked-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • - introduce dump_page() to print the page info for debugging some error
    condition.

    - convert three mm users: bad_page(), print_bad_pte() and memory offline
    failure.

    - print an extra field: the symbolic names of page->flags

    Example dump_page() output:

    [ 157.521694] page:ffffea0000a7cba8 count:2 mapcount:1 mapping:ffff88001c901791 index:0x147
    [ 157.525570] page flags: 0x100000000100068(uptodate|lru|active|swapbacked)

    Signed-off-by: Wu Fengguang
    Cc: Ingo Molnar
    Cc: Alex Chiang
    Cc: Rik van Riel
    Cc: Andi Kleen
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • __zone_pcp_update() iterates over NR_CPUS instead of limiting the access
    to the possible cpus. This might result in access to uninitialized areas
    as the per cpu allocator only populates the per cpu memory for possible
    cpus.

    This problem was created as a result of the dynamic allocation of pagesets
    from percpu memory that went in during the merge window - commit
    99dcc3e5a94ed491fbef402831d8c0bbb267f995 ("this_cpu: Page allocator
    conversion").

    Signed-off-by: Thomas Gleixner
    Acked-by: Pekka Enberg
    Acked-by: Tejun Heo
    Acked-by: Christoph Lameter
    Acked-by: Mel Gorman
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thomas Gleixner
     
  • Commit 34e55232e59f7b19050267a05ff1226e5cd122a5 ("mm: avoid false sharing
    of mm_counter") added sync_mm_rss() for syncing loosely accounted rss
    counters. It's for CONFIG_MMU but sync_mm_rss is called even in NOMMU
    enviroment (kerne/exit.c, fs/exec.c). Above commit doesn't handle it
    well.

    This patch changes
    SPLIT_RSS_COUNTING depends on SPLIT_PTLOCKS && CONFIG_MMU

    And for avoid unnecessary function calls, sync_mm_rss changed to be inlined
    noop function in header file.

    Reported-by: David Howells
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Mike Frysinger
    Signed-off-by: Michal Simek
    Signed-off-by: David Howells
    Cc: Greg Ungerer
    Cc: Geert Uytterhoeven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

08 Mar, 2010

2 commits

  • Constify struct sysfs_ops.

    This is part of the ops structure constification
    effort started by Arjan van de Ven et al.

    Benefits of this constification:

    * prevents modification of data that is shared
    (referenced) by many other structure instances
    at runtime

    * detects/prevents accidental (but not intentional)
    modification attempts on archs that enforce
    read-only kernel data at runtime

    * potentially better optimized code as the compiler
    can assume that the const data cannot be changed

    * the compiler/linker move const data into .rodata
    and therefore exclude them from false sharing

    Signed-off-by: Emese Revfy
    Acked-by: David Teigland
    Acked-by: Matt Domsch
    Acked-by: Maciej Sosnowski
    Acked-by: Hans J. Koch
    Acked-by: Pekka Enberg
    Acked-by: Jens Axboe
    Acked-by: Stephen Hemminger
    Signed-off-by: Greg Kroah-Hartman

    Emese Revfy
     
  • Constify struct kset_uevent_ops.

    This is part of the ops structure constification
    effort started by Arjan van de Ven et al.

    Benefits of this constification:

    * prevents modification of data that is shared
    (referenced) by many other structure instances
    at runtime

    * detects/prevents accidental (but not intentional)
    modification attempts on archs that enforce
    read-only kernel data at runtime

    * potentially better optimized code as the compiler
    can assume that the const data cannot be changed

    * the compiler/linker move const data into .rodata
    and therefore exclude them from false sharing

    Signed-off-by: Emese Revfy
    Signed-off-by: Greg Kroah-Hartman

    Emese Revfy
     

07 Mar, 2010

31 commits

  • swap_duplicate()'s loop appears to miss out on returning the error code
    from __swap_duplicate(), except when that's -ENOMEM. In fact this is
    intentional: prior to -ENOMEM for swap_count_continuation,
    swap_duplicate() was void (and the case only occurs when copy_one_pte()
    hits a corrupt pte). But that's surprising behaviour, which certainly
    deserves a comment.

    Signed-off-by: Hugh Dickins
    Reported-by: Huang Shijie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The noMMU version of get_user_pages() fails to pin the last page when the
    start address isn't page-aligned. The patch fixes this in a way that
    makes find_extend_vma() congruent to its MMU cousin.

    Signed-off-by: Steven J. Magnani
    Acked-by: Paul Mundt
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven J. Magnani
     
  • The VM currently assumes that an inactive, mapped and referenced file page
    is in use and promotes it to the active list.

    However, every mapped file page starts out like this and thus a problem
    arises when workloads create a stream of such pages that are used only for
    a short time. By flooding the active list with those pages, the VM
    quickly gets into trouble finding eligible reclaim canditates. The result
    is long allocation latencies and eviction of the wrong pages.

    This patch reuses the PG_referenced page flag (used for unmapped file
    pages) to implement a usage detection that scales with the speed of LRU
    list cycling (i.e. memory pressure).

    If the scanner encounters those pages, the flag is set and the page cycled
    again on the inactive list. Only if it returns with another page table
    reference it is activated. Otherwise it is reclaimed as 'not recently
    used cache'.

    This effectively changes the minimum lifetime of a used-once mapped file
    page from a full memory cycle to an inactive list cycle, which allows it
    to occur in linear streams without affecting the stable working set of the
    system.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Cc: Minchan Kim
    Cc: OSAKI Motohiro
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • page_mapping_inuse() is a historic predicate function for pages that are
    about to be reclaimed or deactivated.

    According to it, a page is in use when it is mapped into page tables OR
    part of swap cache OR backing an mmapped file.

    This function is used in combination with page_referenced(), which checks
    for young bits in ptes and the page descriptor itself for the
    PG_referenced bit. Thus, checking for unmapped swap cache pages is
    meaningless as PG_referenced is not set for anonymous pages and unmapped
    pages do not have young ptes. The test makes no difference.

    Protecting file pages that are not by themselves mapped but are part of a
    mapped file is also a historic leftover for short-lived things like the
    exec() code in libc. However, the VM now does reference accounting and
    activation of pages at unmap time and thus the special treatment on
    reclaim is obsolete.

    This patch drops page_mapping_inuse() and switches the two callsites to
    use page_mapped() directly.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Cc: Minchan Kim
    Cc: OSAKI Motohiro
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The used-once mapped file page detection patchset.

    It is meant to help workloads with large amounts of shortly used file
    mappings, like rtorrent hashing a file or git when dealing with loose
    objects (git gc on a bigger site?).

    Right now, the VM activates referenced mapped file pages on first
    encounter on the inactive list and it takes a full memory cycle to
    reclaim them again. When those pages dominate memory, the system
    no longer has a meaningful notion of 'working set' and is required
    to give up the active list to make reclaim progress. Obviously,
    this results in rather bad scanning latencies and the wrong pages
    being reclaimed.

    This patch makes the VM be more careful about activating mapped file
    pages in the first place. The minimum granted lifetime without
    another memory access becomes an inactive list cycle instead of the
    full memory cycle, which is more natural given the mentioned loads.

    This test resembles a hashing rtorrent process. Sequentially, 32MB
    chunks of a file are mapped into memory, hashed (sha1) and unmapped
    again. While this happens, every 5 seconds a process is launched and
    its execution time taken:

    python2.4 -c 'import pydoc'
    old: max=2.31s mean=1.26s (0.34)
    new: max=1.25s mean=0.32s (0.32)

    find /etc -type f
    old: max=2.52s mean=1.44s (0.43)
    new: max=1.92s mean=0.12s (0.17)

    vim -c ':quit'
    old: max=6.14s mean=4.03s (0.49)
    new: max=3.48s mean=2.41s (0.25)

    mplayer --help
    old: max=8.08s mean=5.74s (1.02)
    new: max=3.79s mean=1.32s (0.81)

    overall hash time (stdev):
    old: time=1192.30 (12.85) thruput=25.78mb/s (0.27)
    new: time=1060.27 (32.58) thruput=29.02mb/s (0.88) (-11%)

    I also tested kernbench with regular IO streaming in the background to
    see whether the delayed activation of frequently used mapped file
    pages had a negative impact on performance in the presence of pressure
    on the inactive list. The patch made no significant difference in
    timing, neither for kernbench nor for the streaming IO throughput.

    The first patch submission raised concerns about the cost of the extra
    faults for actually activated pages on machines that have no hardware
    support for young page table entries.

    I created an artificial worst case scenario on an ARM machine with
    around 300MHz and 64MB of memory to figure out the dimensions
    involved. The test would mmap a file of 20MB, then

    1. touch all its pages to fault them in
    2. force one full scan cycle on the inactive file LRU
    -- old: mapping pages activated
    -- new: mapping pages inactive
    3. touch the mapping pages again
    -- old and new: fault exceptions to set the young bits
    4. force another full scan cycle on the inactive file LRU
    5. touch the mapping pages one last time
    -- new: fault exceptions to set the young bits

    The test showed an overall increase of 6% in time over 100 iterations
    of the above (old: ~212sec, new: ~225sec). 13 secs total overhead /
    (100 * 5k pages), ignoring the execution time of the test itself,
    makes for about 25us overhead for every page that gets actually
    activated. Note:

    1. File mapping the size of one third of main memory, _completely_
    in active use across memory pressure - i.e., most pages referenced
    within one LRU cycle. This should be rare to non-existant,
    especially on such embedded setups.

    2. Many huge activation batches. Those batches only occur when the
    working set fluctuates. If it changes completely between every full
    LRU cycle, you have problematic reclaim overhead anyway.

    3. Access of activated pages at maximum speed: sequential loads from
    every single page without doing anything in between. In reality,
    the extra faults will get distributed between actual operations on
    the data.

    So even if a workload manages to get the VM into the situation of
    activating a third of memory in one go on such a setup, it will take
    2.2 seconds instead 2.1 without the patch.

    Comparing the numbers (and my user-experience over several months),
    I think this change is an overall improvement to the VM.

    Patch 1 is only refactoring to break up that ugly compound conditional
    in shrink_page_list() and make it easy to document and add new checks
    in a readable fashion.

    Patch 2 gets rid of the obsolete page_mapping_inuse(). It's not
    strictly related to #3, but it was in the original submission and is a
    net simplification, so I kept it.

    Patch 3 implements used-once detection of mapped file pages.

    This patch:

    Moving the big conditional into its own predicate function makes the code
    a bit easier to read and allows for better commenting on the checks
    one-by-one.

    This is just cleaning up, no semantics should have been changed.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Cc: Minchan Kim
    Cc: OSAKI Motohiro
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • free_area_init_nodes() emits pfn ranges for all zones on the system.
    There may be no pages on a higher zone, however, due to memory limitations
    or the use of the mem= kernel parameter. For example:

    Zone PFN ranges:
    DMA 0x00000001 -> 0x00001000
    DMA32 0x00001000 -> 0x00100000
    Normal 0x00100000 -> 0x00100000

    The implementation copies the previous zone's highest pfn, if any, as the
    next zone's lowest pfn. If its highest pfn is then greater than the
    amount of addressable memory, the upper memory limit is used instead.
    Thus, both the lowest and highest possible pfn for higher zones without
    memory may be the same.

    The pfn range for zones without memory is now shown as "empty" instead.

    Signed-off-by: David Rientjes
    Cc: Mel Gorman
    Reviewed-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • There are quite a few GFP_KERNEL memory allocations made during
    suspend/hibernation and resume that may cause the system to hang, because
    the I/O operations they depend on cannot be completed due to the
    underlying devices being suspended.

    Avoid this problem by clearing the __GFP_IO and __GFP_FS bits in
    gfp_allowed_mask before suspend/hibernation and restoring the original
    values of these bits in gfp_allowed_mask durig the subsequent resume.

    [akpm@linux-foundation.org: fix CONFIG_PM=n linkage]
    Signed-off-by: Rafael J. Wysocki
    Reported-by: Maxim Levitsky
    Cc: Sebastian Ott
    Cc: Benjamin Herrenschmidt
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • There's an off-by-one disagreement between mkswap and swapon about the
    meaning of swap_header last_page: mkswap (in all versions I've looked at:
    util-linux-ng and BusyBox and old util-linux; probably as far back as
    1999) consistently means the offset (in page units) of the last page of
    the swap area, whereas kernel sys_swapon (as far back as 2.2 and 2.3)
    strangely takes it to mean the size (in page units) of the swap area.

    This disagreement is the safe way round; but it's worrying people, and
    loses us one page of swap.

    The fix is not just to add one to nr_good_pages: we need to get maxpages
    (the size of the swap_map array) right before that; and though that is an
    unsigned long, be careful not to overflow the unsigned int p->max which
    later holds it (probably why header uses __u32 last_page instead of size).

    Why did we subtract one from the maximum swp_offset to calculate maxpages?
    Though it was probably me who made that change in 2.4.10, I don't get it:
    and now we should be adding one (without risk of overflow in this case).

    Fix the handling of swap_header badpages: it could have overrun the
    swap_map when very large swap area used on a more limited architecture.

    Remove pre-initializations of swap_header, nr_good_pages and maxpages:
    those date from when sys_swapon was supporting other versions of header.

    Reported-by: Nitin Gupta
    Reported-by: Jarkko Lavinen
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • When a VMA is in an inconsistent state during setup or teardown, the worst
    that can happen is that the rmap code will not be able to find the page.

    The mapping is in the process of being torn down (PTEs just got
    invalidated by munmap), or set up (no PTEs have been instantiated yet).

    It is also impossible for the rmap code to follow a pointer to an already
    freed VMA, because the rmap code holds the anon_vma->lock, which the VMA
    teardown code needs to take before the VMA is removed from the anon_vma
    chain.

    Hence, we should not need the VM_LOCK_RMAP locking at all.

    Signed-off-by: Rik van Riel
    Cc: Nick Piggin
    Cc: KOSAKI Motohiro
    Cc: Larry Woodman
    Cc: Lee Schermerhorn
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • When the parent process breaks the COW on a page, both the original which
    is mapped at child and the new page which is mapped parent end up in that
    same anon_vma. Generally this won't be a problem, but for some workloads
    it could preserve the O(N) rmap scanning complexity.

    A simple fix is to ensure that, when a page which is mapped child gets
    reused in do_wp_page, because we already are the exclusive owner, the page
    gets moved to our own exclusive child's anon_vma.

    Signed-off-by: Rik van Riel
    Cc: KOSAKI Motohiro
    Cc: Larry Woodman
    Cc: Lee Schermerhorn
    Reviewed-by: Minchan Kim
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • When an anonymous page is inherited from a parent process, the
    vma->anon_vma can differ from the page anon_vma. This can trip up
    __page_check_anon_rmap, which is indirectly called from do_swap_page().

    Remove that obsolete check to prevent an oops.

    Signed-off-by: Rik van Riel
    Cc: KOSAKI Motohiro
    Cc: Larry Woodman
    Cc: Lee Schermerhorn
    Reviewed-by: Minchan Kim
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • The old anon_vma code can lead to scalability issues with heavily forking
    workloads. Specifically, each anon_vma will be shared between the parent
    process and all its child processes.

    In a workload with 1000 child processes and a VMA with 1000 anonymous
    pages per process that get COWed, this leads to a system with a million
    anonymous pages in the same anon_vma, each of which is mapped in just one
    of the 1000 processes. However, the current rmap code needs to walk them
    all, leading to O(N) scanning complexity for each page.

    This can result in systems where one CPU is walking the page tables of
    1000 processes in page_referenced_one, while all other CPUs are stuck on
    the anon_vma lock. This leads to catastrophic failure for a benchmark
    like AIM7, where the total number of processes can reach in the tens of
    thousands. Real workloads are still a factor 10 less process intensive
    than AIM7, but they are catching up.

    This patch changes the way anon_vmas and VMAs are linked, which allows us
    to associate multiple anon_vmas with a VMA. At fork time, each child
    process gets its own anon_vmas, in which its COWed pages will be
    instantiated. The parents' anon_vma is also linked to the VMA, because
    non-COWed pages could be present in any of the children.

    This reduces rmap scanning complexity to O(1) for the pages of the 1000
    child processes, with O(N) complexity for at most 1/N pages in the system.
    This reduces the average scanning cost in heavily forking workloads from
    O(N) to 2.

    The only real complexity in this patch stems from the fact that linking a
    VMA to anon_vmas now involves memory allocations. This means vma_adjust
    can fail, if it needs to attach a VMA to anon_vma structures. This in
    turn means error handling needs to be added to the calling functions.

    A second source of complexity is that, because there can be multiple
    anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
    "the" anon_vma lock. To prevent the rmap code from walking up an
    incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag. This bit
    flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
    to make sure it is impossible to compile a kernel that needs both symbolic
    values for the same bitflag.

    Some test results:

    Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
    box with 16GB RAM and not quite enough IO), the system ends up running
    >99% in system time, with every CPU on the same anon_vma lock in the
    pageout code.

    With these changes, AIM7 hits the cross-over point around 29.7k users.
    This happens with ~99% IO wait time, there never seems to be any spike in
    system time. The anon_vma lock contention appears to be resolved.

    [akpm@linux-foundation.org: cleanups]
    Signed-off-by: Rik van Riel
    Cc: KOSAKI Motohiro
    Cc: Larry Woodman
    Cc: Lee Schermerhorn
    Cc: Minchan Kim
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • mm/memcontrol.c:2548:32: warning: Using plain integer as NULL pointer

    Signed-off-by: Thiago Farina
    Acked-by: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thiago Farina
     
  • This fixes inefficient page-by-page reads on POSIX_FADV_RANDOM.

    POSIX_FADV_RANDOM used to set ra_pages=0, which leads to poor performance:
    a 16K read will be carried out in 4 _sync_ 1-page reads.

    In other places, ra_pages==0 means
    - it's ramfs/tmpfs/hugetlbfs/sysfs/configfs
    - some IO error happened
    where multi-page read IO won't help or should be avoided.

    POSIX_FADV_RANDOM actually want a different semantics: to disable the
    *heuristic* readahead algorithm, and to use a dumb one which faithfully
    submit read IO for whatever application requests.

    So introduce a flag FMODE_RANDOM for POSIX_FADV_RANDOM.

    Note that the random hint is not likely to help random reads performance
    noticeably. And it may be too permissive on huge request size (its IO
    size is not limited by read_ahead_kb).

    In Quentin's report (http://lkml.org/lkml/2009/12/24/145), the overall
    (NFS read) performance of the application increased by 313%!

    Tested-by: Quentin Barnes
    Signed-off-by: Wu Fengguang
    Cc: Nick Piggin
    Cc: Andi Kleen
    Cc: Steven Whitehouse
    Cc: David Howells
    Cc: Jonathan Corbet
    Cc: Al Viro
    Cc: Christoph Hellwig
    Cc: Trond Myklebust
    Cc: Chuck Lever
    Cc: [2.6.33.x]
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • commit 01b1ae63c2 ("memcg: simple migration handling") removed
    mem_cgroup_uncharge_cache_page() call from migrate_page_copy. Local
    variable `anon' is now unused.

    Signed-off-by: KOSAKI Motohiro
    Cc: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Currently, do_migrate_pages() have very long comment and this is not
    indent properly. I often misunderstand it is function starting commnents
    and confused it.

    this patch fixes it.

    note: this patch doesn't break 80 column rule. I guess original
    author intended this indentaion, but an accident corrupted it.

    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • A memmap is a directory in sysfs which includes 3 text files: start, end
    and type. For example:

    start: 0x100000
    end: 0x7e7b1cff
    type: System RAM

    Interface firmware_map_add was not called explicitly. Remove it and add
    function firmware_map_add_hotplug as hotplug interface of memmap.

    Each memory entry has a memmap in sysfs, When we hot-add new memory, sysfs
    does not export memmap entry for it. We add a call in function add_memory
    to function firmware_map_add_hotplug.

    Add a new function add_sysfs_fw_map_entry() to create memmap entry, it
    will be called when initialize memmap and hot-add memory.

    [akpm@linux-foundation.org: un-kernedoc a no longer kerneldoc comment]
    Signed-off-by: Shaohui Zheng
    Acked-by: Andi Kleen
    Acked-by: Yasunori Goto
    Reviewed-by: Wu Fengguang
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    akpm@linux-foundation.org
     
  • Strangely, current mbind() doesn't merge vma with neighbor vma although it's possible.
    Unfortunately, many vma can reduce performance...

    This patch fixes it.

    reproduced program
    ----------------------------------------------------------------
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    static unsigned long pagesize;

    int main(int argc, char** argv)
    {
    void* addr;
    int ch;
    int node;
    struct bitmask *nmask = numa_allocate_nodemask();
    int err;
    int node_set = 0;
    char buf[128];

    while ((ch = getopt(argc, argv, "n:")) != -1){
    switch (ch){
    case 'n':
    node = strtol(optarg, NULL, 0);
    numa_bitmask_setbit(nmask, node);
    node_set = 1;
    break;
    default:
    ;
    }
    }
    argc -= optind;
    argv += optind;

    if (!node_set)
    numa_bitmask_setbit(nmask, 0);

    pagesize = getpagesize();

    addr = mmap(NULL, pagesize*3, PROT_READ|PROT_WRITE,
    MAP_ANON|MAP_PRIVATE, 0, 0);
    if (addr == MAP_FAILED)
    perror("mmap "), exit(1);

    fprintf(stderr, "pid = %d \n" "addr = %p\n", getpid(), addr);

    /* make page populate */
    memset(addr, 0, pagesize*3);

    /* first mbind */
    err = mbind(addr+pagesize, pagesize, MPOL_BIND, nmask->maskp,
    nmask->size, MPOL_MF_MOVE_ALL);
    if (err)
    error("mbind1 ");

    /* second mbind */
    err = mbind(addr, pagesize*3, MPOL_DEFAULT, NULL, 0, 0);
    if (err)
    error("mbind2 ");

    sprintf(buf, "cat /proc/%d/maps", getpid());
    system(buf);

    return 0;
    }
    ----------------------------------------------------------------

    result without this patch

    addr = 0x7fe26ef09000
    [snip]
    7fe26ef09000-7fe26ef0a000 rw-p 00000000 00:00 0
    7fe26ef0a000-7fe26ef0b000 rw-p 00000000 00:00 0
    7fe26ef0b000-7fe26ef0c000 rw-p 00000000 00:00 0
    7fe26ef0c000-7fe26ef0d000 rw-p 00000000 00:00 0

    => 0x7fe26ef09000-0x7fe26ef0c000 have three vmas.

    result with this patch

    addr = 0x7fc9ebc76000
    [snip]
    7fc9ebc76000-7fc9ebc7a000 rw-p 00000000 00:00 0
    7fffbe690000-7fffbe6a5000 rw-p 00000000 00:00 0 [stack]

    => 0x7fc9ebc76000-0x7fc9ebc7a000 have only one vma.

    [minchan.kim@gmail.com: fix file offset passed to vma_merge()]
    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: Christoph Lameter
    Cc: Nick Piggin
    Cc: Hugh Dickins
    Cc: Andrea Arcangeli
    Cc: Mel Gorman
    Cc: Lee Schermerhorn
    Signed-off-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • commit e815af95 ("change all_unreclaimable zone member to flags") changed
    all_unreclaimable member to bit flag. But it had an undesireble side
    effect. free_one_page() is one of most hot path in linux kernel and
    increasing atomic ops in it can reduce kernel performance a bit.

    Thus, this patch revert such commit partially. at least
    all_unreclaimable shouldn't share memory word with other zone flags.

    [akpm@linux-foundation.org: fix patch interaction]
    Signed-off-by: KOSAKI Motohiro
    Cc: David Rientjes
    Cc: Wu Fengguang
    Cc: KAMEZAWA Hiroyuki
    Cc: Minchan Kim
    Cc: Huang Shijie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • free_hot_page() is just a wrapper around free_hot_cold_page() with
    parameter 'cold = 0'. After adding a clear comment for
    free_hot_cold_page(), it is reasonable to remove a level of call.

    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Li Hong
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Ingo Molnar
    Cc: Larry Woodman
    Cc: Peter Zijlstra
    Cc: Li Ming Chun
    Cc: KOSAKI Motohiro
    Cc: Americo Wang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Hong
     
  • Move a call of trace_mm_page_free_direct() from free_hot_page() to
    free_hot_cold_page(). It is clearer and close to kmemcheck_free_shadow(),
    as it is done in function __free_pages_ok().

    Signed-off-by: Li Hong
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Ingo Molnar
    Cc: Larry Woodman
    Cc: Peter Zijlstra
    Cc: Li Ming Chun
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Hong
     
  • trace_mm_page_free_direct() is called in function __free_pages(). But it
    is called again in free_hot_page() if order == 0 and produce duplicate
    records in trace file for mm_page_free_direct event. As below:

    K-PID CPU# TIMESTAMP FUNCTION
    gnome-terminal-1567 [000] 4415.246466: mm_page_free_direct: page=ffffea0003db9f40 pfn=1155800 order=0
    gnome-terminal-1567 [000] 4415.246468: mm_page_free_direct: page=ffffea0003db9f40 pfn=1155800 order=0
    gnome-terminal-1567 [000] 4415.246506: mm_page_alloc: page=ffffea0003db9f40 pfn=1155800 order=0 migratetype=0 gfp_flags=GFP_KERNEL
    gnome-terminal-1567 [000] 4415.255557: mm_page_free_direct: page=ffffea0003db9f40 pfn=1155800 order=0
    gnome-terminal-1567 [000] 4415.255557: mm_page_free_direct: page=ffffea0003db9f40 pfn=1155800 order=0

    This patch removes the first call and adds a call to
    trace_mm_page_free_direct() in __free_pages_ok().

    Signed-off-by: Li Hong
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Ingo Molnar
    Cc: Larry Woodman
    Cc: Peter Zijlstra
    Cc: Li Ming Chun
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Hong
     
  • Commit cf40bd16fd ("lockdep: annotate reclaim context") introduced reclaim
    context annotation. But it didn't annotate zone reclaim. This patch do
    it.

    The point is, commit cf40bd16fd annotate __alloc_pages_direct_reclaim but
    zone-reclaim doesn't use __alloc_pages_direct_reclaim.

    current call graph is

    __alloc_pages_nodemask
    get_page_from_freelist
    zone_reclaim()
    __alloc_pages_slowpath
    __alloc_pages_direct_reclaim
    try_to_free_pages

    Actually, if zone_reclaim_mode=1, VM never call
    __alloc_pages_direct_reclaim in usual VM pressure.

    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: Minchan Kim
    Acked-by: Nick Piggin
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • The get_scan_ratio() should have all scan-ratio related calculations.
    Thus, this patch move some calculation into get_scan_ratio.

    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: Rik van Riel
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Kswapd checks that zone has sufficient pages free via zone_watermark_ok().

    If any zone doesn't have enough pages, we set all_zones_ok to zero.
    !all_zone_ok makes kswapd retry rather than sleeping.

    I think the watermark check before shrink_zone() is pointless. Only after
    kswapd has tried to shrink the zone is the check meaningful.

    Move the check to after the call to shrink_zone().

    [akpm@linux-foundation.org: fix comment, layout]
    Signed-off-by: Minchan Kim
    Reviewed-by: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Rik van Riel
    Reviewed-by: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Make sure compiler won't do weird things with limits. E.g. fetching them
    twice may return 2 different values after writable limits are implemented.

    I.e. either use rlimit helpers added in
    3e10e716abf3c71bdb5d86b8f507f9e72236c9cd ("resource: add helpers for
    fetching rlimits") or ACCESS_ONCE if not applicable.

    Signed-off-by: Jiri Slaby
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Slaby
     
  • Currently, mlock_vma_pages_range() only return len or 0. then current
    error handling of mmap_region() is meaningless complex.

    This patch makes simplify and makes consist with brk() code.

    Signed-off-by: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Lee Schermerhorn
    Cc: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Currently, mlock_vma_pages_range() never return negative value. Then, we
    can remove some worthless error check.

    Signed-off-by: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Lee Schermerhorn
    Cc: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • A frequent questions from users about memory management is what numbers of
    swap ents are user for processes. And this information will give some
    hints to oom-killer.

    Besides we can count the number of swapents per a process by scanning
    /proc//smaps, this is very slow and not good for usual process
    information handler which works like 'ps' or 'top'. (ps or top is now
    enough slow..)

    This patch adds a counter of swapents to mm_counter and update is at each
    swap events. Information is exported via /proc//status file as

    [kamezawa@bluextal memory]$ cat /proc/self/status
    Name: cat
    State: R (running)
    Tgid: 2910
    Pid: 2910
    PPid: 2823
    TracerPid: 0
    Uid: 500 500 500 500
    Gid: 500 500 500 500
    FDSize: 256
    Groups: 500
    VmPeak: 82696 kB
    VmSize: 82696 kB
    VmLck: 0 kB
    VmHWM: 432 kB
    VmRSS: 432 kB
    VmData: 172 kB
    VmStk: 84 kB
    VmExe: 48 kB
    VmLib: 1568 kB
    VmPTE: 40 kB
    VmSwap: 0 kB
    Reviewed-by: Minchan Kim
    Reviewed-by: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Considering the nature of per mm stats, it's the shared object among
    threads and can be a cache-miss point in the page fault path.

    This patch adds per-thread cache for mm_counter. RSS value will be
    counted into a struct in task_struct and synchronized with mm's one at
    events.

    Now, in this patch, the event is the number of calls to handle_mm_fault.
    Per-thread value is added to mm at each 64 calls.

    rough estimation with small benchmark on parallel thread (2threads) shows
    [before]
    4.5 cache-miss/faults
    [after]
    4.0 cache-miss/faults
    Anyway, the most contended object is mmap_sem if the number of threads grows.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Minchan Kim
    Cc: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Presently, per-mm statistics counter is defined by macro in sched.h

    This patch modifies it to
    - defined in mm.h as inlinf functions
    - use array instead of macro's name creation.

    This patch is for reducing patch size in future patch to modify
    implementation of per-mm counter.

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Minchan Kim
    Cc: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

06 Mar, 2010

1 commit

  • * 'slab-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:
    SLUB: Fix per-cpu merge conflict
    failslab: add ability to filter slab caches
    slab: fix regression in touched logic
    dma kmalloc handling fixes
    slub: remove impossible condition
    slab: initialize unused alien cache entry as NULL at alloc_alien_cache().
    SLUB: Make slub statistics use this_cpu_inc
    SLUB: this_cpu: Remove slub kmem_cache fields
    SLUB: Get rid of dynamic DMA kmalloc cache allocation
    SLUB: Use this_cpu operations in slub

    Linus Torvalds
     

05 Mar, 2010

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (52 commits)
    init: Open /dev/console from rootfs
    mqueue: fix typo "failues" -> "failures"
    mqueue: only set error codes if they are really necessary
    mqueue: simplify do_open() error handling
    mqueue: apply mathematics distributivity on mq_bytes calculation
    mqueue: remove unneeded info->messages initialization
    mqueue: fix mq_open() file descriptor leak on user-space processes
    fix race in d_splice_alias()
    set S_DEAD on unlink() and non-directory rename() victims
    vfs: add NOFOLLOW flag to umount(2)
    get rid of ->mnt_parent in tomoyo/realpath
    hppfs can use existing proc_mnt, no need for do_kern_mount() in there
    Mirror MS_KERNMOUNT in ->mnt_flags
    get rid of useless vfsmount_lock use in put_mnt_ns()
    Take vfsmount_lock to fs/internal.h
    get rid of insanity with namespace roots in tomoyo
    take check for new events in namespace (guts of mounts_poll()) to namespace.c
    Don't mess with generic_permission() under ->d_lock in hpfs
    sanitize const/signedness for udf
    nilfs: sanitize const/signedness in dealing with ->d_name.name
    ...

    Fix up fairly trivial (famous last words...) conflicts in
    drivers/infiniband/core/uverbs_main.c and security/tomoyo/realpath.c

    Linus Torvalds
     

04 Mar, 2010

1 commit

  • The slab tree adds a percpu variable usage case (commit
    9dfc6e68bfe6ee452efb1a4e9ca26a9007f2b864 "SLUB: Use this_cpu operations in
    slub"), but the percpu tree removes the prefixing of percpu variables (commit
    dd17c8f72993f9461e9c19250e3f155d6d99df22 "percpu: remove per_cpu__ prefix"),
    thus causing the following compilation error:

    CC mm/slub.o
    mm/slub.c: In function ‘alloc_kmem_cache_cpus’:
    mm/slub.c:2078: error: implicit declaration of function ‘per_cpu_var’
    mm/slub.c:2078: warning: assignment makes pointer from integer without a cast
    make[1]: *** [mm/slub.o] Error 1

    Signed-off-by: Pekka Enberg

    Stephen Rothwell