06 Feb, 2008

40 commits

  • After running SetPageUptodate, preceeding stores to the page contents to
    actually bring it uptodate may not be ordered with the store to set the
    page uptodate.

    Therefore, another CPU which checks PageUptodate is true, then reads the
    page contents can get stale data.

    Fix this by having an smp_wmb before SetPageUptodate, and smp_rmb after
    PageUptodate.

    Many places that test PageUptodate, do so with the page locked, and this
    would be enough to ensure memory ordering in those places if
    SetPageUptodate were only called while the page is locked. Unfortunately
    that is not always the case for some filesystems, but it could be an idea
    for the future.

    Also bring the handling of anonymous page uptodateness in line with that of
    file backed page management, by marking anon pages as uptodate when they
    _are_ uptodate, rather than when our implementation requires that they be
    marked as such. Doing allows us to get rid of the smp_wmb's in the page
    copying functions, which were especially added for anonymous pages for an
    analogous memory ordering problem. Both file and anonymous pages are
    handled with the same barriers.

    FAQ:
    Q. Why not do this in flush_dcache_page?
    A. Firstly, flush_dcache_page handles only one side (the smb side) of the
    ordering protocol; we'd still need smp_rmb somewhere. Secondly, hiding away
    memory barriers in a completely unrelated function is nasty; at least in the
    PageUptodate macros, they are located together with (half) the operations
    involved in the ordering. Thirdly, the smp_wmb is only required when first
    bringing the page uptodate, wheras flush_dcache_page should be called each time
    it is written to through the kernel mapping. It is logically the wrong place to
    put it.

    Q. Why does this increase my text size / reduce my performance / etc.
    A. Because it is adding the necessary instructions to eliminate the data-race.

    Q. Can it be improved?
    A. Yes, eg. if you were to create a rule that all SetPageUptodate operations
    run under the page lock, we could avoid the smp_rmb places where PageUptodate
    is queried under the page lock. Requires audit of all filesystems and at least
    some would need reworking. That's great you're interested, I'm eagerly awaiting
    your patches.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Orphaned page might have fs-private metadata, the page is truncated. As
    the page hasn't mapping, page migration refuse to migrate the page. It
    appears the page is only freed in page reclaim and if zone watermark is
    low, the page is never freed, as a result migration always fail. I thought
    we could free the metadata so such page can be freed in migration and make
    migration more reliable.

    [akpm@linux-foundation.org: go direct to try_to_free_buffers()]
    Signed-off-by: Shaohua Li
    Acked-by: Nick Piggin
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • Though the lower_zone_protection was changed to lowmem_reserve_ratio, the
    document has been not changed. The lowmem_reserve_ratio seems quite hard
    to estimate, but there is no guidance. This patch is to change document
    for it.

    Signed-off-by: Yasunori Goto
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yasunori Goto
     
  • I've written some test programs in ltp project. During writing I met an
    problem which I cannot solve in user land. So I wrote a patch for linux
    kernel. Please, include this patch if acceptable.

    The test program tests the 4th parameter of fadvise64_64:

    long sys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice);

    My test case calls fadvise64_64 with invalid advice value and checks errno is
    set to EINVAL. About the advice parameter man page says:

    ...
    Permissible values for advice include:

    POSIX_FADV_NORMAL
    ...
    POSIX_FADV_SEQUENTIAL
    ...
    POSIX_FADV_RANDOM
    ...
    POSIX_FADV_NOREUSE
    ...
    POSIX_FADV_WILLNEED
    ...
    POSIX_FADV_DONTNEED
    ...
    ERRORS
    ...
    EINVAL An invalid value was specified for advice.

    However, I got a bug report that the system call invocations
    in my test case returned 0 unexpectedly.

    I've inspected the kernel code:

    asmlinkage long sys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice)
    {
    struct file *file = fget(fd);
    struct address_space *mapping;
    struct backing_dev_info *bdi;
    loff_t endbyte; /* inclusive */
    pgoff_t start_index;
    pgoff_t end_index;
    unsigned long nrpages;
    int ret = 0;

    if (!file)
    return -EBADF;

    if (S_ISFIFO(file->f_path.dentry->d_inode->i_mode)) {
    ret = -ESPIPE;
    goto out;
    }

    mapping = file->f_mapping;
    if (!mapping || len < 0) {
    ret = -EINVAL;
    goto out;
    }

    if (mapping->a_ops->get_xip_page)
    /* no bad return value, but ignore advice */
    goto out;
    ...
    out:
    fput(file);
    return ret;
    }

    I found the advice parameter is just ignored in the case
    mapping->a_ops->get_xip_page is given. This behavior is different from
    what is written on the man page. Is this o.k.?

    get_xip_page is given if CONFIG_EXT2_FS_XIP is true.
    Anyway I cannot find the easy way to detect get_xip_page
    field is given or CONFIG_EXT2_FS_XIP is true from the
    user space.

    I propose the following patch which checks the advice parameter
    even if get_xip_page is given.

    Signed-off-by: Masatake YAMATO
    Acked-by: Carsten Otte
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Masatake YAMATO
     
  • The show_mem() output does not include the total number of pagecache
    pages. This would be helpful when analyzing the debug information in
    the /var/log/messages file after OOM kills occur.

    This patch includes the total pagecache pages in that output.

    Signed-off-by: Larry Woodman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Larry Woodman
     
  • In 46d2277c796f9f4937bfa668c40b2e3f43e93dd0 ("Clean up and make
    try_to_free_buffers() not race with dirty pages"), try_to_free_buffers
    was changed to bail out if the page was dirty.

    That in turn caused truncate_complete_page to leak massive amounts of
    memory, because the dirty bit was only cleared after the call to
    try_to_free_buffers.

    So the call to cancel_dirty_page was moved up to have the dirty bit
    cleared early in 3e67c0987d7567ad666641164a153dca9a43b11d ("truncate:
    clear page dirtiness before running try_to_free_buffers()").

    The problem with that fix is, that the page can be redirtied after
    cancel_dirty_page was called, eg. like this:

    truncate_complete_page()
    cancel_dirty_page() // PG_dirty cleared, decr. dirty pages
    do_invalidatepage()
    ext3_invalidatepage()
    journal_invalidatepage()
    journal_unmap_buffer()
    __dispose_buffer()
    __journal_unfile_buffer()
    __journal_temp_unlink_buffer()
    mark_buffer_dirty(); // PG_dirty set, incr. dirty pages

    And then we end up with dirty pages being wrongly accounted.

    As a result, in ecdfc9787fe527491baefc22dce8b2dbd5b2908d ("Resurrect
    'try_to_free_buffers()' VM hackery") the changes to try_to_free_buffers
    were reverted, so the original reason for the massive memory leak is
    gone, and we can also revert the move of the call to cancel_dirty_page
    from truncate_complete_page and get the accounting right again.

    I'm not sure if it matters, but opposed to the final check in
    __remove_from_page_cache, this one also cares about the task io
    accounting, so maybe we want to use this instead, although it's not
    quite the clean fix either.

    Signed-off-by: Björn Steinbrink
    Tested-by: Krzysztof Piotr Oledzki
    Cc: Jan Kara
    Cc: Nick Piggin
    Cc: Peter Zijlstra
    Cc: Thomas Osterried
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bjorn Steinbrink
     
  • The current PageTail semantic is that a PageTail page is first a
    PageCompound page. So remove the redundant PageCompound test in
    set_page_refcounted().

    Signed-off-by: Qi Yong
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Qi Yong
     
  • fastcall is always defined to be empty, remove it

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Harvey Harrison
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Harvey Harrison
     
  • Signed-off-by: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • Since I_SYNC was split out from I_LOCK, the concern in commit
    4b89eed93e0fa40a63e3d7b1796ec1337ea7a3aa ("Write back inode data pages
    even when the inode itself is locked") is not longer valid.

    We should revert to the original behavior: in __writeback_single_inode(),
    when we find an I_SYNC-ed inode and we're not doing a data-integrity sync,
    skip writing entirely. Otherwise, we are double calling do_writepages()

    Signed-off-by: Qi Yong
    Cc: Peter Zijlstra
    Cc: Hugh Dickins
    Cc: Joern Engel
    Cc: WU Fengguang
    Cc: Michael Rubin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Qi Yong
     
  • try_to_unmap always fails on a page found in a VM_LOCKED vma (unless
    migrating), and recycles it back to the active list. But if it's an
    anonymous page, we've already allocated swap to it: just wasting swap.
    Spot locked pages in page_referenced_one and treat them as referenced.

    Signed-off-by: Hugh Dickins
    Tested-by: KAMEZAWA Hiroyuki
    Cc: Ethan Solomita
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Remove the prefetch logic in order to avoid touching impossible per cpu
    areas.

    Signed-off-by: Christoph Lameter
    Cc: Mike Travis
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • This patch fixes a sles9 system hang in start_this_handle from a customer
    with some heavy workload where all tasks are waiting on kjournald to commit
    the transaction, but kjournald waits on t_updates to go down to zero (it
    never does).

    This was reported as a lowmem shortage deadlock but when checking the debug
    data I noticed the VM wasn't under pressure at all (well it was really
    under vm pressure, because lots of tasks hanged in the VM prune_dcache
    methods trying to flush dirty inodes, but no task was hanging in GFP_NOFS
    mode, the holder of the journal handle should have if this was a vm issue
    in the first place).

    No task was apparently holding the leftover handle in the committing
    transaction, so I deduced t_updates was stuck to 1 because a journal_stop
    was never run by some path (this turned out to be correct). With a debug
    patch adding proper reverse links and stack trace logging in ext3 deployed
    in production, I found journal_stop is never run because
    mark_inode_dirty_sync is called inside release_task called by do_exit.
    (that was quite fun because I would have never thought about this
    subtleness, I thought a regular path in ext3 had a bug and it forgot to
    call journal_stop)

    do_exit->release_task->mark_inode_dirty_sync->schedule() (will never
    come back to run journal_stop)

    The reason is that shrink_dcache_parent is racy by design (feature not
    a bug) and it can do blocking I/O in some case, but the point is that
    calling shrink_dcache_parent at the last stage of do_exit isn't safe
    for self-reaping tasks.

    I guess the memory pressure of the unbalanced highmem system allowed
    to trigger this more easily.

    Now mainline doesn't have this line in iput (like sles9 has):

    if (inode->i_state & I_DIRTY_DELAYED)
    mark_inode_dirty_sync(inode);

    so it will probably not crash with ext3, but for example ext2 implements an
    I/O-blocking ext2_put_inode that will lead to similar screwups with
    ext2_free_blocks never coming back and it's definitely wrong to call
    blocking-IO paths inside do_exit. So this should fix a subtle bug in
    mainline too (not verified in practice though). The equivalent fix for
    ext3 is also not verified yet to fix the problem in sles9 but I don't have
    doubt it will (it usually takes days to crash, so it'll take weeks to be
    sure).

    An alternate fix would be to offload that work to a kernel thread, but I
    don't think a reschedule for this is worth it, the vm should be able to
    collect those entries for the synchronous release_task.

    Signed-off-by: Andrea Arcangeli
    Cc: Jan Kara
    Cc: Ingo Molnar
    Cc: "Eric W. Biederman"
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Add vm.highmem_is_dirtyable toggle

    A 32 bit machine with HIGHMEM64 enabled running DCC has an MMAPed file of
    approximately 2Gb size which contains a hash format that is written
    randomly by the dbclean process. On 2.6.16 this process took a few
    minutes. With lowmem only accounting of dirty ratios, this takes about 12
    hours of 100% disk IO, all random writes.

    Include a toggle in /proc/sys/vm/highmem_is_dirtyable which can be set to 1 to
    add the highmem back to the total available memory count.

    [akpm@linux-foundation.org: Fix the CONFIG_DETECT_SOFTLOCKUP=y build]
    Signed-off-by: Bron Gondwana
    Cc: Ethan Solomita
    Cc: Peter Zijlstra
    Cc: WU Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bron Gondwana
     
  • We have repeatedly discussed if the cold pages still have a point. There is
    one way to join the two lists: Use a single list and put the cold pages at the
    end and the hot pages at the beginning. That way a single list can serve for
    both types of allocations.

    The discussion of the RFC for this and Mel's measurements indicate that
    there may not be too much of a point left to having separate lists for
    hot and cold pages (see http://marc.info/?t=119492914200001&r=1&w=2).

    Signed-off-by: Christoph Lameter
    Cc: Mel Gorman
    Cc: Martin Bligh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • When running with a 16M IOREMAP_MAX_ORDER (on armv7) we found that the
    vmlist search routine in __get_vm_area_node can mistakenly allow a driver
    to ioremap a range larger than vmalloc space.

    If at the time of the ioremap all existing vmlist areas sit below the
    determined alignment then the search routine continues past all entries and
    exits the for loop - straight into the found: label - without ever testing
    for integer wrapping or that the requested size fits.

    We were seeing a driver successfully ioremap 128M of flash even though
    there was only 120M of vmalloc space. From that point the system was left
    with the remainder of the first 16M of space to vmalloc/ioremap within.

    Signed-off-by: Robert Bragg
    Acked-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robert Bragg
     
  • 1. Add comments explaining how the function can be called.

    2. Collect global diffs in a local array and only spill
    them once into the global counters when the zone scan
    is finished. This means that we only touch each global
    counter once instead of each time we fold cpu counters
    into zone counters.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • In order to change the layout of the page tables after an mmap has crossed the
    adress space limit of the current page table layout a architecture hook in
    get_unmapped_area is needed. The arguments are the address of the new mapping
    and the length of it.

    Cc: Benjamin Herrenschmidt
    Signed-off-by: Martin Schwidefsky
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Martin Schwidefsky
     
  • (with Martin Schwidefsky )

    The pgd/pud/pmd/pte page table allocation functions get a mm_struct pointer as
    first argument. The free functions do not get the mm_struct argument. This
    is 1) asymmetrical and 2) to do mm related page table allocations the mm
    argument is needed on the free function as well.

    [kamalesh@linux.vnet.ibm.com: i386 fix]
    [akpm@linux-foundation.org: coding-syle fixes]
    Signed-off-by: Benjamin Herrenschmidt
    Signed-off-by: Martin Schwidefsky
    Cc:
    Signed-off-by: Kamalesh Babulal
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Benjamin Herrenschmidt
     
  • - Add comments explaing how drain_pages() works.

    - Eliminate useless functions

    - Rename drain_all_local_pages to drain_all_pages(). It does drain
    all pages not only those of the local processor.

    - Eliminate useless interrupt off / on sequences. drain_pages()
    disables interrupts on its own. The execution thread is
    pinned to processor by the caller. So there is no need to
    disable interrupts.

    - Put drain_all_pages() declaration in gfp.h and remove the
    declarations from suspend.h and from mm/memory_hotplug.c

    - Make software suspend call drain_all_pages(). The draining
    of processor local pages is may not the right approach if
    software suspend wants to support SMP. If they call drain_all_pages
    then we can make drain_pages() static.

    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Christoph Lameter
    Acked-by: Mel Gorman
    Cc: "Rafael J. Wysocki"
    Cc: Daniel Walker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Most pagecache (and some other) radix tree insertions have the great
    opportunity to preallocate a few nodes with relaxed gfp flags. But the
    preallocation is squandered when it comes time to allocate a node, we
    default to first attempting a GFP_ATOMIC allocation -- that doesn't
    normally fail, but it can eat into atomic memory reserves that we don't
    need to be using.

    Another upshot of this is that it removes the sometimes highly contended
    zone->lock from underneath tree_lock. Pagecache insertions are always
    performed with a radix tree preload, and after this change, such a
    situation will never fall back to kmem_cache_alloc within
    radix_tree_node_alloc.

    David Miller reports seeing this allocation fail on a highly threaded
    sparc64 system:

    [527319.459981] dd: page allocation failure. order:0, mode:0x20
    [527319.460403] Call Trace:
    [527319.460568] [00000000004b71e0] __slab_alloc+0x1b0/0x6a8
    [527319.460636] [00000000004b7bbc] kmem_cache_alloc+0x4c/0xa8
    [527319.460698] [000000000055309c] radix_tree_node_alloc+0x20/0x90
    [527319.460763] [0000000000553238] radix_tree_insert+0x12c/0x260
    [527319.460830] [0000000000495cd0] add_to_page_cache+0x38/0xb0
    [527319.460893] [00000000004e4794] mpage_readpages+0x6c/0x134
    [527319.460955] [000000000049c7fc] __do_page_cache_readahead+0x170/0x280
    [527319.461028] [000000000049cc88] ondemand_readahead+0x208/0x214
    [527319.461094] [0000000000496018] do_generic_mapping_read+0xe8/0x428
    [527319.461152] [0000000000497948] generic_file_aio_read+0x108/0x170
    [527319.461217] [00000000004badac] do_sync_read+0x88/0xd0
    [527319.461292] [00000000004bb5cc] vfs_read+0x78/0x10c
    [527319.461361] [00000000004bb920] sys_read+0x34/0x60
    [527319.461424] [0000000000406294] linux_sparc_syscall32+0x3c/0x40

    The calltrace is significant: __do_page_cache_readahead allocates a number
    of pages with GFP_KERNEL, and hence it should have reclaimed sufficient
    memory to satisfy GFP_ATOMIC allocations. However after the list of pages
    goes to mpage_readpages, there can be significant intervals (including disk
    IO) before all the pages are inserted into the radix-tree. So the reserves
    can easily be depleted at that point. The patch is confirmed to fix the
    problem.

    Signed-off-by: Nick Piggin
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • __vmalloc_area_node() can become static.

    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • This code in mm/tiny-shmem.c is under #if 0 - remove it.

    Signed-off-by: Balbir Singh
    Acked-by: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     
  • task_dirty_limit() can become static.

    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • Make /proc/ page monitoring configurable

    This puts the following files under an embedded config option:

    /proc/pid/clear_refs
    /proc/pid/smaps
    /proc/pid/pagemap
    /proc/kpagecount
    /proc/kpageflags

    [akpm@linux-foundation.org: Kconfig fix]
    Signed-off-by: Matt Mackall
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matt Mackall
     
  • This makes a subset of physical page flags available to userspace. Together
    with /proc/pid/kpagemap, this allows tracking of a wide variety of VM behaviors.

    Exported flags are decoupled from the kernel's internal flags. This
    allows us to reorder flag bits, and synthesize any bits that get
    redefined in terms of other bits.

    [akpm@linux-foundation.org: remove unneeded access_ok()]
    [akpm@linux-foundation.org: s/0/NULL/]
    Signed-off-by: Matt Mackall
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matt Mackall
     
  • This makes physical page map counts available to userspace. Together
    with /proc/pid/pagemap and /proc/pid/clear_refs, this can be used to
    monitor memory usage on a per-page basis.

    [akpm@linux-foundation.org: remove unneeded access_ok()]
    [bunk@stusta.de: make struct proc_kpagemap static]
    Signed-off-by: Matt Mackall
    Cc: Jeremy Fitzhardinge
    Cc: David Rientjes
    Signed-off-by: Adrian Bunk
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matt Mackall
     
  • This interface provides a mapping for each page in an address space to its
    physical page frame number, allowing precise determination of what pages are
    mapped and what pages are shared between processes.

    New in this version:

    - headers gone again (as recommended by Dave Hansen and Alan Cox)
    - 64-bit entries (as per discussion with Andi Kleen)
    - swap pte information exported (from Dave Hansen)
    - page walker callback for holes (from Dave Hansen)
    - direct put_user I/O (as suggested by Rusty Russell)

    This patch folds in cleanups and swap PTE support from Dave Hansen
    .

    Signed-off-by: Matt Mackall
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matt Mackall
     
  • Reorder source so that all the code and data for each interface is together.

    Signed-off-by: Matt Mackall
    Cc: Jeremy Fitzhardinge
    Cc: David Rientjes
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matt Mackall
     
  • This puts all the clear_refs code where it belongs and probably lets things
    compile on MMU-less systems as well.

    Signed-off-by: Matt Mackall
    Cc: Jeremy Fitzhardinge
    Cc: David Rientjes
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matt Mackall
     
  • This pulls the shared map display code out of show_map and puts it in
    show_smap where it belongs.

    Signed-off-by: Matt Mackall
    Cc: Jeremy Fitzhardinge
    Acked-by: David Rientjes
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matt Mackall
     
  • Use the generic pagewalker for smaps and clear_refs

    Signed-off-by: Matt Mackall
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matt Mackall
     
  • Introduce a general page table walker

    Signed-off-by: Matt Mackall
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matt Mackall
     
  • Move is_swap_pte helper function to swapops.h for use by pagemap code

    Signed-off-by: Matt Mackall
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matt Mackall
     
  • The following replaces the earlier patches sent. It should address
    David Rientjes's comments, and has been compile tested on all the
    architectures that it touches, save for parisc.

    For the /proc//pagemap code[1], we need to able to query how
    much virtual address space a particular task has. The trick is
    that we do it through /proc and can't use TASK_SIZE since it
    references "current" on some arches. The process opening the
    /proc file might be a 32-bit process opening a 64-bit process's
    pagemap file.

    x86_64 already has a TASK_SIZE_OF() macro:

    #define TASK_SIZE_OF(child) ((test_tsk_thread_flag(child, TIF_IA32)) ? IA32_PAGE_OFFSET : TASK_SIZE64)

    I'd like to have that for other architectures. So, add it
    for all the architectures that actually use "current" in
    their TASK_SIZE. For the others, just add a quick #define
    in sched.h to use plain old TASK_SIZE.

    1. http://www.linuxworld.com/news/2007/042407-kernel.html

    - MIPS portion from Ralf Baechle

    [akpm@linux-foundation.org: fix mips build]
    Signed-off-by: Dave Hansen
    Signed-off-by: Ralf Baechle
    Signed-off-by: Matt Mackall
    Acked-by: David Rientjes
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • The "proportional set size" (PSS) of a process is the count of pages it has
    in memory, where each page is divided by the number of processes sharing
    it. So if a process has 1000 pages all to itself, and 1000 shared with one
    other process, its PSS will be 1500.

    - lwn.net: "ELC: How much memory are applications really using?"

    The PSS proposed by Matt Mackall is a very nice metic for measuring an
    process's memory footprint. So collect and export it via
    /proc//smaps.

    Matt Mackall's pagemap/kpagemap and John Berthels's exmap can also do the
    job. They are comprehensive tools. But for PSS, let's do it in the simple
    way.

    Cc: John Berthels
    Cc: Bernardo Innocenti
    Cc: Padraig Brady
    Cc: Denys Vlasenko
    Cc: Balbir Singh
    Signed-off-by: Matt Mackall
    Signed-off-by: Fengguang Wu
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     
  • vmtruncate is a twisted maze of gotos, this patch cleans it up to have a
    proper if else for the two major cases of extending and truncating truncate
    and thus makes it a lot more readable while keeping exactly the same
    functinality.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Intensive swapoff testing shows shmem_unuse spinning on an entry in
    shmem_swaplist pointing to itself: how does that come about? Days pass...

    First guess is this: shmem_delete_inode tests list_empty without taking the
    global mutex (so the swapping case doesn't slow down the common case); but
    there's an instant in shmem_unuse_inode's list_move_tail when the list entry
    may appear empty (a rare case, because it's actually moving the head not the
    the list member). So there's a danger of leaving the inode on the swaplist
    when it's freed, then reinitialized to point to itself when reused. Fix that
    by skipping the list_move_tail when it's a no-op, which happens to plug this.

    But this same spinning then surfaces on another machine. Ah, I'd never
    suspected it, but shmem_writepage's swaplist manipulation is unsafe: though we
    still hold page lock, which would hold off inode deletion if the page were in
    pagecache, it doesn't hold off once it's in swapcache (free_swap_and_cache
    doesn't wait on locked pages). Hmm: we could put the the inode on swaplist
    earlier, but then shmem_unuse_inode could never prune unswapped inodes.

    Fix this with an igrab before dropping info->lock, as in shmem_unuse_inode;
    though I am a little uneasy about the iput which has to follow - it works, and
    I see nothing wrong with it, but it is surprising that shmem inode deletion
    may now occur below shmem_writepage. Revisit this fix later?

    And while we're looking at these races: the way shmem_unuse tests swapped
    without holding info->lock looks unsafe, if we've more than one swap area: a
    racing shmem_writepage on another page of the same inode could be putting it
    in swapcache, just as we're deciding to remove the inode from swaplist -
    there's a danger of going on swap without being listed, so a later swapoff
    would hang, being unable to locate the entry. Move that test and removal down
    into shmem_unuse_inode, once info->lock is held.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Nick has observed that shmem.c still uses GFP_ATOMIC when adding to page cache
    or swap cache, without any radix tree preload: so tending to deplete emergency
    reserves of memory.

    GFP_ATOMIC remains appropriate in shmem_writepage's add_to_swap_cache: it's
    being called under memory pressure, so must not wait for more memory to become
    available. But shmem_unuse_inode now has a window in which it can and should
    preload with GFP_KERNEL, and say GFP_NOWAIT instead of GFP_ATOMIC in its
    add_to_page_cache.

    shmem_getpage is not so straightforward: its filepage/swappage integrity
    relies upon exchanging between caches under spinlock, and it would need a lot
    of restructuring to place the preloads correctly. Instead, follow its pattern
    of retrying on races: use GFP_NOWAIT instead of GFP_ATOMIC in
    add_to_page_cache, and begin each circuit of the repeat loop with a sleeping
    radix_tree_preload, followed immediately by radix_tree_preload_end - that
    won't guarantee success in the next add_to_page_cache, but doesn't need to.

    And we can then remove that bothersome congestion_wait: when needed, it'll
    automatically get done in the course of the radix_tree_preload.

    Signed-off-by: Hugh Dickins
    Looks-good-to: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • There are a couple of reasons (patches follow) why it would be good to open a
    window for sleep in shmem_unuse_inode, between its search for a matching swap
    entry, and its handling of the entry found.

    shmem_unuse_inode must then use igrab to hold the inode against deletion in
    that window, and its corresponding iput might result in deletion: so it had
    better unlock_page before the iput, and might as well release the page too.

    Nor is there any need to hold on to shmem_swaplist_mutex once we know we'll
    leave the loop. So this unwinding moves from try_to_unuse and shmem_unuse
    into shmem_unuse_inode, in the case when it finds a match.

    Let try_to_unuse break on error in the shmem_unuse case, as it does in the
    unuse_mm case: though at this point in the series, no error to break on.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins