24 Apr, 2007

3 commits

  • NR_FILE_PAGES must be accounted for depending on the zone that the page
    belongs to. If we replace the page in the radix tree then we may have to
    shift the count to another zone.

    Suggested-by: Ethan Solomita
    Eventually-typed-in-by: Christoph Lameter
    Cc: Martin Bligh
    Cc:
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • I only have CONFIG_NUMA=y for build testing: surprised when trying a memhog
    to see lots of other processes killed with "No available memory
    (MPOL_BIND)". memhog is killed correctly once we initialize nodemask in
    constrained_alloc().

    Signed-off-by: Hugh Dickins
    Acked-by: Christoph Lameter
    Acked-by: William Irwin
    Acked-by: KAMEZAWA Hiroyuki
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • oom_kill_task() calls __oom_kill_task() to OOM kill a selected task.
    When finding other threads that share an mm with that task, we need to
    kill those individual threads and not the same one.

    (Bug introduced by f2a2a7108aa0039ba7a5fe7a0d2ecef2219a7584)

    Acked-by: William Irwin
    Acked-by: Christoph Lameter
    Cc: Nick Piggin
    Cc: Andrew Morton
    Cc: Andi Kleen
    Signed-off-by: David Rientjes
    Signed-off-by: Linus Torvalds

    David Rientjes
     

13 Apr, 2007

1 commit


05 Apr, 2007

1 commit


04 Apr, 2007

2 commits

  • Mention the slab name when listing corrupt objects. Although the function
    that released the memory is mentioned, that is frequently ambiguous as such
    functions often release several pieces of memory.

    Signed-off-by: David Howells
    Signed-off-by: Linus Torvalds

    David Howells
     
  • The git commit c2fda5fed81eea077363b285b66eafce20dfd45a which
    added the page_test_and_clear_dirty call to page_mkclean and the
    git commit 7658cc289288b8ae7dd2c2224549a048431222b3 which fixes
    the "nasty and subtle race in shared mmap'ed page writeback"
    problem in clear_page_dirty_for_io cause data corruption on s390.

    The effect of the two changes is that for every call to
    clear_page_dirty_for_io a page_test_and_clear_dirty is done. If
    the per page dirty bit is set set_page_dirty is called. Strangly
    clear_page_dirty_for_io is called for not-uptodate pages, e.g.
    over this call-chain:

    [] clear_page_dirty_for_io+0x12a/0x130
    [] generic_writepages+0x258/0x3e0
    [] do_writepages+0x76/0x7c
    [] __writeback_single_inode+0xba/0x3e4
    [] sync_sb_inodes+0x23e/0x398
    [] writeback_inodes+0x12e/0x140
    [] wb_kupdate+0xd2/0x178
    [] pdflush+0x162/0x23c

    The bad news now is that page_test_and_clear_dirty might claim
    that a not-uptodate page is dirty since SetPageUptodate which
    resets the per page dirty bit has not yet been called. The page
    writeback that follows clobbers the data on disk.

    The simplest solution to this problem is to move the call to
    page_test_and_clear_dirty under the "if (page_mapped(page))".
    If a file backed page is mapped it is uptodate.

    Signed-off-by: Martin Schwidefsky

    Martin Schwidefsky
     

29 Mar, 2007

5 commits

  • Fix the bug, that reading into xip mapping from /dev/zero fills the user
    page table with ZERO_PAGE() entries. Later on, xip cannot tell which pages
    have been ZERO_PAGE() filled by access to a sparse mapping, and which ones
    origin from /dev/zero. It will unmap ZERO_PAGE from all mappings when
    filling the sparse hole with data. xip does now use its own zeroed page
    for its sparse mappings. Please apply.

    Signed-off-by: Carsten Otte
    Signed-off-by: Hugh Dickins
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Carsten Otte
     
  • sys_madvise has down_write of mmap_sem, then madvise_remove calls
    vmtruncate_range which takes i_mutex and i_alloc_sem: no, we can easily devise
    deadlocks from that ordering.

    madvise_remove drop mmap_sem while calling vmtruncate_range: luckily, since
    madvise_remove doesn't split or merge vmas, it's easy to handle this case with
    a NULL prev, without restructuring sys_madvise. (Though sad to retake
    mmap_sem when it's unlikely to be needed, and certainly down_read is
    sufficient for MADV_REMOVE, unlike the other madvices.)

    Signed-off-by: Hugh Dickins
    Cc: Miklos Szeredi
    Cc: Badari Pulavarty
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • shmem_truncate_range has its own truncate_inode_pages_range, to free any pages
    racily instantiated while it was in progress: a SHMEM_PAGEIN flag is set when
    this might have happened. But holepunching gets no chance to clear that flag
    at the start of vmtruncate_range, so it's always set (unless a truncate came
    just before), so holepunch almost always does this second
    truncate_inode_pages_range.

    shmem holepunch has unlikely swapfile races hereabouts whatever we do
    (without a fuller rework than is fit for this release): I was going to skip
    the second truncate in the punch_hole case, but Miklos points out that would
    make holepunch correctness more vulnerable to swapoff. So keep the second
    truncate, but follow it by an unmap_mapping_range to eliminate the
    disconnected pages (freed from pagecache while still mapped in userspace) that
    it might have left behind.

    Signed-off-by: Hugh Dickins
    Cc: Miklos Szeredi
    Cc: Badari Pulavarty
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Miklos Szeredi observes that during truncation of shmem page directories,
    info->lock is released to improve latency (after lowering i_size and
    next_index to exclude races); but this is quite wrong for holepunching, which
    receives no such protection from i_size or next_index, and is left vulnerable
    to races with shmem_unuse, shmem_getpage and shmem_writepage.

    Hold info->lock throughout when holepunching? No, any user could prevent
    rescheduling for far too long. Instead take info->lock just when needed: in
    shmem_free_swp when removing the swap entries, and whenever removing a
    directory page from the level above. But so long as we remove before
    scanning, we can safely skip taking the lock at the lower levels, except at
    misaligned start and end of the hole.

    Signed-off-by: Hugh Dickins
    Cc: Miklos Szeredi
    Cc: Badari Pulavarty
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Miklos Szeredi observes BUG_ON(!entry) in shmem_writepage() triggered in rare
    circumstances, because shmem_truncate_range() erroneously removes partially
    truncated directory pages at the end of the range: later reclaim on pages
    pointing to these removed directories triggers the BUG. Indeed, and it can
    also cause data loss beyond the hole.

    Fix this as in the patch proposed by Miklos, but distinguish between "limit"
    (how far we need to search: ignore truncation's next_index optimization in the
    holepunch case - if there are races it's more consistent to act on the whole
    range specified) and "upper_limit" (how far we can free directory pages:
    generally we must be careful to keep partially punched pages, but can relax at
    end of file - i_size being held stable by i_mutex).

    Signed-off-by: Hugh Dickins
    Cc: Miklos Szeredi
    Cc: Badari Pulavarty
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

27 Mar, 2007

1 commit

  • There is a small problem in handling page bounce.

    At the moment blk_max_pfn equals max_pfn, which is in fact not maximum
    possible _number_ of a page frame, but the _amount_ of page frames. For
    example for the 32bit x86 node with 4Gb RAM, max_pfn = 0x100000, but not
    0xFFFF.

    request_queue structure has a member q->bounce_pfn and queue needs bounce
    pages for the pages _above_ this limit. This routine is handled by
    blk_queue_bounce(), where the following check is produced:

    if (q->bounce_pfn >= blk_max_pfn)
    return;

    Assume, that a driver has set q->bounce_pfn to 0xFFFF, but blk_max_pfn
    equals 0x10000. In such situation the check above fails and for each bio
    we always fall down for iterating over pages tied to the bio.

    I want to notice, that for quite a big range of device drivers (ide, md,
    ...) such problem doesn't happen because they use BLK_BOUNCE_ANY for
    bounce_pfn. BLK_BOUNCE_ANY is defined as blk_max_pfn << PAGE_SHIFT, and
    then the check above doesn't fail. But for other drivers, which obtain
    reuired value from drivers, it fails. For example sata_nv uses
    ATA_DMA_MASK or dev->dma_mask.

    I propose to use (max_pfn - 1) for blk_max_pfn. And the same for
    blk_max_low_pfn. The patch also cleanses some checks related with
    bounce_pfn.

    Signed-off-by: Vasily Tarasov
    Signed-off-by: Andrew Morton
    Signed-off-by: Jens Axboe

    Vasily Tarasov
     

23 Mar, 2007

2 commits

  • Make the SYSV SHM nattch counter work correctly by forcing multiple VMAs to
    be produced to represent MAP_SHARED segments, even if they overlap exactly.

    Using this test program:

    http://people.redhat.com/~dhowells/doshm.c

    Run as:

    doshm sysv

    I can see nattch going from one before the patch:

    # /doshm sysv
    Command: sysv
    shmid: 65536
    memory: 0xc3700000
    c0b00000-c0b04000 rw-p 00000000 00:00 0
    c0bb0000-c0bba788 r-xs 00000000 00:0b 14582157 /lib/ld-uClibc-0.9.28.so
    c3180000-c31dede4 r-xs 00000000 00:0b 14582179 /lib/libuClibc-0.9.28.so
    c3520000-c352278c rw-p 00000000 00:0b 13763417 /doshm
    c3584000-c35865e8 r-xs 00000000 00:0b 13763417 /doshm
    c3588000-c358aa00 rw-p 00008000 00:0b 14582157 /lib/ld-uClibc-0.9.28.so
    c3590000-c359b6c0 rw-p 00000000 00:00 0
    c3620000-c3640000 rwxp 00000000 00:00 0
    c3700000-c37fa000 rw-S 00000000 00:06 1411 /SYSV00000000 (deleted)
    c3700000-c37fa000 rw-S 00000000 00:06 1411 /SYSV00000000 (deleted)
    nattch 1

    To two after the patch:

    # /doshm sysv
    Command: sysv
    shmid: 0
    memory: 0xc3700000
    c0bb0000-c0bba788 r-xs 00000000 00:0b 14582157 /lib/ld-uClibc-0.9.28.so
    c3180000-c31dede4 r-xs 00000000 00:0b 14582179 /lib/libuClibc-0.9.28.so
    c3320000-c3340000 rwxp 00000000 00:00 0
    c3530000-c35325e8 r-xs 00000000 00:0b 13763417 /doshm
    c3534000-c353678c rw-p 00000000 00:0b 13763417 /doshm
    c3538000-c353aa00 rw-p 00008000 00:0b 14582157 /lib/ld-uClibc-0.9.28.so
    c3590000-c359b6c0 rw-p 00000000 00:00 0
    c35a4000-c35a8000 rw-p 00000000 00:00 0
    c3700000-c37fa000 rw-S 00000000 00:06 1369 /SYSV00000000 (deleted)
    c3700000-c37fa000 rw-S 00000000 00:06 1369 /SYSV00000000 (deleted)
    nattch 2

    That's +1 to nattch for each shmat() made.

    Signed-off-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     
  • Supply a get_unmapped_area() to fix NOMMU SYSV SHM support.

    Signed-off-by: David Howells
    Acked-by: Adam Litke
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     

17 Mar, 2007

4 commits

  • Looking at oom_kill.c, found that the intention to not kill the selected
    process if any of its children/siblings has OOM_DISABLE set, is not being
    met.

    Signed-off-by: Ankita Garg
    Acked-by: Nick Piggin
    Acked-by: William Irwin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ankita Garg
     
  • The current NFS client congestion logic is severly broken, it marks the
    backing device congested during each nfs_writepages() call but doesn't
    mirror this in nfs_writepage() which makes for deadlocks. Also it
    implements its own waitqueue.

    Replace this by a more regular congestion implementation that puts a cap on
    the number of active writeback pages and uses the bdi congestion waitqueue.

    Also always use an interruptible wait since it makes sense to be able to
    SIGKILL the process even for mounts without 'intr'.

    Signed-off-by: Peter Zijlstra
    Acked-by: Trond Myklebust
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • This patch fixes a user-triggerable oops that was reported by Leonid
    Ananiev as archived at http://lkml.org/lkml/2007/2/8/337.

    dio writes invalidate clean pages that intersect the written region so that
    subsequent buffered reads go to disk to read the new data. If this fails
    the interface tries to tell the caller that the cache is inconsistent by
    returning EIO.

    Before this patch we had the problem where this invalidation failure would
    clobber -EIOCBQUEUED as it made its way from fs/direct-io.c to fs/aio.c.
    Both fs/aio.c and bio completion call aio_complete() and we reference freed
    memory, usually oopsing.

    This patch addresses this problem by invalidating before the write so that
    we can cleanly return -EIO before ->direct_IO() has had a chance to return
    -EIOCBQUEUED.

    There is a compromise here. During the dio write we can fault in mmap()ed
    pages which intersect the written range with get_user_pages() if the user
    provided them for the source buffer. This is a crazy thing to do, but we
    can make it mostly work in most cases by trying the invalidation again.
    The compromise is that we won't return an error if this second invalidation
    fails if it's an AIO write and we have -EIOCBQUEUED.

    This was tested by having two processes race performing large O_DIRECT and
    buffered ordered writes. Within minutes ext3 would see a race between
    ext3_releasepage() and jbd holding a reference on ordered data buffers and
    would cause invalidation to fail, panicing the box. The test can be found
    in the 'aio_dio_bugs' test group in test.kernel.org/autotest. After this
    patch the test passes.

    Signed-off-by: Zach Brown
    Signed-off-by: Benjamin LaHaise
    Cc: Leonid Ananiev
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zach Brown
     
  • madvise(MADV_REMOVE) can go into an infinite loop or cause an oops if the
    call covers a region from the start of a vma, and extending past that vma.

    Signed-off-by: Nick Piggin
    Cc: Badari Pulavarty
    Acked-by: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

05 Mar, 2007

2 commits

  • Currently we do not check for vma flags if sys_move_pages is called to move
    individual pages. If sys_migrate_pages is called to move pages then we
    check for vm_flags that indicate a non migratable vma but that still
    includes VM_LOCKED and we can migrate mlocked pages.

    Extract the vma_migratable check from mm/mempolicy.c, fix it and put it
    into migrate.h so that is can be used from both locations.

    Problem was spotted by Lee Schermerhorn

    Signed-off-by: Christoph Lameter
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • shmem's super_operations were missed from the recent const-ification;
    and simple_fill_super()'s, which can share with get_sb_pseudo()'s.

    Signed-off-by: Hugh Dickins
    Acked-by: Josef 'Jeff' Sipek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

02 Mar, 2007

7 commits

  • Fix invalidate_inode_pages2_range() so that it does not immediately exit
    just because a single page in the specified range could not be removed.

    Signed-off-by: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Trond Myklebust
     
  • page_lock_anon_vma() uses spin_lock() to block RCU. This doesn't work with
    PREEMPT_RCU, we have to do rcu_read_lock() explicitely. Otherwise, it is
    theoretically possible that slab returns anon_vma's memory to the system
    before we do spin_unlock(&anon_vma->lock).

    [ Hugh points out that this only matters for PREEMPT_RCU, which isn't merged
    yet, and may never be. Regardless, this patch is conceptually the
    right thing to do, even if it doesn't matter at this point. - Linus ]

    Signed-off-by: Oleg Nesterov
    Cc: Paul McKenney
    Cc: Nick Piggin
    Cc: Christoph Lameter
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • throttle_vm_writeout() is designed to wait for the dirty levels to subside.
    But if the caller holds IO or FS locks, we might be holding up that writeout.

    So change it to take a single nap to give other devices a chance to clean some
    memory, then return.

    Cc: Nick Piggin
    Cc: OGAWA Hirofumi
    Cc: Kumar Gala
    Cc: Pete Zaitcev
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • The code is seemingly trying to make sure that rb_next() brings us to
    successive increasing vma entries.

    But the two variables, prev and pend, used to perform these checks, are
    never advanced.

    Signed-off-by: David S. Miller
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Miller
     
  • Rename PG_checked to PG_owner_priv_1 to reflect its availablilty as a
    private flag for use by the owner/allocator of the page. In the case of
    pagecache pages (which might be considered to be owned by the mm),
    filesystems may use the flag.

    Signed-off-by: Jeremy Fitzhardinge
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Fix kernel-doc warnings in 2.6.20-git15 (lib/, mm/, kernel/, include/).

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • shmem_{nopage,mmap} are no longer used in ipc/shm.c

    Signed-off-by: Adrian Bunk
    Cc: "Eric W. Biederman"
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     

21 Feb, 2007

3 commits

  • The alien cache is a per cpu per node array allocated for every slab on the
    system. Currently we size this array for all nodes that the kernel does
    support. For IA64 this is 1024 nodes. So we allocate an array with 1024
    objects even if we only boot a system with 4 nodes.

    This patch uses "nr_node_ids" to determine the number of possible nodes
    supported by a hardware configuration and only allocates an alien cache
    sized for possible nodes.

    The initialization of nr_node_ids occurred too late relative to the bootstrap
    of the slab allocator and so I moved the setup_nr_node_ids() into
    free_area_init_nodes().

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • highest_possible_node_id() is currently used to calculate the last possible
    node idso that the network subsystem can figure out how to size per node
    arrays.

    I think having the ability to determine the maximum amount of nodes in a
    system at runtime is useful but then we should name this entry
    correspondingly, it should return the number of node_ids, and the the value
    needs to be setup only once on bootup. The node_possible_map does not
    change after bootup.

    This patch introduces nr_node_ids and replaces the use of
    highest_possible_node_id(). nr_node_ids is calculated on bootup when the
    page allocators pagesets are initialized.

    [deweerdt@free.fr: fix oops]
    Signed-off-by: Christoph Lameter
    Cc: Neil Brown
    Cc: Trond Myklebust
    Signed-off-by: Frederik Deweerdt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • bind_zonelist() can create zero-length zonelist if there is a
    memory-less-node. This patch checks the length of zonelist. If length is
    0, returns -EINVAL.

    tested on ia64/NUMA with memory-less-node.

    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Andi Kleen
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

17 Feb, 2007

1 commit

  • When NFSD receives a write request, the data is typically in a number of
    1448 byte segments and writev is used to collect them together.

    Unfortunately, generic_file_buffered_write passes these to the filesystem
    one at a time, so an e.g. 32K over-write becomes a series of partial-page
    writes to each page, causing the filesystem to have to pre-read those pages
    - wasted effort.

    generic_file_buffered_write handles one segment of the vector at a time as
    it has to pre-fault in each segment to avoid deadlocks. When writing from
    kernel-space (and nfsd does) this is not an issue, so
    generic_file_buffered_write does not need to break and iovec from nfsd into
    little pieces.

    This patch avoids the splitting when get_fs is KERNEL_DS as it is
    from NFSd.

    This issue was introduced by commit 6527c2bdf1f833cc18e8f42bd97973d583e4aa83

    Acked-by: Nick Piggin
    Cc: Norman Weathers
    Cc: Vladimir V. Saveliev
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     

16 Feb, 2007

3 commits


13 Feb, 2007

4 commits

  • Many struct inode_operations in the kernel can be "const". Marking them const
    moves these to the .rodata section, which avoids false sharing with potential
    dirty data. In addition it'll catch accidental writes at compile time to
    these shared resources.

    Signed-off-by: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arjan van de Ven
     
  • Make mincore work for anon mappings, nonlinear, and migration entries.
    Based on patch from Linus Torvalds .

    Signed-off-by: Nick Piggin
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Add a NOPFN_REFAULT return code for vm_ops->nopfn() equivalent to
    NOPAGE_REFAULT for vmops->nopage() indicating that the handler requests a
    re-execution of the faulting instruction

    Signed-off-by: Benjamin Herrenschmidt
    Cc: Arnd Bergmann
    Cc: Hugh Dickins
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Benjamin Herrenschmidt
     
  • Add a vm_insert_pfn helper, so that ->fault handlers can have nopfn
    functionality by installing their own pte and returning NULL.

    Signed-off-by: Nick Piggin
    Signed-off-by: Benjamin Herrenschmidt
    Cc: Arnd Bergmann
    Cc: Hugh Dickins
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

12 Feb, 2007

1 commit