28 Feb, 2009

2 commits

  • I just got this new warning from kmemcheck:

    WARNING: kmemcheck: Caught 32-bit read from freed memory (c7806a60)
    a06a80c7ecde70c1a04080c700000000a06709c1000000000000000000000000
    f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f
    ^

    Pid: 0, comm: swapper Not tainted (2.6.29-rc4 #230)
    EIP: 0060:[] EFLAGS: 00000286 CPU: 0
    EIP is at __purge_vmap_area_lazy+0x117/0x140
    EAX: 00070f43 EBX: c7806a40 ECX: c1677080 EDX: 00027b66
    ESI: 00002001 EDI: c170df0c EBP: c170df00 ESP: c178830c
    DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
    CR0: 80050033 CR2: c7806b14 CR3: 01775000 CR4: 00000690
    DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
    DR6: 00004000 DR7: 00000000
    [] free_unmap_vmap_area_noflush+0x6e/0x70
    [] remove_vm_area+0x2a/0x70
    [] __vunmap+0x45/0xe0
    [] vunmap+0x1e/0x30
    [] text_poke+0x95/0x150
    [] alternatives_smp_unlock+0x49/0x60
    [] alternative_instructions+0x11b/0x124
    [] check_bugs+0xbd/0xdc
    [] start_kernel+0x2ed/0x360
    [] __init_begin+0x9e/0xa9
    [] 0xffffffff

    It happened here:

    $ addr2line -e vmlinux -i c1096df7
    mm/vmalloc.c:540

    Code:

    list_for_each_entry(va, &valist, purge_list)
    __free_vmap_area(va);

    It's this instruction:

    mov 0x20(%ebx),%edx

    Which corresponds to a dereference of va->purge_list.next:

    (gdb) p ((struct vmap_area *) 0)->purge_list.next
    Cannot access memory at address 0x20

    It seems that we should use "safe" list traversal here, as the element
    is freed inside the loop. Please verify that this is the right fix.

    Acked-by: Nick Piggin
    Signed-off-by: Vegard Nossum
    Cc: Pekka Enberg
    Cc: Ingo Molnar
    Cc: "Paul E. McKenney"
    Cc: [2.6.28.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vegard Nossum
     
  • The new vmap allocator can wrap the address and get confused in the case
    of large allocations or VMALLOC_END near the end of address space.

    Problem reported by Christoph Hellwig on a 32-bit XFS workload.

    Signed-off-by: Nick Piggin
    Reported-by: Christoph Hellwig
    Cc: [2.6.28.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

26 Feb, 2009

1 commit

  • Each time I exit Firefox, /proc/meminfo's Committed_AS goes down almost
    400 kB: OVERCOMMIT_NEVER would be allowing overcommits it should
    prohibit.

    Commit fc8744adc870a8d4366908221508bb113d8b72ee "Stop playing silly
    games with the VM_ACCOUNT flag" changed shmem_file_setup() to set the
    shmem file's VM_ACCOUNT flag according to VM_NORESERVE not being set in
    the vma flags; but did so only _after_ the shmem_acct_size(flags, size)
    call which is expected to pre-account a shared anonymous object.

    It's all clearer if we switch shmem.c over to use VM_NORESERVE
    throughout in place of !VM_ACCOUNT.

    But I very nearly sent in a patch which mistakenly removed the
    accounting from tmpfs files: shmem_get_inode()'s memset was good for not
    setting VM_ACCOUNT, but now it needs to set VM_NORESERVE.

    Rather than setting that by default, then perhaps clearing it again in
    shmem_file_setup(), let's pass it as a flag to shmem_get_inode(): that
    allows us to remove the #ifdef CONFIG_SHMEM from shmem_file_setup().

    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

22 Feb, 2009

4 commits

  • * hibernate:
    PM: Fix suspend_console and resume_console to use only one semaphore
    PM: Wait for console in resume
    PM: Fix pm_notifiers during user mode hibernation
    swsusp: clean up shrink_all_zones()
    swsusp: dont fiddle with swappiness
    PM: fix build for CONFIG_PM unset
    PM/hibernate: fix "swap breaks after hibernation failures"
    PM/resume: wait for device probing to finish
    Consolidate driver_probe_done() loops into one place

    Linus Torvalds
     
  • Move local variables to innermost possible scopes and use local
    variables to cache calculations/reads done more than once.

    No change in functionality (intended).

    Signed-off-by: Johannes Weiner
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Rafael J. Wysocki
    Cc: Len Brown
    Cc: Greg KH
    Acked-by: Pavel Machek
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • sc.swappiness is not used in the swsusp memory shrinking path, do not
    set it.

    Signed-off-by: Johannes Weiner
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Rafael J. Wysocki
    Cc: Len Brown
    Cc: Greg KH
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • http://bugzilla.kernel.org/show_bug.cgi?id=12239

    The image writing code dropped a reference to the current swap device.
    This doesn't show up if the hibernation succeeds - because it doesn't
    affect the image which gets resumed. But it means multiple _failed_
    hibernations end up freeing the swap device while it is still use!

    swsusp_write() finds the block device for the swap file using swap_type_of().
    It then uses blkdev_get() / blkdev_put() to open and close the block device.

    Unfortunately, blkdev_get() assumes ownership of the inode of the block_device
    passed to it. So blkdev_put() calls iput() on the inode. This is by design
    and other callers expect this behaviour. The fix is for swap_type_of() to take
    a reference on the inode using bdget().

    Signed-off-by: Alan Jenkins
    Signed-off-by: Rafael J. Wysocki
    Cc: Len Brown
    Cc: Greg KH
    Signed-off-by: Linus Torvalds

    Alan Jenkins
     

21 Feb, 2009

2 commits

  • Impact: proper vcache flush on unmap_kernel_range()

    flush_cache_vunmap() should be called before pages are unmapped. Add
    a call to it in unmap_kernel_range().

    Signed-off-by: Tejun Heo
    Acked-by: Nick Piggin
    Acked-by: David S. Miller
    Cc: [2.6.28.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • kzfree() is a wrapper for kfree() that additionally zeroes the underlying
    memory before releasing it to the slab allocator.

    Currently there is code which memset()s the memory region of an object
    before releasing it back to the slab allocator to make sure
    security-sensitive data are really zeroed out after use.

    These callsites can then just use kzfree() which saves some code, makes
    users greppable and allows for a stupid destructor that isn't necessarily
    aware of the actual object size.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Pekka Enberg
    Cc: Matt Mackall
    Acked-by: Christoph Lameter
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

19 Feb, 2009

5 commits

  • * 'for-linus' of git://git.kernel.dk/linux-2.6-block:
    block: fix deadlock in blk_abort_queue() for drivers that readd to timeout list
    block: fix booting from partitioned md array
    block: revert part of 18ce3751ccd488c78d3827e9f6bf54e6322676fb
    cciss: PCI power management reset for kexec
    paride/pg.c: xs(): &&/|| confusion
    fs/bio: bio_alloc_bioset: pass right object ptr to mempool_free
    block: fix bad definition of BIO_RW_SYNC
    bsg: Fix sense buffer bug in SG_IO

    Linus Torvalds
     
  • Now, early_pfn_in_nid(PFN, NID) may returns false if PFN is a hole.
    and memmap initialization was not done. This was a trouble for
    sparc boot.

    To fix this, the PFN should be initialized and marked as PG_reserved.
    This patch changes early_pfn_in_nid() return true if PFN is a hole.

    Signed-off-by: KAMEZAWA Hiroyuki
    Reported-by: David Miller
    Tested-by: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Heiko Carstens
    Cc: [2.6.25.x, 2.6.26.x, 2.6.27.x, 2.6.28.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • What's happening is that the assertion in mm/page_alloc.c:move_freepages()
    is triggering:

    BUG_ON(page_zone(start_page) != page_zone(end_page));

    Once I knew this is what was happening, I added some annotations:

    if (unlikely(page_zone(start_page) != page_zone(end_page))) {
    printk(KERN_ERR "move_freepages: Bogus zones: "
    "start_page[%p] end_page[%p] zone[%p]\n",
    start_page, end_page, zone);
    printk(KERN_ERR "move_freepages: "
    "start_zone[%p] end_zone[%p]\n",
    page_zone(start_page), page_zone(end_page));
    printk(KERN_ERR "move_freepages: "
    "start_pfn[0x%lx] end_pfn[0x%lx]\n",
    page_to_pfn(start_page), page_to_pfn(end_page));
    printk(KERN_ERR "move_freepages: "
    "start_nid[%d] end_nid[%d]\n",
    page_to_nid(start_page), page_to_nid(end_page));
    ...

    And here's what I got:

    move_freepages: Bogus zones: start_page[2207d0000] end_page[2207dffc0] zone[fffff8103effcb00]
    move_freepages: start_zone[fffff8103effcb00] end_zone[fffff8003fffeb00]
    move_freepages: start_pfn[0x81f600] end_pfn[0x81f7ff]
    move_freepages: start_nid[1] end_nid[0]

    My memory layout on this box is:

    [ 0.000000] Zone PFN ranges:
    [ 0.000000] Normal 0x00000000 -> 0x0081ff5d
    [ 0.000000] Movable zone start PFN for each node
    [ 0.000000] early_node_map[8] active PFN ranges
    [ 0.000000] 0: 0x00000000 -> 0x00020000
    [ 0.000000] 1: 0x00800000 -> 0x0081f7ff
    [ 0.000000] 1: 0x0081f800 -> 0x0081fe50
    [ 0.000000] 1: 0x0081fed1 -> 0x0081fed8
    [ 0.000000] 1: 0x0081feda -> 0x0081fedb
    [ 0.000000] 1: 0x0081fedd -> 0x0081fee5
    [ 0.000000] 1: 0x0081fee7 -> 0x0081ff51
    [ 0.000000] 1: 0x0081ff59 -> 0x0081ff5d

    So it's a block move in that 0x81f600-->0x81f7ff region which triggers
    the problem.

    This patch:

    Declaration of early_pfn_to_nid() is scattered over per-arch include
    files, and it seems it's complicated to know when the declaration is used.
    I think it makes fix-for-memmap-init not easy.

    This patch moves all declaration to include/linux/mm.h

    After this,
    if !CONFIG_NODES_POPULATES_NODE_MAP && !CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
    -> Use static definition in include/linux/mm.h
    else if !CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
    -> Use generic definition in mm/page_alloc.c
    else
    -> per-arch back end function will be called.

    Signed-off-by: KAMEZAWA Hiroyuki
    Tested-by: KOSAKI Motohiro
    Reported-by: David Miller
    Cc: Mel Gorman
    Cc: Heiko Carstens
    Cc: [2.6.25.x, 2.6.26.x, 2.6.27.x, 2.6.28.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • YAMAMOTO-san noticed that task_dirty_inc doesn't seem to be called properly for
    cases where set_page_dirty is not used to dirty a page (eg. mark_buffer_dirty).

    Additionally, there is some inconsistency about when task_dirty_inc is
    called. It is used for dirty balancing, however it even gets called for
    __set_page_dirty_no_writeback.

    So rather than increment it in a set_page_dirty wrapper, move it down to
    exactly where the dirty page accounting stats are incremented.

    Cc: YAMAMOTO Takashi
    Signed-off-by: Nick Piggin
    Acked-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • We have get_vm_area_caller() and __get_vm_area() but not
    __get_vm_area_caller()

    On powerpc, I use __get_vm_area() to separate the ranges of addresses
    given to vmalloc vs. ioremap (various good reasons for that) so in order
    to be able to implement the new caller tracking in /proc/vmallocinfo, I
    need a "_caller" variant of it.

    (akpm: needed for ongoing powerpc development, so merge it early)

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Benjamin Herrenschmidt
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Benjamin Herrenschmidt
     

18 Feb, 2009

2 commits

  • We can't OR shift values, so get rid of BIO_RW_SYNC and use BIO_RW_SYNCIO
    and BIO_RW_UNPLUG explicitly. This brings back the behaviour from before
    213d9417fec62ef4c3675621b9364a667954d4dd.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • …git/tip/linux-2.6-tip

    * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    x86, vm86: fix preemption bug
    x86, olpc: fix model detection without OFW
    x86, hpet: fix for LS21 + HPET = boot hang
    x86: CPA avoid repeated lazy mmu flush
    x86: warn if arch_flush_lazy_mmu_cpu is called in preemptible context
    x86/paravirt: make arch_flush_lazy_mmu/cpu disable preemption
    x86, pat: fix warn_on_once() while mapping 0-1MB range with /dev/mem
    x86/cpa: make sure cpa is safe to call in lazy mmu mode
    x86, ptrace, mm: fix double-free on race

    Linus Torvalds
     

13 Feb, 2009

2 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:
    mm: Export symbol ksize()

    Linus Torvalds
     
  • A bug was introduced into write_cache_pages cyclic writeout by commit
    31a12666d8f0c22235297e1c1575f82061480029 ("mm: write_cache_pages cyclic
    fix"). The intention (and comments) is that we should cycle back and
    look for more dirty pages at the beginning of the file if there is no
    more work to be done.

    But the !done condition was dropped from the test. This means that any
    time the page writeout loop breaks (eg. due to nr_to_write == 0), we
    will set index to 0, then goto again. This will set done_index to
    index, then find done is set, so will proceed to the end of the
    function. When updating mapping->writeback_index for cyclic writeout,
    we now use done_index == 0, so we're always cycling back to 0.

    This seemed to be causing random mmap writes (slapadd and iozone) to
    start writing more pages from the LRU and writeout would slowdown, and
    caused bugzilla entry

    http://bugzilla.kernel.org/show_bug.cgi?id=12604

    about Berkeley DB slowing down dramatically.

    With this patch, iozone random write performance is increased nearly
    5x on my system (iozone -B -r 4k -s 64k -s 512m -s 1200m on ext2).

    Signed-off-by: Nick Piggin
    Reported-and-tested-by: Jan Kara
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

12 Feb, 2009

8 commits

  • Commit 7b2cd92adc5430b0c1adeb120971852b4ea1ab08 ("crypto: api - Fix
    zeroing on free") added modular user of ksize(). Export that to fix
    crypto.ko compilation.

    Cc: Herbert Xu
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Pekka Enberg

    Kirill A. Shutemov
     
  • Christophe Saout reported [in precursor to:
    http://marc.info/?l=linux-kernel&m=123209902707347&w=4]:

    > Note that I also some a different issue with CONFIG_UNEVICTABLE_LRU.
    > Seems like Xen tears down current->mm early on process termination, so
    > that __get_user_pages in exit_mmap causes nasty messages when the
    > process had any mlocked pages. (in fact, it somehow manages to get into
    > the swapping code and produces a null pointer dereference trying to get
    > a swap token)

    Jeremy explained:

    Yes. In the normal case under Xen, an in-use pagetable is "pinned",
    meaning that it is RO to the kernel, and all updates must go via hypercall
    (or writes are trapped and emulated, which is much the same thing). An
    unpinned pagetable is not currently in use by any process, and can be
    directly accessed as normal RW pages.

    As an optimisation at process exit time, we unpin the pagetable as early
    as possible (switching the process to init_mm), so that all the normal
    pagetable teardown can happen with direct memory accesses.

    This happens in exit_mmap() -> arch_exit_mmap(). The munlocking happens
    a few lines below. The obvious thing to do would be to move
    arch_exit_mmap() to below the munlock code, but I think we'd want to
    call it even if mm->mmap is NULL, just to be on the safe side.

    Thus, this patch:

    exit_mmap() needs to unlock any locked vmas before calling arch_exit_mmap,
    as the latter may switch the current mm to init_mm, which would cause the
    former to fail.

    Signed-off-by: Jeremy Fitzhardinge
    Signed-off-by: Lee Schermerhorn
    Cc: Christophe Saout
    Cc: Keir Fraser
    Cc: Christophe Saout
    Cc: Alex Williamson
    Cc: [2.6.28.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeremy Fitzhardinge
     
  • Commit dcf6a79dda5cc2a2bec183e50d829030c0972aaa ("write-back: fix
    nr_to_write counter") fixed nr_to_write counter, but didn't set the break
    condition properly.

    If nr_to_write == 0 after being decremented it will loop one more time
    before setting done = 1 and breaking the loop.

    [akpm@linux-foundation.org: coding-style fixes]
    Cc: Artem Bityutskiy
    Acked-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Federico Cuello
     
  • page_cgroup's page allocation at init/memory hotplug uses kmalloc() and
    vmalloc(). If kmalloc() failes, vmalloc() is used.

    This is because vmalloc() is very limited resource on 32bit systems.
    We want to use kmalloc() first.

    But in this kind of call, __GFP_NOWARN should be specified.

    Reported-by: Heiko Carstens
    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Balbir Singh
    Acked-by: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • When I tested following program, I found that the mlocked counter
    is strange. It cannot free some mlocked pages.

    It is because try_to_unmap_file() doesn't check real
    page mappings in vmas.

    That is because the goal of an address_space for a file is to find all
    processes into which the file's specific interval is mapped. It is
    related to the file's interval, not to pages.

    Even if the page isn't really mapped by the vma, it returns SWAP_MLOCK
    since the vma has VM_LOCKED, then calls try_to_mlock_page. After this the
    mlocked counter is increased again.

    COWed anon page in a file-backed vma could be a such case. This patch
    resolves it.

    -- my test program --

    int main()
    {
    mlockall(MCL_CURRENT);
    return 0;
    }

    -- before --

    root@barrios-target-linux:~# cat /proc/meminfo | egrep 'Mlo|Unev'
    Unevictable: 0 kB
    Mlocked: 0 kB

    -- after --

    root@barrios-target-linux:~# cat /proc/meminfo | egrep 'Mlo|Unev'
    Unevictable: 8 kB
    Mlocked: 8 kB

    Signed-off-by: MinChan Kim
    Acked-by: Lee Schermerhorn
    Acked-by: KOSAKI Motohiro
    Tested-by: Lee Schermerhorn
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    MinChan Kim
     
  • We need to pass an unsigned long as the minimum, because it gets casted
    to an unsigned long in the sysctl handler. If we pass an int, we'll
    access four more bytes on 64bit arches, resulting in a random minimum
    value.

    [rientjes@google.com: fix type of `old_bytes']
    Signed-off-by: Sven Wegener
    Cc: Peter Zijlstra
    Cc: Dave Chinner
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sven Wegener
     
  • migrate_vmas() should check "vma" not "vma->vm_next" for for-loop condition.

    Signed-off-by: Daisuke Nishimura
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • Commit 5a6fe125950676015f5108fb71b2a67441755003 brought hugetlbfs more
    in line with the core VM by obeying VM_NORESERVE and not reserving
    hugepages for both shared and private mappings when [SHM|MAP]_NORESERVE
    are specified. However, it is still taking filesystem quota
    unconditionally.

    At fault time, if there are no reserves and attempt is made to allocate
    the page and account for filesystem quota. If either fail, the fault
    fails. The impact is that quota is getting accounted for twice. This
    patch partially reverts 5a6fe125950676015f5108fb71b2a67441755003. To
    help prevent this mistake happening again, it improves the documentation
    of hugetlb_reserve_pages()

    Reported-by: Andy Whitcroft
    Signed-off-by: Mel Gorman
    Acked-by: Andy Whitcroft
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

11 Feb, 2009

2 commits

  • Ptrace_detach() races with __ptrace_unlink() if the traced task is
    reaped while detaching. This might cause a double-free of the BTS
    buffer.

    Change the ptrace_detach() path to only do the memory accounting in
    ptrace_bts_detach() and leave the buffer free to ptrace_bts_untrace()
    which will be called from __ptrace_unlink().

    The fix follows a proposal from Oleg Nesterov.

    Reported-by: Oleg Nesterov
    Signed-off-by: Markus Metzger
    Signed-off-by: Ingo Molnar

    Markus Metzger
     
  • When overcommit is disabled, the core VM accounts for pages used by anonymous
    shared, private mappings and special mappings. It keeps track of VMAs that
    should be accounted for with VM_ACCOUNT and VMAs that never had a reserve
    with VM_NORESERVE.

    Overcommit for hugetlbfs is much riskier than overcommit for base pages
    due to contiguity requirements. It avoids overcommiting on both shared and
    private mappings using reservation counters that are checked and updated
    during mmap(). This ensures (within limits) that hugepages exist in the
    future when faults occurs or it is too easy to applications to be SIGKILLed.

    As hugetlbfs makes its own reservations of a different unit to the base page
    size, VM_ACCOUNT should never be set. Even if the units were correct, we would
    double account for the usage in the core VM and hugetlbfs. VM_NORESERVE may
    be set because an application can request no reserves be made for hugetlbfs
    at the risk of getting killed later.

    With commit fc8744adc870a8d4366908221508bb113d8b72ee, VM_NORESERVE and
    VM_ACCOUNT are getting unconditionally set for hugetlbfs-backed mappings. This
    breaks the accounting for both the core VM and hugetlbfs, can trigger an
    OOM storm when hugepage pools are too small lockups and corrupted counters
    otherwise are used. This patch brings hugetlbfs more in line with how the
    core VM treats VM_NORESERVE but prevents VM_ACCOUNT being set.

    Signed-off-by: Mel Gorman
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

09 Feb, 2009

1 commit

  • Commit 27421e211a39784694b597dbf35848b88363c248, Manually revert
    "mlock: downgrade mmap sem while populating mlocked regions", has
    introduced its own regression: __mlock_vma_pages_range() may report
    an error (for example, -EFAULT from trying to lock down pages from
    beyond EOF), but mlock_vma_pages_range() must hide that from its
    callers as before.

    Reported-by: Sami Farin
    Signed-off-by: Hugh Dickins
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

06 Feb, 2009

1 commit

  • Fix do_wp_page for VM_MIXEDMAP mappings.

    In the case where pfn_valid returns 0 for a pfn at the beginning of
    do_wp_page and the mapping is not shared writable, the code branches to
    label `gotten:' with old_page == NULL.

    In case the vma is locked (vma->vm_flags & VM_LOCKED), lock_page,
    clear_page_mlock, and unlock_page try to access the old_page.

    This patch checks whether old_page is valid before it is dereferenced.

    The regression was introduced by "mlock: mlocked pages are unevictable"
    (commit b291f000393f5a0b679012b39d79fbc85c018233).

    Signed-off-by: Carsten Otte
    Cc: Nick Piggin
    Cc: Heiko Carstens
    Cc: [2.6.28.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Carsten Otte
     

04 Feb, 2009

1 commit

  • Commit 05fe478dd04e02fa230c305ab9b5616669821dd3 introduced some
    @wbc->nr_to_write breakage.

    It made the following changes:
    1. Decrement wbc->nr_to_write instead of nr_to_write
    2. Decrement wbc->nr_to_write _only_ if wbc->sync_mode == WB_SYNC_NONE
    3. If synced nr_to_write pages, stop only if if wbc->sync_mode ==
    WB_SYNC_NONE, otherwise keep going.

    However, according to the commit message, the intention was to only make
    change 3. Change 1 is a bug. Change 2 does not seem to be necessary,
    and it breaks UBIFS expectations, so if needed, it should be done
    separately later. And change 2 does not seem to be documented in the
    commit message.

    This patch does the following:
    1. Undo changes 1 and 2
    2. Add a comment explaining change 3 (it very useful to have comments
    in _code_, not only in the commit).

    Signed-off-by: Artem Bityutskiy
    Acked-by: Nick Piggin
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Artem Bityutskiy
     

03 Feb, 2009

1 commit


02 Feb, 2009

1 commit

  • This essentially reverts commit 8edb08caf68184fb170f4f69c7445929e199eaea.

    It downgraded our mmap semaphore to a read-lock while mlocking pages, in
    order to allow other threads (and external accesses like "ps" et al) to
    walk the vma lists and take page faults etc. Which is a nice idea, but
    the implementation does not work.

    Because we cannot upgrade the lock back to a write lock without
    releasing the mmap semaphore, the code had to release the lock entirely
    and then re-take it as a writelock. However, that meant that the caller
    possibly lost the vma chain that it was following, since now another
    thread could come in and mmap/munmap the range.

    The code tried to work around that by just looking up the vma again and
    erroring out if that happened, but quite frankly, that was just a buggy
    hack that doesn't actually protect against anything (the other thread
    could just have replaced the vma with another one instead of totally
    unmapping it).

    The only way to downgrade to a read map _reliably_ is to do it at the
    end, which is likely the right thing to do: do all the 'vma' operations
    with the write-lock held, then downgrade to a read after completing them
    all, and then do the "populate the newly mlocked regions" while holding
    just the read lock. And then just drop the read-lock and return to user
    space.

    The (perhaps somewhat simpler) alternative is to just make all the
    callers of mlock_vma_pages_range() know that the mmap lock got dropped,
    and just re-grab the mmap semaphore if it needs to mlock more than one
    vma region.

    So we can do this "downgrade mmap sem while populating mlocked regions"
    thing right, but the way it was done here was absolutely not correct.
    Thus the revert, in the expectation that we will do it all correctly
    some day.

    Cc: Lee Schermerhorn
    Cc: Rik van Riel
    Cc: Andrew Morton
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

01 Feb, 2009

1 commit

  • The mmap_region() code would temporarily set the VM_ACCOUNT flag for
    anonymous shared mappings just to inform shmem_zero_setup() that it
    should enable accounting for the resulting shm object. It would then
    clear the flag after calling ->mmap (for the /dev/zero case) or doing
    shmem_zero_setup() (for the MAP_ANON case).

    This just resulted in vma merge issues, but also made for just
    unnecessary confusion. Use the already-existing VM_NORESERVE flag for
    this instead, and let shmem_{zero|file}_setup() just figure it out from
    that.

    This also happens to make it obvious that the new DRI2 GEM layer uses a
    non-reserving backing store for its object allocation - which is quite
    possibly not intentional. But since I didn't want to change semantics
    in this patch, I left it alone, and just updated the caller to use the
    new flag semantics.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

31 Jan, 2009

1 commit

  • Commit de33c8db5910cda599899dd431cc30d7c1018cbf ("Fix OOPS in
    mmap_region() when merging adjacent VM_LOCKED file segments") unified
    the vma merging of anonymous and file maps to just one place, which
    simplified the code and fixed a use-after-free bug that could cause an
    oops.

    But by doing the merge opportunistically before even having called
    ->mmap() on the file method, it now compares two different 'vm_flags'
    values: the pre-mmap() value of the new not-yet-formed vma, and previous
    mappings of the same file around it.

    And in doing so, it refused to merge the common file case, which adds a
    marker to say "I can be made non-linear".

    This fixes it by just adding a set of flags that don't have to match,
    because we know they are ok to merge. Currently it's only that single
    VM_CAN_NONLINEAR flag, but at least conceptually there could be others
    in the future.

    Reported-and-acked-by: Hugh Dickins
    Cc: Lee Schermerhorn
    Cc: Nick Piggin
    Cc: Andrew Morton
    Cc: Greg KH
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

30 Jan, 2009

4 commits

  • N_POSSIBLE doesn't means there is memory...and force_empty can
    visit invalid node which have no pgdat.

    To visit all valid nodes, N_HIGH_MEMORY should be used.

    Reported-by: Li Zefan
    Signed-off-by: KAMEZAWA Hiroyuki
    Tested-by: Li Zefan
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Now, at swapoff, even while try_charge() fails, commit is executed. This
    is a bug which turns the refcnt of cgroup_subsys_state negative.

    Reported-by: Li Zefan
    Tested-by: Li Zefan
    Tested-by: Daisuke Nishimura
    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • The lifetime of struct cgroup and struct mem_cgroup is different and
    mem_cgroup has its own reference count for handling references from
    swap_cgroup.

    This causes strange problem that the parent mem_cgroup dies while child
    mem_cgroup alive, and this problem causes a bug in case of
    use_hierarchy==1 because res_counter_uncharge climbs up the tree.

    This patch is for avoiding it by getting the parent at create, and putting
    it at freeing.

    Signed-off-by: Daisuke Nishimura
    Reviewed-by; KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Pavel Emelyanov
    Cc: Li Zefan
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • As of commit ba470de43188cdbff795b5da43a1474523c6c2fb ("map: handle
    mlocked pages during map, remap, unmap") we now use the 'vma' variable
    at the end of mmap_region() to handle the page-in of newly mapped
    mlocked pages.

    However, if we merged adjacent vma's together, the vma we're using may
    be stale. We historically consciously avoided using it after the merge
    operation, but that got overlooked when redoing the locked page
    handling.

    This commit simplifies mmap_region() by doing any vma merges early,
    avoiding the issue entirely, and 'vma' will always be valid. As pointed
    out by Hugh Dickins, this depends on any drivers that change the page
    offset of flags to have set one of the VM_SPECIAL bits (so that they
    cannot trigger the early merge logic), but that's true in general.

    Reported-and-tested-by: Maksim Yevmenkin
    Cc: Lee Schermerhorn
    Cc: Nick Piggin
    Cc: Andrew Morton
    Cc: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

28 Jan, 2009

1 commit

  • The per cpu array of kmem_cache_cpu structures accomodates
    NR_KMEM_CACHE_CPU such structs.

    When this array overflows and a struct is allocated by kmalloc(), it may
    have an address at the upper bound of this array. If this happens, it
    does not get freed and the per cpu kmem_cache_cpu_free pointer will be out
    of bounds after kmem_cache_destroy() or cpu offlining.

    Cc: Christoph Lameter
    Signed-off-by: David Rientjes
    Signed-off-by: Pekka Enberg

    David Rientjes