16 Nov, 2007
2 commits
-
This code harks back to the days when we didn't count dirty mapped
pages, which led us to try to balance the number of dirty unmapped pages
by how much unmapped memory there was in the system.That makes no sense any more, since now the dirty counts include the
mapped pages. Not to mention that the math doesn't work with HIGHMEM
machines anyway, and causes the unmapped_ratio to potentially turn
negative (which we do catch thanks to clamping it at a minimum value,
but I mention that as an indication of how broken the code is).The code also was written at a time when the default dirty ratio was
much larger, and the unmapped_ratio logic effectively capped that large
dirty ratio a bit. Again, we've since lowered the dirty ratio rather
aggressively, further lessening the point of that code.Acked-by: Peter Zijlstra
Signed-off-by: Linus Torvalds -
Previously, it would be possible for prev->next to point to
&free_slob_pages, and thus we would try to move a list onto itself, and
bad things would happen.It seems a bit hairy to be doing list operations with the list marker as
an entry, rather than a head, but...this resolves the following crash:
http://bugzilla.kernel.org/show_bug.cgi?id=9379
Signed-off-by: Nick Piggin
Signed-off-by: Ingo Molnar
Acked-by: Matt Mackall
Signed-off-by: Linus Torvalds
15 Nov, 2007
16 commits
-
The delay incurred in lock_page() should also be accounted in swap delay
accountingReported-by: Nick Piggin
Signed-off-by: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Mark start_cpu_timer() as __cpuinit instead of __devinit.
Fixes this section warning:WARNING: vmlinux.o(.text+0x60e53): Section mismatch: reference to .init.text:start_cpu_timer (between 'vmstat_cpuup_callback' and 'vmstat_show')
Signed-off-by: Randy Dunlap
Acked-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Commit ef8b4520bd9f8294ffce9abd6158085bde5dc902 added one NULL check for
"p" in krealloc(), but that doesn't seem to be enough since there
doesn't seem to be any guarantee that memcpy(ret, NULL, 0) works
(spotted by the Coverity checker).For making it clearer what happens this patch also removes the pointless
min().Signed-off-by: Adrian Bunk
Acked-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
For administrative purpose, we want to query actual block usage for
hugetlbfs file via fstat. Currently, hugetlbfs always return 0. Fix that
up since kernel already has all the information to track it properly.Signed-off-by: Ken Chen
Acked-by: Adam Litke
Cc: Badari Pulavarty
Cc: David Gibson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
return_unused_surplus_pages() can become static.
Signed-off-by: Adrian Bunk
Acked-by: Adam Litke
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
When a MAP_SHARED mmap of a hugetlbfs file succeeds, huge pages are reserved
to guarantee no problems will occur later when instantiating pages. If quotas
are in force, page instantiation could fail due to a race with another process
or an oversized (but approved) shared mapping.To prevent these scenarios, debit the quota for the full reservation amount up
front and credit the unused quota when the reservation is released.Signed-off-by: Adam Litke
Cc: Ken Chen
Cc: Andy Whitcroft
Cc: Dave Hansen
Cc: David Gibson
Cc: William Lee Irwin III
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Add a second parameter 'delta' to hugetlb_get_quota and hugetlb_put_quota to
allow bulk updating of the sbinfo->free_blocks counter. This will be used by
the next patch in the series.Signed-off-by: Adam Litke
Cc: Ken Chen
Cc: Andy Whitcroft
Cc: Dave Hansen
Cc: David Gibson
Cc: William Lee Irwin III
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Now that quota is credited by free_huge_page(), calls to hugetlb_get_quota()
seem out of place. The alloc/free API is unbalanced because we handle the
hugetlb_put_quota() but expect the caller to open-code hugetlb_get_quota().
Move the get inside alloc_huge_page to clean up this disparity.This patch has been kept apart from the previous patch because of the somewhat
dodgy ERR_PTR() use herein. Moving the quota logic means that
alloc_huge_page() has two failure modes. Quota failure must result in a
SIGBUS while a standard allocation failure is OOM. Unfortunately, ERR_PTR()
doesn't like the small positive errnos we have in VM_FAULT_* so they must be
negated before they are used.Does anyone take issue with the way I am using PTR_ERR. If so, what are your
thoughts on how to clean this up (without needing an if,else if,else block at
each alloc_huge_page() callsite)?Signed-off-by: Adam Litke
Cc: Ken Chen
Cc: Andy Whitcroft
Cc: Dave Hansen
Cc: David Gibson
Cc: William Lee Irwin III
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The hugetlbfs quota management system was never taught to handle MAP_PRIVATE
mappings when that support was added. Currently, quota is debited at page
instantiation and credited at file truncation. This approach works correctly
for shared pages but is incomplete for private pages. In addition to
hugetlb_no_page(), private pages can be instantiated by hugetlb_cow(); but
this function does not respect quotas.Private huge pages are treated very much like normal, anonymous pages. They
are not "backed" by the hugetlbfs file and are not stored in the mapping's
radix tree. This means that private pages are invisible to
truncate_hugepages() so that function will not credit the quota.This patch (based on a prototype provided by Ken Chen) moves quota crediting
for all pages into free_huge_page(). page->private is used to store a pointer
to the mapping to which this page belongs. This is used to credit quota on
the appropriate hugetlbfs instance.Signed-off-by: Adam Litke
Cc: Ken Chen
Cc: Ken Chen
Cc: Andy Whitcroft
Cc: Dave Hansen
Cc: David Gibson
Cc: William Lee Irwin III
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Hugetlbfs implements a quota system which can limit the amount of memory that
can be used by the filesystem. Before allocating a new huge page for a file,
the quota is checked and debited. The quota is then credited when truncating
the file. I found a few bugs in the code for both MAP_PRIVATE and MAP_SHARED
mappings. Before detailing the problems and my proposed solutions, we should
agree on a definition of quotas that properly addresses both private and
shared pages. Since the purpose of quotas is to limit total memory
consumption on a per-filesystem basis, I argue that all pages allocated by the
fs (private and shared) should be charged against quota.Private Mappings
================The current code will debit quota for private pages sometimes, but will never
credit it. At a minimum, this causes a leak in the quota accounting which
renders the accounting essentially useless as it is. Shared pages have a one
to one mapping with a hugetlbfs file and are easy to account by debiting on
allocation and crediting on truncate. Private pages are anonymous in nature
and have a many to one relationship with their hugetlbfs files (due to copy on
write). Because private pages are not indexed by the mapping's radix tree,
thier quota cannot be credited at file truncation time. Crediting must be
done when the page is unmapped and freed.Shared Pages
============I discovered an issue concerning the interaction between the MAP_SHARED
reservation system and quotas. Since quota is not checked until page
instantiation, an over-quota mmap/reservation will initially succeed. When
instantiating the first over-quota page, the program will receive SIGBUS.
This is inconsistent since the reservation is supposed to be a guarantee. The
solution is to debit the full amount of quota at reservation time and credit
the unused portion when the reservation is released.This patch series brings quotas back in line by making the following
modifications:
* Private pages
- Debit quota in alloc_huge_page()
- Credit quota in free_huge_page()
* Shared pages
- Debit quota for entire reservation at mmap time
- Credit quota for instantiated pages in free_huge_page()
- Credit quota for unused reservation at munmap timeThis patch:
The shared page reservation and dynamic pool resizing features have made the
allocation of private vs. shared huge pages quite different. By splitting
out the private/shared-specific portions of the process into their own
functions, readability is greatly improved. alloc_huge_page now calls the
proper helper and performs common operations.[akpm@linux-foundation.org: coding-style cleanups]
Signed-off-by: Adam Litke
Cc: Ken Chen
Cc: Andy Whitcroft
Cc: Dave Hansen
Cc: David Gibson
Cc: William Lee Irwin III
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
When calling get_user_pages(), a write flag is passed in by the caller to
indicate if write access is required on the faulted-in pages. Currently,
follow_hugetlb_page() ignores this flag and always faults pages for
read-only access. This can cause data corruption because a device driver
that calls get_user_pages() with write set will not expect COW faults to
occur on the returned pages.This patch passes the write flag down to follow_hugetlb_page() and makes
sure hugetlb_fault() is called with the right write_access parameter.[ezk@cs.sunysb.edu: build fix]
Signed-off-by: Adam Litke
Reviewed-by: Ken Chen
Cc: David Gibson
Cc: William Lee Irwin III
Cc: Badari Pulavarty
Signed-off-by: Erez Zadok
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
i386 and x86-64 registers System RAM as IORESOURCE_MEM | IORESOURCE_BUSY.
But ia64 registers it as IORESOURCE_MEM only.
In addition, memory hotplug code registers new memory as IORESOURCE_MEM too.This difference causes a failure of memory unplug of x86-64. This patch
fixes it.This patch adds IORESOURCE_BUSY to avoid potential overlap mapping by PCI
device.Signed-off-by: Yasunori Goto
Signed-off-by: Badari Pulavarty
Cc: Luck, Tony"
Cc: Thomas Gleixner
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
We allow violation of bdi limits if there is a lot of room on the system.
Once we hit half the total limit we start enforcing bdi limits and bdi
ramp-up should happen. Doing it this way avoids many small writeouts on an
otherwise idle system and should also speed up the ramp-up.Signed-off-by: Peter Zijlstra
Reviewed-by: Fengguang Wu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
We should unset migrate type "ISOLATE" when we successfully removed memory.
But current code has BUG and cannot works well.This patch also includes bugfix? to change get_pageblock_flags to
get_pageblock_migratetype().Thanks to Badari Pulavarty for finding this.
Signed-off-by: KAMEZAWA Hiroyuki
Acked-by: Badari Pulavarty
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
We hit the BUG_ON() in mm/rmap.c:vma_address() when trying to migrate via
mbind(MPOL_MF_MOVE) a non-anon region that spans multiple vmas. For
anon-regions, we just fail to migrate any pages beyond the 1st vma in the
range.This occurs because do_mbind() collects a list of pages to migrate by
calling check_range(). check_range() walks the task's mm, spanning vmas as
necessary, to collect the migratable pages into a list. Then, do_mbind()
calls migrate_pages() passing the list of pages, a function to allocate new
pages based on vma policy [new_vma_page()], and a pointer to the first vma
of the range.For each page in the list, new_vma_page() calls page_address_in_vma()
passing the page and the vma [first in range] to obtain the address to get
for alloc_page_vma(). The page address is needed to get interleaving
policy correct. If the pages in the list come from multiple vmas,
eventually, new_page_address() will pass that page to page_address_in_vma()
with the incorrect vma. For !PageAnon pages, this will result in a bug
check in rmap.c:vma_address(). For anon pages, vma_address() will just
return EFAULT and fail the migration.This patch modifies new_vma_page() to check the return value from
page_address_in_vma(). If the return value is EFAULT, new_vma_page()
searchs forward via vm_next for the vma that maps the page--i.e., that does
not return EFAULT. This assumes that the pages in the list handed to
migrate_pages() is in address order. This is currently case. The patch
documents this assumption in a new comment block for new_vma_page().If new_vma_page() cannot locate the vma mapping the page in a forward
search in the mm, it will pass a NULL vma to alloc_page_vma(). This will
result in the allocation using the task policy, if any, else system default
policy. This situation is unlikely, but the patch documents this behavior
with a comment.Note, this patch results in restarting from the first vma in a multi-vma
range each time new_vma_page() is called. If this is not acceptable, we
can make the vma argument a pointer, both in new_vma_page() and it's caller
unmap_and_move() so that the value held by the loop in migrate_pages()
always passes down the last vma in which a page was found. This will
require changes to all new_page_t functions passed to migrate_pages(). Is
this necessary?For this patch to work, we can't bug check in vma_address() for pages
outside the argument vma. This patch removes the BUG_ON(). All other
callers [besides new_vma_page()] already check the return status.Tested on x86_64, 4 node NUMA platform.
Signed-off-by: Lee Schermerhorn
Acked-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This patch fixes wrong array index in allocation failure handling.
Cc: Pekka Enberg
Acked-by: Christoph Lameter
Signed-off-by: Akinobu Mita
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
13 Nov, 2007
2 commits
-
This reverts commit 5adc5be7cd1bcef6bb64f5255d2a33f20a3cf5be.
Alexey Dobriyan reports that it causes huge slowdowns under some loads,
in his case a "mkfs.ext2" on a 30G partition. With the placement bias,
the mkfs took over four minutes, with it reverted it's back to about ten
seconds for Alexey.Reported-and-tested-by: Alexey Dobriyan
Cc: Mel Gorman
Cc: Andrew Morton
Signed-off-by: Linus Torvalds -
Since the macro "for_each_object" introduced, the "end" variable becomes unused anymore.
Signed-off-by: Denis Cheng
Signed-off-by: Linus Torvalds
06 Nov, 2007
2 commits
-
* 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-lguest:
lguest: tidy up documentation
kernel/futex.c: make 3 functions static
unexport access_process_vm
lguest: make async_hcall() static -
Fix the memory leak that may occur when we attempt to reuse a cpu_slab
that was allocated while we reenabled interrupts in order to be able to
grow a slab cache.The per cpu freelist may contain objects and in that situation we may
overwrite the per cpu freelist pointer loosing objects. This only
occurs if we find that the concurrently allocated slab fits our
allocation needs.If we simply always deactivate the slab then the freelist will be
properly reintegrated and the memory leak will go away.Signed-off-by: Christoph Lameter
Acked-by: Hugh Dickins
Signed-off-by: Linus Torvalds
05 Nov, 2007
1 commit
-
This patch removes the no longer used EXPORT_SYMBOL_GPL(access_process_vm).
Signed-off-by: Adrian Bunk
Signed-off-by: Rusty Russell
01 Nov, 2007
1 commit
-
The kernel has for random historical reasons allowed ptrace() accesses
to access (and insert) pages into the page cache above the size of the
file.However, Nick broke that by mistake when doing the new fault handling in
commit 54cb8821de07f2ffcd28c380ce9b93d5784b40d7 ("mm: merge populate and
nopage into fault (fixes nonlinear)". The breakage caused a hang with
gdb when trying to access the invalid page.The ptrace "feature" really isn't worth resurrecting, since it really is
wrong both from a portability _and_ from an internal page cache validity
standpoint. So this removes those old broken remnants, and fixes the
ptrace() hang in the process.Noticed and bisected by Duane Griffin, who also supplied a test-case
(quoth Nick: "Well that's probably the best bug report I've ever had,
thanks Duane!").Cc: Duane Griffin
Acked-by: Nick Piggin
Signed-off-by: Linus Torvalds
31 Oct, 2007
1 commit
-
Commit commit 65b8291c4000e5f38fc94fb2ca0cb7e8683c8a1b ("dio: invalidate
clean pages before dio write") introduced a bug which stopped dio from
ever invalidating the page cache after writes. It still invalidated it
before writes so most users were fine.Karl Schendel reported ( http://lkml.org/lkml/2007/10/26/481 ) hitting
this bug when he had a buffered reader immediately reading file data
after an O_DIRECT wirter had written the data. The kernel issued
read-ahead beyond the position of the reader which overlapped with the
O_DIRECT writer. The failure to invalidate after writes caused the
reader to see stale data from the read-ahead.The following patch is originally from Karl. The following commentary
is his:The below 3rd try takes on your suggestion of just invalidating
no matter what the retval from the direct_IO call. I ran it
thru the test-case several times and it has worked every time.
The post-invalidate is probably still too early for async-directio,
but I don't have a testcase for that; just sync. And, this
won't be any worse in the async case.I added a test to the aio-dio-regress repository which mimics Karl's IO
pattern. It verifed the bad behaviour and that the patch fixed it. I
agree with Karl, this still doesn't help the case where a buffered
reader follows an AIO O_DIRECT writer. That will require a bit more
work.This gives up on the idea of returning EIO to indicate to userspace that
stale data remains if the invalidation failed.Signed-off-by: Zach Brown
Cc: Karl Schendel
Cc: Benjamin LaHaise
Cc: Andrew Morton
Cc: Nick Piggin
Cc: Leonid Ananiev
Cc: Chris Mason
Signed-off-by: Linus Torvalds
30 Oct, 2007
3 commits
-
It's possible to provoke unionfs (not yet in mainline, though in mm and
some distros) to hit shmem_writepage's BUG_ON(page_mapped(page)). I expect
it's possible to provoke the 2.6.23 ecryptfs in the same way (but the
2.6.24 ecryptfs no longer calls lower level's ->writepage).This came to light with the recent find that AOP_WRITEPAGE_ACTIVATE could
leak from tmpfs via write_cache_pages and unionfs to userspace. There's
already a fix (e423003028183df54f039dfda8b58c49e78c89d7 - writeback: don't
propagate AOP_WRITEPAGE_ACTIVATE) in the tree for that, and it's okay so
far as it goes; but insufficient because it doesn't address the underlying
issue, that shmem_writepage expects to be called only by vmscan (relying on
backing_dev_info capabilities to prevent the normal writeback path from
ever approaching it).That's an increasingly fragile assumption, and ramdisk_writepage (the other
source of AOP_WRITEPAGE_ACTIVATEs) is already careful to check
wbc->for_reclaim before returning it. Make the same check in
shmem_writepage, thereby sidestepping the page_mapped BUG also.Signed-off-by: Hugh Dickins
Cc: Erez Zadok
Cc:
Reviewed-by: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
mm/sparse-vmemmap.c uses init_mm in some places. However, it is not
present in any of the headers currently included in the file.init_mm is defined as extern in sched.h, so we add it to the headers list
Up to now, this problem was masked by the fact that functions like
set_pte_at() and pmd_populate_kernel() are usually macros that expand to
simpler variants that does not use the first parameter at all.Signed-off-by: Glauber de Oliveira Costa
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This reverts commit 2e1c49db4c640b35df13889b86b9d62215ade4b6.
First off, testing in Fedora has shown it to cause boot failures,
bisected down by Martin Ebourne, and reported by Dave Jobes. So the
commit will likely be reverted in the 2.6.23 stable kernels.Secondly, in the 2.6.24 model, x86-64 has now grown support for
SPARSEMEM_VMEMMAP, which disables the relevant code anyway, so while the
bug is not visible any more, it's become invisible due to the code just
being irrelevant and no longer enabled on the only architecture that
this ever affected.Reported-by: Dave Jones
Tested-by: Martin Ebourne
Cc: Zou Nan hai
Cc: Suresh Siddha
Cc: Andrew Morton
Acked-by: Andy Whitcroft
Signed-off-by: Linus Torvalds
29 Oct, 2007
4 commits
-
mm/nommu.c needs to #include linux/module.h for it to understand EXPORT_*()
macros.Signed-off-by: David Howells
Signed-off-by: Linus Torvalds -
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
compat_ioctl: fix block device compat ioctl regression
[BLOCK] Fix bad sharing of tag busy list on queues with shared tag maps
Fix a build error when BLOCK=n
block: use lock bitops for the tag map.
cciss: update copyright notices
cfq_get_queue: fix possible NULL pointer access
blk_sync_queue() should cancel request_queue->unplug_work
cfq_exit_queue() should cancel cfq_data->unplug_work
block layer: remove a unused argument of drive_stat_acct() -
nr_slabs is atomic_long_t, not atomic_t
Signed-off-by: Al Viro
Signed-off-by: Linus Torvalds -
mm/filemap.c: In function '__filemap_fdatawrite_range':
mm/filemap.c:200: error: implicit declaration of function
'mapping_cap_writeback_dirty'This happens when we don't use/have any block devices and a NFS root
filesystem is used.mapping_cap_writeback_dirty() is defined in linux/backing-dev.h which
used to be provided in mm/filemap.c by linux/blkdev.h until commit
f5ff8422bbdd59f8c1f699df248e1b7a11073027 (Fix warnings with
!CONFIG_BLOCK).Signed-off-by: Emil Medve
Signed-off-by: Jens Axboe
23 Oct, 2007
1 commit
-
Fix mprotect bug in recent commit 3ed75eb8f1cd89565966599c4f77d2edb086d5b0
(setup vma->vm_page_prot by vm_get_page_prot()): the vma_wants_writenotify
case was setting the same prot as when not.Nothing wrong with the use of protection_map[] in mmap_region(),
but use vm_get_page_prot() there too in the same ~VM_SHARED way.Signed-off-by: Hugh Dickins
Cc: Coly Li
Cc: Tony Luck
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
22 Oct, 2007
4 commits
-
Now that nfsd has stopped writing to the find_exported_dentry member we an
mark the export_operations constSigned-off-by: Christoph Hellwig
Cc: Neil Brown
Cc: "J. Bruce Fields"
Cc:
Cc: Dave Kleikamp
Cc: Anton Altaparmakov
Cc: David Chinner
Cc: Timothy Shimmin
Cc: OGAWA Hirofumi
Cc: Hugh Dickins
Cc: Chris Mason
Cc: Jeff Mahoney
Cc: "Vladimir V. Saveliev"
Cc: Steven Whitehouse
Cc: Mark Fasheh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
I'm not sure what people were thinking when adding support to export tmpfs,
but here's the conversion anyway:Signed-off-by: Christoph Hellwig
Cc: Neil Brown
Cc: "J. Bruce Fields"
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Fix a panic due to access NULL pointer of kmem_cache_node at discard_slab()
after memory online.When memory online is called, kmem_cache_nodes are created for all SLUBs
for new node whose memory are available.slab_mem_going_online_callback() is called to make kmem_cache_node() in
callback of memory online event. If it (or other callbacks) fails, then
slab_mem_offline_callback() is called for rollback.In memory offline, slab_mem_going_offline_callback() is called to shrink
all slub cache, then slab_mem_offline_callback() is called later.[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: locking fix]
[akpm@linux-foundation.org: build fix]
Signed-off-by: Yasunori Goto
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Current memory notifier has some defects yet. (Fortunately, nothing uses
it.) This patch is to fix and rearrange for them.- Add information of start_pfn, nr_pages, and node id if node status is
changes from/to memoryless node for callback functions.
Callbacks can't do anything without those information.
- Add notification going-online status.
It is necessary for creating per node structure before the node's
pages are available.
- Move GOING_OFFLINE status notification after page isolation.
It is good place for return memory like cache for callback,
because returned page is not used again.
- Make CANCEL events for rollingback when error occurs.
- Delete MEM_MAPPING_INVALID notification. It will be not used.
- Fix compile error of (un)register_memory_notifier().Signed-off-by: Yasunori Goto
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
21 Oct, 2007
1 commit
-
Wrong order of arguments
Signed-off-by: Al Viro
Signed-off-by: Linus Torvalds
20 Oct, 2007
2 commits
-
Signed-off-by: Adrian Bunk
-
Typo fixes retrun -> return
Signed-off-by: Gabriel Craciunescu
Signed-off-by: Adrian Bunk