Eric Lee / smarc-fsl-linux-kernel

07 Jan, 2009

31 commits

e5991371e mm: remove cgroup_mm_owner_callbacks ... Browse Code »

cgroup_mm_owner_callbacks() was brought in to support the memrlimit
controller, but sneaked into mainline ahead of it. That controller has
now been shelved, and the mm_owner_changed() args were inadequate for it
anyway (they needed an mm pointer instead of a task pointer).

Remove the dead code, and restore mm_update_next_owner() locking to how it
was before: taking mmap_sem there does nothing for memcontrol.c, now the
only user of mm->owner.

Signed-off-by: Hugh Dickins
Cc: Paul Menage
Cc: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-01-07 07:59:01 +0800
39f0dee2d do_mpage_readpage(): remove useless clear_buffer_mapped() call ... Browse Code »

It is known that buffer_mapped() is false in this code path.

Signed-off-by: Franck Bui-Huu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Franck Bui-Huu
2009-01-07 07:59:01 +0800
38e0edb15 mm/apply_to_range: call pte function with lazy updates ... Browse Code »

Make the pte-level function in apply_to_range be called in lazy mmu mode,
so that any pagetable modifications can be batched.

Signed-off-by: Jeremy Fitzhardinge
Cc: Johannes Weiner
Cc: Nick Piggin
Cc: Venkatesh Pallipadi
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jeremy Fitzhardinge
2009-01-07 07:59:01 +0800
cd52858c7 mm: vmalloc make lazy unmapping configurable ... Browse Code »

Lazy unmapping in the vmalloc code has now opened the possibility for use
after free bugs to go undetected. We can catch those by forcing an unmap
and flush (which is going to be slow, but that's what happens).

Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2009-01-07 07:59:01 +0800
e97a630eb mm: vmalloc use mutex for purge ... Browse Code »

The vmalloc purge lock can be a mutex so we can sleep while a purge is
going on (purge involves a global kernel TLB invalidate, so it can take
quite a while).

Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2009-01-07 07:59:01 +0800
848778483 mm: vmalloc improve vmallocinfo ... Browse Code »

If we do that, output of files like /proc/vmallocinfo will show things
like "vmalloc_32", "vmalloc_user", or whomever the caller was as the
caller. This info is not as useful as the real caller of the allocation.

So, proposal is to call __vmalloc_node node directly, with matching
parameters to save the caller information

Signed-off-by: Glauber Costa
Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Glauber Costa
2009-01-07 07:59:01 +0800
c1279c4ef mm: vmalloc tweak failure printk ... Browse Code »

If we can't service a vmalloc allocation, show size of the allocation that
actually failed. Useful for debugging.

Signed-off-by: Glauber Costa
Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Glauber Costa
2009-01-07 07:59:01 +0800
4917e5d04 mm: more likely reclaim MADV_SEQUENTIAL mappings ... Browse Code »

File pages mapped only in sequentially read mappings are perfect reclaim
canditates.

This patch makes these mappings behave like weak references, their pages
will be reclaimed unless they have a strong reference from a normal
mapping as well.

It changes the reclaim and the unmap path where they check if the page has
been referenced. In both cases, accesses through sequentially read
mappings will be ignored.

Benchmark results from KOSAKI Motohiro:

http://marc.info/?l=linux-mm&m=122485301925098&w=2

Signed-off-by: Johannes Weiner
Signed-off-by: Rik van Riel
Acked-by: KOSAKI Motohiro
Cc: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2009-01-07 07:59:00 +0800
64cdd548f mm: cleanup: remove #ifdef CONFIG_MIGRATION ... Browse Code »

#ifdef in *.c file decrease source readability a bit. removing is better.

This patch doesn't have any functional change.

Signed-off-by: KOSAKI Motohiro
Cc: Christoph Lameter
Cc: Mel Gorman
Cc: Lee Schermerhorn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2009-01-07 07:59:00 +0800
1b0bd1188 mm: get rid of pagevec_release_nonlru() ... Browse Code »

speculative page references patch (commit:
e286781d5f2e9c846e012a39653a166e9d31777d) removed last
pagevec_release_nonlru() caller.

So this function can be removed now.

This patch doesn't have any functional change.

Signed-off-by: KOSAKI Motohiro
Cc: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2009-01-07 07:59:00 +0800
5594c8c81 mm: print out memmap number only if it is not zero ... Browse Code »

Don't print the size of the zone's memmap array if it does not have one.

Impact: cleanup

Signed-off-by: Yinghai Lu
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yinghai Lu
2009-01-07 07:59:00 +0800
c04fc586c mm: show node to memory section relationship with symlinks in sysfs ... Browse Code »

Show node to memory section relationship with symlinks in sysfs

Add /sys/devices/system/node/nodeX/memoryY symlinks for all
the memory sections located on nodeX. For example:
/sys/devices/system/node/node1/memory135 -> ../../memory/memory135
indicates that memory section 135 resides on node1.

Also revises documentation to cover this change as well as updating
Documentation/ABI/testing/sysfs-devices-memory to include descriptions
of memory hotremove files 'phys_device', 'phys_index', and 'state'
that were previously not described there.

In addition to it always being a good policy to provide users with
the maximum possible amount of physical location information for
resources that can be hot-added and/or hot-removed, the following
are some (but likely not all) of the user benefits provided by
this change.
Immediate:
- Provides information needed to determine the specific node
on which a defective DIMM is located. This will reduce system
downtime when the node or defective DIMM is swapped out.
- Prevents unintended onlining of a memory section that was
previously offlined due to a defective DIMM. This could happen
during node hot-add when the user or node hot-add assist script
onlines _all_ offlined sections due to user or script inability
to identify the specific memory sections located on the hot-added
node. The consequences of reintroducing the defective memory
could be ugly.
- Provides information needed to vary the amount and distribution
of memory on specific nodes for testing or debugging purposes.
Future:
- Will provide information needed to identify the memory
sections that need to be offlined prior to physical removal
of a specific node.

Symlink creation during boot was tested on 2-node x86_64, 2-node
ppc64, and 2-node ia64 systems. Symlink creation during physical
memory hot-add tested on a 2-node x86_64 system.

Signed-off-by: Gary Hade
Signed-off-by: Badari Pulavarty
Acked-by: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Gary Hade
2009-01-07 07:59:00 +0800
ee53a891f mm: do_sync_mapping_range integrity fix ... Browse Code »

Chris Mason notices do_sync_mapping_range didn't actually ask for data
integrity writeout. Unfortunately, it is advertised as being usable for
data integrity operations.

This is a data integrity bug.

Signed-off-by: Nick Piggin
Cc: Chris Mason
Cc: Dave Chinner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2009-01-07 07:59:00 +0800
82fd1a9a8 mm: write_cache_pages more terminate quickly ... Browse Code »

Now that we have the early-termination logic in place, it makes sense to
bail out early in all other cases where done is set to 1.

Signed-off-by: Nick Piggin
Cc: Chris Mason
Cc: Dave Chinner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2009-01-07 07:59:00 +0800
d5482cdf8 mm: write_cache_pages terminate quickly ... Browse Code »

Terminate the write_cache_pages loop upon encountering the first page past
end, without locking the page. Pages cannot have their index change when
we have a reference on them (truncate, eg truncate_inode_pages_range
performs the same check without the page lock).

Signed-off-by: Nick Piggin
Cc: Chris Mason
Cc: Dave Chinner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2009-01-07 07:59:00 +0800
515f4a037 mm: write_cache_pages optimise page cleaning ... Browse Code »

In write_cache_pages, if we get stuck behind another process that is
cleaning pages, we will be forced to wait for them to finish, then perform
our own writeout (if it was redirtied during the long wait), then wait for
that.

If a page under writeout is still clean, we can skip waiting for it (if
we're part of a data integrity sync, we'll be waiting for all writeout
pages afterwards, so we'll still be waiting for the other guy's write
that's cleaned the page).

Signed-off-by: Nick Piggin
Cc: Chris Mason
Cc: Dave Chinner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2009-01-07 07:58:59 +0800
5a3d5c981 mm: write_cache_pages cleanups ... Browse Code »

Get rid of some complex expressions from flow control statements, add a
comment, remove some duplicate code.

Signed-off-by: Nick Piggin
Cc: Chris Mason
Cc: Dave Chinner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2009-01-07 07:58:59 +0800
05fe478dd mm: write_cache_pages integrity fix ... Browse Code »

In write_cache_pages, nr_to_write is heeded even for data-integrity syncs,
so the function will return success after writing out nr_to_write pages,
even if that was not sufficient to guarantee data integrity.

The callers tend to set it to values that could break data interity
semantics easily in practice. For example, nr_to_write can be set to
mapping->nr_pages * 2, however if a file has a single, dirty page, then
fsync is called, subsequent pages might be concurrently added and dirtied,
then write_cache_pages might writeout two of these newly dirty pages,
while not writing out the old page that should have been written out.

Fix this by ignoring nr_to_write if it is a data integrity sync.

This is a data integrity bug.

The reason this has been done in the past is to avoid stalling sync
operations behind page dirtiers.

"If a file has one dirty page at offset 1000000000000000 then someone
does an fsync() and someone else gets in first and starts madly writing
pages at offset 0, we want to write that page at 1000000000000000.
Somehow."

What we do today is return success after an arbitrary amount of pages are
written, whether or not we have provided the data-integrity semantics that
the caller has asked for. Even this doesn't actually fix all stall cases
completely: in the above situation, if the file has a huge number of pages
in pagecache (but not dirty), then mapping->nrpages is going to be huge,
even if pages are being dirtied.

This change does indeed make the possibility of long stalls lager, and
that's not a good thing, but lying about data integrity is even worse. We
have to either perform the sync, or return -ELINUXISLAME so at least the
caller knows what has happened.

There are subsequent competing approaches in the works to solve the stall
problems properly, without compromising data integrity.

Signed-off-by: Nick Piggin
Cc: Chris Mason
Cc: Dave Chinner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2009-01-07 07:58:59 +0800
00266770b mm: write_cache_pages writepage error fix ... Browse Code »

In write_cache_pages, if ret signals a real error, but we still have some
pages left in the pagevec, done would be set to 1, but the remaining pages
would continue to be processed and ret will be overwritten in the process.

It could easily be overwritten with success, and thus success will be
returned even if there is an error. Thus the caller is told all writes
succeeded, wheras in reality some did not.

Fix this by bailing immediately if there is an error, and retaining the
first error code.

This is a data integrity bug.

Signed-off-by: Nick Piggin
Cc: Chris Mason
Cc: Dave Chinner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2009-01-07 07:58:59 +0800
bd19e012f mm: write_cache_pages early loop termination ... Browse Code »

We'd like to break out of the loop early in many situations, however the
existing code has been setting mapping->writeback_index past the final
page in the pagevec lookup for cyclic writeback. This is a problem if we
don't process all pages up to the final page.

Currently the code mostly keeps writeback_index reasonable and hacked
around this by not breaking out of the loop or writing pages outside the
range in these cases. Keep track of a real "done index" that enables us
to terminate the loop in a much more flexible manner.

Needed by the subsequent patch to preserve writepage errors, and then
further patches to break out of the loop early for other reasons. However
there are no functional changes with this patch alone.

Signed-off-by: Nick Piggin
Cc: Chris Mason
Cc: Dave Chinner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2009-01-07 07:58:59 +0800
31a12666d mm: write_cache_pages cyclic fix ... Browse Code »

In write_cache_pages, scanned == 1 is supposed to mean that cyclic
writeback has circled through zero, thus we should not circle again.
However it gets set to 1 after the first successful pagevec lookup. This
leads to cases where not enough data gets written.

Counterexample: file with first 10 pages dirty, writeback_index == 5,
nr_to_write == 10. Then the 5 last pages will be found, and scanned will
be set to 1, after writing those out, we will not cycle back to get the
first 5.

Rework this logic, now we'll always cycle unless we started off from index
0. When cycling, only write out as far as 1 page before the start page
from the first cycle (so we don't write parts of the file twice).

Signed-off-by: Nick Piggin
Cc: Chris Mason
Cc: Dave Chinner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2009-01-07 07:58:59 +0800
38c8e6180 do_mpage_readpage(): don't submit lots of small bios on boundary ... Browse Code »

While tracing I/O patterns with blktrace (a great tool) a few weeks ago I
identified a minor issue in fs/mpage.c

As the comment above mpage_readpages() says, a fs's get_block function
will set BH_Boundary when it maps a block just before a block for which
extra I/O is required.

Since get_block() can map a range of pages, for all these pages the
BH_Boundary flag will be set. But we only need to push what I/O we have
accumulated at the last block of this range.

This makes do_mpage_readpage() send out the largest possible bio instead
of a bunch of page-sized ones in the BH_Boundary case.

Signed-off-by: Miquel van Smoorenburg
Cc: Nick Piggin
Cc: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Miquel van Smoorenburg
2009-01-07 07:58:59 +0800
75aa19941 oom: print triggering task's cpuset and mems allowed ... Browse Code »

When cpusets are enabled, it's necessary to print the triggering task's
set of allowable nodes so the subsequently printed meminfo can be
interpreted correctly.

We also print the task's cpuset name for informational purposes.

[rientjes@google.com: task lock current before dereferencing cpuset]
Cc: Paul Menage
Cc: Li Zefan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2009-01-07 07:58:59 +0800
c7d4caeb1 oom: fix zone_scan_mutex name ... Browse Code »

zone_scan_mutex is actually a spinlock, so name it appropriately.

Signed-off-by: David Rientjes
Reviewed-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2009-01-07 07:58:58 +0800
1c0fe6e3b mm: invoke oom-killer from page fault ... Browse Code »

Rather than have the pagefault handler kill a process directly if it gets
a VM_FAULT_OOM, have it call into the OOM killer.

With increasingly sophisticated oom behaviour (cpusets, memory cgroups,
oom killing throttling, oom priority adjustment or selective disabling,
panic on oom, etc), it's silly to unconditionally kill the faulting
process at page fault time. Create a hook for pagefault oom path to call
into instead.

Only converted x86 and uml so far.

[akpm@linux-foundation.org: make __out_of_memory() static]
[akpm@linux-foundation.org: fix comment]
Signed-off-by: Nick Piggin
Cc: Jeff Dike
Acked-by: Ingo Molnar
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2009-01-07 07:58:58 +0800
5bd1455c2 mm: move_pages: no need to set pp->page to ZERO_PAGE(0) by default ... Browse Code »

pp->page is never used when not set to the right page, so there is no need
to set it to ZERO_PAGE(0) by default.

Signed-off-by: Brice Goglin
Acked-by: Christoph Lameter
Cc: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Brice Goglin
2009-01-07 07:58:58 +0800
3140a2273 mm: rework do_pages_move() to work on page_sized chunks ... Browse Code »

Rework do_pages_move() to work by page-sized chunks of struct page_to_node
that are passed to do_move_page_to_node_array(). We now only have to
allocate a single page instead a possibly very large vmalloc area to store
all page_to_node entries.

As a result, new_page_node() will now have a very small lookup, hidding
much of the overall sys_move_pages() overhead.

Signed-off-by: Brice Goglin
Signed-off-by: Nathalie Furmento
Acked-by: Christoph Lameter
Cc: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Brice Goglin
2009-01-07 07:58:58 +0800
390722baa mm: don't mark_page_accessed in shmem_fault ... Browse Code »

Following "mm: don't mark_page_accessed in fault path", which now
places a mark_page_accessed() in zap_pte_range(), we should remove
the mark_page_accessed() from shmem_fault().

Signed-off-by: Hugh Dickins
Cc: Nick Piggin
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-01-07 07:58:58 +0800
bf3f3bc5e mm: don't mark_page_accessed in fault path ... Browse Code »

Doing a mark_page_accessed at fault-time, then doing SetPageReferenced at
unmap-time if the pte is young has a number of problems.

mark_page_accessed is supposed to be roughly the equivalent of a young pte
for unmapped references. Unfortunately it doesn't come with any context:
after being called, reclaim doesn't know who or why the page was touched.

So calling mark_page_accessed not only adds extra lru or PG_referenced
manipulations for pages that are already going to have pte_young ptes anyway,
but it also adds these references which are difficult to work with from the
context of vma specific references (eg. MADV_SEQUENTIAL pte_young may not
wish to contribute to the page being referenced).

Then, simply doing SetPageReferenced when zapping a pte and finding it is
young, is not a really good solution either. SetPageReferenced does not
correctly promote the page to the active list for example. So after removing
mark_page_accessed from the fault path, several mmap()+touch+munmap() would
have a very different result from several read(2) calls for example, which
is not really desirable.

Signed-off-by: Nick Piggin
Acked-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2009-01-07 07:58:58 +0800
3340289dd mm: report the MMU pagesize in /proc/pid/smaps ... Browse Code »

The KernelPageSize entry in /proc/pid/smaps is the pagesize used by the
kernel to back a VMA. This matches the size used by the MMU in the
majority of cases. However, one counter-example occurs on PPC64 kernels
whereby a kernel using 64K as a base pagesize may still use 4K pages for
the MMU on older processor. To distinguish, this patch reports
MMUPageSize as the pagesize used by the MMU in /proc/pid/smaps.

Signed-off-by: Mel Gorman
Cc: "KOSAKI Motohiro"
Cc: Alexey Dobriyan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2009-01-07 07:58:58 +0800
08fba6998 mm: report the pagesize backing a VMA in /proc/pid/smaps ... Browse Code »

It is useful to verify a hugepage-aware application is using the expected
pagesizes for its memory regions. This patch creates an entry called
KernelPageSize in /proc/pid/smaps that is the size of page used by the
kernel to back a VMA. The entry is not called PageSize as it is possible
the MMU uses a different size. This extension should not break any sensible
parser that skips lines containing unrecognised information.

Signed-off-by: Mel Gorman
Acked-by: "KOSAKI Motohiro"
Cc: Alexey Dobriyan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2009-01-07 07:58:58 +0800

06 Jan, 2009

9 commits

238c6d548 Merge git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm ... Browse Code »

* git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm:
dm snapshot: extend exception store functions
dm snapshot: split out exception store implementations
dm snapshot: rename struct exception_store
dm snapshot: separate out exception store interface
dm mpath: move trigger_event to system workqueue
dm: add name and uuid to sysfs
dm table: rework reference counting
dm: support barriers on simple devices
dm request: extend target interface
dm request: add caches
dm ioctl: allow dm_copy_name_and_uuid to return only one field
dm log: ensure log bitmap fits on log device
dm log: move region_size validation
dm log: avoid reinitialising io_req on every operation
dm: consolidate target deregistration error handling
dm raid1: fix error count
dm log: fix dm_io_client leak on error paths
dm snapshot: change yield to msleep
dm table: drop reference at unbind

Linus Torvalds
2009-01-06 11:20:59 +0800
a159c1ac5 dm snapshot: extend exception store functions ... Browse Code »

Supply dm_add_exception as a callback to the read_metadata function.
Add a status function ready for a later patch and name the functions
consistently.

Signed-off-by: Jonathan Brassow
Signed-off-by: Alasdair G Kergon

Jonathan Brassow
2009-01-06 11:05:19 +0800
4db6bfe02 dm snapshot: split out exception store implementations ... Browse Code »

Move the existing snapshot exception store implementations out into
separate files. Later patches will place these behind a new
interface in preparation for alternative implementations.

Signed-off-by: Alasdair G Kergon

Alasdair G Kergon
2009-01-06 11:05:17 +0800
1ae25f9c9 dm snapshot: rename struct exception_store ... Browse Code »

Rename struct exception_store to dm_exception_store.

Signed-off-by: Jonathan Brassow
Signed-off-by: Alasdair G Kergon

Jonathan Brassow
2009-01-06 11:05:16 +0800
aea53d92f dm snapshot: separate out exception store interface ... Browse Code »

Pull structures that bridge the gap between snapshot and
exception store out of dm-snap.h and put them in a new
.h file - dm-exception-store.h. This file will define the
API for new exception stores.

Ultimately, dm-snap.h is unnecessary, since only dm-snap.c
should be using it.

Signed-off-by: Jonathan Brassow
Signed-off-by: Alasdair G Kergon

Jonathan Brassow
2009-01-06 11:05:15 +0800
fe9cf30eb dm mpath: move trigger_event to system workqueue ... Browse Code »

The same workqueue is used both for sending uevents and processing queued I/O.
Deadlock has been reported in RHEL5 when sending a uevent was blocked waiting
for the queued I/O to be processed. Use scheduled_work() for the asynchronous
uevents instead.

Signed-off-by: Alasdair G Kergon

Alasdair G Kergon
2009-01-06 11:05:13 +0800
784aae735 dm: add name and uuid to sysfs ... Browse Code »

Implement simple read-only sysfs entry for device-mapper block device.

This patch adds a simple sysfs directory named "dm" under block device
properties and implements
- name attribute (string containing mapped device name)
- uuid attribute (string containing UUID, or empty string if not set)

The kobject is embedded in mapped_device struct, so no additional
memory allocation is needed for initializing sysfs entry.

During the processing of sysfs attribute we need to lock mapped device
which is done by a new function dm_get_from_kobj, which returns the md
associated with kobject and increases the usage count.

Each 'show attribute' function is responsible for its own locking.

Signed-off-by: Milan Broz
Signed-off-by: Alasdair G Kergon

Milan Broz
2009-01-06 11:05:12 +0800
d58168763 dm table: rework reference counting ... Browse Code »

Rework table reference counting.

The existing code uses a reference counter. When the last reference is
dropped and the counter reaches zero, the table destructor is called.
Table reference counters are acquired/released from upcalls from other
kernel code (dm_any_congested, dm_merge_bvec, dm_unplug_all).
If the reference counter reaches zero in one of the upcalls, the table
destructor is called from almost random kernel code.

This leads to various problems:
* dm_any_congested being called under a spinlock, which calls the
destructor, which calls some sleeping function.
* the destructor attempting to take a lock that is already taken by the
same process.
* stale reference from some other kernel code keeps the table
constructed, which keeps some devices open, even after successful
return from "dmsetup remove". This can confuse lvm and prevent closing
of underlying devices or reusing device minor numbers.

The patch changes reference counting so that the table destructor can be
called only at predetermined places.

The table has always exactly one reference from either mapped_device->map
or hash_cell->new_map. After this patch, this reference is not counted
in table->holders. A pair of dm_create_table/dm_destroy_table functions
is used for table creation/destruction.

Temporary references from the other code increase table->holders. A pair
of dm_table_get/dm_table_put functions is used to manipulate it.

When the table is about to be destroyed, we wait for table->holders to
reach 0. Then, we call the table destructor. We use active waiting with
msleep(1), because the situation happens rarely (to one user in 5 years)
and removing the device isn't performance-critical task: the user doesn't
care if it takes one tick more or not.

This way, the destructor is called only at specific points
(dm_table_destroy function) and the above problems associated with lazy
destruction can't happen.

Finally remove the temporary protection added to dm_any_congested().

Signed-off-by: Mikulas Patocka
Signed-off-by: Alasdair G Kergon

Mikulas Patocka
2009-01-06 11:05:10 +0800
ab4c14248 dm: support barriers on simple devices ... Browse Code »

Implement barrier support for single device DM devices

This patch implements barrier support in DM for the common case of dm linear
just remapping a single underlying device. In this case we can safely
pass the barrier through because there can be no reordering between
devices.

NB. Any DM device might cease to support barriers if it gets
reconfigured so code must continue to allow for a possible
-EOPNOTSUPP on every barrier bio submitted. - agk

Signed-off-by: Andi Kleen
Signed-off-by: Mikulas Patocka
Signed-off-by: Alasdair G Kergon

Andi Kleen
2009-01-06 11:05:09 +0800