08 Dec, 2006
40 commits
-
Make swsusp use block device offsets instead of swap offsets to identify swap
locations and make it use the same code paths for writing as well as for
reading data.This allows us to use the same code for handling swap files and swap
partitions and to simplify the code, eg. by dropping rw_swap_page_sync().Signed-off-by: Rafael J. Wysocki
Cc: Pavel Machek
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Rearrange the code in kernel/power/swap.c so that the next patch is more
readable.[This patch only moves the existing code.]
Signed-off-by: Rafael J. Wysocki
Acked-by: Pavel Machek
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The Linux kernel handles swap files almost in the same way as it handles swap
partitions and there are only two differences between these two types of swap
areas:(1) swap files need not be contiguous,
(2) the header of a swap file is not in the first block of the partition
that holds it. From the swsusp's point of view (1) is not a problem,
because it is already taken care of by the swap-handling code, but (2) has
to be taken into consideration.In principle the location of a swap file's header may be determined with the
help of appropriate filesystem driver. Unfortunately, however, it requires
the filesystem holding the swap file to be mounted, and if this filesystem is
journaled, it cannot be mounted during a resume from disk. For this reason we
need some other means by which swap areas can be identified.For example, to identify a swap area we can use the partition that holds the
area and the offset from the beginning of this partition at which the swap
header is located.The following patch allows swsusp to identify swap areas this way. It changes
swap_type_of() so that it takes an additional argument representing an offset
of the swap header within the partition represented by its first argument.Signed-off-by: Rafael J. Wysocki
Acked-by: Pavel Machek
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Add an ioctl to the userspace swsusp code that enables the usage of the
pmops->prepare, pmops->enter and pmops->finish methods (the in-kernel
suspend knows these as "platform method"). These are needed on many
machines to (among others) speed up resuming by letting the BIOS skip some
steps or let my hp nx5000 recognise the correct ac_adapter state after
resume again.It also ensures on many machines, that changed hardware (unplugged AC
adapters) gets correctly detected and that kacpid does not run wild after
resume.Signed-off-by: Stefan Seyfried
Cc: "Rafael J. Wysocki"
Cc: Pavel Machek
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Now that we have pci_get_bus_and_slot we can do the job correctly. Note that
some of these calls intentionally leak a device - this is because the device
in question is always needed from boot to reboot.Signed-off-by: Alan Cox
Cc: Richard Henderson
Cc: Ivan Kokshaysky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Signed-off-by: Mariusz Kozlowski
Cc: Yoshinori Sato
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
While working on SH kprobes, I noticed that avr32 got the preemption
handling wrong in the no probe case. The idea is that upon entry of
kprobe_handler() preemption is disabled outright across the life of the
kprobe, only to be re-enabled in post_kprobe_handler().However, in the event that the probe is never activated, there's never any
chance of hitting the post probe handler, which allows for the current
avr32 implementation to disable preemption indefinitely, as it's currently
missing a re-enable when no probe is activated.Signed-off-by: Paul Mundt
Cc: Haavard Skinnemoen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This patch fixes the following compile error with
-Werror-implicit-function-declaration
(without -Werror-implicit-function-declaration it's a link error):...
CC arch/frv/kernel/futex.o
/home/bunk/linux/kernel-2.6/linux-2.6.19-rc6-mm2/arch/frv/kernel/futex.c:
In function 'futex_atomic_op_inuser':
/home/bunk/linux/kernel-2.6/linux-2.6.19-rc6-mm2/arch/frv/kernel/futex.c:203:
error: implicit declaration of function 'pagefault_disable'
/home/bunk/linux/kernel-2.6/linux-2.6.19-rc6-mm2/arch/frv/kernel/futex.c:226:
error: implicit declaration of function 'pagefault_enable'
make[2]: *** [arch/frv/kernel/futex.o] Error 1
...Signed-off-by: Adrian Bunk
Acked-by: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Signed-off-by: Eric Sesterhenn
Signed-off-by: Alexey Dobriyan
Acked-By: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Make radix tree lookups safe to be performed without locks. Readers are
protected against nodes being deleted by using RCU based freeing. Readers
are protected against new node insertion by using memory barriers to ensure
the node itself will be properly written before it is visible in the radix
tree.Each radix tree node keeps a record of their height (above leaf nodes).
This height does not change after insertion -- when the radix tree is
extended, higher nodes are only inserted in the top. So a lookup can take
the pointer to what is *now* the root node, and traverse down it even if
the tree is concurrently extended and this node becomes a subtree of a new
root."Direct" pointers (tree height of 0, where root->rnode points directly to
the data item) are handled by using the low bit of the pointer to signal
whether rnode is a direct pointer or a pointer to a radix tree node.When a reader wants to traverse the next branch, they will take a copy of
the pointer. This pointer will be either NULL (and the branch is empty) or
non-NULL (and will point to a valid node).[akpm@osdl.org: cleanups]
[Lee.Schermerhorn@hp.com: bugfixes, comments, simplifications]
[clameter@sgi.com: build fix]
Signed-off-by: Nick Piggin
Cc: "Paul E. McKenney"
Signed-off-by: Lee Schermerhorn
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Before:
[acme@newtoy net-2.6.20]$ pahole --cacheline 32 kernel/sched.o mm_struct/* include2/asm/processor.h:542 */
struct mm_struct {
struct vm_area_struct * mmap; /* 0 4 */
struct rb_root mm_rb; /* 4 4 */
struct vm_area_struct * mmap_cache; /* 8 4 */
long unsigned int (*get_unmapped_area)(); /* 12 4 */
void (*unmap_area)(); /* 16 4 */
long unsigned int mmap_base; /* 20 4 */
long unsigned int task_size; /* 24 4 */
long unsigned int cached_hole_size; /* 28 4 */
/* ---------- cacheline 1 boundary ---------- */
long unsigned int free_area_cache; /* 32 4 */
pgd_t * pgd; /* 36 4 */
atomic_t mm_users; /* 40 4 */
atomic_t mm_count; /* 44 4 */
int map_count; /* 48 4 */
struct rw_semaphore mmap_sem; /* 52 64 */
spinlock_t page_table_lock; /* 116 40 */
struct list_head mmlist; /* 156 8 */
mm_counter_t _file_rss; /* 164 4 */
mm_counter_t _anon_rss; /* 168 4 */
long unsigned int hiwater_rss; /* 172 4 */
long unsigned int hiwater_vm; /* 176 4 */
long unsigned int total_vm; /* 180 4 */
long unsigned int locked_vm; /* 184 4 */
long unsigned int shared_vm; /* 188 4 */
/* ---------- cacheline 6 boundary ---------- */
long unsigned int exec_vm; /* 192 4 */
long unsigned int stack_vm; /* 196 4 */
long unsigned int reserved_vm; /* 200 4 */
long unsigned int def_flags; /* 204 4 */
long unsigned int nr_ptes; /* 208 4 */
long unsigned int start_code; /* 212 4 */
long unsigned int end_code; /* 216 4 */
long unsigned int start_data; /* 220 4 */
/* ---------- cacheline 7 boundary ---------- */
long unsigned int end_data; /* 224 4 */
long unsigned int start_brk; /* 228 4 */
long unsigned int brk; /* 232 4 */
long unsigned int start_stack; /* 236 4 */
long unsigned int arg_start; /* 240 4 */
long unsigned int arg_end; /* 244 4 */
long unsigned int env_start; /* 248 4 */
long unsigned int env_end; /* 252 4 */
/* ---------- cacheline 8 boundary ---------- */
long unsigned int saved_auxv[44]; /* 256 176 */
unsigned int dumpable:2; /* 432 4 */
cpumask_t cpu_vm_mask; /* 436 4 */
mm_context_t context; /* 440 68 */
long unsigned int swap_token_time; /* 508 4 */
/* ---------- cacheline 16 boundary ---------- */
char recent_pagein; /* 512 1 *//* XXX 3 bytes hole, try to pack */
int core_waiters; /* 516 4 */
struct completion * core_startup_done; /* 520 4 */
struct completion core_done; /* 524 52 */
rwlock_t ioctx_list_lock; /* 576 36 */
struct kioctx * ioctx_list; /* 612 4 */
}; /* size: 616, sum members: 613, holes: 1, sum holes: 3, cachelines: 20,
last cacheline: 8 bytes */After:
[acme@newtoy net-2.6.20]$ pahole --cacheline 32 kernel/sched.o mm_struct
/* include2/asm/processor.h:542 */
struct mm_struct {
struct vm_area_struct * mmap; /* 0 4 */
struct rb_root mm_rb; /* 4 4 */
struct vm_area_struct * mmap_cache; /* 8 4 */
long unsigned int (*get_unmapped_area)(); /* 12 4 */
void (*unmap_area)(); /* 16 4 */
long unsigned int mmap_base; /* 20 4 */
long unsigned int task_size; /* 24 4 */
long unsigned int cached_hole_size; /* 28 4 */
/* ---------- cacheline 1 boundary ---------- */
long unsigned int free_area_cache; /* 32 4 */
pgd_t * pgd; /* 36 4 */
atomic_t mm_users; /* 40 4 */
atomic_t mm_count; /* 44 4 */
int map_count; /* 48 4 */
struct rw_semaphore mmap_sem; /* 52 64 */
spinlock_t page_table_lock; /* 116 40 */
struct list_head mmlist; /* 156 8 */
mm_counter_t _file_rss; /* 164 4 */
mm_counter_t _anon_rss; /* 168 4 */
long unsigned int hiwater_rss; /* 172 4 */
long unsigned int hiwater_vm; /* 176 4 */
long unsigned int total_vm; /* 180 4 */
long unsigned int locked_vm; /* 184 4 */
long unsigned int shared_vm; /* 188 4 */
/* ---------- cacheline 6 boundary ---------- */
long unsigned int exec_vm; /* 192 4 */
long unsigned int stack_vm; /* 196 4 */
long unsigned int reserved_vm; /* 200 4 */
long unsigned int def_flags; /* 204 4 */
long unsigned int nr_ptes; /* 208 4 */
long unsigned int start_code; /* 212 4 */
long unsigned int end_code; /* 216 4 */
long unsigned int start_data; /* 220 4 */
/* ---------- cacheline 7 boundary ---------- */
long unsigned int end_data; /* 224 4 */
long unsigned int start_brk; /* 228 4 */
long unsigned int brk; /* 232 4 */
long unsigned int start_stack; /* 236 4 */
long unsigned int arg_start; /* 240 4 */
long unsigned int arg_end; /* 244 4 */
long unsigned int env_start; /* 248 4 */
long unsigned int env_end; /* 252 4 */
/* ---------- cacheline 8 boundary ---------- */
long unsigned int saved_auxv[44]; /* 256 176 */
cpumask_t cpu_vm_mask; /* 432 4 */
mm_context_t context; /* 436 68 */
long unsigned int swap_token_time; /* 504 4 */
char recent_pagein; /* 508 1 */
unsigned char dumpable:2; /* 509 1 *//* XXX 2 bytes hole, try to pack */
int core_waiters; /* 512 4 */
struct completion * core_startup_done; /* 516 4 */
struct completion core_done; /* 520 52 */
rwlock_t ioctx_list_lock; /* 572 36 */
struct kioctx * ioctx_list; /* 608 4 */
}; /* size: 612, sum members: 610, holes: 1, sum holes: 2, cachelines: 20,
last cacheline: 4 bytes */[acme@newtoy net-2.6.20]$ codiff -V /tmp/sched.o.before kernel/sched.o
/pub/scm/linux/kernel/git/acme/net-2.6.20/kernel/sched.c:
struct mm_struct | -4
dumpable:2;
from: unsigned int /* 432(30) 4(2) */
to: unsigned char /* 509(6) 1(2) */
< SNIP other offset changes >
1 struct changed
[acme@newtoy net-2.6.20]$I'm not aware of any problem about using 2 byte wide bitfields where
previously a 4 byte wide one was, holler if there is any, I wouldn't be
surprised, bitfields are things from hell.For the curious, 432(30) means: at offset 432 from the struct start, at
offset 30 in the bitfield (yeah, it comes backwards, hellish, huh?) ditto
for 509(6), while 4(2) and 1(2) means "struct field size(bitfield size)".Now we have a 2 bytes hole and are using only 4 bytes of the last 32
bytes cacheline, any takers? :-)Signed-off-by: Arnaldo Carvalho de Melo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Currently we we use the lru head link of the second page of a compound page
to hold its destructor. This was ok when it was purely an internal
implmentation detail. However, hugetlbfs overrides this destructor
violating the layering. Abstract this out as explicit calls, also
introduce a type for the callback function allowing them to be type
checked. For each callback we pre-declare the function, causing a type
error on definition rather than on use elsewhere.[akpm@osdl.org: cleanups]
Signed-off-by: Andy Whitcroft
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Currently we simply attempt to allocate from all allowed nodes using
GFP_THISNODE. However, GFP_THISNODE does not do reclaim (it wont do any at
all if the recent GFP_THISNODE patch is accepted). If we truly run out of
memory in the whole system then fallback_alloc may return NULL although
memory may still be available if we would perform more thorough reclaim.This patch changes fallback_alloc() so that we first only inspect all the
per node queues for available slabs. If we find any then we allocate from
those. This avoids slab fragmentation by first getting rid of all partial
allocated slabs on every node before allocating new memory.If we cannot satisfy the allocation from any per node queue then we extend
a slab. We now call into the page allocator without specifying
GFP_THISNODE. The page allocator will then implement its own fallback (in
the given cpuset context), perform necessary reclaim (again considering not
a single node but the whole set of allowed nodes) and then return pages for
a new slab.We identify from which node the pages were allocated and then insert the
pages into the corresponding per node structure. In order to do so we need
to modify cache_grow() to take a parameter that specifies the new slab.
kmem_getpages() can no longer set the GFP_THISNODE flag since we need to be
able to use kmem_getpage to allocate from an arbitrary node. GFP_THISNODE
needs to be specified when calling cache_grow().One key advantage is that the decision from which node to allocate new
memory is removed from slab fallback processing. The patch allows to go
back to use of the page allocators fallback/reclaim logic.Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The intent of GFP_THISNODE is to make sure that an allocation occurs on a
particular node. If this is not possible then NULL needs to be returned so
that the caller can choose what to do next on its own (the slab allocator
depends on that).However, GFP_THISNODE currently triggers reclaim before returning a failure
(GFP_THISNODE means GFP_NORETRY is set). If we have over allocated a node
then we will currently do some reclaim before returning NULL. The caller
may want memory from other nodes before reclaim should be triggered. (If
the caller wants reclaim then he can directly use __GFP_THISNODE instead).There is no flag to avoid reclaim in the page allocator and adding yet
another GFP_xx flag would be difficult given that we are out of available
flags.So just compare and see if all bits for GFP_THISNODE (__GFP_THISNODE,
__GFP_NORETRY and __GFP_NOWARN) are set. If so then we return NULL before
waking up kswapd.Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This addresses two issues:
1. Kmalloc_node() may intermittently return NULL if we are allocating
from the current node and are unable to obtain memory for the current
node from the page allocator. This is because we call ___cache_alloc()
if nodeid == numa_node_id() and ____cache_alloc is not able to fallback
to other nodes.This was introduced in the 2.6.19 development cycle.
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Replace all uses of kmem_cache_t with struct kmem_cache.
The patch was generated using the following script:
#!/bin/sh
#
# Replace one string by another in all the kernel sources.
#set -e
for file in `find * -name "*.c" -o -name "*.h"|xargs grep -l $1`; do
quilt add $file
sed -e "1,\$s/$1/$2/g" $file >/tmp/$$
mv /tmp/$$ $file
quilt refresh
doneThe script was run like this
sh replace kmem_cache_t "struct kmem_cache"
Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
SLAB_DMA is an alias of GFP_DMA. This is the last one so we
remove the leftover comment too.Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
SLAB_KERNEL is an alias of GFP_KERNEL.
Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
SLAB_ATOMIC is an alias of GFP_ATOMIC
Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
SLAB_USER is an alias of GFP_USER
Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
SLAB_NOFS is an alias of GFP_NOFS.
Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
SLAB_NOIO is an alias of GFP_NOIO with a single instance of use.
Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
SLAB_LEVEL_MASK is only used internally to the slab and is
and alias of GFP_LEVEL_MASK.Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
It is only used internally in the slab.
Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
David Binderman and his Intel C compiler rightly observe that
install_file_pte no longer has any use for its pte_val.Signed-off-by: Hugh Dickins
Cc: d binderman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
These patches introduced new switch statements which are indented contrary
to the concensus in mm/*.c. Fix them up to match that concensus.[PATCH] node local per-cpu-pages
[PATCH] ZVC: Scale thresholds depending on the size of the system
commit e7c8d5c9955a4d2e88e36b640563f5d6d5aba48a
commit df9ecaba3f152d1ea79f2a5e0b87505e03f47590Signed-off-by: Andy Whitcroft
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The fsfuzzer found this; with a corrupt small swapfile that claims to have
many pages:[root]# file swap.741.img
swap.741.img: Linux/i386 swap file (new style) 1 (4K pages) size 1040191487 pages
[root]# ls -l swap.741.img
-rw-r--r-- 1 root root 16777216 Nov 22 05:18 swap.741.imgsys_swapon() will try to vmalloc all those pages, and -then- check to see if
the file is actually that large:if (!(p->swap_map = vmalloc(maxpages * sizeof(short)))) {
if (swapfilesize && maxpages > swapfilesize) {
printk(KERN_WARNING
"Swap area shorter than signature indicates\n");It seems to me that it would make more sense to move this test up before
the vmalloc, with the other checks, to avoid the OOM-killer in this
situation...Signed-off-by: Eric Sandeen
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
x86 NUMA systems only define bootmem for node 0. alloc_bootmem_node() and
friends therefore ignore the passed pgdat and use NODE_DATA(0) in all
cases. This leads to the following warnings as we are not using the passed
parameter:.../mm/page_alloc.c: In function 'zone_wait_table_init':
.../mm/page_alloc.c:2259: warning: unused variable 'pgdat'One option would be to define all variables used with these macros
__attribute__ ((unused)), but this would leave us exposed should these
become genuinely unused.The key here is that we _are_ using the value, we ignore it but that is a
deliberate action. This patch adds a nested local variable within the
alloc_bootmem_node helper to which the pgdat parameter is assigned making
it 'used'. The nested local is marked __attribute__ ((unused)) to silence
this same warning for it.Signed-off-by: Andy Whitcroft
Cc: Christoph Lameter
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
NUMA node ids are passed as either int or unsigned int almost exclusivly
page_to_nid and zone_to_nid both return unsigned long. This is a throw
back to when page_to_nid was a #define and was thus exposing the real type
of the page flags field.In addition to fixing up the definitions of page_to_nid and zone_to_nid I
audited the users of these functions identifying the following incorrect
uses:1) mm/page_alloc.c show_node() -- printk dumping the node id,
2) include/asm-ia64/pgalloc.h pgtable_quicklist_free() -- comparison
against numa_node_id() which returns an int from cpu_to_node(), and
3) mm/mpolicy.c check_pte_range -- used as an index in node_isset which
uses bit_set which in generic code takes an int.Signed-off-by: Andy Whitcroft
Cc: Christoph Lameter
Cc: "Luck, Tony"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
drain_node_pages() currently drains the complete pageset of all pages. If
there are a large number of pages in the queues then we may hold off
interrupts for too long.Duplicate the method used in free_hot_cold_page. Only drain pcp->batch
pages at one time.Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Remove all uses of kmem_cache_t (the most were left in slab.h). The
typedef for kmem_cache_t is then only necessary for other kernel
subsystems. Add a comment to that effect.Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The names_cachep is used for getname() and putname(). So lets put it into
fs.h near those two definitions.Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
fs_cachep is only used in kernel/exit.c and in kernel/fork.c.
It is used to store fs_struct items so it should be placed in linux/fs_struct.h
Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
filp_cachep is only used in fs/file_table.c and in fs/dcache.c where
it is defined.Move it to related definitions in linux/file.h.
Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Proper place is in file.h since files_cachep uses are rated to file I/O.
Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
vm_area_cachep is used to store vm_area_structs. So move to mm.h.
Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Move sighand_cachep definitioni to linux/signal.h
The sighand cache is only used in fs/exec.c and kernel/fork.c. It is defined
in kernel/fork.c but only used in fs/exec.c.The sighand_cachep is related to signal processing. So add the definition to
signal.h.Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Remove bio_cachep from slab.h - it no longer exists.
Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This patch makes the needlessly global "global_faults" static.
Signed-off-by: Adrian Bunk
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds