Doug / smarc-fsl-linux-kernel | Embedian Git Server

20 Oct, 2008

40 commits

07e8dbd3e hwmon: applesmc: Add support for Macbook Pro 3 ... Browse Code »

Add temperature sensor support for Macbook Pro 3.

Signed-off-by: Henrik Rydberg
Cc: Nicolas Boichat
Cc: Riki Oktarianto
Cc: Mark M. Hoffman
Cc: Jean Delvare
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Henrik Rydberg
2008-10-20 23:52:35 +0800
d7549905f hwmon: applesmc: Add support for Macbook Pro 4 ... Browse Code »

Adds temperature sensor support for the Macbook Pro 4.

Signed-off-by: Henrik Rydberg
Cc: Nicolas Boichat
Cc: Riki Oktarianto
Cc: Mark M. Hoffman
Cc: Jean Delvare
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Henrik Rydberg
2008-10-20 23:52:35 +0800
7b5e3cb28 drivers/hwmon/applesmc.c: remove unneeded casts ... Browse Code »

dmi_system_id.driver_data is already void*.

Cc: Henrik Rydberg
Cc: Nicolas Boichat
Cc: Riki Oktarianto
Cc: Mark M. Hoffman
Cc: Jean Delvare
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2008-10-20 23:52:35 +0800
f5274c972 hwmon: applesmc: add support for Macbook Air ... Browse Code »

This patch adds accelerometer, backlight and temperature sensor support
for the Macbook Air.

Signed-off-by: Henrik Rydberg
Cc: Nicolas Boichat
Cc: Riki Oktarianto
Cc: Mark M. Hoffman
Cc: Jean Delvare
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Henrik Rydberg
2008-10-20 23:52:35 +0800
8bd1a12a5 hwmon: applesmc: allow for variable ALV0 and ALV1 package length ... Browse Code »

On some recent Macbooks, the package length for the light sensors ALV0 and
ALV1 has changed from 6 to 10. This patch allows for a variable package
length encompassing both variants.

Signed-off-by: Henrik Rydberg
Cc: Nicolas Boichat
Cc: Riki Oktarianto
Cc: Mark M. Hoffman
Cc: Jean Delvare
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Henrik Rydberg
2008-10-20 23:52:35 +0800
02fcbd144 hwmon: applesmc: prolong status wait ... Browse Code »

The time to wait for a status change while reading or writing to the SMC
ports is a balance between read reliability and system performance. The
current setting yields rougly three errors in a thousand when
simultaneously reading three different temperature values on a Macbook
Air. This patch increases the setting to a value yielding roughly one
error in ten thousand, with no noticable system performance degradation.

Signed-off-by: Henrik Rydberg
Cc: Nicolas Boichat
Cc: Riki Oktarianto
Cc: Mark M. Hoffman
Cc: Jean Delvare
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Henrik Rydberg
2008-10-20 23:52:35 +0800
84d2d7f2e hwmon: applesmc: fix the 'wait status failed: c != 8' problem ... Browse Code »

On many Macbooks since mid 2007, the Pro, C2D and Air models, applesmc
fails to read some or all SMC ports. This problem has various effects,
such as flooded logfiles, malfunctioning temperature sensors,
accelerometers failing to initialize, and difficulties getting backlight
functionality to work properly.

The root of the problem seems to be the command protocol. The current
code sends out a command byte, then repeatedly polls for an ack before
continuing to send or recieve data. From experiments leading to this
patch, it seems the command protocol never quite worked or changed so that
one now sends a command byte, waits a little bit, polls for an ack, and if
it fails, repeats the whole thing by sending the command byte again.

This patch implements a send_command function according to the new
interpretation of the protocol, and should work also for earlier models.

Signed-off-by: Henrik Rydberg
Cc: Nicolas Boichat
Cc: Riki Oktarianto
Cc: Mark M. Hoffman
Cc: Jean Delvare
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Henrik Rydberg
2008-10-20 23:52:35 +0800
05224091a hwmon: applesmc: specified number of bytes to read should match actual ... Browse Code »

At one single place in the code, the specified number of bytes to read and
the actual number of bytes read differ by one. This one-liner patch fixes
that inconsistency.

Signed-off-by: Henrik Rydberg
Cc: Nicolas Boichat
Cc: Riki Oktarianto
Cc: Mark M. Hoffman
Cc: Jean Delvare
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Henrik Rydberg
2008-10-20 23:52:35 +0800
865c29536 hwmon/pc87360 separate alarm files: add therm-min/max/crit-alarms ... Browse Code »

Adds therm-min/max/crit-alarm callbacks, sensor-device-attribute
declarations, and refs to those new decls in the macro used to initialize
the therm_group (of sysfs files)

The thermistors use voltage channels to measure; so they don't have a
fault-alarm, but unlike the other voltages, they do have an overtemp,
which we call crit (by convention).

[akpm@linux-foundation.org: cleanup]
Signed-off-by: Jim Cromie
Cc: Jean Delvare
Cc: "Mark M. Hoffman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jim Cromie
2008-10-20 23:52:35 +0800
8ca136741 hwmon/pc87360 separate alarm files: add dev_dbg help ... Browse Code »

temp and vin status register values may be set by chip specifications, set
again by bios, or by this previously loaded driver. Debug output nicely
displays modprobe init=\d actions.

Signed-off-by: Jim Cromie
Cc: Jean Delvare
Cc: "Mark M. Hoffman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jim Cromie
2008-10-20 23:52:35 +0800
2a32ec250 hwmon/pc87360 separate alarm files: define LDNI_MAX const ... Browse Code »

Driver handles 3 logical devices in fixed length array. Give this a
define-d constant.

Signed-off-by: Jim Cromie
Cc: Jean Delvare
Cc: "Mark M. Hoffman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jim Cromie
2008-10-20 23:52:35 +0800
b267e8cdc hwmon/pc87360 separate alarm files: add temp-min/max/crit/fault-alarms ... Browse Code »

Adds temp-min/max/crit/fault-alarm callbacks, sensor-device-attribute
declarations, and refs to those new decls in the macro used to initialize
the temp_group (of sysfs files)

[akpm@linux-foundation.org: cleanups]
Signed-off-by: Jim Cromie
Cc: Jean Delvare
Cc: "Mark M. Hoffman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jim Cromie
2008-10-20 23:52:35 +0800
492e9657d hwmon/pc87360 separate alarm files: add in-min/max-alarms ... Browse Code »

Adds vin-min/max-alarm callbacks, sensor-device-attribute declarations,
and refs to those new decls in the macro used to initialize the vin_group
(of sysfs files)

[akpm@linux-foundation.org: cleanups]
Signed-off-by: Jim Cromie
Cc: Jean Delvare
Cc: "Mark M. Hoffman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jim Cromie
2008-10-20 23:52:35 +0800
28f74e717 hwmon/pc87360 separate alarm files: define some constants ... Browse Code »

Bring hwmon/pc87360 into agreement with
Documentation/hwmon/sysfs-interface.

Patchset adds separate limit alarms for voltages and temps, it also adds
temp[123]_fault files. On my Soekris, temps 1,2 are unused/unconnected,
so temp[123]_fault = 1,1,0 respectively. This agrees with
/usr/bin/sensors, which has always shown them as OPEN. Temps 4,5,6 are
thermistor based, and dont have a fault bit in their status register.

This patch:

2 different kinds of constants added:
- CHAN_ALM_* constants for (later) vin, temp alarm callbacks.
- CHAN_* conversion constants, used in _init_device, partly for RW1C bits

Signed-off-by: Jim Cromie
Cc: Jean Delvare
Cc: "Mark M. Hoffman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jim Cromie
2008-10-20 23:52:35 +0800
4d235ba6c intel-iommu: typo fix and correct word in the comment ... Browse Code »

Fix for a typo and and replacing incorrect word in the comment.

Signed-off-by: Ameya Palande
Cc: "Ashok Raj"
Cc: "Shaohua Li"
Cc: "Anil S Keshavamurthy"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ameya Palande
2008-10-20 23:52:34 +0800
c3b9f5afc kernel/configs.c: remove useless comments ... Browse Code »

These comments are useless, remove them.

Signed-off-by: WANG Cong
Cc: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

WANG Cong
2008-10-20 23:52:34 +0800
5c6248411 HP-WMI: additional keycode (or typo) ... Browse Code »

On my HP 2510, pressing the (i) button generates an unknown keycode:
0x213b. So here is a patch adding support for it. However, as it seems
there is already support for a similar button connected to 0x231b as
keycode, I wonder if it could be a typo in the driver?

Signed-off-by: Eric Piel
Cc: Matthew Garrett
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric Piel
2008-10-20 23:52:34 +0800
2a80a3783 Fix documentation of sysrq-q ... Browse Code »

I fell into the trap recently that it only dumps hrtimers instead of
all timers. Fix the documentation.

Signed-off-by: Andi Kleen
Cc: Ingo Molnar
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andi Kleen
2008-10-20 23:52:34 +0800
966c8079c uml: fix a compile error ... Browse Code »

Fix

arch/um/sys-i386/signal.c: In function 'copy_sc_from_user':
arch/um/sys-i386/signal.c:182: warning: dereferencing 'void *' pointer
arch/um/sys-i386/signal.c:182: error: request for member '_fxsr_env' in something not a structure or union

Signed-off-by: WANG Cong
Cc: Jeff Dike
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

WANG Cong
2008-10-20 23:52:34 +0800
d12a6f7f9 arch/m68k/bvme6000/rtc.c: remove duplicated include ... Browse Code »

Removed duplicated include file in
arch/m68k/bvme6000/rtc.c.

Signed-off-by: Huang Weiyi
Cc: Geert Uytterhoeven
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Huang Weiyi
2008-10-20 23:52:34 +0800
bde5ab655 container freezer: document the cgroup freezer subsystem. ... Browse Code »

Describe why we need the freezer subsystem and how to use it in a
documentation file. Since the cgroups.txt file is focused on the
subsystem-agnostic portions of cgroups make a directory and move the old
cgroups.txt file at the same time.

Signed-off-by: Matt Helsley
Cc: Paul Menage
Cc: containers@lists.linux-foundation.org
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matt Helsley
2008-10-20 23:52:34 +0800
1aece3483 container freezer: rename check_if_frozen() ... Browse Code »

check_if_frozen() sounds like it should return something when in fact it's
just updating the freezer state.

Signed-off-by: Matt Helsley
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matt Helsley
2008-10-20 23:52:34 +0800
81dcf33c2 container freezer: make freezer state names less generic ... Browse Code »

Rename cgroup freezer states to be less generic to avoid any name
collisions while also better describing what each state is.

Signed-off-by: Matt Helsley
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matt Helsley
2008-10-20 23:52:34 +0800
957a4eeaf container freezer: prevent frozen tasks or cgroups from changing ... Browse Code »

Don't let frozen tasks or cgroups change. This means frozen tasks can't
leave their current cgroup for another cgroup. It also means that tasks
cannot be added to or removed from a cgroup in the FROZEN state. We
enforce these rules by checking for frozen tasks and cgroups in the
can_attach() function.

Signed-off-by: Matt Helsley
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matt Helsley
2008-10-20 23:52:34 +0800
5a06915c6 container freezer: skip frozen cgroups during power management resume ... Browse Code »

When a system is resumed after a suspend, it will also unfreeze frozen
cgroups.

This patchs modifies the resume sequence to skip the tasks which are part
of a frozen control group.

Signed-off-by: Cedric Le Goater
Signed-off-by: Matt Helsley
Acked-by: Serge E. Hallyn
Tested-by: Matt Helsley
Acked-by: Rafael J. Wysocki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matt Helsley
2008-10-20 23:52:34 +0800
dc52ddc0e container freezer: implement freezer cgroup subsystem ... Browse Code »

This patch implements a new freezer subsystem in the control groups
framework. It provides a way to stop and resume execution of all tasks in
a cgroup by writing in the cgroup filesystem.

The freezer subsystem in the container filesystem defines a file named
freezer.state. Writing "FROZEN" to the state file will freeze all tasks
in the cgroup. Subsequently writing "RUNNING" will unfreeze the tasks in
the cgroup. Reading will return the current state.

* Examples of usage :

# mkdir /containers/freezer
# mount -t cgroup -ofreezer freezer /containers
# mkdir /containers/0
# echo $some_pid > /containers/0/tasks

to get status of the freezer subsystem :

# cat /containers/0/freezer.state
RUNNING

to freeze all tasks in the container :

# echo FROZEN > /containers/0/freezer.state
# cat /containers/0/freezer.state
FREEZING
# cat /containers/0/freezer.state
FROZEN

to unfreeze all tasks in the container :

# echo RUNNING > /containers/0/freezer.state
# cat /containers/0/freezer.state
RUNNING

This is the basic mechanism which should do the right thing for user space
task in a simple scenario.

It's important to note that freezing can be incomplete. In that case we
return EBUSY. This means that some tasks in the cgroup are busy doing
something that prevents us from completely freezing the cgroup at this
time. After EBUSY, the cgroup will remain partially frozen -- reflected
by freezer.state reporting "FREEZING" when read. The state will remain
"FREEZING" until one of these things happens:

1) Userspace cancels the freezing operation by writing "RUNNING" to
the freezer.state file
2) Userspace retries the freezing operation by writing "FROZEN" to
the freezer.state file (writing "FREEZING" is not legal
and returns EIO)
3) The tasks that blocked the cgroup from entering the "FROZEN"
state disappear from the cgroup's set of tasks.

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: export thaw_process]
Signed-off-by: Cedric Le Goater
Signed-off-by: Matt Helsley
Acked-by: Serge E. Hallyn
Tested-by: Matt Helsley
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matt Helsley
2008-10-20 23:52:34 +0800
8174f1503 container freezer: make refrigerator always available ... Browse Code »

Now that the TIF_FREEZE flag is available in all architectures, extract
the refrigerator() and freeze_task() from kernel/power/process.c and make
it available to all.

The refrigerator() can now be used in a control group subsystem
implementing a control group freezer.

Signed-off-by: Cedric Le Goater
Signed-off-by: Matt Helsley
Acked-by: Serge E. Hallyn
Tested-by: Matt Helsley
Cc: "Rafael J. Wysocki"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matt Helsley
2008-10-20 23:52:33 +0800
83224b083 container freezer: add TIF_FREEZE flag to all architectures ... Browse Code »

This patch series introduces a cgroup subsystem that utilizes the swsusp
freezer to freeze a group of tasks. It's immediately useful for batch job
management scripts. It should also be useful in the future for
implementing container checkpoint/restart.

The freezer subsystem in the container filesystem defines a cgroup file
named freezer.state. Reading freezer.state will return the current state
of the cgroup. Writing "FROZEN" to the state file will freeze all tasks
in the cgroup. Subsequently writing "RUNNING" will unfreeze the tasks in
the cgroup.

* Examples of usage :

# mkdir /containers/freezer
# mount -t cgroup -ofreezer freezer /containers
# mkdir /containers/0
# echo $some_pid > /containers/0/tasks

to get status of the freezer subsystem :

# cat /containers/0/freezer.state
RUNNING

to freeze all tasks in the container :

# echo FROZEN > /containers/0/freezer.state
# cat /containers/0/freezer.state
FREEZING
# cat /containers/0/freezer.state
FROZEN

to unfreeze all tasks in the container :

# echo RUNNING > /containers/0/freezer.state
# cat /containers/0/freezer.state
RUNNING

This patch:

The first step in making the refrigerator() available to all
architectures, even for those without power management.

The purpose of such a change is to be able to use the refrigerator() in a
new control group subsystem which will implement a control group freezer.

[akpm@linux-foundation.org: fix sparc]
Signed-off-by: Cedric Le Goater
Signed-off-by: Matt Helsley
Acked-by: Pavel Machek
Acked-by: Serge E. Hallyn
Acked-by: Rafael J. Wysocki
Acked-by: Nigel Cunningham
Tested-by: Matt Helsley
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matt Helsley
2008-10-20 23:52:33 +0800
5e9a0f023 mm: extract do_pages_move() out of sys_move_pages() ... Browse Code »

To prepare the chunking, move the sys_move_pages() code that is used when
nodes!=NULL into do_pages_move(). And rename do_move_pages() into
do_move_page_to_node_array().

Signed-off-by: Brice Goglin
Acked-by: Christoph Lameter
Cc: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Brice Goglin
2008-10-20 23:52:33 +0800
2f007e74b mm: don't vmalloc a huge page_to_node array for do_pages_stat() ... Browse Code »

do_pages_stat() does not need any page_to_node entry for real. Just pass
the pointers to the user-space page address array and to the user-space
status array, and have do_pages_stat() traverse the former and fill the
latter directly.

Signed-off-by: Brice Goglin
Acked-by: Christoph Lameter
Cc: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Brice Goglin
2008-10-20 23:52:33 +0800
e78bbfa82 mm: stop returning -ENOENT from sys_move_pages() if nothing got migrated ... Browse Code »

A patchset reworking sys_move_pages(). It removes the possibly large
vmalloc by using multiple chunks when migrating large buffers. It also
dramatically increases the throughput for large buffers since the lookup
in new_page_node() is now limited to a single chunk, causing the quadratic
complexity to have a much slower impact. There is no need to use any
radix-tree-like structure to improve this lookup.

sys_move_pages() duration on a 4-quadcore-opteron 2347HE (1.9Gz),
migrating between nodes #2 and #3:

length move_pages (us) move_pages+patch (us)
4kB 126 98
40kB 198 168
400kB 963 937
4MB 12503 11930
40MB 246867 11848

Patches #1 and #4 are the important ones:
1) stop returning -ENOENT from sys_move_pages() if nothing got migrated
2) don't vmalloc a huge page_to_node array for do_pages_stat()
3) extract do_pages_move() out of sys_move_pages()
4) rework do_pages_move() to work on page_sized chunks
5) move_pages: no need to set pp->page to ZERO_PAGE(0) by default

This patch:

There is no point in returning -ENOENT from sys_move_pages() if all pages
were already on the right node, while we return 0 if only 1 page was not.
Most application don't know where their pages are allocated, so it's not
an error to try to migrate them anyway.

Just return 0 and let the status array in user-space be checked if the
application needs details.

It will make the upcoming chunked-move_pages() support much easier.

Signed-off-by: Brice Goglin
Acked-by: Christoph Lameter
Cc: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Brice Goglin
2008-10-20 23:52:33 +0800
de7f0cba9 memory hotplug: release memory regions in PAGES_PER_SECTION chunks ... Browse Code »

During hotplug memory remove, memory regions should be released on a
PAGES_PER_SECTION size chunks. This mirrors the code in add_memory where
resources are requested on a PAGES_PER_SECTION size.

Attempting to release the entire memory region fails because there is not
a single resource for the total number of pages being removed. Instead
the resources for the pages are split in PAGES_PER_SECTION size chunks as
requested during memory add.

Signed-off-by: Nathan Fontenot
Signed-off-by: Badari Pulavarty
Acked-by: Yasunori Goto
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nathan Fontenot
2008-10-20 23:52:32 +0800
7a6560e02 documentation: clarify dirty_ratio and dirty_background_ratio description ... Browse Code »

The current documentation of dirty_ratio and dirty_background_ratio is a
bit misleading.

In the documentation we say that they are "a percentage of total system
memory", but the current page writeback policy, intead, is to apply the
percentages to the dirtyable memory, that means free pages + reclaimable
pages.

Better to be more explicit to clarify this concept.

Signed-off-by: Andrea Righi
Signed-off-by: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Righi
2008-10-20 23:52:32 +0800
9f1b16a51 memory_probe: fix wrong sysfs file attribute ... Browse Code »

This attribute just has a write operation.

[akpm@linux-foundation.org: use S_IWUSR as suggested by Randy]
Signed-off-by: Shaohua Li
Cc: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shaohua Li
2008-10-20 23:52:32 +0800
1125b4e39 setup_per_zone_pages_min(): take zone->lock instead of zone->lru_lock ... Browse Code »

This replaces zone->lru_lock in setup_per_zone_pages_min() with zone->lock.
There seems to be no need for the lru_lock anymore, but there is a need for
zone->lock instead, because that function may call move_freepages() via
setup_zone_migrate_reserve().

Signed-off-by: Gerald Schaefer
Acked-by: KAMEZAWA Hiroyuki
Tested-by: Yasunori Goto
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Gerald Schaefer
2008-10-20 23:52:32 +0800
4b2e38ad7 hugepage: support ZERO_PAGE() ... Browse Code »

Presently hugepage doesn't use zero page at all because zero page is only
used for coredumping and hugepage can't core dump.

However we have now implemented hugepage coredumping. Therefore we should
implement the zero page of hugepage.

Implementation note:

o Why do we only check VM_SHARED for zero page?
normal page checked as ..

static inline int use_zero_page(struct vm_area_struct *vma)
{
if (vma->vm_flags & (VM_LOCKED | VM_SHARED))
return 0;

return !vma->vm_ops || !vma->vm_ops->fault;
}

First, hugepages are never mlock()ed. We aren't concerned with VM_LOCKED.

Second, hugetlbfs is a pseudo filesystem, not a real filesystem and it
doesn't have any file backing. Thus ops->fault checking is meaningless.

o Why don't we use zero page if !pte.

!pte indicate {pud, pmd} doesn't exist or some error happened. So we
shouldn't return zero page if any error occurred.

Signed-off-by: KOSAKI Motohiro
Cc: Adam Litke
Cc: Hugh Dickins
Cc: Kawai Hidehiro
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2008-10-20 23:52:32 +0800
e575f111d coredump_filter: add hugepage dumping ... Browse Code »

Presently hugepage's vma has a VM_RESERVED flag in order not to be
swapped. But a VM_RESERVED vma isn't core dumped because this flag is
often used for some kernel vmas (e.g. vmalloc, sound related).

Thus hugepages are never dumped and it can't be debugged easily. Many
developers want hugepages to be included into core-dump.

However, We can't read generic VM_RESERVED area because this area is often
IO mapping area. then these area reading may change device state. it is
definitly undesiable side-effect.

So adding a hugepage specific bit to the coredump filter is better. It
will be able to hugepage core dumping and doesn't cause any side-effect to
any i/o devices.

In additional, libhugetlb use hugetlb private mapping pages as anonymous
page. Then, hugepage private mapping pages should be core dumped by
default.

Then, /proc/[pid]/core_dump_filter has two new bits.

- bit 5 mean hugetlb private mapping pages are dumped or not. (default: yes)
- bit 6 mean hugetlb shared mapping pages are dumped or not. (default: no)

I tested by following method.

% ulimit -c unlimited
% ./crash_hugepage 50
% ./crash_hugepage 50 -p
% ls -lh
% gdb ./crash_hugepage core
%
% echo 0x43 > /proc/self/coredump_filter
% ./crash_hugepage 50
% ./crash_hugepage 50 -p
% ls -lh
% gdb ./crash_hugepage core

#include
#include
#include
#include
#include

#include "hugetlbfs.h"

int main(int argc, char** argv){
char* p;
int ch;
int mmap_flags = MAP_SHARED;
int fd;
int nr_pages;

while((ch = getopt(argc, argv, "p")) != -1) {
switch (ch) {
case 'p':
mmap_flags &= ~MAP_SHARED;
mmap_flags |= MAP_PRIVATE;
break;
default:
/* nothing*/
break;
}
}
argc -= optind;
argv += optind;

if (argc == 0){
printf("need # of pages\n");
exit(1);
}

nr_pages = atoi(argv[0]);
if (nr_pages < 2) {
printf("nr_pages must >2\n");
exit(1);
}

fd = hugetlbfs_unlinked_fd();
p = mmap(NULL, nr_pages * gethugepagesize(),
PROT_READ|PROT_WRITE, mmap_flags, fd, 0);

sleep(2);

*(p + gethugepagesize()) = 1; /* COW */
sleep(2);

/* crash! */
*(int*)0 = 1;

return 0;
}

Signed-off-by: KOSAKI Motohiro
Reviewed-by: Kawai Hidehiro
Cc: Hugh Dickins
Cc: William Irwin
Cc: Adam Litke
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2008-10-20 23:52:32 +0800
d903ef9f3 mm: print out meminit for memmap ... Browse Code »

Improve debuggability of memory setup problems.

Signed-off-by: Yinghai Lu
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yinghai Lu
2008-10-20 23:52:32 +0800
2a4b3ded5 mm: hugetlb.c make functions static, use NULL rather than 0 ... Browse Code »

mm/hugetlb.c:265:17: warning: symbol 'resv_map_alloc' was not declared. Should it be static?
mm/hugetlb.c:277:6: warning: symbol 'resv_map_release' was not declared. Should it be static?
mm/hugetlb.c:292:9: warning: Using plain integer as NULL pointer
mm/hugetlb.c:1750:5: warning: symbol 'unmap_ref_private' was not declared. Should it be static?

Signed-off-by: Harvey Harrison
Acked-by: Andy Whitcroft
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Harvey Harrison
2008-10-20 23:52:32 +0800
db64fe022 mm: rewrite vmap layer ... Browse Code »

Rewrite the vmap allocator to use rbtrees and lazy tlb flushing, and
provide a fast, scalable percpu frontend for small vmaps (requires a
slightly different API, though).

The biggest problem with vmap is actually vunmap. Presently this requires
a global kernel TLB flush, which on most architectures is a broadcast IPI
to all CPUs to flush the cache. This is all done under a global lock. As
the number of CPUs increases, so will the number of vunmaps a scaled
workload will want to perform, and so will the cost of a global TLB flush.
This gives terrible quadratic scalability characteristics.

Another problem is that the entire vmap subsystem works under a single
lock. It is a rwlock, but it is actually taken for write in all the fast
paths, and the read locking would likely never be run concurrently anyway,
so it's just pointless.

This is a rewrite of vmap subsystem to solve those problems. The existing
vmalloc API is implemented on top of the rewritten subsystem.

The TLB flushing problem is solved by using lazy TLB unmapping. vmap
addresses do not have to be flushed immediately when they are vunmapped,
because the kernel will not reuse them again (would be a use-after-free)
until they are reallocated. So the addresses aren't allocated again until
a subsequent TLB flush. A single TLB flush then can flush multiple
vunmaps from each CPU.

XEN and PAT and such do not like deferred TLB flushing because they can't
always handle multiple aliasing virtual addresses to a physical address.
They now call vm_unmap_aliases() in order to flush any deferred mappings.
That call is very expensive (well, actually not a lot more expensive than
a single vunmap under the old scheme), however it should be OK if not
called too often.

The virtual memory extent information is stored in an rbtree rather than a
linked list to improve the algorithmic scalability.

There is a per-CPU allocator for small vmaps, which amortizes or avoids
global locking.

To use the per-CPU interface, the vm_map_ram / vm_unmap_ram interfaces
must be used in place of vmap and vunmap. Vmalloc does not use these
interfaces at the moment, so it will not be quite so scalable (although it
will use lazy TLB flushing).

As a quick test of performance, I ran a test that loops in the kernel,
linearly mapping then touching then unmapping 4 pages. Different numbers
of tests were run in parallel on an 4 core, 2 socket opteron. Results are
in nanoseconds per map+touch+unmap.

threads vanilla vmap rewrite
1 14700 2900
2 33600 3000
4 49500 2800
8 70631 2900

So with a 8 cores, the rewritten version is already 25x faster.

In a slightly more realistic test (although with an older and less
scalable version of the patch), I ripped the not-very-good vunmap batching
code out of XFS, and implemented the large buffer mapping with vm_map_ram
and vm_unmap_ram... along with a couple of other tricks, I was able to
speed up a large directory workload by 20x on a 64 CPU system. I believe
vmap/vunmap is actually sped up a lot more than 20x on such a system, but
I'm running into other locks now. vmap is pretty well blown off the
profiles.

Before:
1352059 total 0.1401
798784 _write_lock 8320.6667
Cc: Jeremy Fitzhardinge
Cc: Krzysztof Helt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2008-10-20 23:52:32 +0800