Eric Lee / smarc-fsl-linux-kernel

15 Nov, 2007

40 commits

91ad997a3 file capabilities: allow sigcont within session ... Browse Code »

Fix http://bugzilla.kernel.org/show_bug.cgi?id=9247

Allow sigcont to be sent to a process with greater capabilities if it is in
the same session. Otherwise, a shell from which I've started a root shell
and done 'suspend' can't be restarted by the parent shell.

Also don't do file-capabilities signaling checks when uids for the
processes don't match, since the standard check_kill_permission will have
done those checks.

[akpm@linux-foundation.org: coding-style cleanups]
Signed-off-by: Serge E. Hallyn
Acked-by: Andrew Morgan
Cc: Chris Wright
Tested-by: "Theodore Ts'o"
Cc: Stephen Smalley
Cc: "Rafael J. Wysocki"
Cc: Chris Wright
Cc: James Morris
Cc: Stephen Smalley
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Serge E. Hallyn
2007-11-15 10:45:44 +0800
20a1022d4 Swap delay accounting, include lock_page() delays ... Browse Code »

The delay incurred in lock_page() should also be accounted in swap delay
accounting

Reported-by: Nick Piggin
Signed-off-by: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Balbir Singh
2007-11-15 10:45:44 +0800
9c8d6381d uml: fix build for !CONFIG_PRINTK ... Browse Code »

Handle the case of CONFIG_PRINTK being disabled. This requires a do-nothing
stub to be present in arch/um/include/user.h so that we don't get references
to printk from libc code.

Signed-off-by: Jeff Dike
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jeff Dike
2007-11-15 10:45:43 +0800
32f862c31 uml: fix build for !CONFIG_TCP ... Browse Code »

Make UML build in the absence of CONFIG_INET by making the inetaddr_notifier
registration depend on it.

Signed-off-by: Jeff Dike
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jeff Dike
2007-11-15 10:45:43 +0800
ee1eca5d2 uml: remove last include of libc asm/page.h ... Browse Code »

asm/page.h is disappearing from the libc headers and we don't need it anyway.

Signed-off-by: Jeff Dike
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jeff Dike
2007-11-15 10:45:43 +0800
9ac625a39 uml: fix spurious IRQ testing ... Browse Code »

The spurious IRQ testing in request_irq is mishandled in um_request_irq, which
sets the incoming file descriptors non-blocking only after request_irq
succeeds. This results in the spurious irq calling read on a blocking
descriptor, and a hang.

Fixed by reversing the O_NONBLOCK setting and the request_irq call.

Signed-off-by: Jeff Dike
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jeff Dike
2007-11-15 10:45:43 +0800
7c06a8dc6 Fix 64KB blocksize in ext3 directories ... Browse Code »

With 64KB blocksize, a directory entry can have size 64KB which does not
fit into 16 bits we have for entry lenght. So we store 0xffff instead and
convert value when read from / written to disk. The patch also converts
some places to use ext3_next_entry() when we are changing them anyway.

[akpm@linux-foundation.org: coding-style cleanups]
Signed-off-by: Jan Kara
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Kara
2007-11-15 10:45:43 +0800
dbaf4c024 smbfs: fix debug builds ... Browse Code »

Fix some warnings with SMBFS_DEBUG_* builds. This patch makes it so that
builds with -Werror don't fail.

Signed-off-by: Jeff Layton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jeff Layton
2007-11-15 10:45:43 +0800
60a0d2338 hibernate: fix lockdep report ... Browse Code »

Lockdep reports a circular locking dependency in the hibernate code
because
- during system boot hibernate code (from an initcall) locks pm_mutex
and then a sysfs buffer mutex via name_to_dev_t
- during regular operation hibernate code locks pm_mutex under a
sysfs buffer mutex because it's called from sysfs methods.

The deadlock can never happen because during initcall invocation nothing
can write to sysfs yet. This removes the lockdep report by marking the
initcall locking as being in a different class.

Signed-off-by: Johannes Berg
Cc: "Rafael J. Wysocki"
Cc: Alan Stern
Acked-by: Peter Zijlstra
Cc: Pavel Machek
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Berg
2007-11-15 10:45:43 +0800
c642b8391 __do_IRQ does not check IRQ_DISABLED when IRQ_PER_CPU is set ... Browse Code »

In __do_IRQ(), the normal case is that IRQ_DISABLED is checked and if set
the handler (handle_IRQ_event()) is not called.

Earlier in __do_IRQ(), if IRQ_PER_CPU is set the code does not check
IRQ_DISABLED and calls the handler even though IRQ_DISABLED is set. This
behavior seems unintentional.

One user encountering this behavior is the CPE handler (in
arch/ia64/kernel/mca.c). When the CPE handler encounters too many CPEs
(such as a solid single bit error), it sets up a polling timer and disables
the CPE interrupt (to avoid excessive overhead logging the stream of single
bit errors). disable_irq_nosync() is called which sets IRQ_DISABLED. The
IRQ_PER_CPU flag was previously set (in ia64_mca_late_init()). The net
result is the CPE handler gets called even though it is marked disabled.

If the behavior of not checking IRQ_DISABLED when IRQ_PER_CPU is set is
intentional, it would be worthy of a comment describing the intended
behavior. disable_irq_nosync() does call chip->disable() to provide a
chipset specifiec interface for disabling the interrupt, which avoids this
issue when used.

Signed-off-by: Russ Anderson
Cc: "Luck, Tony"
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: Bjorn Helgaas
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Russ Anderson
2007-11-15 10:45:43 +0800
57d5f66b8 pidns: Place under CONFIG_EXPERIMENTAL ... Browse Code »

This is my trivial patch to swat innumerable little bugs with a single
blow.

After some intensive review (my apologies for not having gotten to this
sooner) what we have looks like a good base to build on with the current
pid namespace code but it is not complete, and it is still much to simple
to find issues where the kernel does the wrong thing outside of the initial
pid namespace.

Until the dust settles and we are certain we have the ABI and the
implementation is as correct as humanly possible let's keep process ID
namespaces behind CONFIG_EXPERIMENTAL.

Allowing us the option of fixing any ABI or other bugs we find as long as
they are minor.

Allowing users of the kernel to avoid those bugs simply by ensuring their
kernel does not have support for multiple pid namespaces.

[akpm@linux-foundation.org: coding-style cleanups]
Signed-off-by: Eric W. Biederman
Cc: Cedric Le Goater
Cc: Adrian Bunk
Cc: Jeremy Fitzhardinge
Cc: Kir Kolyshkin
Cc: Kirill Korotaev
Cc: Pavel Emelyanov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric W. Biederman
2007-11-15 10:45:43 +0800
42614fcde vmstat: fix section mismatch warning ... Browse Code »

Mark start_cpu_timer() as __cpuinit instead of __devinit.
Fixes this section warning:

WARNING: vmlinux.o(.text+0x60e53): Section mismatch: reference to .init.text:start_cpu_timer (between 'vmstat_cpuup_callback' and 'vmstat_show')

Signed-off-by: Randy Dunlap
Acked-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Randy Dunlap
2007-11-15 10:45:42 +0800
579d6d93c gbefb: fix section mismatch warnings ... Browse Code »

Make 'default_mode' and 'default_var' be __initdata.
Fixes these section warnings:

WARNING: vmlinux.o(.data+0x128e0): Section mismatch: reference to .init.data:default_mode_CRT (between 'default_mode' and 'default_var')
WARNING: vmlinux.o(.data+0x128e4): Section mismatch: reference to .init.data:default_var_CRT (between 'default_var' and 'dev_attr_size')

Signed-off-by: Randy Dunlap
Cc: "Antonino A. Daplas"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Randy Dunlap
2007-11-15 10:45:42 +0800
cb51f973b mark sys_open/sys_read exports unused ... Browse Code »

sys_open / sys_read were used in the early 1.2 days to load firmware from
disk inside drivers. Since 2.0 or so this was deprecated behavior, but
several drivers still were using this. Since a few years we have a
request_firmware() API that implements this in a nice, consistent way.
Only some old ISA sound drivers (pre-ALSA) still straggled along for some
time.... however with commit c2b1239a9f22f19c53543b460b24507d0e21ea0c the
last user is now gone.

This is a good thing, since using sys_open / sys_read etc for firmware is a
very buggy to dangerous thing to do; these operations put an fd in the
process file descriptor table.... which then can be tampered with from
other threads for example. For those who don't want the firmware loader,
filp_open()/vfs_read are the better APIs to use, without this security
issue.

The patch below marks sys_open and sys_read unused now that they're
really not used anymore, and for deletion in the 2.6.25 timeframe.

Signed-off-by: Arjan van de Ven
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Arjan van de Ven
2007-11-15 10:45:42 +0800
22800a283 fix param_sysfs_builtin name length check ... Browse Code »

Commit faf8c714f4508207a9c81cc94dafc76ed6680b44 caused a regression:
parameter names longer than MAX_KBUILD_MODNAME will now be rejected,
although we just need to keep the module name part that short. This patch
restores the old behaviour while still avoiding that memchr is called with
its length parameter larger than the total string length.

Signed-off-by: Jan Kiszka
Cc: Dave Young
Cc: Greg KH
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Kiszka
2007-11-15 10:45:42 +0800
9fcc2d15b proc: simplify and correct proc_flush_task ... Browse Code »

Currently we special case when we have only the initial pid namespace.
Unfortunately in doing so the copied case for the other namespaces was
broken so we don't properly flush the thread directories :(

So this patch removes the unnecessary special case (removing a usage of
proc_mnt) and corrects the flushing of the thread directories.

Signed-off-by: Eric W. Biederman
Cc: Al Viro
Cc: Pavel Emelyanov
Cc: Sukadev Bhattiprolu
Cc: Kirill Korotaev
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric W. Biederman
2007-11-15 10:45:42 +0800
c0f2a9d75 mips: undo locking on error path returns ... Browse Code »

[akpm@linux-foundation.org: coding-style cleanups]
Signed-off-by: Roel Kluin
Cc: Ralf Baechle
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Roel Kluin
2007-11-15 10:45:42 +0800
5c6ff79d0 cris gpio: undo locks before returning ... Browse Code »

Signed-off-by: Roel Kluin
Cc: Mikael Starvik
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Roel Kluin
2007-11-15 10:45:42 +0800
5d0360ee9 rd: fix data corruption on memory pressure ... Browse Code »

We have seen ramdisk based install systems, where some pages of mapped
libraries and programs were suddendly zeroed under memory pressure. This
should not happen, as the ramdisk avoids freeing its pages by keeping them
dirty all the time.

It turns out that there is a case, where the VM makes a ramdisk page clean,
without telling the ramdisk driver. On memory pressure shrink_zone runs
and it starts to run shrink_active_list. There is a check for
buffer_heads_over_limit, and if true, pagevec_strip is called.
pagevec_strip calls try_to_release_page. If the mapping has no releasepage
callback, try_to_free_buffers is called. try_to_free_buffers has now a
special logic for some file systems to make a dirty page clean, if all
buffers are clean. Thats what happened in our test case.

The simplest solution is to provide a noop-releasepage callback for the
ramdisk driver. This avoids try_to_free_buffers for ramdisk pages.

Signed-off-by: Christian Borntraeger
Acked-by: Nick Piggin
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christian Borntraeger
2007-11-15 10:45:42 +0800
822bd5aa2 tle62x0 driver stops ignoring read errors ... Browse Code »

The tle62x0 driver was ignoring all read errors. This patch makes it
pass such errors up the stack, instead of returning bogus data.

Signed-off-by: David Brownell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Brownell
2007-11-15 10:45:42 +0800
8744969a8 fuse_file_alloc(): fix NULL dereferences ... Browse Code »

Fix obvious NULL dereferences spotted by the Coverity checker.

Signed-off-by: Adrian Bunk
Acked-by: Miklos Szeredi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adrian Bunk
2007-11-15 10:45:42 +0800
be21f0ab0 fix mm/util.c:krealloc() ... Browse Code »

Commit ef8b4520bd9f8294ffce9abd6158085bde5dc902 added one NULL check for
"p" in krealloc(), but that doesn't seem to be enough since there
doesn't seem to be any guarantee that memcpy(ret, NULL, 0) works
(spotted by the Coverity checker).

For making it clearer what happens this patch also removes the pointless
min().

Signed-off-by: Adrian Bunk
Acked-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adrian Bunk
2007-11-15 10:45:41 +0800
d5cd97872 sunrpc/xprtrdma/transport.c: fix use-after-free ... Browse Code »

Fix an obvious use-after-free spotted by the Coverity checker.

Signed-off-by: Adrian Bunk
Cc: Trond Myklebust
Cc: "J. Bruce Fields"
Cc: Neil Brown
Cc: "David S. Miller"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adrian Bunk
2007-11-15 10:45:41 +0800
e02f5f52c serial: only use PNP IRQ if it's valid ... Browse Code »

"Luming Yu" says:

There is a "ttyS1 irq is -1" problem observed on tiger4 which cause the
serial port broken.

It is because that there is __no__ ACPI IRQ resource assigned for the
serial port. So the value of the IRQ for the port is never changed since it
got initialized to -1.

If PNP supplies a valid IRQ, use it. Otherwise, leave port.irq == 0, which
means "no IRQ" to the serial core.

Signed-off-by: Bjorn Helgaas
Cc: Yu Luming
Acked-by: Matthew Wilcox
Cc: Alan Cox
Cc: Russell King
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Bjorn Helgaas
2007-11-15 10:45:41 +0800
57510c2f9 i5000_edac: no need to __stringify() KBUILD_BASENAME ... Browse Code »

The i5000_edac driver's PCI registration structure has the name
""i5000_edac"" (with extra set of double-quotes) which is probably not
intentional. Get rid of __stringify.

Signed-off-by: Darrick J. Wong
Cc: Doug Thompson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Darrick J. Wong
2007-11-15 10:45:41 +0800
9626f1f11 rtc: fall back to requesting only the ports we actually use ... Browse Code »

Firmware like PNPBIOS or ACPI can report the address space consumed by the
RTC. The actual space consumed may be less than the size (RTC_IO_EXTENT)
assumed by the RTC driver.

The PNP core doesn't request resources yet, but I'd like to make it do so.
If/when it does, the RTC_IO_EXTENT request may fail, which prevents the RTC
driver from loading.

Since we only use the RTC index and data registers at RTC_PORT(0) and
RTC_PORT(1), we can fall back to requesting just enough space for those.

If the PNP core requests resources, this results in typical I/O port usage
like this:

0070-0073 : 00:06
Cc: Alessandro Zummo
Cc: David Brownell
Cc: Ralf Baechle
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Bjorn Helgaas
2007-11-15 10:45:41 +0800
4c06be10c rtc: release correct region in error path ... Browse Code »

The misc_register() error path always released an I/O port region,
even if the region was memory-mapped (only mips uses memory-mapped RTC,
as far as I can see).

Signed-off-by: Bjorn Helgaas
Cc: Alessandro Zummo
Cc: David Brownell
Acked-by: Ralf Baechle
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Bjorn Helgaas
2007-11-15 10:45:41 +0800
c06a018fa reiserfs: don't drop PG_dirty when releasing sub-page-sized dirty file ... Browse Code »

This is not a new problem in 2.6.23-git17. 2.6.22/2.6.23 is buggy in the
same way.

Reiserfs could accumulate dirty sub-page-size files until umount time.
They cannot be synced to disk by pdflush routines or explicit `sync'
commands. Only `umount' can do the trick.

The direct cause is: the dirty page's PG_dirty is wrongly _cleared_.
Call trace:
[] cancel_dirty_page+0xd0/0xf0
[] :reiserfs:reiserfs_cut_from_item+0x660/0x710
[] :reiserfs:reiserfs_do_truncate+0x271/0x530
[] :reiserfs:reiserfs_truncate_file+0xfd/0x3b0
[] :reiserfs:reiserfs_file_release+0x1e0/0x340
[] __fput+0xcc/0x1b0
[] fput+0x16/0x20
[] filp_close+0x56/0x90
[] sys_close+0xad/0x110
[] system_call+0x7e/0x83

Fix the bug by removing the cancel_dirty_page() call. Tests show that
it causes no bad behaviors on various write sizes.

=== for the patient ===
Here are more detailed demonstrations of the problem.

1) the page has both PG_dirty(D)/PAGECACHE_TAG_DIRTY(d) after being written to;
and then only PAGECACHE_TAG_DIRTY(d) remains after the file is closed.

------------------------------ screen 0 ------------------------------
[T0] root /home/wfg# cat > /test/tiny
[T1] hi
[T2] root /home/wfg#

------------------------------ screen 1 ------------------------------
[T1] root /home/wfg# echo /test/tiny > /proc/filecache
[T1] root /home/wfg# cat /proc/filecache
# file /test/tiny
# flags R:referenced A:active M:mmap U:uptodate D:dirty W:writeback O:owner B:buffer d:dirty w:writeback
# idx len state refcnt
0 1 ___UD__Bd_ 2
[T2] root /home/wfg# cat /proc/filecache
# file /test/tiny
# flags R:referenced A:active M:mmap U:uptodate D:dirty W:writeback O:owner B:buffer d:dirty w:writeback
# idx len state refcnt
0 1 ___U___Bd_ 2

2) note the non-zero 'cancelled_write_bytes' after /tmp/hi is copied.

------------------------------ screen 0 ------------------------------
[T0] root /home/wfg# echo hi > /tmp/hi
[T1] root /home/wfg# cp /tmp/hi /dev/stdin /test
[T2] hi
[T3] root /home/wfg#

------------------------------ screen 1 ------------------------------
[T1] root /proc/4397# cd /proc/`pidof cp`
[T1] root /proc/4713# cat io
rchar: 8396
wchar: 3
syscr: 20
syscw: 1
read_bytes: 0
write_bytes: 20480
cancelled_write_bytes: 4096
[T2] root /proc/4713# cat io
rchar: 8399
wchar: 6
syscr: 21
syscw: 2
read_bytes: 0
write_bytes: 24576
cancelled_write_bytes: 4096

//Question: the 'write_bytes' is a bit more than expected ;-)

Tested-by: Maxim Levitsky
Cc: Peter Zijlstra
Cc: Jeff Mahoney
Signed-off-by: Fengguang Wu
Reviewed-by: Chris Mason
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Fengguang Wu
2007-11-15 10:45:41 +0800
7bb67c14f I/OAT: Add support for version 2 of ioatdma device ... Browse Code »

Add support for version 2 of the ioatdma device. This device handles
the descriptor chain and DCA services slightly differently:
- Instead of moving the dma descriptors between a busy and an idle chain,
this new version uses a single circular chain so that we don't have
rewrite the next_descriptor pointers as we add new requests, and the
device doesn't need to re-read the last descriptor.
- The new device has the DCA tags defined internally instead of needing
them defined statically.

Signed-off-by: Shannon Nelson
Cc: "Williams, Dan J"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shannon Nelson
2007-11-15 10:45:41 +0800
cc9f2f8f6 Linux Kernel Markers: fix samples to follow format string standard ... Browse Code »

Add the field names to marker example format string.

Signed-off-by: Mathieu Desnoyers
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mathieu Desnoyers
2007-11-15 10:45:40 +0800
5f9468ceb Linux Kernel Markers: document format string ... Browse Code »

Describes the format string standard further: Use of field names before the
type specifiers..

Signed-off-by: Mathieu Desnoyers
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mathieu Desnoyers
2007-11-15 10:45:40 +0800
314de8a9e Linux Kernel Markers: fix marker mutex not taken upon module load ... Browse Code »

Upon module load, we must take the markers mutex. It implies that the marker
mutex must be nested inside the module mutex.

It implies changing the nesting order : now the marker mutex nests inside the
module mutex. Make the necessary changes to reverse the order in which the
mutexes are taken.

Includes some cleanup from Dave Hansen .

Signed-off-by: Mathieu Desnoyers
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mathieu Desnoyers
2007-11-15 10:45:40 +0800
f433dc563 Fixes to the BFS filesystem driver ... Browse Code »

I found a few bugs in the BFS driver. Detailed description of the bugs as
well as the steps to reproduce the errors are given in the kernel bugzilla.
Please follow these links for more information:

http://bugzilla.kernel.org/show_bug.cgi?id=9363
http://bugzilla.kernel.org/show_bug.cgi?id=9364
http://bugzilla.kernel.org/show_bug.cgi?id=9365
http://bugzilla.kernel.org/show_bug.cgi?id=9366

This patch fixes the bugs described above. Besides, the patch introduces
coding style changes to make the BFS driver conform to the requirements
specified for Linux kernel code. Finally, I made a few cosmetic changes
such as removal of trivial debug output.

Also, the patch removes the fields `si_lf_ioff' and `si_lf_sblk' of the
in-core superblock structure. These fields are initialized but never
actually used.

If you are wondering why I need BFS, here is the answer: I am using this
driver in the context of Linux kernel classes I am teaching in the Moscow
State University and in the International Institute of Information
Technology in Pune, India.

Signed-off-by: Dmitri Vorobiev
Cc: Tigran Aivazian
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dmitri Vorobiev
2007-11-15 10:45:40 +0800
cfb528566 revert "Task Control Groups: example CPU accounting subsystem" ... Browse Code »

Revert 62d0df64065e7c135d0002f069444fbdfc64768f.

This was originally intended as a simple initial example of how to create a
control groups subsystem; it wasn't intended for mainline, but I didn't make
this clear enough to Andrew.

The CFS cgroup subsystem now has better functionality for the per-cgroup usage
accounting (based directly on CFS stats) than the "usage" status file in this
patch, and the "load" status file is rather simplistic - although having a
per-cgroup load average report would be a useful feature, I don't believe this
patch actually provides it. If it gets into the final 2.6.24 we'd probably
have to support this interface for ever.

Cc: Paul Menage
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2007-11-15 10:45:40 +0800
45c682a68 hugetlb: fix i_blocks accounting ... Browse Code »

For administrative purpose, we want to query actual block usage for
hugetlbfs file via fstat. Currently, hugetlbfs always return 0. Fix that
up since kernel already has all the information to track it properly.

Signed-off-by: Ken Chen
Acked-by: Adam Litke
Cc: Badari Pulavarty
Cc: David Gibson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ken Chen
2007-11-15 10:45:40 +0800
8cde045c7 mm/hugetlb.c: make a function static ... Browse Code »

return_unused_surplus_pages() can become static.

Signed-off-by: Adrian Bunk
Acked-by: Adam Litke
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adrian Bunk
2007-11-15 10:45:40 +0800
90d8b7e61 hugetlb: enforce quotas during reservation for shared mappings ... Browse Code »

When a MAP_SHARED mmap of a hugetlbfs file succeeds, huge pages are reserved
to guarantee no problems will occur later when instantiating pages. If quotas
are in force, page instantiation could fail due to a race with another process
or an oversized (but approved) shared mapping.

To prevent these scenarios, debit the quota for the full reservation amount up
front and credit the unused quota when the reservation is released.

Signed-off-by: Adam Litke
Cc: Ken Chen
Cc: Andy Whitcroft
Cc: Dave Hansen
Cc: David Gibson
Cc: William Lee Irwin III
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adam Litke
2007-11-15 10:45:40 +0800
9a119c056 hugetlb: allow bulk updating in hugetlb_*_quota() ... Browse Code »

Add a second parameter 'delta' to hugetlb_get_quota and hugetlb_put_quota to
allow bulk updating of the sbinfo->free_blocks counter. This will be used by
the next patch in the series.

Signed-off-by: Adam Litke
Cc: Ken Chen
Cc: Andy Whitcroft
Cc: Dave Hansen
Cc: David Gibson
Cc: William Lee Irwin III
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adam Litke
2007-11-15 10:45:40 +0800
2fc39cec6 hugetlb: debit quota in alloc_huge_page ... Browse Code »

Now that quota is credited by free_huge_page(), calls to hugetlb_get_quota()
seem out of place. The alloc/free API is unbalanced because we handle the
hugetlb_put_quota() but expect the caller to open-code hugetlb_get_quota().
Move the get inside alloc_huge_page to clean up this disparity.

This patch has been kept apart from the previous patch because of the somewhat
dodgy ERR_PTR() use herein. Moving the quota logic means that
alloc_huge_page() has two failure modes. Quota failure must result in a
SIGBUS while a standard allocation failure is OOM. Unfortunately, ERR_PTR()
doesn't like the small positive errnos we have in VM_FAULT_* so they must be
negated before they are used.

Does anyone take issue with the way I am using PTR_ERR. If so, what are your
thoughts on how to clean this up (without needing an if,else if,else block at
each alloc_huge_page() callsite)?

Signed-off-by: Adam Litke
Cc: Ken Chen
Cc: Andy Whitcroft
Cc: Dave Hansen
Cc: David Gibson
Cc: William Lee Irwin III
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adam Litke
2007-11-15 10:45:40 +0800
c79fb75e5 hugetlb: fix quota management for private mappings ... Browse Code »

The hugetlbfs quota management system was never taught to handle MAP_PRIVATE
mappings when that support was added. Currently, quota is debited at page
instantiation and credited at file truncation. This approach works correctly
for shared pages but is incomplete for private pages. In addition to
hugetlb_no_page(), private pages can be instantiated by hugetlb_cow(); but
this function does not respect quotas.

Private huge pages are treated very much like normal, anonymous pages. They
are not "backed" by the hugetlbfs file and are not stored in the mapping's
radix tree. This means that private pages are invisible to
truncate_hugepages() so that function will not credit the quota.

This patch (based on a prototype provided by Ken Chen) moves quota crediting
for all pages into free_huge_page(). page->private is used to store a pointer
to the mapping to which this page belongs. This is used to credit quota on
the appropriate hugetlbfs instance.

Signed-off-by: Adam Litke
Cc: Ken Chen
Cc: Ken Chen
Cc: Andy Whitcroft
Cc: Dave Hansen
Cc: David Gibson
Cc: William Lee Irwin III
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adam Litke
2007-11-15 10:45:40 +0800