Eric Lee / smarc-fsl-linux-kernel

15 May, 2008

6 commits

76cdd58e5 memory_hotplug: always initialize pageblock bitmap ... Browse Code »

Trying to online a new memory section that was added via memory hotplug
sometimes results in crashes when the new pages are added via __free_page.
Reason for that is that the pageblock bitmap isn't initialized and hence
contains random stuff. That means that get_pageblock_migratetype()
returns also random stuff and therefore

list_add(&page->lru,
&zone->free_area[order].free_list[migratetype]);

in __free_one_page() tries to do a list_add to something that isn't even
necessarily a list.

This happens since 86051ca5eaf5e560113ec7673462804c54284456 ("mm: fix
usemap initialization") which makes sure that the pageblock bitmap gets
only initialized for pages present in a zone. Unfortunately for hot-added
memory the zones "grow" after the memmap and the pageblock memmap have
been initialized. Which means that the new pages have an unitialized
bitmap. To solve this the calls to grow_zone_span() and grow_pgdat_span()
are moved to __add_zone() just before the initialization happens.

The patch also moves the two functions since __add_zone() is the only
caller and I didn't want to add a forward declaration.

Signed-off-by: Heiko Carstens
Cc: Andy Whitcroft
Cc: Dave Hansen
Cc: Gerald Schaefer
Cc: KAMEZAWA Hiroyuki
Cc: Yasunori Goto
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Heiko Carstens
2008-05-15 10:11:15 +0800
1c12c4cf9 mprotect: prevent alteration of the PAT bits ... Browse Code »

There is a defect in mprotect, which lets the user change the page cache
type bits by-passing the kernel reserve_memtype and free_memtype
wrappers. Fix the problem by not letting mprotect change the PAT bits.

Signed-off-by: Venkatesh Pallipadi
Signed-off-by: Suresh Siddha
Signed-off-by: Ingo Molnar
Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Venki Pallipadi
2008-05-15 10:11:15 +0800
fd8a4221a memory_hotplug: check for walk_memory_resource() failure in online_pages() ... Browse Code »

Add a check to online_pages() to test for failure of
walk_memory_resource(). This fixes a condition where a failure
of walk_memory_resource() can lead to online_pages() returning
success without the requested pages being onlined.

Signed-off-by: Geoff Levand
Cc: Yasunori Goto
Cc: KAMEZAWA Hiroyuki
Cc: Dave Hansen
Cc: Keith Mannthey
Cc: Christoph Lameter
Cc: Paul Jackson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Geoff Levand
2008-05-15 10:11:14 +0800
c3723ca38 memory hotplug: memmap_init_zone called twice ... Browse Code »

__add_zone calls memmap_init_zone twice if memory gets attached to an empty
zone. Once via init_currently_empty_zone and once explictly right after that
call.

Looks like this is currently not a bug, however the call is superfluous and
might lead to subtle bugs if memmap_init_zone gets changed. So make sure it
is called only once.

Cc: Yasunori Goto
Acked-by: KAMEZAWA Hiroyuki
Cc: Dave Hansen
Signed-off-by: Heiko Carstens
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Heiko Carstens
2008-05-15 10:11:14 +0800
3ef0f720e mm: fix infinite loop in filemap_fault ... Browse Code »

filemap_fault will go into an infinite loop if ->readpage() fails
asynchronously.

AFAICS the bug was introduced by this commit, which removed the wait after the
final readpage:

commit d00806b183152af6d24f46f0c33f14162ca1262a
Author: Nick Piggin
Date: Thu Jul 19 01:46:57 2007 -0700

mm: fix fault vs invalidate race for linear mappings

Fix by reintroducing the wait_on_page_locked() after ->readpage() to make sure
the page is up-to-date before jumping back to the beginning of the function.

I've noticed this while testing nfs exporting on fuse. The patch
fixes it.

Signed-off-by: Miklos Szeredi
Cc: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Miklos Szeredi
2008-05-15 10:11:13 +0800
362a61ad6 fix SMP data race in pagetable setup vs walking ... Browse Code »

There is a possible data race in the page table walking code. After the split
ptlock patches, it actually seems to have been introduced to the core code, but
even before that I think it would have impacted some architectures (powerpc
and sparc64, at least, walk the page tables without taking locks eg. see
find_linux_pte()).

The race is as follows:
The pte page is allocated, zeroed, and its struct page gets its spinlock
initialized. The mm-wide ptl is then taken, and then the pte page is inserted
into the pagetables.

At this point, the spinlock is not guaranteed to have ordered the previous
stores to initialize the pte page with the subsequent store to put it in the
page tables. So another Linux page table walker might be walking down (without
any locks, because we have split-leaf-ptls), and find that new pte we've
inserted. It might try to take the spinlock before the store from the other
CPU initializes it. And subsequently it might read a pte_t out before stores
from the other CPU have cleared the memory.

There are also similar races in higher levels of the page tables. They
obviously don't involve the spinlock, but could see uninitialized memory.

Arch code and hardware pagetable walkers that walk the pagetables without
locks could see similar uninitialized memory problems, regardless of whether
split ptes are enabled or not.

I prefer to put the barriers in core code, because that's where the higher
level logic happens, but the page table accessors are per-arch, and open-coding
them everywhere I don't think is an option. I'll put the read-side barriers
in alpha arch code for now (other architectures perform data-dependent loads
in order).

Signed-off-by: Nick Piggin
Signed-off-by: Linus Torvalds

Nick Piggin
2008-05-15 01:05:18 +0800

13 May, 2008

2 commits

5aecd5598 mm/pdflush.c: merge the same code in two path ... Browse Code »

Signed-off-by: Denis Cheng
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Denis Cheng
2008-05-13 23:02:24 +0800
b5be11329 make vmstat cpu-unplug safe ... Browse Code »

When accessing cpu_online_map, we should prevent dynamic changing
of cpu_online_map by get_online_cpus().

Unfortunately, all_vm_events() doesn't do that.

Signed-off-by: KOSAKI Motohiro
Acked-by: Christoph Lameter
Cc: Gautham R Shenoy
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2008-05-13 23:02:23 +0800

09 May, 2008

2 commits

7a34912d9 Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block ... Browse Code »

* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
Revert "relay: fix splice problem"
docbook: fix bio missing parameter
block: use unitialized_var() in bio_alloc_bioset()
block: avoid duplicate calls to get_part() in disk stat code
cfq-iosched: make io priorities inherit CPU scheduling class as well as nice
block: optimize generic_unplug_device()
block: get rid of likely/unlikely predictions in merge logic
vfs: splice remove_suid() cleanup
cfq-iosched: fix RCU race in the cfq io_context destructor handling
block: adjust tagging function queue bit locking
block: sysfs store function needs to grab queue_lock and use queue_flag_*()

Linus Torvalds
2008-05-09 01:48:36 +0800
4ea33e2dc slub: fix atomic usage in any_slab_objects() ... Browse Code »

any_slab_objects() does an atomic_read on an atomic_long_t, this
fixes it to use atomic_long_read instead.

Signed-off-by: Benjamin Herrenschmidt
Cc: Christoph Lameter
Cc: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Benjamin Herrenschmidt
2008-05-09 01:46:56 +0800

07 May, 2008

2 commits

7f3d4ee10 vfs: splice remove_suid() cleanup ... Browse Code »

generic_file_splice_write() duplicates remove_suid() just because it
doesn't hold i_mutex. But it grabs i_mutex inside splice_from_pipe()
anyway, so this is rather pointless.

Move locking to generic_file_splice_write() and call remove_suid() and
__splice_from_pipe() instead.

Signed-off-by: Miklos Szeredi
Signed-off-by: Jens Axboe

Miklos Szeredi
2008-05-07 15:29:00 +0800
aeed5fce3 x86: fix PAE pmd_bad bootup warning ... Browse Code »

Fix warning from pmd_bad() at bootup on a HIGHMEM64G HIGHPTE x86_32.

That came from 9fc34113f6880b215cbea4e7017fc818700384c2 x86: debug pmd_bad();
but we understand now that the typecasting was wrong for PAE in the previous
version: pagetable pages above 4GB looked bad and stopped Arjan from booting.

And revert that cded932b75ab0a5f9181ee3da34a0a488d1a14fd x86: fix pmd_bad
and pud_bad to support huge pages. It was the wrong way round: we shouldn't
weaken every pmd_bad and pud_bad check to let huge pages slip through - in
part they check that we _don't_ have a huge page where it's not expected.

Put the x86 pmd_bad() and pud_bad() definitions back to what they have long
been: they can be improved (x86_32 should use PTE_MASK, to stop PAE thinking
junk in the upper word is good; and x86_64 should follow x86_32's stricter
comparison, to stop thinking any subset of required bits is good); but that
should be a later patch.

Fix Hans' good observation that follow_page() will never find pmd_huge()
because that would have already failed the pmd_bad test: test pmd_huge in
between the pmd_none and pmd_bad tests. Tighten x86's pmd_huge() check?
No, once it's a hugepage entry, it can get quite far from a good pmd: for
example, PROT_NONE leaves it with only ACCESSED of the KERN_PGTABLE bits.

However... though follow_page() contains this and another test for huge
pages, so it's nice to keep it working on them, where does it actually get
called on a huge page? get_user_pages() checks is_vm_hugetlb_page(vma) to
to call alternative hugetlb processing, as does unmap_vmas() and others.

Signed-off-by: Hugh Dickins
Earlier-version-tested-by: Ingo Molnar
Cc: Thomas Gleixner
Cc: Jeff Chua
Cc: Hans Rosenfeld
Cc: Arjan van de Ven
Signed-off-by: Linus Torvalds

Hugh Dickins
2008-05-07 04:08:58 +0800

02 May, 2008

2 commits

f6acb6350 slub: #ifdef simplification ... Browse Code »

If we make SLUB_DEBUG depend on SYSFS then we can simplify some
#ifdefs and avoid others.

Signed-off-by: Christoph Lameter
Signed-off-by: Pekka Enberg

Christoph Lameter
2008-05-02 05:27:13 +0800
0121c619d slub: Whitespace cleanup and use of strict_strtoul ... Browse Code »

Fix some issues with wrapping and use strict_strtoul to make parameter
passing from sysfs safer.

Signed-off-by: Christoph Lameter
Signed-off-by: Pekka Enberg

Christoph Lameter
2008-05-02 05:26:31 +0800

01 May, 2008

3 commits

55e462b05 memcg: simple stats for memory resource controller ... Browse Code »

Implement trivial statistics for the memory resource controller.

Signed-off-by: Balaji Rao
Acked-by: Balbir Singh
Cc: Dhaval Giani
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Balaji Rao
2008-05-01 23:04:02 +0800
c85d194bf docbook: fix vmalloc missing parameter notation ... Browse Code »

Fix vmalloc kernel-doc warning:

Warning(linux-2.6.25-git14//mm/vmalloc.c:555): No description found for parameter 'caller'

Signed-off-by: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Randy Dunlap
2008-05-01 23:03:59 +0800
f8bd2258e remove div_long_long_rem ... Browse Code »

x86 is the only arch right now, which provides an optimized for
div_long_long_rem and it has the downside that one has to be very careful that
the divide doesn't overflow.

The API is a little akward, as the arguments for the unsigned divide are
signed. The signed version also doesn't handle a negative divisor and
produces worse code on 64bit archs.

There is little incentive to keep this API alive, so this converts the few
users to the new API.

Signed-off-by: Roman Zippel
Cc: Ralf Baechle
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: john stultz
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Roman Zippel
2008-05-01 23:03:58 +0800

30 Apr, 2008

12 commits

516746444 revert "memory hotplug: allocate usemap on the section with pgdat" ... Browse Code »

This:

commit 86f6dae1377523689bd8468fed2f2dd180fc0560
Author: Yasunori Goto
Date: Mon Apr 28 02:13:33 2008 -0700

memory hotplug: allocate usemap on the section with pgdat

Usemaps are allocated on the section which has pgdat by this.

Because usemap size is very small, many other sections usemaps are allocated
on only one page. If a section has usemap, it can't be removed until removing
other sections. This dependency is not desirable for memory removing.

Pgdat has similar feature. When a section has pgdat area, it must be the last
section for removing on the node. So, if section A has pgdat and section B
has usemap for section A, Both sections can't be removed due to dependency
each other.

To solve this issue, this patch collects usemap on same section with pgdat.
If other sections doesn't have any dependency, this section will be able to be
removed finally.

Signed-off-by: Yasunori Goto
Cc: Badari Pulavarty
Cc: Yinghai Lu
Cc: Yasunori Goto
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

broke davem's sparc64 bootup. Revert it while we work out what went wrong.

Cc: Yasunori Goto
Cc: Badari Pulavarty
Cc: Yinghai Lu
Cc: "David S. Miller"
Cc: Heiko Carstens
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2008-04-30 23:29:55 +0800
3a902c5f6 mm: fix warning on memory offline ... Browse Code »

KAMEZAWA Hiroyuki found a warning message in the buffer dirtying code that
is coming from page migration caller.

WARNING: at fs/buffer.c:720 __set_page_dirty+0x330/0x360()
Call Trace:
[] show_stack+0x80/0xa0
[] dump_stack+0x30/0x60
[] warn_on_slowpath+0x90/0xe0
[] __set_page_dirty+0x330/0x360
[] __set_page_dirty_buffers+0xd0/0x280
[] set_page_dirty+0xc0/0x260
[] migrate_page_copy+0x5d0/0x5e0
[] buffer_migrate_page+0x2e0/0x3c0
[] migrate_pages+0x770/0xe00

What was happening is that migrate_page_copy wants to transfer the PG_dirty
bit from old page to new page, so what it would do is set_page_dirty(newpage).
However set_page_dirty() is used to set the entire page dirty, wheras in
this case, only part of the page was dirty, and it also was not uptodate.

Marking the whole page dirty with set_page_dirty would lead to corruption or
unresolvable conditions -- a dirty && !uptodate page and dirty && !uptodate
buffers.

Possibly we could just ClearPageDirty(oldpage); SetPageDirty(newpage);
however in the interests of keeping the change minimal...

Signed-off-by: Nick Piggin
Tested-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2008-04-30 23:29:55 +0800
d40cee245 mm: remove remaining __FUNCTION__ occurrences ... Browse Code »

__FUNCTION__ is gcc-specific, use __func__

Signed-off-by: Harvey Harrison
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Harvey Harrison
2008-04-30 23:29:53 +0800
3ac7fe5a4 infrastructure to debug (dynamic) objects ... Browse Code »

We can see an ever repeating problem pattern with objects of any kind in the
kernel:

1) freeing of active objects
2) reinitialization of active objects

Both problems can be hard to debug because the crash happens at a point where
we have no chance to decode the root cause anymore. One problem spot are
kernel timers, where the detection of the problem often happens in interrupt
context and usually causes the machine to panic.

While working on a timer related bug report I had to hack specialized code
into the timer subsystem to get a reasonable hint for the root cause. This
debug hack was fine for temporary use, but far from a mergeable solution due
to the intrusiveness into the timer code.

The code further lacked the ability to detect and report the root cause
instantly and keep the system operational.

Keeping the system operational is important to get hold of the debug
information without special debugging aids like serial consoles and special
knowledge of the bug reporter.

The problems described above are not restricted to timers, but timers tend to
expose it usually in a full system crash. Other objects are less explosive,
but the symptoms caused by such mistakes can be even harder to debug.

Instead of creating specialized debugging code for the timer subsystem a
generic infrastructure is created which allows developers to verify their code
and provides an easy to enable debug facility for users in case of trouble.

The debugobjects core code keeps track of operations on static and dynamic
objects by inserting them into a hashed list and sanity checking them on
object operations and provides additional checks whenever kernel memory is
freed.

The tracked object operations are:
- initializing an object
- adding an object to a subsystem list
- deleting an object from a subsystem list

Each operation is sanity checked before the operation is executed and the
subsystem specific code can provide a fixup function which allows to prevent
the damage of the operation. When the sanity check triggers a warning message
and a stack trace is printed.

The list of operations can be extended if the need arises. For now it's
limited to the requirements of the first user (timers).

The core code enqueues the objects into hash buckets. The hash index is
generated from the address of the object to simplify the lookup for the check
on kfree/vfree. Each bucket has it's own spinlock to avoid contention on a
global lock.

The debug code can be compiled in without being active. The runtime overhead
is minimal and could be optimized by asm alternatives. A kernel command line
option enables the debugging code.

Thanks to Ingo Molnar for review, suggestions and cleanup patches.

Signed-off-by: Thomas Gleixner
Signed-off-by: Ingo Molnar
Cc: Greg KH
Cc: Randy Dunlap
Cc: Kay Sievers
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Thomas Gleixner
2008-04-30 23:29:53 +0800
fc3ba692a mm: Add NR_WRITEBACK_TEMP counter ... Browse Code »

Fuse will use temporary buffers to write back dirty data from memory mappings
(normal writes are done synchronously). This is needed, because there cannot
be any guarantee about the time in which a write will complete.

By using temporary buffers, from the MM's point if view the page is written
back immediately. If the writeout was due to memory pressure, this
effectively migrates data from a full zone to a less full zone.

This patch adds a new counter (NR_WRITEBACK_TEMP) for the number of pages used
as temporary buffers.

[Lee.Schermerhorn@hp.com: add vmstat_text for NR_WRITEBACK_TEMP]
Signed-off-by: Miklos Szeredi
Cc: Christoph Lameter
Signed-off-by: Lee Schermerhorn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Miklos Szeredi
2008-04-30 23:29:50 +0800
dd5656e59 mm: bdi: export bdi_writeout_inc() ... Browse Code »

Fuse needs this for writable mmap support.

Signed-off-by: Miklos Szeredi
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Miklos Szeredi
2008-04-30 23:29:50 +0800
e4ad08fe6 mm: bdi: add separate writeback accounting capability ... Browse Code »

Add a new BDI capability flag: BDI_CAP_NO_ACCT_WB. If this flag is
set, then don't update the per-bdi writeback stats from
test_set_page_writeback() and test_clear_page_writeback().

Misc cleanups:

- convert bdi_cap_writeback_dirty() and friends to static inline functions
- create a flag that includes all three dirty/writeback related flags,
since almst all users will want to have them toghether

Signed-off-by: Miklos Szeredi
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Miklos Szeredi
2008-04-30 23:29:50 +0800
76f1418b4 mm: bdi: move statistics to debugfs ... Browse Code »

Move BDI statistics to debugfs:

/sys/kernel/debug/bdi//stats

Use postcore_initcall() to initialize the sysfs class and debugfs,
because debugfs is initialized in core_initcall().

Update descriptions in ABI documentation.

Signed-off-by: Miklos Szeredi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Miklos Szeredi
2008-04-30 23:29:50 +0800
a42dde041 mm: bdi: allow setting a maximum for the bdi dirty limit ... Browse Code »

Add "max_ratio" to /sys/class/bdi. This indicates the maximum percentage of
the global dirty threshold allocated to this bdi.

[mszeredi@suse.cz]

- fix parsing in max_ratio_store().
- export bdi_set_max_ratio() to modules
- limit bdi_dirty with bdi->max_ratio
- document new sysfs attribute

Signed-off-by: Peter Zijlstra
Signed-off-by: Miklos Szeredi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Zijlstra
2008-04-30 23:29:50 +0800
189d3c4a9 mm: bdi: allow setting a minimum for the bdi dirty limit ... Browse Code »

Under normal circumstances each device is given a part of the total write-back
cache that relates to its current avg writeout speed in relation to the other
devices.

min_ratio - allows one to assign a minimum portion of the write-back cache to
a particular device. This is useful in situations where you might want to
provide a minimum QoS. (One request for this feature came from flash based
storage people who wanted to avoid writing out at all costs - they of course
needed some pdflush hacks as well)

max_ratio - allows one to assign a maximum portion of the dirty limit to a
particular device. This is useful in situations where you want to avoid one
device taking all or most of the write-back cache. Eg. an NFS mount that is
prone to get stuck, or a FUSE mount which you don't trust to play fair.

Add "min_ratio" to /sys/class/bdi. This indicates the minimum percentage of
the global dirty threshold allocated to this bdi.

[mszeredi@suse.cz]

- fix parsing in min_ratio_store()
- document new sysfs attribute

Signed-off-by: Peter Zijlstra
Signed-off-by: Miklos Szeredi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Zijlstra
2008-04-30 23:29:50 +0800
cf0ca9fe5 mm: bdi: export BDI attributes in sysfs ... Browse Code »

Provide a place in sysfs (/sys/class/bdi) for the backing_dev_info object.
This allows us to see and set the various BDI specific variables.

In particular this properly exposes the read-ahead window for all relevant
users and /sys/block//queue/read_ahead_kb should be deprecated.

With patient help from Kay Sievers and Greg KH

[mszeredi@suse.cz]

- split off NFS and FUSE changes into separate patches
- document new sysfs attributes under Documentation/ABI
- do bdi_class_init as a core_initcall, otherwise the "default" BDI
won't be initialized
- remove bdi_init_fmt macro, it's not used very much

[akpm@linux-foundation.org: fix ia64 warning]
Signed-off-by: Peter Zijlstra
Cc: Kay Sievers
Acked-by: Greg KH
Cc: Trond Myklebust
Signed-off-by: Miklos Szeredi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Zijlstra
2008-04-30 23:29:49 +0800
41b25a378 /proc/pagetypeinfo: fix output for memoryless nodes ... Browse Code »

on memoryless node, /proc/pagetypeinfo is displayed slightly funny output.
this patch fix it.

output example (header is outputed, but no data is outputed)
--------------------------------------------------------------
Page block order: 14
Pages per block: 16384

Free pages count per migrate type at order 0 1 2 3 4 5 \
6 7 8 9 10 11 12 13 14 15 16

Number of blocks type Unmovable Reclaimable Movable Reserve Isolate
Page block order: 14
Pages per block: 16384

Free pages count per migrate type at order 0 1 2 3 4 5 \
6 7 8 9 10 11 12 13 14 15 16

Signed-off-by: KOSAKI Motohiro
Cc: KAMEZAWA Hiroyuki
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2008-04-30 23:29:30 +0800

29 Apr, 2008

11 commits

3d71f86f4 mm: use non-racy method for /proc/swaps creation ... Browse Code »

Use proc_create() to make sure that ->proc_fops be setup before gluing PDE to
main tree.

Signed-off-by: Denis V. Lunev
Cc: Alexey Dobriyan
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Denis V. Lunev
2008-04-29 23:06:20 +0800
925d1c401 procfs task exe symlink ... Browse Code »

The kernel implements readlink of /proc/pid/exe by getting the file from
the first executable VMA. Then the path to the file is reconstructed and
reported as the result.

Because of the VMA walk the code is slightly different on nommu systems.
This patch avoids separate /proc/pid/exe code on nommu systems. Instead of
walking the VMAs to find the first executable file-backed VMA we store a
reference to the exec'd file in the mm_struct.

That reference would prevent the filesystem holding the executable file
from being unmounted even after unmapping the VMAs. So we track the number
of VM_EXECUTABLE VMAs and drop the new reference when the last one is
unmapped. This avoids pinning the mounted filesystem.

[akpm@linux-foundation.org: improve comments]
[yamamoto@valinux.co.jp: fix dup_mmap]
Signed-off-by: Matt Helsley
Cc: Oleg Nesterov
Cc: David Howells
Cc:"Eric W. Biederman"
Cc: Christoph Hellwig
Cc: Al Viro
Cc: Hugh Dickins
Signed-off-by: YAMAMOTO Takashi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matt Helsley
2008-04-29 23:06:17 +0800
0c40ba4fd ipc: define the slab_memory_callback priority as a constant ... Browse Code »

This is a trivial patch that defines the priority of slab_memory_callback in
the callback chain as a constant. This is to prepare for next patch in the
series.

Signed-off-by: Nadia Derbey
Cc: Yasunori Goto
Cc: Matt Helsley
Cc: Mingming Cao
Cc: Pierre Peiffer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nadia Derbey
2008-04-29 23:06:12 +0800
1faf8e40a memcg: remove redundant initialization in mem_cgroup_create() ... Browse Code »

*mem has been zeroed, that means mem->info has already been filled with 0.

Signed-off-by: Li Zefan
Acked-by: KAMEZAWA Hiroyuki
Acked-by: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Li Zefan
2008-04-29 23:06:11 +0800
333279487 memcgroup: use vmalloc for mem_cgroup allocation ... Browse Code »

On ia64, this kmalloc() requires order-4 pages. But this is not necessary to
be physically contiguous. For big mem_cgroup, vmalloc is better. For small
ones, kmalloc is used.

[akpm@linux-foundation.org: simplification]
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Pavel Emelyanov
Cc: Li Zefan
Cc: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2008-04-29 23:06:11 +0800
4a56d02e3 memcgroup: make the memory controller more desktop responsive ... Browse Code »

This patch makes the memory controller more responsive on my desktop.

1. Set all cached pages as inactive. We were by default marking all pages
as active, thus forcing us to go through two passes for reclaiming pages

2. Remove congestion_wait(), since we already have that logic in
do_try_to_free_pages()

Signed-off-by: Balbir Singh
Reviewed-by: KOSAKI Motohiro
Cc: YAMAMOTO Takashi
Cc: Paul Menage
Cc: Pavel Emelianov
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Balbir Singh
2008-04-29 23:06:11 +0800
3eae90c3c memcg: remove redundant function calls ... Browse Code »

remove_list/add_list uses page_cgroup_zoneinfo() in it.

So, it's called twice before and after lock.

mz = page_cgroup_zoneinfo();
lock();
mz = page_cgroup_zoneinfo();
....
unlock();

And address of mz never changes.

This is not good. This patch fixes this behavior.

Signed-off-by: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Cc: Pavel Emelyanov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2008-04-29 23:06:10 +0800
29f2a4dac memcgroup: implement failcounter reset ... Browse Code »

This is a very common requirement from people using the resource accounting
facilities (not only memcgroup but also OpenVZ beancounters). They want to
put the cgroup in an initial state without re-creating it.

For example after re-configuring a group people want to observe how this new
configuration fits the group needs without saving the previous failcnt value.

Merge two resets into one mem_cgroup_reset() function to demonstrate how
multiplexing work.

Besides, I have plans to move the files, that correspond to res_counter to the
res_counter.c file and somehow "import" them into controller. I don't know
how to make it gracefully yet, but merging resets of max_usage and failcnt in
one function will be there for sure.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Pavel Emelyanov
Acked-by: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Cc: Paul Menage
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Emelyanov
2008-04-29 23:06:10 +0800
85cc59db1 memcgroup: use triggers in force_empty and max_usage files ... Browse Code »

These two files are essentially event callbacks. They do not care about the
contents of the string, but only about the fact of the write itself.

Signed-off-by: Pavel Emelyanov
Acked-by: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Cc: Paul Menage
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Emelyanov
2008-04-29 23:06:10 +0800
b6ac57d50 memcgroup: move memory controller allocations to their own slabs ... Browse Code »

Move the memory controller data structure page_cgroup to its own slab cache.
It saves space on the system, allocations are not necessarily pushed to order
of 2 and should provide performance benefits. Users who disable the memory
controller can also double check that the memory controller is not allocating
page_cgroup's.

NOTE: Hugh Dickins brought up the issue of whether we want to mark page_cgroup
as __GFP_MOVABLE or __GFP_RECLAIMABLE. I don't think there is an easy answer
at the moment. page_cgroup's are associated with user pages, they can be
reclaimed once the user page has been reclaimed, so it might make sense to
mark them as __GFP_RECLAIMABLE. For now, I am leaving the marking to default
values that the slab allocator uses.

Signed-off-by: Balbir Singh
Cc: Pavel Emelianov
Cc: Hugh Dickins
Cc: Sudhir Kumar
Cc: YAMAMOTO Takashi
Cc: Paul Menage
Cc: David Rientjes
Cc: KAMEZAWA Hiroyuki
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Balbir Singh
2008-04-29 23:06:10 +0800
c84872e16 memcgroup: add the max_usage member on the res_counter ... Browse Code »

This field is the maximal value of the usage one since the counter creation
(or since the latest reset).

To reset this to the usage value simply write anything to the appropriate
cgroup file.

Signed-off-by: Pavel Emelyanov
Acked-by: Balbir Singh
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Emelyanov
2008-04-29 23:06:10 +0800