Eric Lee / smarc-fsl-linux-kernel

24 Mar, 2011

18 commits

f8f062698 rapidio: add architecture specific callbacks ... Browse Code »

This set of patches eliminates RapidIO dependency on PowerPC architecture
and makes it available to other architectures (x86 and MIPS). It also
enables support of new platform independent RapidIO controllers such as
PCI-to-SRIO and PCI Express-to-SRIO.

This patch:

Extend number of mport callback functions to eliminate direct linking of
architecture specific mport operations.

Signed-off-by: Alexandre Bounine
Cc: Kumar Gala
Cc: Matt Porter
Cc: Li Yang
Cc: Thomas Moll
Cc: Micha Nelissen
Cc: Benjamin Herrenschmidt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexandre Bounine
2011-03-24 10:46:41 +0800
312ec7e50 proc: make struct proc_dir_entry::namelen unsigned int ... Browse Code »

1. namelen is declared "unsigned short" which hints for "maybe space savings".
Indeed in 2.4 struct proc_dir_entry looked like:

struct proc_dir_entry {
unsigned short low_ino;
unsigned short namelen;

Now, low_ino is "unsigned int", all savings were gone for a long time.
"struct proc_dir_entry" is not that countless to worry about it's size,
anyway.

2. converting from unsigned short to int/unsigned int can only create
problems, we better play it safe.

Space is not really conserved, because of natural alignment for the next
field. sizeof(struct proc_dir_entry) remains the same.

Signed-off-by: Alexey Dobriyan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2011-03-24 10:46:37 +0800
7ffd4ca7a memcg: convert uncharge batching from bytes to page granularity ... Browse Code »

We never uncharge subpage quantities.

Signed-off-by: Johannes Weiner
Acked-by: KAMEZAWA Hiroyuki
Cc: Daisuke Nishimura
Cc: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2011-03-24 10:46:30 +0800
6b3ae58ef memcg: remove direct page_cgroup-to-page pointer ... Browse Code »

In struct page_cgroup, we have a full word for flags but only a few are
reserved. Use the remaining upper bits to encode, depending on
configuration, the node or the section, to enable page_cgroup-to-page
lookups without a direct pointer.

This saves a full word for every page in a system with memory cgroups
enabled.

Signed-off-by: Johannes Weiner
Acked-by: KAMEZAWA Hiroyuki
Cc: Daisuke Nishimura
Cc: Balbir Singh
Cc: Minchan Kim
Cc: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2011-03-24 10:46:28 +0800
de3638d9c memcg: fold __mem_cgroup_move_account into caller ... Browse Code »

It is one logical function, no need to have it split up.

Also, get rid of some checks from the inner function that ensured the
sanity of the outer function.

Signed-off-by: Johannes Weiner
Acked-by: KAMEZAWA Hiroyuki
Acked-by: Daisuke Nishimura
Cc: Balbir Singh
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2011-03-24 10:46:27 +0800
97a6c37b3 memcg: change page_cgroup_zoneinfo signature ... Browse Code »

Instead of passing a whole struct page_cgroup to this function, let it
take only what it really needs from it: the struct mem_cgroup and the
page.

This has the advantage that reading pc->mem_cgroup is now done at the same
place where the ordering rules for this pointer are enforced and
explained.

It is also in preparation for removing the pc->page backpointer.

Signed-off-by: Johannes Weiner
Acked-by: KAMEZAWA Hiroyuki
Cc: Daisuke Nishimura
Cc: Balbir Singh
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2011-03-24 10:46:26 +0800
f212ad7cf memcg: add memcg sanity checks at allocating and freeing pages ... Browse Code »

Add checks at allocating or freeing a page whether the page is used (iow,
charged) from the view point of memcg.

This check may be useful in debugging a problem and we did similar checks
before the commit 52d4b9ac(memcg: allocate all page_cgroup at boot).

This patch adds some overheads at allocating or freeing memory, so it's
enabled only when CONFIG_DEBUG_VM is enabled.

Signed-off-by: Daisuke Nishimura
Signed-off-by: Johannes Weiner
Acked-by: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Daisuke Nishimura
2011-03-24 10:46:25 +0800
9d11ea9f1 memcg: simplify the way memory limits are checked ... Browse Code »

Since transparent huge pages, checking whether memory cgroups are below
their limits is no longer enough, but the actual amount of chargeable
space is important.

To not have more than one limit-checking interface, replace
memory_cgroup_check_under_limit() and memory_cgroup_check_margin() with a
single memory_cgroup_margin() that returns the chargeable space and leaves
the comparison to the callsite.

Soft limits are now checked the other way round, by using the already
existing function that returns the amount by which soft limits are
exceeded: res_counter_soft_limit_excess().

Also remove all the corresponding functions on the res_counter side that
are now no longer used.

Signed-off-by: Johannes Weiner
Acked-by: KAMEZAWA Hiroyuki
Cc: Daisuke Nishimura
Acked-by: Balbir Singh
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2011-03-24 10:46:23 +0800
b7c616784 memcg: soft limit reclaim should end at limit not below ... Browse Code »

Soft limit reclaim continues until the usage is below the current soft
limit, but the documented semantics are actually that soft limit reclaim
will push usage back until the soft limits are met again.

Signed-off-by: Johannes Weiner
Acked-by: KAMEZAWA Hiroyuki
Cc: Daisuke Nishimura
Acked-by: Balbir Singh
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2011-03-24 10:46:23 +0800
61f2e7b0f bitops: remove minix bitops from asm/bitops.h ... Browse Code »

minix bit operations are only used by minix filesystem and useless by
other modules. Because byte order of inode and block bitmaps is different
on each architecture like below:

m68k:
big-endian 16bit indexed bitmaps

h8300, microblaze, s390, sparc, m68knommu:
big-endian 32 or 64bit indexed bitmaps

m32r, mips, sh, xtensa:
big-endian 32 or 64bit indexed bitmaps for big-endian mode
little-endian bitmaps for little-endian mode

Others:
little-endian bitmaps

In order to move minix bit operations from asm/bitops.h to architecture
independent code in minix filesystem, this provides two config options.

CONFIG_MINIX_FS_BIG_ENDIAN_16BIT_INDEXED is only selected by m68k.
CONFIG_MINIX_FS_NATIVE_ENDIAN is selected by the architectures which use
native byte order bitmaps (h8300, microblaze, s390, sparc, m68knommu,
m32r, mips, sh, xtensa). The architectures which always use little-endian
bitmaps do not select these options.

Finally, we can remove minix bit operations from asm/bitops.h for all
architectures.

Signed-off-by: Akinobu Mita
Acked-by: Arnd Bergmann
Acked-by: Greg Ungerer
Cc: Geert Uytterhoeven
Cc: Roman Zippel
Cc: Andreas Schwab
Cc: Martin Schwidefsky
Cc: Heiko Carstens
Cc: Yoshinori Sato
Cc: Michal Simek
Cc: "David S. Miller"
Cc: Hirokazu Takata
Acked-by: Ralf Baechle
Acked-by: Paul Mundt
Cc: Chris Zankel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Akinobu Mita
2011-03-24 10:46:22 +0800
f312eff81 bitops: remove ext2 non-atomic bitops from asm/bitops.h ... Browse Code »

As the result of conversions, there are no users of ext2 non-atomic bit
operations except for ext2 filesystem itself. Now we can put them into
architecture independent code in ext2 filesystem, and remove from
asm/bitops.h for all architectures.

Signed-off-by: Akinobu Mita
Cc: Jan Kara
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Akinobu Mita
2011-03-24 10:46:21 +0800
b9b9144a5 reiserfs: use little-endian bitops ... Browse Code »

As a preparation for removing ext2 non-atomic bit operations from
asm/bitops.h. This converts ext2 non-atomic bit operations to
little-endian bit operations.

Signed-off-by: Akinobu Mita
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Akinobu Mita
2011-03-24 10:46:18 +0800
0795ccea2 ext3: use little-endian bitops ... Browse Code »

As a preparation for removing ext2 non-atomic bit operations from
asm/bitops.h. This converts ext2 non-atomic bit operations to
little-endian bit operations.

Signed-off-by: Akinobu Mita
Acked-by: Jan Kara
Cc: Andreas Dilger
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Akinobu Mita
2011-03-24 10:46:17 +0800
c56530055 asm-generic: use little-endian bitops ... Browse Code »

As a preparation for removing ext2 non-atomic bit operations from
asm/bitops.h. This converts ext2 non-atomic bit operations to
little-endian bit operations.

Signed-off-by: Akinobu Mita
Acked-by: Arnd Bergmann
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Akinobu Mita
2011-03-24 10:46:15 +0800
861b5ae7c bitops: introduce little-endian bitops for most architectures ... Browse Code »

Introduce little-endian bit operations to the big-endian architectures
which do not have native little-endian bit operations and the
little-endian architectures. (alpha, avr32, blackfin, cris, frv, h8300,
ia64, m32r, mips, mn10300, parisc, sh, sparc, tile, x86, xtensa)

These architectures can just include generic implementation
(asm-generic/bitops/le.h).

Signed-off-by: Akinobu Mita
Cc: Richard Henderson
Cc: Ivan Kokshaysky
Cc: Mikael Starvik
Cc: David Howells
Cc: Yoshinori Sato
Cc: "Luck, Tony"
Cc: Ralf Baechle
Cc: Kyle McMartin
Cc: Matthew Wilcox
Cc: Grant Grundler
Cc: Paul Mundt
Cc: Kazumoto Kojima
Cc: Hirokazu Takata
Cc: "David S. Miller"
Cc: Chris Zankel
Cc: Ingo Molnar
Cc: Thomas Gleixner
Acked-by: Hans-Christian Egtvedt
Acked-by: "H. Peter Anvin"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Akinobu Mita
2011-03-24 10:46:15 +0800
a56560b3b asm-generic: change little-endian bitops to take any pointer types ... Browse Code »

This makes the little-endian bitops take any pointer types by changing the
prototypes and adding casts in the preprocessor macros.

That would seem to at least make all the filesystem code happier, and they
can continue to do just something like

#define ext2_set_bit __test_and_set_bit_le

(or whatever the exact sequence ends up being).

Signed-off-by: Akinobu Mita
Cc: Arnd Bergmann
Cc: Richard Henderson
Cc: Ivan Kokshaysky
Cc: Mikael Starvik
Cc: David Howells
Cc: Yoshinori Sato
Cc: "Luck, Tony"
Cc: Ralf Baechle
Cc: Kyle McMartin
Cc: Matthew Wilcox
Cc: Grant Grundler
Cc: Paul Mundt
Cc: Kazumoto Kojima
Cc: Hirokazu Takata
Cc: "David S. Miller"
Cc: Chris Zankel
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: Hans-Christian Egtvedt
Cc: "H. Peter Anvin"
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Akinobu Mita
2011-03-24 10:46:12 +0800
c4945b9ed asm-generic: rename generic little-endian bitops functions ... Browse Code »

As a preparation for providing little-endian bitops for all architectures,
This renames generic implementation of little-endian bitops. (remove
"generic_" prefix and postfix "_le")

s/generic_find_next_le_bit/find_next_bit_le/
s/generic_find_next_zero_le_bit/find_next_zero_bit_le/
s/generic_find_first_zero_le_bit/find_first_zero_bit_le/
s/generic___test_and_set_le_bit/__test_and_set_bit_le/
s/generic___test_and_clear_le_bit/__test_and_clear_bit_le/
s/generic_test_le_bit/test_bit_le/
s/generic___set_le_bit/__set_bit_le/
s/generic___clear_le_bit/__clear_bit_le/
s/generic_test_and_set_le_bit/test_and_set_bit_le/
s/generic_test_and_clear_le_bit/test_and_clear_bit_le/

Signed-off-by: Akinobu Mita
Acked-by: Arnd Bergmann
Acked-by: Hans-Christian Egtvedt
Cc: Geert Uytterhoeven
Cc: Roman Zippel
Cc: Andreas Schwab
Cc: Greg Ungerer
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Akinobu Mita
2011-03-24 10:46:11 +0800
63ab595fb bitops: merge little and big endian definisions in asm-generic/bitops/le.h ... Browse Code »

This patch series introduces little-endian bit operations in asm/bitops.h
for all architectures and converts all ext2 non-atomic and minix bit
operations to use little-endian bit operations. It enables us to remove
ext2 non-atomic and minix bit operations from asm/bitops.h. The reason
they should be removed from asm/bitops.h is as follows:

For ext2 non-atomic bit operations, they are used for little-endian byte
order bitmap access by some filesystems and modules. But using ext2_*()
functions on a module other than ext2 filesystem makes some feel strange.

For minix bit operations, they are only used by minix filesystem and are
useless by other modules. Because byte order of inode and block bitmap is

This patch:

In order to make the forthcoming changes smaller, this merges macro
definisions in asm-generic/bitops/le.h for big-endian and little-endian as
much as possible.

This also removes unused BITOP_WORD macro.

Signed-off-by: Akinobu Mita
Acked-by: Arnd Bergmann
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Akinobu Mita
2011-03-24 10:46:11 +0800

23 Mar, 2011

22 commits

04948c7f8 smp: add missing init.h include ... Browse Code »

Commit 34db18a054c6 ("smp: move smp setup functions to kernel/smp.c")
causes this build error on s390 because of a missing init.h include:

CC arch/s390/kernel/asm-offsets.s
In file included from /home2/heicarst/linux-2.6/arch/s390/include/asm/spinlock.h:14:0,
from include/linux/spinlock.h:87,
from include/linux/seqlock.h:29,
from include/linux/time.h:8,
from include/linux/timex.h:56,
from include/linux/sched.h:57,
from arch/s390/kernel/asm-offsets.c:10:
include/linux/smp.h:117:20: error: expected '=', ',', ';', 'asm' or '__attribute__' before 'setup_nr_cpu_ids'
include/linux/smp.h:118:20: error: expected '=', ',', ';', 'asm' or '__attribute__' before 'smp_init'

Fix it by adding the include statement.

Signed-off-by: Heiko Carstens
Acked-by: WANG Cong
Signed-off-by: Linus Torvalds

Heiko Carstens
2011-03-23 22:48:42 +0800
6447f55da Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/async_tx ... Browse Code »

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/async_tx: (66 commits)
avr32: at32ap700x: fix typo in DMA master configuration
dmaengine/dmatest: Pass timeout via module params
dma: let IMX_DMA depend on IMX_HAVE_DMA_V1 instead of an explicit list of SoCs
fsldma: make halt behave nicely on all supported controllers
fsldma: reduce locking during descriptor cleanup
fsldma: support async_tx dependencies and automatic unmapping
fsldma: fix controller lockups
fsldma: minor codingstyle and consistency fixes
fsldma: improve link descriptor debugging
fsldma: use channel name in printk output
fsldma: move related helper functions near each other
dmatest: fix automatic buffer unmap type
drivers, pch_dma: Fix warning when CONFIG_PM=n.
dmaengine/dw_dmac fix: use readl & writel instead of __raw_readl & __raw_writel
avr32: at32ap700x: Specify DMA Flow Controller, Src and Dst msize
dw_dmac: Setting Default Burst length for transfers as 16.
dw_dmac: Allow src/dst msize & flow controller to be configured at runtime
dw_dmac: Changing type of src_master and dest_master to u8.
dw_dmac: Pass Channel Priority from platform_data
dw_dmac: Pass Channel Allocation Order from platform_data
...

Linus Torvalds
2011-03-23 08:53:13 +0800
565d76cb7 zlib: slim down zlib_deflate() workspace when possible ... Browse Code »

Instead of always creating a huge (268K) deflate_workspace with the
maximum compression parameters (windowBits=15, memLevel=8), allow the
caller to obtain a smaller workspace by specifying smaller parameter
values.

For example, when capturing oops and panic reports to a medium with
limited capacity, such as NVRAM, compression may be the only way to
capture the whole report. In this case, a small workspace (24K works
fine) is a win, whether you allocate the workspace when you need it (i.e.,
during an oops or panic) or at boot time.

I've verified that this patch works with all accepted values of windowBits
(positive and negative), memLevel, and compression level.

Signed-off-by: Jim Keniston
Cc: Herbert Xu
Cc: David Miller
Cc: Chris Mason
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jim Keniston
2011-03-23 08:44:17 +0800
d03e1617f crc32: add missed brackets in macro ... Browse Code »

Add brackets around typecasted argument in crc32() macro.

Signed-off-by: Konstantin Khlebnikov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2011-03-23 08:44:15 +0800
e359dc24d sigma-firmware: loader for Analog Devices' SigmaStudio ... Browse Code »

Analog Devices' SigmaStudio can produce firmware blobs for devices with
these DSPs embedded (like some audio codecs). Allow these device drivers
to easily parse and load them.

Signed-off-by: Mike Frysinger
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Frysinger
2011-03-23 08:44:15 +0800
33ee3b2e2 kstrto*: converting strings to integers done (hopefully) right ... Browse Code »

1. simple_strto*() do not contain overflow checks and crufty,
libc way to indicate failure.
2. strict_strto*() also do not have overflow checks but the name and
comments pretend they do.
3. Both families have only "long long" and "long" variants,
but users want strtou8()
4. Both "simple" and "strict" prefixes are wrong:
Simple doesn't exactly say what's so simple, strict should not exist
because conversion should be strict by default.

The solution is to use "k" prefix and add convertors for more types.
Enter
kstrtoull()
kstrtoll()
kstrtoul()
kstrtol()
kstrtouint()
kstrtoint()

kstrtou64()
kstrtos64()
kstrtou32()
kstrtos32()
kstrtou16()
kstrtos16()
kstrtou8()
kstrtos8()

Include runtime testsuite (somewhat incomplete) as well.

strict_strto*() become deprecated, stubbed to kstrto*() and
eventually will be removed altogether.

Use kstrto*() in code today!

Note: on some archs _kstrtoul() and _kstrtol() are left in tree, even if
they'll be unused at runtime. This is temporarily solution,
because I don't want to hardcode list of archs where these
functions aren't needed. Current solution with sizeof() and
__alignof__ at least always works.

Signed-off-by: Alexey Dobriyan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2011-03-23 08:44:14 +0800
34db18a05 smp: move smp setup functions to kernel/smp.c ... Browse Code »

Move setup_nr_cpu_ids(), smp_init() and some other SMP boot parameter
setup functions from init/main.c to kenrel/smp.c, saves some #ifdef
CONFIG_SMP.

Signed-off-by: WANG Cong
Cc: Rakib Mullick
Cc: David Howells
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: Tejun Heo
Cc: Arnd Bergmann
Cc: Akinobu Mita
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Amerigo Wang
2011-03-23 08:44:11 +0800
fa9ee9c4b include/linux/err.h: add a function to cast error-pointers to a return value ... Browse Code »

PTR_RET() can be used if you have an error-pointer and are only interested
in the eventual error value, but not the pointer. Yields the usual 0 for
no error, -ESOMETHING otherwise.

Signed-off-by: Uwe Kleine-König
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Uwe Kleine-König
2011-03-23 08:44:11 +0800
f3ccfcdaf fs.h: remove 8 bytes of padding from block_device on 64bit builds ... Browse Code »

Re-ordering struct block_inode to remove 8 bytes of padding on 64 bit
builds, which also shrinks bdev_inode by 8 bytes (776 -> 768) allowing it
to fit into one fewer cache lines.

Signed-off-by: Richard Kennedy
Cc: Al Viro
Cc: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Richard Kennedy
2011-03-23 08:44:10 +0800
c837fb37a include/linux/compiler-gcc*.h: unify macro definitions ... Browse Code »

Unify identical gcc3.x and gcc4.x macros.

Signed-off-by: Borislav Petkov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Borislav Petkov
2011-03-23 08:44:10 +0800
3e50594e8 add the common dma_addr_t typedef to include/linux/types.h ... Browse Code »

All architectures can use the common dma_addr_t typedef now. We can
remove the arch specific dma_addr_t.

Signed-off-by: FUJITA Tomonori
Acked-by: Arnd Bergmann
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Cc: Ivan Kokshaysky
Cc: Richard Henderson
Cc: Matt Turner
Cc: "Luck, Tony"
Cc: Ralf Baechle
Cc: Benjamin Herrenschmidt
Cc: Heiko Carstens
Cc: Martin Schwidefsky
Cc: Chris Metcalf
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

FUJITA Tomonori
2011-03-23 08:44:09 +0800
78afd5612 mm: add __GFP_OTHER_NODE flag ... Browse Code »

Add a new __GFP_OTHER_NODE flag to tell the low level numa statistics in
zone_statistics() that an allocation is on behalf of another thread. This
way the local and remote counters can be still correct, even when
background daemons like khugepaged are changing memory mappings.

This only affects the accounting, but I think it's worth doing that right
to avoid confusing users.

I first tried to just pass down the right node, but this required a lot of
changes to pass down this parameter and at least one addition of a 10th
argument to a 9 argument function. Using the flag is a lot less
intrusive.

Open: should be also used for migration?

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Andi Kleen
Cc: Andrea Arcangeli
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andi Kleen
2011-03-23 08:44:05 +0800
8afdcece4 mm: vmscan: kswapd should not free an excessive number of pages when balancing small zones ... Browse Code »

When reclaiming for order-0 pages, kswapd requires that all zones be
balanced. Each cycle through balance_pgdat() does background ageing on
all zones if necessary and applies equal pressure on the inactive zone
unless a lot of pages are free already.

A "lot of free pages" is defined as a "balance gap" above the high
watermark which is currently 7*high_watermark. Historically this was
reasonable as min_free_kbytes was small. However, on systems using huge
pages, it is recommended that min_free_kbytes is higher and it is tuned
with hugeadm --set-recommended-min_free_kbytes. With the introduction of
transparent huge page support, this recommended value is also applied. On
X86-64 with 4G of memory, min_free_kbytes becomes 67584 so one would
expect around 68M of memory to be free. The Normal zone is approximately
35000 pages so under even normal memory pressure such as copying a large
file, it gets exhausted quickly. As it is getting exhausted, kswapd
applies pressure equally to all zones, including the DMA32 zone. DMA32 is
approximately 700,000 pages with a high watermark of around 23,000 pages.
In this situation, kswapd will reclaim around (23000*8 where 8 is the high
watermark + balance gap of 7 * high watermark) pages or 718M of pages
before the zone is ignored. What the user sees is that free memory far
higher than it should be.

To avoid an excessive number of pages being reclaimed from the larger
zones, explicitely defines the "balance gap" to be either 1% of the zone
or the low watermark for the zone, whichever is smaller. While kswapd
will check all zones to apply pressure, it'll ignore zones that meets the
(high_wmark + balance_gap) watermark.

To test this, 80G were copied from a partition and the amount of memory
being used was recorded. A comparison of a patch and unpatched kernel can
be seen at
http://www.csn.ul.ie/~mel/postings/minfree-20110222/memory-usage-hydra.ps
and shows that kswapd is not reclaiming as much memory with the patch
applied.

Signed-off-by: Andrea Arcangeli
Signed-off-by: Mel Gorman
Acked-by: Rik van Riel
Cc: Shaohua Li
Cc: "Chen, Tim C"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2011-03-23 08:44:04 +0800
033193275 pagewalk: only split huge pages when necessary ... Browse Code »

Right now, if a mm_walk has either ->pte_entry or ->pmd_entry set, it will
unconditionally split any transparent huge pages it runs in to. In
practice, that means that anyone doing a

cat /proc/$pid/smaps

will unconditionally break down every huge page in the process and depend
on khugepaged to re-collapse it later. This is fairly suboptimal.

This patch changes that behavior. It teaches each ->pmd_entry handler
(there are five) that they must break down the THPs themselves. Also, the
_generic_ code will never break down a THP unless a ->pte_entry handler is
actually set.

This means that the ->pmd_entry handlers can now choose to deal with THPs
without breaking them down.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Dave Hansen
Acked-by: Mel Gorman
Acked-by: David Rientjes
Reviewed-by: Eric B Munson
Tested-by: Eric B Munson
Cc: Michael J Wolf
Cc: Andrea Arcangeli
Cc: Johannes Weiner
Cc: Matt Mackall
Cc: Jeremy Fitzhardinge
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dave Hansen
2011-03-23 08:44:04 +0800
3f58a8294 memcg: move memcg reclaimable page into tail of inactive list ... Browse Code »

The rotate_reclaimable_page function moves just written out pages, which
the VM wanted to reclaim, to the end of the inactive list. That way the
VM will find those pages first next time it needs to free memory.

This patch applies the rule in memcg. It can help to prevent unnecessary
working page eviction of memcg.

Signed-off-by: Minchan Kim
Acked-by: Balbir Singh
Acked-by: KAMEZAWA Hiroyuki
Reviewed-by: Rik van Riel
Cc: KOSAKI Motohiro
Acked-by: Johannes Weiner
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2011-03-23 08:44:03 +0800
315601809 mm: deactivate invalidated pages ... Browse Code »

Recently, there are reported problem about thrashing.
(http://marc.info/?l=rsync&m=128885034930933&w=2) It happens by backup
workloads(ex, nightly rsync). That's because the workload makes just
use-once pages and touches pages twice. It promotes the page into active
list so that it results in working set page eviction.

Some app developer want to support POSIX_FADV_NOREUSE. But other OSes
don't support it, either.
(http://marc.info/?l=linux-mm&m=128928979512086&w=2)

By other approach, app developers use POSIX_FADV_DONTNEED. But it has a
problem. If kernel meets page is writing during invalidate_mapping_pages,
it can't work. It makes for application programmer to use it since they
always have to sync data before calling fadivse(..POSIX_FADV_DONTNEED) to
make sure the pages could be discardable. At last, they can't use
deferred write of kernel so that they could see performance loss.
(http://insights.oetiker.ch/linux/fadvise.html)

In fact, invalidation is very big hint to reclaimer. It means we don't
use the page any more. So let's move the writing page into inactive
list's head if we can't truncate it right now.

Why I move page to head of lru on this patch, Dirty/Writeback page would
be flushed sooner or later. It can prevent writeout of pageout which is
less effective than flusher's writeout.

Originally, I reused lru_demote of Peter with some change so added his
Signed-off-by.

Signed-off-by: Minchan Kim
Reported-by: Ben Gamari
Signed-off-by: Peter Zijlstra
Acked-by: Rik van Riel
Acked-by: Mel Gorman
Reviewed-by: KOSAKI Motohiro
Cc: Wu Fengguang
Acked-by: Johannes Weiner
Cc: Nick Piggin
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2011-03-23 08:44:03 +0800
481b4bb5e mm: mm_struct: remove 16 bytes of alignment padding on 64 bit builds ... Browse Code »

Reorder mm_struct to remove 16 bytes of alignment padding on 64 bit
builds. On my config this shrinks mm_struct by enough to fit in one
fewer cache lines and allows more objects per slab in mm_struct
kmem_cache under SLUB.

slabinfo before patch :-
Sizes (bytes) Slabs
--------------------------------
Object : 848 Total : 9
SlabObj: 896 Full : 2
SlabSiz: 16384 Partial: 5
Loss : 48 CpuSlab: 2
Align : 64 Objects: 18

slabinfo after :-
Sizes (bytes) Slabs
--------------------------------
Object : 832 Total : 7
SlabObj: 832 Full : 2
SlabSiz: 16384 Partial: 3
Loss : 0 CpuSlab: 2
Align : 64 Objects: 19

Signed-off-by: Richard Kennedy
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Richard Kennedy
2011-03-23 08:44:03 +0800
cb240452b mm: remove unused TestSetPageLocked() interface ... Browse Code »

TestSetPageLocked() isn't being used anywhere. Also, using it would
likely be an error, since the proper interface trylock_page() provides
stronger ordering guarantees.

Signed-off-by: Michel Lespinasse
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michel Lespinasse
2011-03-23 08:44:03 +0800
01d8b20de mm: simplify anon_vma refcounts ... Browse Code »

This patch changes the anon_vma refcount to be 0 when the object is free.
It does this by adding 1 ref to being in use in the anon_vma structure
(iow. the anon_vma->head list is not empty).

This allows a simpler release scheme without having to check both the
refcount and the list as well as avoids taking a ref for each entry on the
list.

Signed-off-by: Peter Zijlstra
Reviewed-by: KAMEZAWA Hiroyuki
Acked-by: Hugh Dickins
Acked-by: Mel Gorman
Acked-by: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Zijlstra
2011-03-23 08:44:03 +0800
83813267c mm: move anon_vma ref out from under CONFIG_foo ... Browse Code »

We need the anon_vma refcount unconditionally to simplify the anon_vma
lifetime rules.

Signed-off-by: Peter Zijlstra
Acked-by: Mel Gorman
Reviewed-by: KAMEZAWA Hiroyuki
Acked-by: Hugh Dickins
Acked-by: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Zijlstra
2011-03-23 08:44:03 +0800
9e60109f1 mm: rename drop_anon_vma() to put_anon_vma() ... Browse Code »

The normal code pattern used in the kernel is: get/put.

Signed-off-by: Peter Zijlstra
Reviewed-by: KAMEZAWA Hiroyuki
Acked-by: Hugh Dickins
Reviewed-by: Rik van Riel
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Zijlstra
2011-03-23 08:44:03 +0800
e64a782fe mm: change __remove_from_page_cache() ... Browse Code »

Now we renamed remove_from_page_cache with delete_from_page_cache. As
consistency of __remove_from_swap_cache and remove_from_swap_cache, we
change internal page cache handling function name, too.

Signed-off-by: Minchan Kim
Cc: Christoph Hellwig
Acked-by: Hugh Dickins
Acked-by: Mel Gorman
Reviewed-by: KAMEZAWA Hiroyuki
Reviewed-by: Johannes Weiner
Reviewed-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2011-03-23 08:44:02 +0800