Eric Lee / smarc-fsl-linux-kernel

27 May, 2011

40 commits

19de85ef5 bitops: add #ifndef for each of find bitops ... Browse Code »

The style that we normally use in asm-generic is to test the macro itself
for existence, so in asm-generic, do:

#ifndef find_next_zero_bit_le
extern unsigned long find_next_zero_bit_le(const void *addr,
unsigned long size, unsigned long offset);
#endif

and in the architectures, write

static inline unsigned long find_next_zero_bit_le(const void *addr,
unsigned long size, unsigned long offset)
#define find_next_zero_bit_le find_next_zero_bit_le

This adds the #ifndef for each of the find bitops in the generic header
and source files.

Suggested-by: Arnd Bergmann
Signed-off-by: Akinobu Mita
Acked-by: Russell King
Cc: Martin Schwidefsky
Cc: Heiko Carstens
Cc: Greg Ungerer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Akinobu Mita
2011-05-27 08:12:38 +0800
a2812e178 arch: add #define for each of optimized find bitops ... Browse Code »

The style that we normally use in asm-generic is to test the macro itself
for existence, so in asm-generic, do:

#ifndef find_next_zero_bit_le
extern unsigned long find_next_zero_bit_le(const void *addr,
unsigned long size, unsigned long offset);
#endif

and in the architectures, write

static inline unsigned long find_next_zero_bit_le(const void *addr,
unsigned long size, unsigned long offset)
#define find_next_zero_bit_le find_next_zero_bit_le

This adds the #define for each of the optimized find bitops in the
architectures.

Suggested-by: Arnd Bergmann
Signed-off-by: Akinobu Mita
Acked-by: Hans-Christian Egtvedt
Acked-by: Russell King
Acked-by: Greg Ungerer
Cc: Martin Schwidefsky
Cc: Heiko Carstens
Acked-by: Geert Uytterhoeven
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Akinobu Mita
2011-05-27 08:12:38 +0800
e0819410d m68knommu: fix build error due to the lack of find_next_bit_le() ... Browse Code »

m68knommu can't build ext4, udf, and ocfs2 due to the lack of
find_next_bit_le().

This implements find_next_bit_le() on m68knommu by duplicating the generic
find_next_bit_le() in lib/find_next_bit.c.

Signed-off-by: Akinobu Mita
Acked-by: Greg Ungerer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Akinobu Mita
2011-05-27 08:12:38 +0800
275ac7462 w1: add Maxim/Dallas DS2780 Stand-Alone Fuel Gauge IC support ... Browse Code »

Add support for the Maxim/Dallas DS2780 Stand-Alone Fuel Gauge IC.

It was suggested to combine this functionality with the current ds2782
driver. Unfortunately, I'm unable to commit the time to refactoring this
driver to that extent and I don't have a platform with the ds2782 part to
validate that there are no regression issues by adding this functionality.

[akpm@linux-foundation.org: use min_t()]
Signed-off-by: Clifton Barnes
Tested-by: Haojian Zhuang
Cc: Evgeniy Polyakov
Cc: Ryan Mallon
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Clifton Barnes
2011-05-27 08:12:38 +0800
963bb1010 w1: have netlink search update kernel list ... Browse Code »

Reorganize so the netlink connector one wire search command will update
the kernel list of detected slave devices. Otherwise, a newly detected
device is unusable because unless it's in the kernel list of known devices
any commands will result in ENODEV status.

Signed-off-by: David Fries
Cc: Evgeniy Polyakov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Fries
2011-05-27 08:12:38 +0800
26a6afb91 w1: complete the 1-wire (w1) ds1wm driver search algorithm ... Browse Code »
43

This adds multi-slave support of the w1 bus for the ds1wm Synthesizable
1-Wire Bus Master. Also many fixes and tweaks based on the rev3 of the
datasheet http://datasheets.maxim-ic.com/en/ds/DS1WM.pdf

Signed-off-by: Jean-François Dagenais
Cc: Evgeniy Polyakov
Cc: Szabolcs Gyurko
Cc: Matt Reimer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jean-François Dagenais
2011-05-27 08:12:38 +0800
89610274b w1: add 1-wire (w1) DS2408 8-Channel Addressable Switch support ... Browse Code »

This DS2408 w1 slave driver is not complete for all the features of the
chip, but its sufficient if you use it as a simple IO expander.

[randy.dunlap@oracle.com: fix w1_ds2408.c printk formats]
Signed-off-by: Jean-François Dagenais
Cc: Evgeniy Polyakov
Cc: Szabolcs Gyurko
Cc: Matt Reimer
Signed-off-by: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jean-François Dagenais
2011-05-27 08:12:38 +0800
67dfd54c2 w1: add 1-wire (w1) reset and resume command API support ... Browse Code »

The first patch adds generic functionnality to w1_io for Resume Command
[A5h] lots of slaves support. I found it useful for multi-commands/reset
workflows with the same slave on a multi-slave bus.

This DS2408 w1 slave driver is not complete for all the features of the
chip, but its sufficient if you use it as a simple IO expander. Enjoy!

The ds1wm had Kconfig dependencies towards ARM && HAVE_CLK. I took them
out since I was using the ds1wm on an x86_64 platform (ds1wm in a FPGA
through pcie) and found them irrelevant.

The clock freq/divisors at the top of ds1wm.c did not have the MSB set to
1. This bit is CLK_EN which turns the whole prescaler and dividers on.
The driver never mentionned this bit either, so I just included this bit
right in the table entries. I also took the liberty to add a couple of
entries to the table. The spec doesn't explicitely mentions these
possibilities but the description and examination of the core shows the
prescalers & dividers can be used for more than the table explicitely
shows. The table I enlarged still doesn't cover all possibilities, but
it's a good start.

I also made a few tweaks to a couple of the read and write algorithms
which made sense while I had my head very deep in the ds1wm documentation.
We stressed it a lot with 10+ slaves on the bus, many ds2408, ds2431 and
ds2433 at the same time doing extensive interaction. It proved quite
stable in our production environment.

This patch:

Add generic functionnality to w1_io for Resume Command [A5h] lots of
slaves support.

Signed-off-by: Jean-François Dagenais
Cc: Evgeniy Polyakov
Cc: Szabolcs Gyurko
Cc: Matt Reimer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jean-François Dagenais
2011-05-27 08:12:38 +0800
6f7bd76f0 kernel/profile.c: remove some duplicate code from profile_hits() ... Browse Code »

profile_hits() has a common check for prof_on and prof_buffer regardless
of SMP or !SMP. So, remove some duplicate code by splitting profile_hits
into two.

[akpm@linux-foundation.org: make do_profile_hits static]
Signed-off-by: Rakib Mullick
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rakib Mullick
2011-05-27 08:12:37 +0800
d98808a25 drivers/char/ppdev.c: put gotten port value ... Browse Code »

parport_find_number() calls parport_get_port() on its result, so there
should be a corresponding call to parport_put_port() before dropping the
reference. Similar code is found in the function register_device() in the
same file.

The semantic match that finds this problem is as follows:
(http://coccinelle.lip6.fr/)

//
@exists@
local idexpression struct parport * x;
expression ra,rr;
statement S1,S2;
@@

x = parport_find_number(...)
... when != x = rr
when any
when != parport_put_port(x,...)
when != if (...) { ... parport_put_port(x,...) ...}
(
if() S1 else S2
|
if(...) { ... when != x = ra
when forall
when != parport_put_port(x,...)
*return...;
}
)
//

Signed-off-by: Julia Lawall
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Julia Lawall
2011-05-27 08:12:37 +0800
e2e770987 edac,rcu: use synchronize_rcu() instead of call_rcu()+rcu_barrier() ... Browse Code »

synchronize_rcu() does the stuff as needed.

Signed-off-by: Lai Jiangshan
Cc: Doug Thompson
Cc: "Paul E. McKenney"
Cc: Mauro Carvalho Chehab
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lai Jiangshan
2011-05-27 08:12:37 +0800
26498e89e pid: fix typo in function description ... Browse Code »

finds is misspelt as finr. No functional change.

Signed-off-by: Sisir Koppaka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sisir Koppaka
2011-05-27 08:12:37 +0800
3eb8e74ec fs/partitions/efi.c: corrupted GUID partition tables can cause kernel oops ... Browse Code »

The kernel automatically evaluates partition tables of storage devices.
The code for evaluating GUID partitions (in fs/partitions/efi.c) contains
a bug that causes a kernel oops on certain corrupted GUID partition
tables.

This bug has security impacts, because it allows, for example, to
prepare a storage device that crashes a kernel subsystem upon connecting
the device (e.g., a "USB Stick of (Partial) Death").

crc = efi_crc32((const unsigned char *) (*gpt), le32_to_cpu((*gpt)->header_size));

computes a CRC32 checksum over gpt covering (*gpt)->header_size bytes.
There is no validation of (*gpt)->header_size before the efi_crc32 call.

A corrupted partition table may have large values for (*gpt)->header_size.
In this case, the CRC32 computation access memory beyond the memory
allocated for gpt, which may cause a kernel heap overflow.

Validate value of GUID partition table header size.

[akpm@linux-foundation.org: fix layout and indenting]
Signed-off-by: Timo Warns
Cc: Matt Domsch
Cc: Eugene Teo
Cc: Dave Jones
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Timo Warns
2011-05-27 08:12:37 +0800
658c74cf3 drivers/char/mspec.c: use {k,v}zalloc to allocate memory ... Browse Code »

Let memory allocator initialize the allocated memory as null, thus remove
the use of memset.

Signed-off-by: Rakib Mullick
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rakib Mullick
2011-05-27 08:12:37 +0800
074127367 ipmi: convert to seq_file interface ... Browse Code »

The ->read_proc interface is going away, convert to seq_file.

Signed-off-by: Alexey Dobriyan
Cc:Corey Minyard
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2011-05-27 08:12:37 +0800
997c136f5 fs/proc/vmcore.c: add hook to read_from_oldmem() to check for non-ram pages ... Browse Code »

The balloon driver in a Xen guest frees guest pages and marks them as
mmio. When the kernel crashes and the crash kernel attempts to read the
oldmem via /proc/vmcore a read from ballooned pages will generate 100%
load in dom0 because Xen asks qemu-dm for the page content. Since the
reads come in as 8byte requests each ballooned page is tried 512 times.

With this change a hook can be registered which checks wether the given
pfn is really ram. The hook has to return a value > 0 for ram pages, a
value < 0 on error (because the hypercall is not known) and 0 for non-ram
pages.

This will reduce the time to read /proc/vmcore. Without this change a
512M guest with 128M crashkernel region needs 200 seconds to read it, with
this change it takes just 2 seconds.

Signed-off-by: Olaf Hering
Cc: Alexey Dobriyan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Olaf Hering
2011-05-27 08:12:37 +0800
98bc93e50 proc: fix pagemap_read() error case ... Browse Code »

Currently, pagemap_read() has three error and/or corner case handling
mistake.

(1) If ppos parameter is wrong, mm refcount will be leak.
(2) If count parameter is 0, mm refcount will be leak too.
(3) If the current task is sleeping in kmalloc() and the system
is out of memory and oom-killer kill the proc associated task,
mm_refcount prevent the task free its memory. then system may
hang up.

Cc: Hugh Dickins
Cc: Jovi Zhang
Acked-by: Hugh Dickins
Cc: Stephen Wilson
Cc: Alexey Dobriyan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2011-05-27 08:12:37 +0800
30cd89039 proc: put check_mem_permission after __get_free_page in mem_write ... Browse Code »

It whould be better if put check_mem_permission after __get_free_page in
mem_write, to be same as function mem_read.

Hugh Dickins explained the reason.

check_mem_permission gets a reference to the mm. If we __get_free_page
after check_mem_permission, imagine what happens if the system is out
of memory, and the mm we're looking at is selected for killing by the
OOM killer: while we wait in __get_free_page for more memory, no memory
is freed from the selected mm because it cannot reach exit_mmap while
we hold that reference.

Reported-by: Jovi Zhang
Signed-off-by: KOSAKI Motohiro
Acked-by: Hugh Dickins
Reviewed-by: Stephen Wilson
Cc: Alexey Dobriyan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2011-05-27 08:12:37 +0800
a4dbf0ec2 proc/stat: use defined macro KMALLOC_MAX_SIZE ... Browse Code »

There is a macro for the max size kmalloc can allocate, so use it instead
of a hardcoded number.

Signed-off-by: Yuanhan Liu
Cc: Alexey Dobriyan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yuanhan Liu
2011-05-27 08:12:37 +0800
e130aa70f proc: constify status array ... Browse Code »

No need for this local array to be writable, so mark it const.

Signed-off-by: Mike Frysinger
Cc: Alexey Dobriyan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Frysinger
2011-05-27 08:12:36 +0800
0a8cb8e34 fs/proc: convert to kstrtoX() ... Browse Code »

Convert fs/proc/ from strict_strto*() to kstrto*() functions.

Signed-off-by: Alexey Dobriyan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2011-05-27 08:12:36 +0800
57cc083ad coredump: add support for exe_file in core name ... Browse Code »

Now, exe_file is not proc FS dependent, so we can use it to name core
file. So we add %E pattern for core file name cration which extract path
from mm_struct->exe_file. Then it converts slashes to exclamation marks
and pastes the result to the core file name itself.

This is useful for environments where binary names are longer than 16
character (the current->comm limitation). Also where there are binaries
with same name but in a different path. Further in case the binery itself
changes its current->comm after exec.

So by doing (s/$/#/ -- # is treated as git comment):

$ sysctl kernel.core_pattern='core.%p.%e.%E'
$ ln /bin/cat cat45678901234567890
$ ./cat45678901234567890
^Z
$ rm cat45678901234567890
$ fg
^\Quit (core dumped)
$ ls core*

we now get:

core.2434.cat456789012345.!root!cat45678901234567890 (deleted)

Signed-off-by: Jiri Slaby
Cc: Al Viro
Cc: Alan Cox
Reviewed-by: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jiri Slaby
2011-05-27 08:12:36 +0800
386460138 mm: extract exe_file handling from procfs ... Browse Code »

Setup and cleanup of mm_struct->exe_file is currently done in fs/proc/.
This was because exe_file was needed only for /proc//exe. Since we
will need the exe_file functionality also for core dumps (so core name can
contain full binary path), built this functionality always into the
kernel.

To achieve that move that out of proc FS to the kernel/ where in fact it
should belong. By doing that we can make dup_mm_exe_file static. Also we
can drop linux/proc_fs.h inclusion in fs/exec.c and kernel/fork.c.

Signed-off-by: Jiri Slaby
Cc: Alexander Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jiri Slaby
2011-05-27 08:12:36 +0800
63ab25ebb kgdbts: unify/generalize gdb breakpoint adjustment ... Browse Code »

The Blackfin arch, like the x86 arch, needs to adjust the PC manually
after a breakpoint is hit as normally this is handled by the remote gdb.
However, rather than starting another arch ifdef mess, create a common
GDB_ADJUSTS_BREAK_OFFSET define for any arch to opt-in via their kgdb.h.

Signed-off-by: Mike Frysinger
Cc: Oleg Nesterov
Cc: Jason Wessel
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Acked-by: Paul Mundt
Acked-by: Dongdong Deng
Cc: Sergei Shtylyov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Frysinger
2011-05-27 08:12:36 +0800
3cea45c6e sh: convert to asm-generic ptrace.h ... Browse Code »

Signed-off-by: Mike Frysinger
Cc: Oleg Nesterov
Cc: Jason Wessel
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Paul Mundt
Cc: Sergei Shtylyov
Cc: Dongdong Deng
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Frysinger
2011-05-27 08:12:36 +0800
c46dd6b48 x86: convert to asm-generic ptrace.h ... Browse Code »

Signed-off-by: Mike Frysinger
Cc: Oleg Nesterov
Cc: Jason Wessel
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Paul Mundt
Cc: Sergei Shtylyov
Cc: Dongdong Deng
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Frysinger
2011-05-27 08:12:36 +0800
82258c661 Blackfin: convert to asm-generic ptrace.h ... Browse Code »

Signed-off-by: Mike Frysinger
Cc: Oleg Nesterov
Cc: Jason Wessel
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Paul Mundt
Cc: Sergei Shtylyov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Frysinger
2011-05-27 08:12:36 +0800
edeafa74e asm-generic/ptrace.h: start a common low level ptrace helper ... Browse Code »

This is a series of low level ptrace unification steps to make it easier
for common code (like KGDB) to poke at register state. This also avoids
having to duplicate higher level operations for most ports which don't
have special needs for accessing things.

This patch:

This implements a bunch of helper funcs for poking the registers of a
ptrace structure. Now common code should be able to portably update
specific registers (like kgdb updating the PC).

Signed-off-by: Mike Frysinger
Cc: Oleg Nesterov
Cc: Jason Wessel
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Paul Mundt
Cc: Sergei Shtylyov
Cc: Dongdong Deng
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Frysinger
2011-05-27 08:12:36 +0800
456f998ec memcg: add the pagefault count into memcg stats ... Browse Code »

Two new stats in per-memcg memory.stat which tracks the number of page
faults and number of major page faults.

"pgfault"
"pgmajfault"

They are different from "pgpgin"/"pgpgout" stat which count number of
pages charged/discharged to the cgroup and have no meaning of reading/
writing page to disk.

It is valuable to track the two stats for both measuring application's
performance as well as the efficiency of the kernel page reclaim path.
Counting pagefaults per process is useful, but we also need the aggregated
value since processes are monitored and controlled in cgroup basis in
memcg.

Functional test: check the total number of pgfault/pgmajfault of all
memcgs and compare with global vmstat value:

$ cat /proc/vmstat | grep fault
pgfault 1070751
pgmajfault 553

$ cat /dev/cgroup/memory.stat | grep fault
pgfault 1071138
pgmajfault 553
total_pgfault 1071142
total_pgmajfault 553

$ cat /dev/cgroup/A/memory.stat | grep fault
pgfault 199
pgmajfault 0
total_pgfault 199
total_pgmajfault 0

Performance test: run page fault test(pft) wit 16 thread on faulting in
15G anon pages in 16G container. There is no regression noticed on the
"flt/cpu/s"

Sample output from pft:

TAG pft:anon-sys-default:
Gb Thr CLine User System Wall flt/cpu/s fault/wsec
15 16 1 0.67s 233.41s 14.76s 16798.546 266356.260

+-------------------------------------------------------------------------+
N Min Max Median Avg Stddev
x 10 16682.962 17344.027 16913.524 16928.812 166.5362
+ 10 16695.568 16923.896 16820.604 16824.652 84.816568
No difference proven at 95.0% confidence

[akpm@linux-foundation.org: fix build]
[hughd@google.com: shmem fix]
Signed-off-by: Ying Han
Acked-by: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro
Reviewed-by: Minchan Kim
Cc: Daisuke Nishimura
Acked-by: Balbir Singh
Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ying Han
2011-05-27 08:12:36 +0800
406eb0c9b memcg: add memory.numastat api for numa statistics ... Browse Code »

The new API exports numa_maps per-memcg basis. This is a piece of useful
information where it exports per-memcg page distribution across real numa
nodes.

One of the usecases is evaluating application performance by combining
this information w/ the cpu allocation to the application.

The output of the memory.numastat tries to follow w/ simiar format of
numa_maps like:

total= N0= N1= ...
file= N0= N1= ...
anon= N0= N1= ...
unevictable= N0= N1= ...

And we have per-node:

total = file + anon + unevictable

$ cat /dev/cgroup/memory/memory.numa_stat
total=250020 N0=87620 N1=52367 N2=45298 N3=64735
file=225232 N0=83402 N1=46160 N2=40522 N3=55148
anon=21053 N0=3424 N1=6207 N2=4776 N3=6646
unevictable=3735 N0=794 N1=0 N2=0 N3=2941

Signed-off-by: Ying Han
Cc: Balbir Singh
Cc: Daisuke Nishimura
Acked-by: KAMEZAWA Hiroyuki
Acked-by: Daisuke Nishimura
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ying Han
2011-05-27 08:12:36 +0800
1bac180bd memcg: rename mem_cgroup_zone_nr_pages() to mem_cgroup_zone_nr_lru_pages() ... Browse Code »

The caller of the function has been renamed to zone_nr_lru_pages(), and
this is just fixing up in the memcg code. The current name is easily to
be mis-read as zone's total number of pages.

Signed-off-by: Ying Han
Acked-by: Johannes Weiner
Acked-by: KAMEZAWA Hiroyuki
Reviewed-by: Minchan Kim
Cc: Balbir Singh
Cc: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ying Han
2011-05-27 08:12:35 +0800
4fd14ebf6 memcg: remove unused retry signal from reclaim ... Browse Code »

If the memcg reclaim code detects the target memcg below its limit it
exits and returns a guaranteed non-zero value so that the charge is
retried.

Nowadays, the charge side checks the memcg limit itself and does not rely
on this non-zero return value trick.

This patch removes it. The reclaim code will now always return the true
number of pages it reclaimed on its own.

Signed-off-by: Johannes Weiner
Acked-by: Rik van Riel
Acked-by: Ying Han
Acked-by: KAMEZAWA Hiroyuki
Reviewed-by: Michal Hocko
Cc: Balbir Singh
Cc: KOSAKI Motohiro
Cc: Mel Gorman
Cc: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2011-05-27 08:12:35 +0800
246e87a93 memcg: fix get_scan_count() for small targets ... Browse Code »
1

During memory reclaim we determine the number of pages to be scanned per
zone as

(anon + file) >> priority.
Assume
scan = (anon + file) >> priority.

If scan < SWAP_CLUSTER_MAX, the scan will be skipped for this time and
priority gets higher. This has some problems.

1. This increases priority as 1 without any scan.
To do scan in this priority, amount of pages should be larger than 512M.
If pages>>priority < SWAP_CLUSTER_MAX, it's recorded and scan will be
batched, later. (But we lose 1 priority.)
If memory size is below 16M, pages >> priority is 0 and no scan in
DEF_PRIORITY forever.

2. If zone->all_unreclaimabe==true, it's scanned only when priority==0.
So, x86's ZONE_DMA will never be recoverred until the user of pages
frees memory by itself.

3. With memcg, the limit of memory can be small. When using small memcg,
it gets priority < DEF_PRIORITY-2 very easily and need to call
wait_iff_congested().
For doing scan before priorty=9, 64MB of memory should be used.

Then, this patch tries to scan SWAP_CLUSTER_MAX of pages in force...when

1. the target is enough small.
2. it's kswapd or memcg reclaim.

Then we can avoid rapid priority drop and may be able to recover
all_unreclaimable in a small zones. And this patch removes nr_saved_scan.
This will allow scanning in this priority even when pages >> priority is
very small.

Signed-off-by: KAMEZAWA Hiroyuki
Acked-by: Ying Han
Cc: Balbir Singh
Cc: KOSAKI Motohiro
Cc: Daisuke Nishimura
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2011-05-27 08:12:35 +0800
889976dbc memcg: reclaim memory from nodes in round-robin order ... Browse Code »

Presently, memory cgroup's direct reclaim frees memory from the current
node. But this has some troubles. Usually when a set of threads works in
a cooperative way, they tend to operate on the same node. So if they hit
limits under memcg they will reclaim memory from themselves, damaging the
active working set.

For example, assume 2 node system which has Node 0 and Node 1 and a memcg
which has 1G limit. After some work, file cache remains and the usages
are

Node 0: 1M
Node 1: 998M.

and run an application on Node 0, it will eat its foot before freeing
unnecessary file caches.

This patch adds round-robin for NUMA and adds equal pressure to each node.
When using cpuset's spread memory feature, this will work very well.

But yes, a better algorithm is needed.

[akpm@linux-foundation.org: comment editing]
[kamezawa.hiroyu@jp.fujitsu.com: fix time comparisons]
Signed-off-by: Ying Han
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Cc: KOSAKI Motohiro
Cc: Daisuke Nishimura
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ying Han
2011-05-27 08:12:35 +0800
4e4c941c1 MAINTAINERS: add mm/page_cgroup.c into memcg subsystem ... Browse Code »

AFAICS mm/page_cgroup.c is for memcg subsystem, but it was directed only
to generic cgroup maintainers. Fix it.

Signed-off-by: Namhyung Kim
Cc: Balbir Singh
Cc: Daisuke Nishimura
Cc: KAMEZAWA Hiroyuki
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Namhyung Kim
2011-05-27 08:12:35 +0800
6a5b18d2b memcg: move page-freeing code out of lock ... Browse Code »

Move page-freeing code out of swap_cgroup_mutex in the hope that it could
reduce few of theoretical contentions between swapons and/or swapoffs.

This is just a cleanup, no functional changes.

Signed-off-by: Namhyung Kim
Acked-by: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Cc: Daisuke Nishimura
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Namhyung Kim
2011-05-27 08:12:35 +0800
33278f7f0 memcg: fix off-by-one when calculating swap cgroup map length ... Browse Code »

It allocated one more page than necessary if @max_pages was a multiple of
SC_PER_PAGE.

Signed-off-by: Namhyung Kim
Acked-by: KAMEZAWA Hiroyuki
Acked-by: Balbir Singh
Cc: Daisuke Nishimura
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Namhyung Kim
2011-05-27 08:12:35 +0800
268433b8e memcg: mark init_section_page_cgroup() properly ... Browse Code »

Commit ca371c0d7e23 ("memcg: fix page_cgroup fatal error in FLATMEM")
removes call to alloc_bootmem() in the function so that it can be marked
as __meminit to reduce memory usage when MEMORY_HOTPLUG=n.

Also as the new helper function alloc_page_cgroup() is called only in the
function, it should be marked too.

Signed-off-by: Namhyung Kim
Acked-by: KAMEZAWA Hiroyuki
Acked-by: Balbir Singh
Cc: Michal Hocko
Cc: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Namhyung Kim
2011-05-27 08:12:35 +0800
39cc98f1f memcg: remove pointless next_mz nullification in mem_cgroup_soft_limit_reclaim() ... Browse Code »

next_mz is assigned to NULL if __mem_cgroup_largest_soft_limit_node
selects the same mz. This doesn't make much sense as we assign to the
variable right in the next loop.

Compiler will probably optimize this out but it is little bit confusing
for the code reading.

Signed-off-by: Michal Hocko
Acked-by: Daisuke Nishimura
Cc: Balbir Singh
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2011-05-27 08:12:35 +0800
d149e3b25 memcg: add the soft_limit reclaim in global direct reclaim. ... Browse Code »

We recently added the change in global background reclaim which counts the
return value of soft_limit reclaim. Now this patch adds the similar logic
on global direct reclaim.

We should skip scanning global LRU on shrink_zone if soft_limit reclaim
does enough work. This is the first step where we start with counting the
nr_scanned and nr_reclaimed from soft_limit reclaim into global
scan_control.

Signed-off-by: Ying Han
Cc: KOSAKI Motohiro
Cc: Minchan Kim
Cc: Daisuke Nishimura
Cc: Balbir Singh
Cc: KAMEZAWA Hiroyuki
Cc: Mel Gorman
Cc: Johannes Weiner
Cc: Rik van Riel
Cc: Hugh Dickins
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ying Han
2011-05-27 08:12:35 +0800