27 May, 2011
40 commits
-
During memory reclaim we determine the number of pages to be scanned per
zone as(anon + file) >> priority.
Assume
scan = (anon + file) >> priority.If scan < SWAP_CLUSTER_MAX, the scan will be skipped for this time and
priority gets higher. This has some problems.1. This increases priority as 1 without any scan.
To do scan in this priority, amount of pages should be larger than 512M.
If pages>>priority < SWAP_CLUSTER_MAX, it's recorded and scan will be
batched, later. (But we lose 1 priority.)
If memory size is below 16M, pages >> priority is 0 and no scan in
DEF_PRIORITY forever.2. If zone->all_unreclaimabe==true, it's scanned only when priority==0.
So, x86's ZONE_DMA will never be recoverred until the user of pages
frees memory by itself.3. With memcg, the limit of memory can be small. When using small memcg,
it gets priority < DEF_PRIORITY-2 very easily and need to call
wait_iff_congested().
For doing scan before priorty=9, 64MB of memory should be used.Then, this patch tries to scan SWAP_CLUSTER_MAX of pages in force...when
1. the target is enough small.
2. it's kswapd or memcg reclaim.Then we can avoid rapid priority drop and may be able to recover
all_unreclaimable in a small zones. And this patch removes nr_saved_scan.
This will allow scanning in this priority even when pages >> priority is
very small.Signed-off-by: KAMEZAWA Hiroyuki
Acked-by: Ying Han
Cc: Balbir Singh
Cc: KOSAKI Motohiro
Cc: Daisuke Nishimura
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Presently, memory cgroup's direct reclaim frees memory from the current
node. But this has some troubles. Usually when a set of threads works in
a cooperative way, they tend to operate on the same node. So if they hit
limits under memcg they will reclaim memory from themselves, damaging the
active working set.For example, assume 2 node system which has Node 0 and Node 1 and a memcg
which has 1G limit. After some work, file cache remains and the usages
areNode 0: 1M
Node 1: 998M.and run an application on Node 0, it will eat its foot before freeing
unnecessary file caches.This patch adds round-robin for NUMA and adds equal pressure to each node.
When using cpuset's spread memory feature, this will work very well.But yes, a better algorithm is needed.
[akpm@linux-foundation.org: comment editing]
[kamezawa.hiroyu@jp.fujitsu.com: fix time comparisons]
Signed-off-by: Ying Han
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Cc: KOSAKI Motohiro
Cc: Daisuke Nishimura
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
AFAICS mm/page_cgroup.c is for memcg subsystem, but it was directed only
to generic cgroup maintainers. Fix it.Signed-off-by: Namhyung Kim
Cc: Balbir Singh
Cc: Daisuke Nishimura
Cc: KAMEZAWA Hiroyuki
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Move page-freeing code out of swap_cgroup_mutex in the hope that it could
reduce few of theoretical contentions between swapons and/or swapoffs.This is just a cleanup, no functional changes.
Signed-off-by: Namhyung Kim
Acked-by: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Cc: Daisuke Nishimura
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
It allocated one more page than necessary if @max_pages was a multiple of
SC_PER_PAGE.Signed-off-by: Namhyung Kim
Acked-by: KAMEZAWA Hiroyuki
Acked-by: Balbir Singh
Cc: Daisuke Nishimura
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Commit ca371c0d7e23 ("memcg: fix page_cgroup fatal error in FLATMEM")
removes call to alloc_bootmem() in the function so that it can be marked
as __meminit to reduce memory usage when MEMORY_HOTPLUG=n.Also as the new helper function alloc_page_cgroup() is called only in the
function, it should be marked too.Signed-off-by: Namhyung Kim
Acked-by: KAMEZAWA Hiroyuki
Acked-by: Balbir Singh
Cc: Michal Hocko
Cc: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
next_mz is assigned to NULL if __mem_cgroup_largest_soft_limit_node
selects the same mz. This doesn't make much sense as we assign to the
variable right in the next loop.Compiler will probably optimize this out but it is little bit confusing
for the code reading.Signed-off-by: Michal Hocko
Acked-by: Daisuke Nishimura
Cc: Balbir Singh
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
We recently added the change in global background reclaim which counts the
return value of soft_limit reclaim. Now this patch adds the similar logic
on global direct reclaim.We should skip scanning global LRU on shrink_zone if soft_limit reclaim
does enough work. This is the first step where we start with counting the
nr_scanned and nr_reclaimed from soft_limit reclaim into global
scan_control.Signed-off-by: Ying Han
Cc: KOSAKI Motohiro
Cc: Minchan Kim
Cc: Daisuke Nishimura
Cc: Balbir Singh
Cc: KAMEZAWA Hiroyuki
Cc: Mel Gorman
Cc: Johannes Weiner
Cc: Rik van Riel
Cc: Hugh Dickins
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The global kswapd scans per-zone LRU and reclaims pages regardless of the
cgroup. It breaks memory isolation since one cgroup can end up reclaiming
pages from another cgroup. Instead we should rely on memcg-aware target
reclaim including per-memcg kswapd and soft_limit hierarchical reclaim under
memory pressure.In the global background reclaim, we do soft reclaim before scanning the
per-zone LRU. However, the return value is ignored. This patch is the first
step to skip shrink_zone() if soft_limit reclaim does enough work.This is part of the effort which tries to reduce reclaiming pages in global
LRU in memcg. The per-memcg background reclaim patchset further enhances the
per-cgroup targetting reclaim, which I should have V4 posted shortly.Try running multiple memory intensive workloads within seperate memcgs. Watch
the counters of soft_steal in memory.stat.$ cat /dev/cgroup/A/memory.stat | grep 'soft'
soft_steal 240000
soft_scan 240000
total_soft_steal 240000
total_soft_scan 240000This patch:
In the global background reclaim, we do soft reclaim before scanning the
per-zone LRU. However, the return value is ignored.We would like to skip shrink_zone() if soft_limit reclaim does enough
work. Also, we need to make the memory pressure balanced across per-memcg
zones, like the logic vm-core. This patch is the first step where we
start with counting the nr_scanned and nr_reclaimed from soft_limit
reclaim into the global scan_control.Signed-off-by: Ying Han
Cc: KOSAKI Motohiro
Cc: Minchan Kim
Cc: Rik van Riel
Cc: Mel Gorman
Cc: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Acked-by: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
enums are problematic because they cannot be forward-declared:
akpm2:/home/akpm> cat t.c
enum foo;
static inline void bar(enum foo f)
{
}
akpm2:/home/akpm> gcc -c t.c
t.c:4: error: parameter 1 ('f') has incomplete typeSo move the enum's definition into a standalone header file which can be used
wherever its definition is needed.Cc: Ying Han
Cc: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro
Cc: Minchan Kim
Cc: Daisuke Nishimura
Cc: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The ns_cgroup is an annoying cgroup at the namespace / cgroup frontier and
leads to some problems:* cgroup creation is out-of-control
* cgroup name can conflict when pids are looping
* it is not possible to have a single process handling a lot of
namespaces without falling in a exponential creation time
* we may want to create a namespace without creating a cgroupThe ns_cgroup was replaced by a compatibility flag 'clone_children',
where a newly created cgroup will copy the parent cgroup values.
The userspace has to manually create a cgroup and add a task to
the 'tasks' file.This patch removes the ns_cgroup as suggested in the following thread:
https://lists.linux-foundation.org/pipermail/containers/2009-June/018616.html
The 'cgroup_clone' function is removed because it is no longer used.
This is a userspace-visible change. Commit 45531757b45c ("cgroup: notify
ns_cgroup deprecated") (merged into 2.6.27) caused the kernel to emit a
printk warning users that the feature is planned for removal. Since that
time we have heard from XXX users who were affected by this.Signed-off-by: Daniel Lezcano
Signed-off-by: Serge E. Hallyn
Cc: Eric W. Biederman
Cc: Jamal Hadi Salim
Reviewed-by: Li Zefan
Acked-by: Paul Menage
Acked-by: Matt Helsley
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Convert cgroup_attach_proc to use flex_array.
The cgroup_attach_proc implementation requires a pre-allocated array to
store task pointers to atomically move a thread-group, but asking for a
monolithic array with kmalloc() may be unreliable for very large groups.
Using flex_array provides the same functionality with less risk of
failure.This is a post-patch for cgroup-procs-write.patch.
Signed-off-by: Ben Blum
Cc: "Eric W. Biederman"
Cc: Li Zefan
Cc: Matt Helsley
Reviewed-by: Paul Menage
Cc: Oleg Nesterov
Cc: David Rientjes
Cc: Miao Xie
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Make procs file writable to move all threads by tgid at once.
Add functionality that enables users to move all threads in a threadgroup
at once to a cgroup by writing the tgid to the 'cgroup.procs' file. This
current implementation makes use of a per-threadgroup rwsem that's taken
for reading in the fork() path to prevent newly forking threads within the
threadgroup from "escaping" while the move is in progress.Signed-off-by: Ben Blum
Cc: "Eric W. Biederman"
Cc: Li Zefan
Cc: Matt Helsley
Reviewed-by: Paul Menage
Cc: Oleg Nesterov
Cc: David Rientjes
Cc: Miao Xie
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Add cgroup subsystem callbacks for per-thread attachment in atomic contexts
Add can_attach_task(), pre_attach(), and attach_task() as new callbacks
for cgroups's subsystem interface. Unlike can_attach and attach, these
are for per-thread operations, to be called potentially many times when
attaching an entire threadgroup.Also, the old "bool threadgroup" interface is removed, as replaced by
this. All subsystems are modified for the new interface - of note is
cpuset, which requires from/to nodemasks for attach to be globally scoped
(though per-cpuset would work too) to persist from its pre_attach to
attach_task and attach.This is a pre-patch for cgroup-procs-writable.patch.
Signed-off-by: Ben Blum
Cc: "Eric W. Biederman"
Cc: Li Zefan
Cc: Matt Helsley
Reviewed-by: Paul Menage
Cc: Oleg Nesterov
Cc: David Rientjes
Cc: Miao Xie
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Adds functionality to read/write lock CLONE_THREAD fork()ing per-threadgroup
Add an rwsem that lives in a threadgroup's signal_struct that's taken for
reading in the fork path, under CONFIG_CGROUPS. If another part of the
kernel later wants to use such a locking mechanism, the CONFIG_CGROUPS
ifdefs should be changed to a higher-up flag that CGROUPS and the other
system would both depend on.This is a pre-patch for cgroup-procs-write.patch.
Signed-off-by: Ben Blum
Cc: "Eric W. Biederman"
Cc: Li Zefan
Cc: Matt Helsley
Reviewed-by: Paul Menage
Cc: Oleg Nesterov
Cc: David Rientjes
Cc: Miao Xie
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
When configfs_register_subsystem() fails, we unregister too many
subsystems in configfs_example_init. Decrement i by one to not unregister
non-registered subsystem.[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Jiri Slaby
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
I find it very handy to show the average delays in milliseconds.
Example output (on 100 concurrent dd reading sparse files):
CPU count real total virtual total delay total delay average
986 3223509952 3207643301 38863410579 39.415ms
IO count delay total delay average
0 0 0ms
SWAP count delay total delay average
0 0 0ms
RECLAIM count delay total delay average
1059 5131834899 4ms
dd: read=0, write=0, cancelled_write=0Signed-off-by: Wu Fengguang
Cc: Mel Gorman
Cc: Balbir Singh
Reviewed-by: Satoru Moriya
Reviewed-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Fixes
Documentation/accounting/getdelays.c: In function `get_family_id':
Documentation/accounting/getdelays.c:172:14: warning: variable `rc' set but not used [-Wunused-but-set-variable]Reported-by: "Justin P. Mattock"
Cc: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Fixes
Documentation/accounting/getdelays.c: In function `main':
Documentation/accounting/getdelays.c:436:7: warning: variable `i' set but not used [-Wunused-but-set-variable]Signed-off-by: Justin P. Mattock
Cc: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
As declaring counter as volatile is discouraged, it is best not to use it
in sample code as well.Signed-off-by: Nikanth Karthikesan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Originally i_lastfrag was 32 bits but then we added support for handling
64 bit metadata and it became a 64 bit variable. That was during 2007, in
54fb996ac15c "[PATCH] ufs2 write: block allocation update". Unfortunately
these casts got left behind so the value got truncated to 32 bit again.[akpm@linux-foundation.org: remove now-unneeded min_t/max_t casting]
Signed-off-by: Dan Carpenter
Cc: Evgeniy Dushistov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
[akpm@linux-foundation.org: retain the code comments]
Signed-off-by: Wolfram Sang
Cc: Vladimir Zapolskiy
Cc: Alessandro Zummo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Commit 51ba60c5 ("RTC: Cleanup rtc_class_ops->update_irq_enable()")
removed the only user of the update IRQ, so there is no need to manage it
any more.Signed-off-by: Lars-Peter Clausen
Cc: Thomas Gleixner
Cc: Alessandro Zummo
Cc: Marcelo Roberto Jimenez
Cc: John Stultz
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Signed-off-by: Rajeev Kumar
Signed-off-by: Viresh Kumar
Cc: Alessandro Zummo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The memory allocated using request_mem_region should be released using
release_mem_region, not release_region.The semantic patch that fixes part of this problem is as follows:
(http://coccinelle.lip6.fr/)//
@@
expression E1,E2,E3;
@@request_mem_region(E1,E2,E3)
...
?- release_region(E1,E2)
+ release_mem_region(E1,E2)
//[akpm@linux-foundation.org: use resource_size()]
Signed-off-by: Julia Lawall
Cc: Alessandro Zummo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Add basic support for ST m41t93 SPI RTCs. Tested with factory-new and
with "run-in" species with and without backup batteries.Signed-off-by: Nikolaus Voss
Cc: Alessandro Zummo
Cc: Grant Likely
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Add support for the Micro Crystal RV3029-C2 RTC chips.
Signed-off-by: Heiko Schocher
Signed-off-by: Gregory Hermant
Cc: Wan ZongShun
Cc: Alessandro Zummo
Acked-by: Wolfram Sang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Add support for EM Microelectronic EM3027 RTC chip.
Signed-off-by: Mike Rapoport
Cc: Alessandro Zummo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This adds a driver for the RTC devices in VIA and WonderMedia
Systems-on-Chip. Alarm, 1Hz interrupts, reading and setting time are
supported.Signed-off-by: Alexey Charkov
Cc: Lars-Peter Clausen
Cc: Alexey Charkov
Cc: Alessandro Zummo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
On most architectures division is an expensive operation and accessing an
element currently requires four of them. This performance penalty
effectively precludes flex arrays from being used on any kind of fast
path. However, two of these divisions can be handled at creation time and
the others can be replaced by a reciprocal divide, completely avoiding
real divisions on access.[eparis@redhat.com: rebase on top of changes to support 0 len elements]
[eparis@redhat.com: initialize part_nr when array fits entirely in base]
Signed-off-by: Jesse Gross
Signed-off-by: Eric Paris
Cc: Dave Hansen
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
They have no meaning.
Signed-off-by: KOSAKI Motohiro
Cc: Hirokazu Takata
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
spin_lock_irqsave() requires unsigned long.
Signed-off-by: KOSAKI Motohiro
Cc: Hirokazu Takata
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
We plan to remove cpus_xx() old cpumask APIs later. Also, we plan to
change mm_cpu_mask() implementation, allocate only nr_cpu_ids, thus
*mm_cpu_mask() is dangerous operation.Then, this patch convert them.
Signed-off-by: KOSAKI Motohiro
Cc: Hirokazu Takata
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
alpha allmodconfig:
drivers/bcma/host_pci.c: In function 'bcma_host_pci_probe':
drivers/bcma/host_pci.c:102: error: implicit declaration of function 'kzalloc'
drivers/bcma/host_pci.c:102: warning: assignment makes pointer from integer without a castCc:
Cc: John W. Linville
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
alpha allmodconfig:
drivers/video/mb862xx/mb862xxfbdrv.c: In function 'mb862xxfb_ioctl':
drivers/video/mb862xx/mb862xxfbdrv.c:323: error: implicit declaration of function 'copy_to_user'
drivers/video/mb862xx/mb862xxfbdrv.c:327: error: implicit declaration of function 'copy_from_user'Cc: Paul Mundt
Cc: Anatolij Gustschin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Since this cred was not created with copy_creds(), it needs to get
initialized. Otherwise use of syscall(__NR_keyctl, KEYCTL_SESSION_TO_PARENT);
can lead to a NULL deref. Thanks to Robert for finding this.But introduced by commit 47a150edc2a ("Cache user_ns in struct cred").
Signed-off-by: Serge E. Hallyn
Reported-by: Robert Święcki
Cc: David Howells
Cc: stable@kernel.org (2.6.39)
Signed-off-by: Linus Torvalds -
* 'trivial' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild-2.6:
gfs2: Drop __TIME__ usage
isdn/diva: Drop __TIME__ usage
atm: Drop __TIME__ usage
dlm: Drop __TIME__ usage
wan/pc300: Drop __TIME__ usage
parport: Drop __TIME__ usage
hdlcdrv: Drop __TIME__ usage
baycom: Drop __TIME__ usage
pmcraid: Drop __DATE__ usage
edac: Drop __DATE__ usage
rio: Drop __DATE__ usage
scsi/wd33c93: Drop __TIME__ usage
scsi/in2000: Drop __TIME__ usage
aacraid: Drop __TIME__ usage
media/cx231xx: Drop __TIME__ usage
media/radio-maxiradio: Drop __TIME__ usage
nozomi: Drop __TIME__ usage
cyclades: Drop __TIME__ usage -
* 'x86-vdso-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: vdso: Remove unused variable
x86-64: Optimize vDSO time()
x86-64: Add time to vDSO
x86-64: Turn off -pg and turn on -foptimize-sibling-calls for vDSO
x86-64: Move vread_tsc into a new file with sensible options
x86-64: Vclock_gettime(CLOCK_MONOTONIC) can't ever see nsec < 0
x86-64: Don't generate cmov in vread_tsc
x86-64: Remove unnecessary barrier in vread_tsc
x86-64: Clean up vdso/kernel shared variables -
…nel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
seqlock: Get rid of SEQLOCK_UNLOCKED* 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
irq: Remove smp_affinity_list when unregister irq proc -
* 'gpio/next' of git://git.secretlab.ca/git/linux-2.6:
gpio/via: rename VIA local config struct
basic_mmio_gpio: split into a gpio library and platform device
gpio: remove some legacy comments in build files
gpio: add trace events for setting direction and value
gpio/pca953x: Use handle_simple_irq instead of handle_edge_irq
gpiolib: export gpiochip_find
gpio: remove redundant Kconfig depends on GPIOLIB
basic_mmio_gpio: convert to non-__raw* accessors
basic_mmio_gpio: support direction registers
basic_mmio_gpio: support different input/output registers
basic_mmio_gpio: detect output method at probe time
basic_mmio_gpio: request register regions
basic_mmio_gpio: allow overriding number of gpio
basic_mmio_gpio: convert to platform_{get,set}_drvdata()
basic_mmio_gpio: remove runtime width/endianness evaluation